What affects the speed of C++ programs and how to achieve it with a high level of code? The lead developer of the CatBoost library, Evgeny Petrov, answered these questions using examples and illustrations from the experience of working on CatBoost for x86_64.
Video report

- Hi all. I'm doing CPU optimization for the CatBoost machine learning library. The main part of our library is written in C++. Today I will tell you in what simple ways we achieve speed.

Computation speed consists of two parts. The first part is the algorithm. If we make a mistake with the choice of the algorithm, then you can’t make it work quickly later. The second part is how our algorithm is optimized for the computing system that we have, with its performance and throughput.

Separately, data exchange and calculations have to be taken into account due to the large difference in their speed. If we take the speed of memory as the speed of a pedestrian, then the speed of calculations is approximately the cruising speed of a passenger plane.
To smooth out this difference, the architecture has several levels of caching. The fastest and smallest is the L1 cache. Then there is a larger and slower cache, the second level. And there is a very large cache, which can be tens of megabytes, a third-level cache, but it is the slowest.

Due to the different data exchange rates of computations, the computational code is divided into two classes. One class is limited by bandwidth, that is, the speed of data exchange. The second class is limited by the speed of the processor. The boundary between them is set depending on the number of operations that are performed with one data byte. This is usually a constant for specific code.
Most of the computationally heavy code is long written, very well optimized, and there are a large number of libraries, so it makes sense if you see heavy computations in your code, look for a library that can do them for you.

Of the rest, compilers do not know how to do everything, since a very limited percentage of resources is spent on their development. Which of them are developing more or less actively today, that is, they support standards and try to follow them? This is the frontend EDG, which is used in various derivatives, for example, the Intel compiler; LLVM; GNU and frontend Microsoft.
Since there are few of them, compilers only support frequency control patterns and data dependencies. If we look at control, then these are linear sections and simple cycles, that is, a sequence of instructions and repetition. They learn frequency dependences on data from reduction, when we, say, sum many elements into one, collapse and perform element-by-element actions on one or more arrays.
What is left for developers? This can be roughly divided into four parts. The first is the architecture of the application, compilers are simply not able to come up with it for us.

Parallelization is also a tricky thing for compilers. Working with memory - because it is really difficult: you need to take into account the architecture, and parallelization, and all together. In addition, compilers do not know how to properly evaluate the quality of optimization, how fast the code turns out. We, the developers, also have to do this, make a decision - to optimize further or stop.
In the architecture part, we'll look at the amortization of overhead, virtual calls, on which the architecture is based in a large number of cases.
Let's leave parallelization out of the equation. Regarding the use of memory: this is also, in a sense, depreciation and correct work with data, their correct placement in memory. In terms of evaluating efficiency, let's talk about profiling and how to look for bottlenecks in the code.
The use of interfaces and abstract data types is one of the main design methods. Consider a similar computational code from machine learning. This is a conditional code that updates the forecast with a gradient method.

If we look a little inside and try to understand what is happening inside, then we have an IDerCalcer interface for calculating the derivatives of the loss function, and a function that shifts the forecast (our prediction) according to the gradient of the loss function.
On the right side of the slide you see what this means for the 10D case. And in machine learning, the size of the forecast is not two or three, but millions, tens of millions of elements. Let's see how good this code is for a vector of about XNUMX million elements.

Let's take the standard deviation as the objective function and measure how it works, how much it will shift this forecast. The derivative of this objective function is shown on the slide. The running time on the conditional machine, which then remains fixed, is 40 ms.

Let's try to understand what is wrong here. The first thing that attracts attention is virtual calls. If you look at the profiler, you can see that, depending on the number of parameters, this is about five to ten instructions. And if, as in our case, the calculation of the derivative itself is just two arithmetic operations, then this can easily turn out to be a significant overhead. For a large body when calculating derivatives, this is approx. For a short body that calculates a derivative - say, not even 500 instructions, but 20, 50 or even less - this will already be a significant percentage of the time. What to do? Let's try to buffer the virtual function call by changing the interface.

Initially, we calculated the derivatives point by point, for each element of the vector separately. Let's pass from element-by-element processing to processing by vectors. Let's take the standard C++ template, which allows you to work with a view on a vector. Or, if your compiler doesn't support the latest standard, you can use a simple homemade class that holds a pointer to the data and size. How will the code change? We are left with one call that calculates derivatives, and then we have to add a loop that will actually update the forecast.

In addition to adding a cycle, we still have to look at the data a second time, that is, read the forecast vector itself and the gradient vector that we just calculated a second time.

We will measure again on the same machine and see what turned out worse, something went wrong. Let's understand what happened to the code.

It makes no sense to suspect a cycle of something, since this is exactly the same frequency pattern that compilers recognize and optimize well. The number of operations per data element there will be less than the cost of a virtual call.

But the creation of a large vector and a second pass through it - in this place it is worth suspecting a problem. To understand why this is bad and leads to slowdowns, you need to imagine what happens in memory when the code that we see on the slide on the right is running.

When the vector of derivatives is calculated, it comes to a cycle that shifts the forecast. Before this cycle, only a very small portion of the data will remain in the fast LXNUMX cache, which runs at the processor frequency. On the slide, it's green at the traffic light. The rest of the data will be pushed out of the cache into memory, and when the loop goes to update the predictions, the data will have to be read from memory a second time. And it works for us, in general, very slowly, at the speed of a pedestrian.

When we update the forecast, we don't have to read all the derivatives at once. It is enough to count them in large bundles to absorb virtual calls. Therefore, it makes sense to split the calculation of derivatives and updating the forecast into small blocks and mix these two actions. What will this lead to if you look at where the data will be read from?

This will lead to the fact that we will take data all the time, and to the fact that the data will remain in the L1 cache and will not have time to go to slow memory. And then we need to understand who will tell us this block size.

It is logical to entrust the derivative calculator itself, because only he knows how much cache he needs. Next, you need to rewrite the loop that we looked through the array. It needs to be divided into two. The outer loop will go through the blocks, and inside we will go through the elements of the block twice.

Here it is, external by blocks.

And here is the inner element of the block.

We take into account that the last block may be incomplete.

Let's see what comes of it. We see that we guessed correctly, correctly understood what was happening, and at the cost of rather small changes in the code, we reduced the time of work by eight percent. But we can do even more. We need to take another critical look at what we have written. Look at the function that calculates the derivatives for us. It returns us a vector of derivatives, access to the elements of which in adverse situations will be slow.

There are two reasons. First, allocating a vector on the heap. Chances are good that this vector will be repeatedly created and destroyed. The second minus in terms of speed is that each time we will receive memory, probably at a new address. This memory will be "cold" from the point of view of the cache, that is, before writing to it, the processor will perform an auxiliary read to initialize the data in the cache.
To fix this, you need to take the allocation out of the loop. To do this, we will have to change the interface again, stop returning vectors and start writing derivatives into the memory that we receive from the calling code.

This is a standard technique - the removal of all manipulations with resources from bottlenecks in the computational code. Let's add one more parameter to the CalcDer method, view on the vector where the derivatives should fall.

The code will also change in an obvious way. The derivative vector will be one, outside of all loops, and a new parameter will simply be added to the method.

We look. It turns out that we have won about another eight percent compared to the previous version, and compared to the base version - already 15%.
It is clear that optimizations are not limited to the amortization of overhead costs, that there are other types of bottlenecks.

To illustrate how to look for bottlenecks, we need one more simple experimental code. For example, I took the matrix transposition. We have an approx matrix and an approxByCol matrix where we need to put the transposed data. And a simple nest of two cycles. There are no virtual calls here to create vectors. It's just data transfer. The loop is relatively convenient for the compiler.
Let's measure how this code works on a sufficiently large matrix and on a specific machine.

For example, I took the number of rows 1000, the number of columns 100. The machine is an Intel server, one core. Memory is exactly like that, it is important for us, because all work with memory and speed will depend on the speed of the memory. We measured and got 000 s. Is it a lot or a little? What can we do during this time?

We have time to read 800 megabytes, this is not a transposed matrix, but the original one. And also read and write 1,6 GB, this is already a transposed matrix. The processor performs an auxiliary read before writing to initialize the data in the cache.

Let's calculate how much bandwidth we used with benefit. It turns out that the throughput of our code was 1,7 GB / s.

It was a theoretical calculation. Then you can take a profiler that can measure the speed of working with memory. I took VTune. Let's see what it shows. Shows a similar figure - 1,8 GB. In principle, it agrees well, because in our calculation we did not take into account that we have to read row addresses and column addresses. Plus, VTune registers background activity with the operating system. Therefore, our model is consistent with reality.
To understand whether 1,7 GB is a lot or a little, you need to find out what maximum bandwidth is available to us.
To do this, you need to read the specs on the processor. There is a special site ark.intel.com where you can find out everything about any processor. If we look specifically at our server, we can see that it has eight cores and for the fastest DDR3 memory it supports, transfers at about 60 GB / s one way.

But here we must take into account that we use only one core and our memory is slower, that is, we need to scale these 60 GB to our conditions in proportion to the number of cores and memory frequency.
It turns out that our code could use 5,3 GB one way. And since you can read and write in parallel, then ideally, if we just copied data from place to place, we would reach 10,6. Since we have two reads and one write, it should be about 8 GB / s. We remember that we got 1,7. That is, we used somewhere around 20%.
Why is it so? Again, you need to deal with the architecture. The fact is that the data between the memory and the cache is not transferred in arbitrary packets, but in 64 bytes exactly, no more, no less. This is the first consideration.

The second consideration is that we write the transposed data not sequentially, but randomly, since the rows of the matrix are located in memory in an unpredictable way.
It turns out that before writing one real number, we have to subtract 64 bytes of data. If we denote the size of the matrix N, then instead of the optimal operating time (N / 5,3 + N / 10,6), we get (8 * N / 5,3 + N / 10,6). Somewhere four to five times more, which explains this efficiency of 20%.

What to do with it? You need to stop writing data in one column and start writing as many columns as fit into one cache line (64 bytes). To do this, we split the cycle over columns into a cycle over cache lines and a nested loop over cache line elements.

Here they are, iterations over the cache lines.

And here they are, iterations inside the cache line. Here, for simplicity, we assume that the data is aligned to the border of the cache line. Now let's check with VTune what happens.

We see what happened close to the calculated eight gigabytes per second - 7,6. But it’s not yet a fact that all these 7,6 are useful work. Maybe some of them are overhead.
To understand how much benefit we got, let's measure the running time after optimization. It turns out 0,5 s on the same machine. The throughput, which is accounted for by the transposition itself, has become 4,8 GB / s. It can be seen that there is still a margin that we did not choose, but anyway, we got 20 percent from 60 percent efficiency.
With the help of a profiler, you can figure out why we did not get 80% or 95%.

The fact is that we store matrices as a vector of vectors, that is, we use memory access with double indirection.

With VTune, you can see which instructions are generated to access array elements. Instructions that subtract the addresses of the columns of the transposed matrix are highlighted in yellow on the left. And these are, firstly, additional instructions, and secondly, additional data transfers. But we will not optimize further, we will stop and summarize.

What did I tell you today? A useful tip for working with computational code is to process in blocks, to amortize overhead costs that are associated, for example, with virtual calls. Plus, due to blocking, data locality improves, and we get a higher access speed.
The removal of allocations from bottlenecks is also their depreciation. And also an increase in access speed by fixing temporary buffers in memory.
About profiling. First, profiling is a useful technique for finding bottlenecks "in the general case." Secondly, it allows us to evaluate the efficiency of the code, decide whether we are satisfied with the speed or want to optimize further, and shows in which direction to move.
That's all for me. If you use CatBoost or heard about it for the first time and want to know what it is, read come visit us for , write to . Thank you very much for your attention.
Source: habr.com
