A friend recently asked me for “the simplest optimization problem I could think of”. This led to a fun discussion of low-level optimization and how the CPU executes your code. And so I decided to share it here.
Let’s make it clear though, that the following will have very little practical use. We’re not just into “it doesn’t make a measurable difference” territory, but also deep into “the compiler will do this for you”. So please, don’t try to apply these “optimizations” to your real-world code to save a clock cycle.
This is merely intended as a thought experiment, illustrating some of the factors that makes performance so difficult to predict. And now, with that disclaimer in place, let’s get on with it:
The problem
The “problem” I came up with was the evaluation of x+x+x+x. This was the simplest snippet of code I could think of for which optimization is possible. For the sake of this discussion, let us assume that x is an integer.
A naive compiler will evaluate this code as ((x+x)+x)+x. In other words, it will evaluate one addition, feed that result to the second addition, and then finally feed the result of that to the third addition.
Optimization #1
The optimization I suggested was to evaluate it as (x+x)+(x+x) instead. And why is this faster?
A modern CPU is superscalar — that is, it is able to execute multiple instructions every clock cycle. Depending on the CPU model, it can probably execute three or four instructions belonging from the same thread every cycle.
So where the original version would take three times the duration of an add instruction to execute, my optimization can be done in two times the duration: Both the initial subexpressions can be evaluated in parallel. And so, after only the duration of one add instruction, we’ve got the result of two of the additions, and can perform the third and final one. So in this very simple case, we actually reduced the run time by 33%. Not bad, eh?
Optimization #2
My friend then asked if x*4 would be an optimization as well. And now it gets a bit more interesting.
First, of course, x*4 is just a single multiplication. Is that faster than three additions? Is it faster than two additions (which is the time it’d take for my “optimized” version to run)?
That depends on the speed of a multiplication instruction. On common CPU’s, a moment’s research tells us that add has a latency of 1 cycle, and mul has a latency of 3 cycles.1, so the multiplication takes as long as the original unoptimized version.
Evaluation
So what does this mean? That at a glance, optimization #1 is faster than #2, certainly. #1 yields a result after two clock cycles, where #2 takes a whopping three cycles.
But there are other factors at play. Sometimes the multiplication version may be more efficient. The CPU has a limited number of execution units. It can also only decode a limited number of instructions at a time.
The version using addition requires three instructions to be decoded, and uses two execution units during the first cycle, and one unit in the second. All in all, we’re occupying three “execution-unit cycles”. The version using multiplication does take three cycles, but because modern CPU’s are pipelined, it only occupies the execution unit during the first cycle. In the second cycle, the execution unit is able to begin on a new instruction, while continuing to process the mul instruction. So this version only requires one “execution-unit cycle”. In other words, we’ll free up other execution units so they can execute other instructions. We’re also freeing the front-end from having to decode three instructions.
So we now know that:
- If we need the result as soon as possible, the optimized
addversion will be more efficient because it finishes sooner. - But if we need to execute a lot of other instructions as well, the
mulversion will be more efficient because it uses fewer hardware resources on the CPU
What if we have a lot of instructions we want to execute and we need the result soon? Or if we only have these instructions to execute, and we don’t care about when we’ll get the result (perhaps the next operation is to add the result to that of an ongoing division, which is very slow, so it won’t matter if we take 2, 3 or 15 cycles to get ready)? Hard to say. Either one may be preferable.
Of course on x86 CPUs we also have to take the variable instruction length into account. How many bytes does a mul instruction take? What about three adds? That affects both how much data has to be read from memory and how much space will be taken up in CPU cache, and so that should be taken into account as well.
So what can we learn from this? Mainly that performance is nontrivial. Never assume that you can tell whether some code is “fast” or “slow”. And be especially careful with assumptions about how it can be improved. It is very possible that your “optimization” will actually run slower.
Whenever you optimize code, do as the pros: measure, measure and measure. Measure the speed of the original code. Measure the result of the optimized code. Be careful with the many ways in which your measurement can be invalidated (by the compiler optimizing away the code you wanted to test, or by the CPU cache changing the result in your test case from what you’d expect in the real world by caching — or not caching — the data you’re operating on).
And when performing low-level optimizations, another vital piece of advice is to understand the hardware. Know which instructions are being executed, know the cost of instructions on the relevant hardware, and know what other tricks the hardware uses (Your CPU is probably superscalar and pipelined, and processes instructions out of order. It probably also has a cache of a certain size, with a specific cache line size, and a certain associativity. It has a fixed number of execution units, a known pipeline length and so on. And while we’re at it, the memory subsystem matters too. How long does it take to access RAM? How can the CPU reorder reads and writes? What is its policy for writes? When are they pushed from cache to RAM? If you want to optimize your code on the instruction level, you need to know your CPU. Even the simplest code is affected by dozens such factors, any of which might make a difference.
-
For the sake of this example, let us assume that simple multiplication and addition instructions are used. Some CPU’s may have more complex instructions that, for example, can perform the multiplication faster if the second operand is a power of two. And of course we could implement the multiplication as
x << 2too. ↩
Tags: assembly, cpu, low-level, optimization, performance





All good points! If you haven’t seen it before, “Producing Wrong Data Without Doing Anything Obviously Wrong!” [http://www-plan.cs.colorado.edu/diwan/asplos09.pdf] is a good look at the problems involved with measuring performance at a a more global level. The paper points out that changing “innocuous” things like link order and environment size can have a larger performance impact than the change from moderate to heavy compiler optimization!
There’s a Google Tech Talk called “We have it easy, but do we have it right?” covering the results, for anyone interested. I recommend both the paper and the video!
Other good videos include Elizabeth Bradley’s “Chaos in Computer Dynamics” and Michael Hind’s “The Impact of Multicore Architecture on Software.” Both of those are UWash colloquium videos: [http://norfolk.cs.washington.edu/htbin-post/unrestricted/colloq/search.cgi]
One interesting result from the Hind video is that, while –O3 overall slightly improves performance at the application level, things look very different at the method level. Most methods are not affected, a few are made much faster, and a few are made much slower! This suggests that JIT and/or tracing compilers will be much better positioned for effective optimization than traditional static compilers.
I’m just finishing up with a computer architecture course, so the complexities of modern CPUs are not (entirely) lost on me. But at a certain point, I can’t help but wonder how much good the extra knowledge brings, precisely because performance is so context-sensitive. An optimization on a Core i7 may be a pessimization on an Atom. Like you said, measure measure measure…
Thanks for the comment! :)
True, but there are still areas where such micro-optimizations may be useful. First, if you know what the CPU actually does to your code, you can make some more general assumptions about what is fast and what isn’t. Taking the example in my post above, the exact latency might vary, but I doubt you’ll find a CPU where integer multiplication has less latency than addition. So we can still reason about what is preferable inside a tight loop where the combined latency is what’s holding us back. Or knowing that almost every modern CPU is superscalar, we can determine that a higher instruction count isn’t necessarily worse for performance. We can try to split up our dependencies so that subexpressions can be evaluated in parallel.
But of course you’re right, at a certain point we get down to the really CPU-specific stuff, and that’s probably dangerous to rely on, unless you know the exact hardware configuration on which the program is going to run. (You might be targeting a Playstation 3 specifically, for example, and then you don’t care that your optimizations wouldn’t work on other CPU’s)
But I thought the point was not to make assumptions in the first place? ;-)
I agree that remembering what the combination of OoO and superscalar can do is important, and easy to forget. Witness Mike Pall and LuaJIT2 for the payoff…
All things considered, cache layout is probably the most practical performance-related “thing” to keep in mind for most programmers. It’s much easier to reason about memory access patterns than asm dependency chains. But of course I wish I had measurements to back my intuition up.
My favorite example of the effect of a (relatively) obscure hardware structure on code is that inline caches make more sense than C++ style vtbls on a processor without a BTB, but the BTB helps the vtbl more than the inline cache. So modern processors are, in effect, optimized for C++-style virtual dispatch!
You’re right, cache layout is generally easier to reason about, and very important performance-wise. I did a project at university a couple of years ago where we saw a 2x speedup simply from changing from column– to row-major traversal of a 2D array. It’s definitely the first thing to check if you need to squeeze better performance out of your code. (partly because much of the ASM hackery can be done by the compiler, but it can’t do much about the cache layout)
I didn’t really intend this post to be a guide to “useful” optimizations though. It’s just intended to highlight some of the quirks and complexities of low-level optimization. If anything, it should be taken as a warning that trying to optimize by fiddling with individual ASM instructions is going to cause a lot of headaches, and you’ll see some unexpected results in many cases.