Adventures in Microoptimizations

A friend recently asked me for “the sim­plest opti­miza­tion prob­lem I could think of”. This led to a fun dis­cus­sion of low-level opti­miza­tion and how the CPU exe­cutes your code. And so I decided to share it here.

Let’s make it clear though, that the fol­low­ing will have very lit­tle prac­ti­cal use. We’re not just into “it doesn’t make a mea­sur­able dif­fer­ence” ter­ri­tory, but also deep into “the com­piler will do this for you”. So please, don’t try to apply these “opti­miza­tions” to your real-world code to save a clock cycle.

This is merely intended as a thought exper­i­ment, illus­trat­ing some of the fac­tors that makes per­for­mance so dif­fi­cult to pre­dict. And now, with that dis­claimer in place, let’s get on with it:

The prob­lem

The “prob­lem” I came up with was the eval­u­a­tion of x+x+x+x. This was the sim­plest snip­pet of code I could think of for which opti­miza­tion is pos­si­ble. For the sake of this dis­cus­sion, let us assume that x is an integer.

A naive com­piler will eval­u­ate this code as ((x+x)+x)+x. In other words, it will eval­u­ate one addi­tion, feed that result to the sec­ond addi­tion, and then finally feed the result of that to the third addition.

Opti­miza­tion #1

The opti­miza­tion I sug­gested was to eval­u­ate it as (x+x)+(x+x) instead. And why is this faster? A mod­ern CPU is super­scalar — that is, it is able to exe­cute mul­ti­ple instruc­tions every clock cycle. Depend­ing on the CPU model, it can prob­a­bly exe­cute three or four instruc­tions belong­ing from the same thread every cycle.

So where the orig­i­nal ver­sion would take three times the dura­tion of an add instruc­tion to exe­cute, my opti­miza­tion can be done in two times the dura­tion: Both the ini­tial subex­pres­sions can be eval­u­ated in par­al­lel. And so, after only the dura­tion of one add instruc­tion, we’ve got the result of two of the addi­tions, and can per­form the third and final one. So in this very sim­ple case, we actu­ally reduced the run time by 33%. Not bad, eh?

Opti­miza­tion #2

My friend then asked if x*4 would be an opti­miza­tion as well. And now it gets a bit more interesting.

First, of course, x*4 is just a sin­gle mul­ti­pli­ca­tion. Is that faster than three addi­tions? Is it faster than two addi­tions (which is the time it’d take for my “opti­mized” ver­sion to run)?

That depends on the speed of a mul­ti­pli­ca­tion instruc­tion. On com­mon CPU’s, a moment’s research tells us that add has a latency of 1 cycle, and mul has a latency of 3 cycles.1, so the mul­ti­pli­ca­tion takes as long as the orig­i­nal unop­ti­mized version.

Eval­u­a­tion

So what does this mean? That at a glance, opti­miza­tion #1 is faster than #2, cer­tainly. #1 yields a result after two clock cycles, where #2 takes a whop­ping three cycles.

But there are other fac­tors at play. Some­times the mul­ti­pli­ca­tion ver­sion may be more effi­cient. The CPU has a lim­ited num­ber of exe­cu­tion units. It can also only decode a lim­ited num­ber of instruc­tions at a time.

The ver­sion using addi­tion requires three instruc­tions to be decoded, and uses two exe­cu­tion units dur­ing the first cycle, and one unit in the sec­ond. All in all, we’re occu­py­ing three “execution-unit cycles”. The ver­sion using mul­ti­pli­ca­tion does take three cycles, but because mod­ern CPU’s are pipelined, it only occu­pies the exe­cu­tion unit dur­ing the first cycle. In the sec­ond cycle, the exe­cu­tion unit is able to begin on a new instruc­tion, while con­tin­u­ing to process the mul instruc­tion. So this ver­sion only requires one “execution-unit cycle”. In other words, we’ll free up other exe­cu­tion units so they can exe­cute other instruc­tions. We’re also free­ing the front-end from hav­ing to decode three instructions.

So we now know that:

  • If we need the result as soon as pos­si­ble, the opti­mized add ver­sion will be more effi­cient because it fin­ishes sooner.
  • But if we need to exe­cute a lot of other instruc­tions as well, the mul ver­sion will be more effi­cient because it uses fewer hard­ware resources on the CPU

What if we have a lot of instruc­tions we want to exe­cute and we need the result soon? Or if we only have these instruc­tions to exe­cute, and we don’t care about when we’ll get the result (per­haps the next oper­a­tion is to add the result to that of an ongo­ing divi­sion, which is very slow, so it won’t mat­ter if we take 2, 3 or 15 cycles to get ready)? Hard to say. Either one may be preferable.

Of course on x86 CPUs we also have to take the vari­able instruc­tion length into account. How many bytes does a mul instruc­tion take? What about three adds? That affects both how much data has to be read from mem­ory and how much space will be taken up in CPU cache, and so that should be taken into account as well.

So what can we learn from this? Mainly that per­for­mance is non­triv­ial. Never assume that you can tell whether some code is “fast” or “slow”. And be espe­cially care­ful with assump­tions about how it can be improved. It is very pos­si­ble that your “opti­miza­tion” will actu­ally run slower.

When­ever you opti­mize code, do as the pros: mea­sure, mea­sure and mea­sure. Mea­sure the speed of the orig­i­nal code. Mea­sure the result of the opti­mized code. Be care­ful with the many ways in which your mea­sure­ment can be inval­i­dated (by the com­piler opti­miz­ing away the code you wanted to test, or by the CPU cache chang­ing the result in your test case from what you’d expect in the real world by caching — or not caching — the data you’re oper­at­ing on).

And when per­form­ing low-level opti­miza­tions, another vital piece of advice is to under­stand the hard­ware. Know which instruc­tions are being exe­cuted, know the cost of instruc­tions on the rel­e­vant hard­ware, and know what other tricks the hard­ware uses (Your CPU is prob­a­bly super­scalar and pipelined, and processes instruc­tions out of order. It prob­a­bly also has a cache of a cer­tain size, with a spe­cific cache line size, and a cer­tain asso­cia­tiv­ity. It has a fixed num­ber of exe­cu­tion units, a known pipeline length and so on. And while we’re at it, the mem­ory sub­sys­tem mat­ters too. How long does it take to access RAM? How can the CPU reorder reads and writes? What is its pol­icy for writes? When are they pushed from cache to RAM? If you want to opti­mize your code on the instruc­tion level, you need to know your CPU. Even the sim­plest code is affected by dozens such fac­tors, any of which might make a difference.


  1. For the sake of this exam­ple, let us assume that sim­ple mul­ti­pli­ca­tion and addi­tion instruc­tions are used. Some CPU’s may have more com­plex instruc­tions that, for exam­ple, can per­form the mul­ti­pli­ca­tion faster if the sec­ond operand is a power of two. And of course we could imple­ment the mul­ti­pli­ca­tion as x << 2 too. 

Share and Enjoy: These icons link to social book­mark­ing sites where read­ers can share and dis­cover new web pages.
  • Digg
  • del.icio.us
  • StumbleUpon
  • Reddit
  • Technorati

Tags: , , , ,

4 Responses to Adventures in Microoptimizations

  1. Ben Karel says:

    All good points! If you haven’t seen it before, “Pro­duc­ing Wrong Data With­out Doing Any­thing Obvi­ously Wrong!” [http://www-plan.cs.colorado.edu/diwan/asplos09.pdf] is a good look at the prob­lems involved with mea­sur­ing per­for­mance at a a more global level. The paper points out that chang­ing “innocu­ous” things like link order and envi­ron­ment size can have a larger per­for­mance impact than the change from mod­er­ate to heavy com­piler optimization!

    There’s a Google Tech Talk called “We have it easy, but do we have it right?” cov­er­ing the results, for any­one inter­ested. I rec­om­mend both the paper and the video!

    Other good videos include Eliz­a­beth Bradley’s “Chaos in Com­puter Dynam­ics” and Michael Hind’s “The Impact of Mul­ti­core Archi­tec­ture on Soft­ware.” Both of those are UWash col­lo­quium videos: [http://norfolk.cs.washington.edu/htbin-post/unrestricted/colloq/search.cgi]

    One inter­est­ing result from the Hind video is that, while –O3 over­all slightly improves per­for­mance at the appli­ca­tion level, things look very dif­fer­ent at the method level. Most meth­ods are not affected, a few are made much faster, and a few are made much slower! This sug­gests that JIT and/or trac­ing com­pil­ers will be much bet­ter posi­tioned for effec­tive opti­miza­tion than tra­di­tional sta­tic compilers.

    I’m just fin­ish­ing up with a com­puter archi­tec­ture course, so the com­plex­i­ties of mod­ern CPUs are not (entirely) lost on me. But at a cer­tain point, I can’t help but won­der how much good the extra knowl­edge brings, pre­cisely because per­for­mance is so context-sensitive. An opti­miza­tion on a Core i7 may be a pes­simiza­tion on an Atom. Like you said, mea­sure mea­sure measure…

  2. jalf says:

    Thanks for the comment! :)

    True, but there are still areas where such micro-optimizations may be use­ful. First, if you know what the CPU actu­ally does to your code, you can make some more gen­eral assump­tions about what is fast and what isn’t. Tak­ing the exam­ple in my post above, the exact latency might vary, but I doubt you’ll find a CPU where inte­ger mul­ti­pli­ca­tion has less latency than addi­tion. So we can still rea­son about what is prefer­able inside a tight loop where the com­bined latency is what’s hold­ing us back. Or know­ing that almost every mod­ern CPU is super­scalar, we can deter­mine that a higher instruc­tion count isn’t nec­es­sar­ily worse for per­for­mance. We can try to split up our depen­den­cies so that subex­pres­sions can be eval­u­ated in parallel.

    But of course you’re right, at a cer­tain point we get down to the really CPU-specific stuff, and that’s prob­a­bly dan­ger­ous to rely on, unless you know the exact hard­ware con­fig­u­ra­tion on which the pro­gram is going to run. (You might be tar­get­ing a Playsta­tion 3 specif­i­cally, for exam­ple, and then you don’t care that your opti­miza­tions wouldn’t work on other CPU’s)

  3. Ben Karel says:

    But I thought the point was not to make assump­tions in the first place? ;-)

    I agree that remem­ber­ing what the com­bi­na­tion of OoO and super­scalar can do is impor­tant, and easy to for­get. Wit­ness Mike Pall and LuaJIT2 for the payoff…

    All things con­sid­ered, cache lay­out is prob­a­bly the most prac­ti­cal performance-related “thing” to keep in mind for most pro­gram­mers. It’s much eas­ier to rea­son about mem­ory access pat­terns than asm depen­dency chains. But of course I wish I had mea­sure­ments to back my intu­ition up.

    My favorite exam­ple of the effect of a (rel­a­tively) obscure hard­ware struc­ture on code is that inline caches make more sense than C++ style vtbls on a proces­sor with­out a BTB, but the BTB helps the vtbl more than the inline cache. So mod­ern proces­sors are, in effect, opti­mized for C++-style vir­tual dispatch!

  4. jalf says:

    You’re right, cache lay­out is gen­er­ally eas­ier to rea­son about, and very impor­tant performance-wise. I did a project at uni­ver­sity a cou­ple of years ago where we saw a 2x speedup sim­ply from chang­ing from col­umn– to row-major tra­ver­sal of a 2D array. It’s def­i­nitely the first thing to check if you need to squeeze bet­ter per­for­mance out of your code. (partly because much of the ASM hack­ery can be done by the com­piler, but it can’t do much about the cache layout)

    I didn’t really intend this post to be a guide to “use­ful” opti­miza­tions though. It’s just intended to high­light some of the quirks and com­plex­i­ties of low-level opti­miza­tion. If any­thing, it should be taken as a warn­ing that try­ing to opti­mize by fid­dling with indi­vid­ual ASM instruc­tions is going to cause a lot of headaches, and you’ll see some unex­pected results in many cases.

Leave a Reply

Name and Email Address are required fields. Your email will not be published or shared with third parties.