<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>jalf.dk &#187; performance</title>
	<atom:link href="http://jalf.dk/blog/tag/performance/feed/" rel="self" type="application/rss+xml" />
	<link>http://jalf.dk/blog</link>
	<description>Musings and thoughts on programming and other geeky stuff</description>
	<lastBuildDate>Sat, 07 Jan 2012 15:42:18 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Adventures in Microoptimizations</title>
		<link>http://jalf.dk/blog/2009/12/adventures-in-microoptimizations/</link>
		<comments>http://jalf.dk/blog/2009/12/adventures-in-microoptimizations/#comments</comments>
		<pubDate>Sun, 20 Dec 2009 07:10:49 +0000</pubDate>
		<dc:creator>jalf</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[assembly]]></category>
		<category><![CDATA[cpu]]></category>
		<category><![CDATA[low-level]]></category>
		<category><![CDATA[optimization]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://jalf.dk/blog/?p=425</guid>
		<description><![CDATA[A friend recently asked me for “the simplest optimization problem I could think of”. This led to a fun discussion of low-level optimization and how the CPU executes your code. And so I decided to share it here. Let’s make it clear though, that the following will have very little practical use. We’re not just [...]]]></description>
			<content:encoded><![CDATA[<p>A friend recently asked me for “the simplest optimization problem I could think of”. This led to a fun discussion of low-level optimization and how the CPU executes your code. And so I decided to share it here.<span id="more-425"></span></p>

<p>Let’s make it clear though, that the following will have very little practical use. We’re not just into “it doesn’t make a measurable difference” territory, but also deep into “the compiler will do this for you”. So please, don’t try to apply these “optimizations” to your real-world code to save a clock cycle.</p>

<p>This is merely intended as a thought experiment, illustrating some of the factors that makes performance so difficult to predict. And now, with that disclaimer in place, let’s get on with it:</p>

<h1>The problem</h1>

<p>The “problem” I came up with was the evaluation of <code>x+x+x+x</code>. This was the simplest snippet of code I could think of for which optimization is possible. For the sake of this discussion, let us assume that <code>x</code> is an integer.</p>

<p>A naive compiler will evaluate this code as <code>((x+x)+x)+x</code>. In other words, it will evaluate one addition, feed that result to the second addition, and then finally feed the result of that to the third addition.</p>

<h1>Optimization #1</h1>

<p>The optimization I suggested was to evaluate it as <code>(x+x)+(x+x)</code> instead. And why is this faster?
A modern CPU is superscalar — that is, it is able to execute multiple instructions every clock cycle. Depending on the CPU model, it can probably execute three or four instructions belonging from the same thread every cycle.</p>

<p>So where the original version would take three times the duration of an <code>add</code> instruction to execute, my optimization can be done in two times the duration: Both the initial subexpressions can be evaluated <em>in parallel</em>. And so, after only the duration of <em>one</em> <code>add</code> instruction, we’ve got the result of two of the additions, and can perform the third and final one. So in this very simple case, we actually reduced the run time by 33%. Not bad, eh?</p>

<h1>Optimization #2</h1>

<p>My friend then asked if <code>x*4</code> would be an optimization as well. And now it gets a bit more interesting.</p>

<p>First, of course, <code>x*4</code> is just a single multiplication. Is that faster than three additions? Is it faster than two additions (which is the time it’d take for my “optimized” version to run)?</p>

<p>That depends on the speed of a multiplication instruction. On common CPU’s, a moment’s research tells us that <code>add</code> has a latency of 1 cycle, and <code>mul</code> has a latency of 3 cycles.<sup id="fnref:1"><a href="#fn:1" rel="footnote">1</a></sup>, so the multiplication takes as long as the original unoptimized version.</p>

<h1>Evaluation</h1>

<p>So what does this mean? That at a glance, optimization #1 is faster than #2, certainly. #1 yields a result after two clock cycles, where #2 takes a whopping <em>three</em> cycles.</p>

<p>But there are other factors at play. Sometimes the multiplication version may be more efficient. The CPU has a limited number of execution units. It can also only decode a limited number of instructions at a time.</p>

<p>The version using addition requires three instructions to be decoded, and uses two execution units during the first cycle, and one unit in the second. All in all, we’re occupying three “execution-unit cycles”. The version using multiplication does take three cycles, but because modern CPU’s are pipelined, it only occupies the execution unit during the first cycle. In the second cycle, the execution unit is able to begin on a new instruction, while continuing to process the <code>mul</code> instruction. So this version only requires one “execution-unit cycle”. In other words, we’ll free up other execution units so they can execute other instructions. We’re also freeing the front-end from having to decode three instructions.</p>

<p>So we now know that:</p>

<ul>
<li>If we need the result as soon as possible, the optimized <code>add</code> version will be more efficient because it finishes sooner.</li>
<li>But if we need to execute a lot of other instructions as well, the <code>mul</code> version will be more efficient because it uses fewer hardware resources on the CPU</li>
</ul>

<p>What if we have a lot of instructions we want to execute <em>and</em> we need the result soon? Or if we only have these instructions to execute, and we don’t care about when we’ll get the result (perhaps the next operation is to add the result to that of an ongoing division, which is <em>very</em> slow, so it won’t matter if we take 2, 3 or 15 cycles to get ready)? Hard to say. Either one may be preferable.</p>

<p>Of course on x86 CPUs we also have to take the variable instruction length into account. How many bytes does a <code>mul</code> instruction take? What about three <code>add</code>s? That affects both how much data has to be read from memory and how much space will be taken up in CPU cache, and so that should be taken into account as well.</p>

<p>So what can we learn from this? Mainly that performance is nontrivial. Never assume that you can tell whether some code is “fast” or “slow”. And be especially careful with assumptions about how it can be improved. It is very possible that your “optimization” will actually run slower.</p>

<p>Whenever you optimize code, do as the <a href="http://blogs.msdn.com/ricom/archive/2003/12/02/40779.aspx">pros</a>: <a href="http://blogs.msdn.com/ricom/archive/2007/06/13/partly-sunny-chance-of-showers-bring-an-umbrella.aspx"><em>measure, measure and measure</em></a>. Measure the speed of the original code. Measure the result of the optimized code. Be careful with the many ways in which your measurement can be invalidated (by the compiler optimizing away the code you wanted to test, or by the CPU cache changing the result in your test case from what you’d expect in the real world by caching — or not caching — the data you’re operating on).</p>

<p>And when performing low-level optimizations, another vital piece of advice is to <em>understand the hardware</em>. Know which instructions are being executed, know the cost of instructions on the relevant hardware, and know what other tricks the hardware uses (Your CPU is probably superscalar and pipelined, and processes instructions out of order. It probably also has a cache of a certain size, with a specific cache line size, and a certain associativity. It has a fixed number of execution units, a known pipeline length and so on. And while we’re at it, the memory subsystem matters too. How long does it take to access RAM? How can the CPU reorder reads and writes? What is its policy for writes? When are they pushed from cache to RAM? If you want to optimize your code on the instruction level, you <em>need</em> to know your CPU. Even the simplest code is affected by dozens such factors, any of which might make a difference.</p>

<div class="footnotes">
<hr />
<ol>

<li id="fn:1">
<p>For the sake of this example, let us assume that simple multiplication and addition instructions are used. Some CPU’s may have more complex instructions that, for example, can perform the multiplication faster if the second operand is a power of two. And of course we could implement the multiplication as <code>x &lt;&lt; 2</code> too. <a href="#fnref:1" rev="footnote">↩</a></p>
</li>

</ol>
</div>
]]></content:encoded>
			<wfw:commentRss>http://jalf.dk/blog/2009/12/adventures-in-microoptimizations/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Houston, we have a (performance) problem</title>
		<link>http://jalf.dk/blog/2009/12/houston-we-have-a-performance-problem/</link>
		<comments>http://jalf.dk/blog/2009/12/houston-we-have-a-performance-problem/#comments</comments>
		<pubDate>Tue, 15 Dec 2009 13:49:43 +0000</pubDate>
		<dc:creator>jalf</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[c++]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[stm]]></category>
		<category><![CDATA[thesis]]></category>
		<category><![CDATA[transactional-memory]]></category>

		<guid isPermaLink="false">http://jalf.dk/blog/?p=403</guid>
		<description><![CDATA[Ouch. These last few days, I’ve been fixing a few lingering bugs in my STM system, and last night, I finally nailed them. Specifically, it is now possible to open variables within a transaction as read-only. An obvious optimization, right? At least that’s the idea. Less work is required by the STM system if we [...]]]></description>
			<content:encoded><![CDATA[<p>Ouch. These last few days, I’ve been fixing a few lingering bugs in my STM system, and last night, I finally nailed them. Specifically, it is now possible to open variables within a transaction as <em>read-only</em>. An obvious optimization, right? At least that’s the idea. Less work is required by the STM system if we can trust that the variable isn’t modified by this transaction.
<span id="more-403"></span></p>

<p>Well, my test case for this feature now takes <em>ages</em> to run. As I mentioned previously, a simple transaction modifying two integer variables under heavy contention can pull off almost two million transactions per second on my laptop.</p>

<p>My new test, in which each thread takes four variables and alternates between modifying two of them and reading the other two, runs perhaps ten thousand (!) times slower.</p>

<p>Of course I have several leads on how to fix this. The problem is largely all the performance-related “extras” I’ve been leaving out. For example, if a transaction fails to acquire a variable it needs, it simply aborts and immediately retries. In many cases, a  better approach would be to block the thread, waiting for that variable to actually become available.</p>

<p>There are several other cases where I have a similar problem: I have to choose between delaying the thread for a moment with <code>sleep()</code> before attempting to continue, blocking it until some condition is true, or aborting the transaction entirely and starting over from scratch. At the moment, I generally just pick the easiest solutions (typically abort, and <em>occasionally</em> call <code>sleep()</code> a few times before we resort to that. Again, implementing some actual meaningful policies here would make a big difference. And tweaking these policies should help still more.</p>

<p>Another problem is that currently, I do not enforce a consistent global order when acquiring objects during a commit. This means I risk livelocks, again causing excessive rollbacks when multiple threads are competing over access to the same variables.</p>

<p>So I’m still optimistic. It should be possible to get performance back on track. But man, it’s depressing watching performance plummet like this.</p>

<p><strong>Edit</strong><br />
And an update. After poking around a bit, it turned out that most of the time was being spent sleeping. When a transaction attempts to commit, if it can not acquire all the all the variables it needs, it retries a few times with a short delay (a couple of milliseconds) in between. If it doesn’t succeed after a few tries, it rolls back the entire transaction and starts over.</p>

<p>It turned out that these few, short <code>sleep()</code> calls brought CPU utilization down to something like 0.01%, and totally destroyed performance. Simply turning the <code>sleep()</code> call into a <em>no-op</em> brought me back to something more or less reasonable. I still need to improve on the above shortcomings, but now at least I can run my tests in less than an hour.</p>
]]></content:encoded>
			<wfw:commentRss>http://jalf.dk/blog/2009/12/houston-we-have-a-performance-problem/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

