<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Adventures in Microoptimizations</title>
	<atom:link href="http://jalf.dk/blog/2009/12/adventures-in-microoptimizations/feed/" rel="self" type="application/rss+xml" />
	<link>http://jalf.dk/blog/2009/12/adventures-in-microoptimizations/</link>
	<description>Musings and thoughts on programming and other geeky stuff</description>
	<lastBuildDate>Sat, 07 Jan 2012 20:30:11 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: jalf</title>
		<link>http://jalf.dk/blog/2009/12/adventures-in-microoptimizations/comment-page-1/#comment-331</link>
		<dc:creator>jalf</dc:creator>
		<pubDate>Sun, 27 Dec 2009 02:52:35 +0000</pubDate>
		<guid isPermaLink="false">http://jalf.dk/blog/?p=425#comment-331</guid>
		<description>&lt;p&gt;You&#039;re right, cache layout is generally easier to reason about, and &lt;em&gt;very&lt;/em&gt; important performance-wise. I did a project at university a couple of years ago where we saw a 2x speedup simply from changing from column- to row-major traversal of a 2D array. It&#039;s definitely the first thing to check if you need to squeeze better performance out of your code. (partly because much of the ASM hackery can be done by the compiler, but it can&#039;t do much about the cache layout)&lt;/p&gt;

&lt;p&gt;I didn&#039;t really intend this post to be a guide to &quot;useful&quot; optimizations though. It&#039;s just intended to highlight some of the quirks and complexities of low-level optimization. If anything, it should be taken as a warning that trying to optimize by fiddling with individual ASM instructions is going to cause a lot of headaches, and you&#039;ll see some unexpected results in many cases.&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>You’re right, cache layout is generally easier to reason about, and <em>very</em> important performance-wise. I did a project at university a couple of years ago where we saw a 2x speedup simply from changing from column– to row-major traversal of a 2D array. It’s definitely the first thing to check if you need to squeeze better performance out of your code. (partly because much of the ASM hackery can be done by the compiler, but it can’t do much about the cache layout)</p>

<p>I didn’t really intend this post to be a guide to “useful” optimizations though. It’s just intended to highlight some of the quirks and complexities of low-level optimization. If anything, it should be taken as a warning that trying to optimize by fiddling with individual ASM instructions is going to cause a lot of headaches, and you’ll see some unexpected results in many cases.</p>]]></content:encoded>
	</item>
	<item>
		<title>By: Ben Karel</title>
		<link>http://jalf.dk/blog/2009/12/adventures-in-microoptimizations/comment-page-1/#comment-313</link>
		<dc:creator>Ben Karel</dc:creator>
		<pubDate>Sun, 20 Dec 2009 23:28:36 +0000</pubDate>
		<guid isPermaLink="false">http://jalf.dk/blog/?p=425#comment-313</guid>
		<description>&lt;p&gt;But I thought the point was not to make assumptions in the first place? ;-)&lt;/p&gt;

&lt;p&gt;I agree that remembering what the combination of OoO and superscalar can do is important, and easy to forget. Witness Mike Pall and LuaJIT2 for the payoff...&lt;/p&gt;

&lt;p&gt;All things considered, cache layout is probably the most practical performance-related &quot;thing&quot; to keep in mind for most programmers. It&#039;s much easier to reason about memory access patterns than asm dependency chains. But of course I wish I had measurements to back my intuition up.&lt;/p&gt;

&lt;p&gt;My favorite example of the effect of a (relatively) obscure hardware structure on code is that inline caches make more sense than C++ style vtbls on a processor without a BTB, but the BTB helps the vtbl more than the inline cache. So modern processors are, in effect, optimized for C++-style virtual dispatch!&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>But I thought the point was not to make assumptions in the first place? ;-)</p>

<p>I agree that remembering what the combination of OoO and superscalar can do is important, and easy to forget. Witness Mike Pall and LuaJIT2 for the payoff…</p>

<p>All things considered, cache layout is probably the most practical performance-related “thing” to keep in mind for most programmers. It’s much easier to reason about memory access patterns than asm dependency chains. But of course I wish I had measurements to back my intuition up.</p>

<p>My favorite example of the effect of a (relatively) obscure hardware structure on code is that inline caches make more sense than C++ style vtbls on a processor without a BTB, but the BTB helps the vtbl more than the inline cache. So modern processors are, in effect, optimized for C++-style virtual dispatch!</p>]]></content:encoded>
	</item>
	<item>
		<title>By: jalf</title>
		<link>http://jalf.dk/blog/2009/12/adventures-in-microoptimizations/comment-page-1/#comment-312</link>
		<dc:creator>jalf</dc:creator>
		<pubDate>Sun, 20 Dec 2009 21:51:12 +0000</pubDate>
		<guid isPermaLink="false">http://jalf.dk/blog/?p=425#comment-312</guid>
		<description>&lt;p&gt;Thanks for the comment! :)&lt;/p&gt;

&lt;p&gt;True, but there are still areas where such micro-optimizations may be useful. First, if you know what the CPU actually does to your code, you can make some more general assumptions about what is fast and what isn&#039;t. Taking the example in my post above, the exact latency might vary, but I doubt you&#039;ll find a CPU where integer multiplication has less latency than addition. So we can still reason about what is preferable inside a tight loop where the combined latency is what&#039;s holding us back. Or knowing that almost every modern CPU is superscalar, we can determine that a higher instruction count isn&#039;t necessarily worse for performance. We can try to split up our dependencies so that subexpressions can be evaluated in parallel.&lt;/p&gt;

&lt;p&gt;But of course you&#039;re right, at a certain point we get down to the really CPU-specific stuff, and that&#039;s probably dangerous to rely on, unless you know the exact hardware configuration on which the program is going to run. (You might be targeting a Playstation 3 specifically, for example, and then you don&#039;t care that your optimizations wouldn&#039;t work on other CPU&#039;s)&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>Thanks for the comment! :)</p>

<p>True, but there are still areas where such micro-optimizations may be useful. First, if you know what the CPU actually does to your code, you can make some more general assumptions about what is fast and what isn’t. Taking the example in my post above, the exact latency might vary, but I doubt you’ll find a CPU where integer multiplication has less latency than addition. So we can still reason about what is preferable inside a tight loop where the combined latency is what’s holding us back. Or knowing that almost every modern CPU is superscalar, we can determine that a higher instruction count isn’t necessarily worse for performance. We can try to split up our dependencies so that subexpressions can be evaluated in parallel.</p>

<p>But of course you’re right, at a certain point we get down to the really CPU-specific stuff, and that’s probably dangerous to rely on, unless you know the exact hardware configuration on which the program is going to run. (You might be targeting a Playstation 3 specifically, for example, and then you don’t care that your optimizations wouldn’t work on other CPU’s)</p>]]></content:encoded>
	</item>
	<item>
		<title>By: Ben Karel</title>
		<link>http://jalf.dk/blog/2009/12/adventures-in-microoptimizations/comment-page-1/#comment-310</link>
		<dc:creator>Ben Karel</dc:creator>
		<pubDate>Sun, 20 Dec 2009 18:40:03 +0000</pubDate>
		<guid isPermaLink="false">http://jalf.dk/blog/?p=425#comment-310</guid>
		<description>&lt;p&gt;All good points! If you haven&#039;t seen it before, &quot;Producing Wrong Data Without Doing Anything Obviously Wrong!&quot; [http://www-plan.cs.colorado.edu/diwan/asplos09.pdf] is a good look at the problems involved with measuring performance at a a more global level. The paper points out that changing &quot;innocuous&quot; things like link order and environment size can have a larger performance impact than the change from moderate to heavy compiler optimization!&lt;/p&gt;

&lt;p&gt;There&#039;s a Google Tech Talk called &quot;We have it easy, but do we have it right?&quot; covering the results, for anyone interested. I recommend both the paper and the video!&lt;/p&gt;

&lt;p&gt;Other good videos include Elizabeth Bradley&#039;s &quot;Chaos in Computer Dynamics&quot; and Michael Hind&#039;s &quot;The Impact of Multicore Architecture on Software.&quot; Both of those are UWash colloquium videos: [http://norfolk.cs.washington.edu/htbin-post/unrestricted/colloq/search.cgi]&lt;/p&gt;

&lt;p&gt;One interesting result from the Hind video is that, while -O3 overall slightly improves performance at the application level, things look very different at the method level. Most methods are not affected, a few are made much faster, and a few are made much slower! This suggests that JIT and/or tracing compilers will be much better positioned for effective optimization than traditional static compilers.&lt;/p&gt;

&lt;p&gt;I&#039;m just finishing up with a computer architecture course, so the complexities of modern CPUs are not (entirely) lost on me. But at a certain point, I can&#039;t help but wonder how much good the extra knowledge brings, precisely because performance is so context-sensitive. An optimization on a Core i7 may be a pessimization on an Atom. Like you said, measure measure measure...&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>All good points! If you haven’t seen it before, “Producing Wrong Data Without Doing Anything Obviously Wrong!” [http://www-plan.cs.colorado.edu/diwan/asplos09.pdf] is a good look at the problems involved with measuring performance at a a more global level. The paper points out that changing “innocuous” things like link order and environment size can have a larger performance impact than the change from moderate to heavy compiler optimization!</p>

<p>There’s a Google Tech Talk called “We have it easy, but do we have it right?” covering the results, for anyone interested. I recommend both the paper and the video!</p>

<p>Other good videos include Elizabeth Bradley’s “Chaos in Computer Dynamics” and Michael Hind’s “The Impact of Multicore Architecture on Software.” Both of those are UWash colloquium videos: [http://norfolk.cs.washington.edu/htbin-post/unrestricted/colloq/search.cgi]</p>

<p>One interesting result from the Hind video is that, while –O3 overall slightly improves performance at the application level, things look very different at the method level. Most methods are not affected, a few are made much faster, and a few are made much slower! This suggests that JIT and/or tracing compilers will be much better positioned for effective optimization than traditional static compilers.</p>

<p>I’m just finishing up with a computer architecture course, so the complexities of modern CPUs are not (entirely) lost on me. But at a certain point, I can’t help but wonder how much good the extra knowledge brings, precisely because performance is so context-sensitive. An optimization on a Core i7 may be a pessimization on an Atom. Like you said, measure measure measure…</p>]]></content:encoded>
	</item>
</channel>
</rss>

