Houston, we have a (performance) problem

Ouch. These last few days, I’ve been fix­ing a few lin­ger­ing bugs in my STM sys­tem, and last night, I finally nailed them. Specif­i­cally, it is now pos­si­ble to open vari­ables within a trans­ac­tion as read-only. An obvi­ous opti­miza­tion, right? At least that’s the idea. Less work is required by the STM sys­tem if we can trust that the vari­able isn’t mod­i­fied by this transaction.

Well, my test case for this fea­ture now takes ages to run. As I men­tioned pre­vi­ously, a sim­ple trans­ac­tion mod­i­fy­ing two inte­ger vari­ables under heavy con­tention can pull off almost two mil­lion trans­ac­tions per sec­ond on my laptop.

My new test, in which each thread takes four vari­ables and alter­nates between mod­i­fy­ing two of them and read­ing the other two, runs per­haps ten thou­sand (!) times slower.

Of course I have sev­eral leads on how to fix this. The prob­lem is largely all the performance-related “extras” I’ve been leav­ing out. For exam­ple, if a trans­ac­tion fails to acquire a vari­able it needs, it sim­ply aborts and imme­di­ately retries. In many cases, a bet­ter approach would be to block the thread, wait­ing for that vari­able to actu­ally become available.

There are sev­eral other cases where I have a sim­i­lar prob­lem: I have to choose between delay­ing the thread for a moment with sleep() before attempt­ing to con­tinue, block­ing it until some con­di­tion is true, or abort­ing the trans­ac­tion entirely and start­ing over from scratch. At the moment, I gen­er­ally just pick the eas­i­est solu­tions (typ­i­cally abort, and occa­sion­ally call sleep() a few times before we resort to that. Again, imple­ment­ing some actual mean­ing­ful poli­cies here would make a big dif­fer­ence. And tweak­ing these poli­cies should help still more.

Another prob­lem is that cur­rently, I do not enforce a con­sis­tent global order when acquir­ing objects dur­ing a com­mit. This means I risk live­locks, again caus­ing exces­sive roll­backs when mul­ti­ple threads are com­pet­ing over access to the same variables.

So I’m still opti­mistic. It should be pos­si­ble to get per­for­mance back on track. But man, it’s depress­ing watch­ing per­for­mance plum­met like this.

Edit
And an update. After pok­ing around a bit, it turned out that most of the time was being spent sleep­ing. When a trans­ac­tion attempts to com­mit, if it can not acquire all the all the vari­ables it needs, it retries a few times with a short delay (a cou­ple of mil­lisec­onds) in between. If it doesn’t suc­ceed after a few tries, it rolls back the entire trans­ac­tion and starts over.

It turned out that these few, short sleep() calls brought CPU uti­liza­tion down to some­thing like 0.01%, and totally destroyed per­for­mance. Sim­ply turn­ing the sleep() call into a no-op brought me back to some­thing more or less rea­son­able. I still need to improve on the above short­com­ings, but now at least I can run my tests in less than an hour.

Share and Enjoy: These icons link to social book­mark­ing sites where read­ers can share and dis­cover new web pages.
  • Digg
  • del.icio.us
  • StumbleUpon
  • Reddit
  • Technorati

Tags: , , , ,

Leave a Reply

Name and Email Address are required fields. Your email will not be published or shared with third parties.