Friday, May 15, 2009

On Performance and Scaling

Finally, here's what you've been waiting for...

Performance is about getting to the bone with a butcher's knife to get that last bit of flesh (sure, it takes me great effort to say that as a vegetarian...so savor and ruminate this key point). It is about getting the last drop of milch from that udder. Think instruction counts. Think about the shortest path between two points. Think about latency and real-time systems.

Scalability, interestingly, lands you on a different planet. If you had 256 cores underneath, can you get 256,00% utilization? If this is of principal interest to you first then don't worry too much about performance as defined above. Use coarse grained locks, look at the wait time of locks (how long do you have to wait to get inside mutually exclusive code), and worry about pruning your critical sections (mutex protected code) later. Scalability is about throughput, mainly. Performance, is roughly, about latency. Eventually we will combine both and have some real fun.

To get the jargon right - we will use the term performance for a single thread of execution, and scalability for the "parallel performance" of multiple threads of execution.

Designing MT code from scratch can be a bitch if you haven't trained your mind to do so. In my company, PixBlitz Studios, we have some vision engineers (not hard-core programmers but good scientists and mathematicians) solve the algorithm correctly using simple single-threaded logic. Once the algorithm has been nailed, it is psuedo-coded and then parallelized. Sure, there are trade-offs in taking this route, the positive one being that the vision engineers can produce some kick-ass algorithms without worrying about the systems side of life. All algorithms are not naturally parallelizable. Knowledge of this fact can greatly reduce risk and agony. There are cases, however, where such a non-parallelizable problem may start looking interesting (say, to someone like my colleagues Arvind Nithrakayshap or Tudor Bosman from my past lives, or even me for that matter) especially when we know that solving it has much value. Exceptions apart, it is important to solve the problem for correctness before multi-threading it. Before you kill me...

Even in the general case of a solvable MT problem, just because you have nailed the single-threaded algorithm does not mean you can readily multi-thread it. A good approach is to start from a clean sheet of paper and write the parallel algorithm using the single-threaded algorithm in psuedo-code. A sound understanding of the algorithm and the underlying problem domain can quickly give a good sense for what parts can we written as concurrent code and what is intrinsically non-parallelizable. In the difficult case of non-parallelizable logic, it is helpful to understand the effect of sync points (points in the algorithms where some/all threads must synchronize) on the parallel logic. When writing parallel (or scalable) code the sync points play a key role in determining the scalability of the system.

You'll often hear me say - what is the effect of the fast path on the slow path? If the slow path has a systemic feedback loop into the fast path of the system you have a pathological problem to deal with and should be going out fishing instead.

Phew, was that much to digest? Let's focus on the easy case...the code is naturally parallelizable with no frickin gotchas. Should you be worrying about the execution time of a single thread of execution? Or go after the lock wait times? Put coarse locks in your code and use 256,00% CPU utilization first? Let's nail these questions first. Haha. We can then come back with the pruning shears and start shaving off instructions!!

Without being too cynical we must wait to milch the instruction count udder until we actually need to do that. In all likelihood your code is running on today's latest multi-core CPUs and GPUs; contrary to the urban legend - where the Gods of Performance will write you an eulogy for a lifetime of instruction pruning - your time is best spent on ensuring that your algorithm is parallel before going after instruction counts of a single thread of execution.

Let's leave it at this for now.

To be continued into many blogs from now on with real examples, code, etc.

1 comment:

  1. Vikram,

    It is great to see you blogging. I'm enjoying the content and look forward to more installments.

    You likely don't remember me, but we did have some interactions in the mid-to-late 90s while I was working through the acceptance of some OSD-level Sequent-specific optimizations in our port of the Oracle Database. I seem to recall we were analyzing some odd spinlocking pathologies exhibited under certain workloads.

    All that aside, keep up the good work...I may even blogroll this one :-)

    ReplyDelete