Tuesday, January 8, 2008

Multicore: Not Just a Software Crisis

By Roderick Flores

I have been following the exchange over the existence of a multicore crisis with a significant amount of interest. The central question is will software systems as we know it today continue to show the performance gains that they have historically (following Moore’s Law)? As you know, clock speeds have remained fairly constant over the preceding few years and there are physical limits to the number of transistors that can be packed on a chip. Thus the leading manufacturers have begun placing more than one central processing core on a single chip.



So, can software developers continue to rely on computational power increases from multicore architectures to improve performance? Conversely will they need some radical new toolsets in order to produce applications that can harness the power of a chip with a large number of cores on board? In other words, will the trend towards multicore lead to stagnation in application performance? Well, I couldn’t just standby without diving into this fray.



I will not get into the details of the many sophisticated arguments either for or against either side of this position. However I think it is important to note a few. To begin with, there just are not that many embarrassingly parallel problems which would easily lend themselves to multicore processing. The other major problem sets (or dwarfs as they have been termed in The Landscape of Parallel Computing Research: A View from Berkeley) just are not easy to program. As many of you know, parallel programming is typically hard and the tools to make it easy just do not exist. A number of people have also suggested that universities have not adequately trained people for the parallelism that multicore requires. They argue that these architectures are essentially the same as the SMP systems which people have been using for a very long time. Furthermore, threads have been around for a while and programmers should know how to use them.



Personally, I think this is all good and well but somewhat misleading. What if we had an embarrassingly parallel problem, that could be multithreaded, and be written without breakthroughs in toolsets? Would that mean that we could just chug along realizing massive gains from multicore processors without hitting the figurative brick wall?



My most recent parallel programming experience involved a very embarrassingly parallel algorithm. Essentially the process was to load as much of the dataset into memory as you could, perform a large number of calculations on it, and then write the results. I will not go into too much detail because the algorithm is proprietary. In any event, to get speed up, all you had to do was multithread the software and let it loose. The more cores and memory I threw at the problem, the faster it became. I remember the excitement we had when we got near-linear speed-up on a system with a total of four cores.



So what would happen if I could have the quad-processor system with 64 cores each that I hear is just around the corner? You know, exactly the kind of system that most applications will not be able to utilize. I could get by with 16GB of memory and several terabytes of disk – those resources are already available. Shouldn’t I expect my embarrassingly parallel problem to be 256 cores / 4 cores or 128 times faster? In other words a standard model that would take 40 hours on a single computational node might take fewer than 20 minutes. If that were the case, we could have saved an enormous amount of money in hardware costs in lieu of our cluster.



What I can tell you is that this is a fantasy. You can only throw so many cores at a problem because of limitations in memory access speeds and latency, bus speeds, network speeds and latency, and SAN throughput. First of all, to store the nearly one terabyte that an example of the aforementioned problem would produce in fewer than twenty minutes would require some pretty amazing throughput. Now imagine a doubling of the number of cores to 128 per processor: can we push that much data into a disk in less than five minutes? In other words, will there be thirty-odd terabit networks out there by the time 128 core chips are in production? Secondly, imagine that you have 256 threads accessing the data stored in memory. Each thread is going to be in a different part of the calculation and therefore memory access is going to be essentially random. I do not believe that today’s memory handles that kind of load without major contention. In short, you will not see anything like the speed-up for which you are hoping.



Similarly, I believe that you can see this problem on a grid where you do not need to have a massively parallel program to get full use of your hardware. For example, imagine 200 four-way 64-core computational nodes connected to high-end storage through a low-latency network: you have a grid with over 50,000 cores!!! A good resource manager is more than willing to schedule single-threaded or parallel jobs for each of those cores. If your grid has high-utilization rates as is typical (you build it and they will use it), then you can easily see the same problem I described earlier: hundreds of processes contending for severely over-taxed resources.



So I see the crisis as more than just dealing with complicated parallel algorithms or a lack of good programming tools. Rather the crisis seems to be rooted in the entire computational community. In order to make use of systems with numerous cores, there will need to be some significant improvements in internal communication, networking, memory access, and probably a number of other details that I have not even considered, let alone imagined. That is not to say that the arguments surrounding the multicore crisis are not well founded. Rather, I am saying that there is much more with which to be concerned than programming.

No comments:

Post a Comment