The State of Multicore Software Development
One of the big issues in multicore is specialism vs generalism. Some multicore processor providers design a multicore processor for a specific job (e.g. Ageia made a processor for simulating physics in computer games, Icera make a software defined modem for 3G). The advantage of making a special-purpose processor is that you can cut out all the features of the processor you don't need for the application and just concentrate on getting the maximum performance for the desired application. You can have your own software developers working on optimizing the software specifically for your application. The advantage of a processor is that you can improve the software later, and maybe even apply the processor to a different application which has similar requirements. Some companies (like our neighbours Critical Blue) take this even further and have tools that take an application and make a processor specially for it.On the other hand, companies like Intel, AMD and Sun make general-purpose multicore processors. The advantage of this is that there is a huge market, so they can spend far more on processor design, tools, applications, and make much larger chips. For applications that are already multi-threaded (like server applications) then you get a speed-up very easily. The disadvantage is that if you are forcing people to write multi-threaded applications when they wouldn't naturally do that, then there is a lot of software development work to do for relatively little gain. The maximum performance improvement of a special-purpose processor is much greater than that of a general-purpose multi-core processor (in theory), so if you're making software developers do lots of multi-core optimization work, there needs to be a big benefit.What I find interesting is the companies half-way in-between these 2 extremes. The GPU makers (like NVIDIA and the former ATI) can sell huge numbers of the GPUs, which are very specialized processors. But as the GPUs get more general, people have become interested in using their incredible raw performance for processing. The problem is that GPUs really are just processors for graphics, so you have to do a bit of work to fit normal problems into a GPU. Their floating-point accuracy is quite low (although more than adequate for graphics) and their model of processing is optimized for speed, not flexibility. NVIDIA are tackling it by providing a C programming model called CUDA and both are hinting at full double-precision support in the near future.IBM/Toshiba/Sony's Cell processor is also somewhere in-between maximum performance and maximum generality. It achieves incredible floating-point performance and also very high bandwidth. The bandwidth issue seems to be the most exciting to the people we speak to. The floating-point performance is also disappointing if you want high levels of accuracy (full double-precision). But because the Cell processor is inside the PlayStation?3 games console, it has been designed at huge cost and will be manufactured in huge numbers for some time to come. I would expect Cell processors with fast double-precision floating-point support for scientific and engineering applications to be available within the near future too, making it a very powerful number-crunching processor.This variety is all very exciting for a techy like me. It reminds me of when I started getting interested in computing in the 8-bit microcomputer days. Then, everyone designed their own computers, although only some survived. Now, everyone seems to be designing their own processors. But this poses a number of problems, which have to be dealt with to be able to convert all this technology into something valuable for the customer.== Problem 1: The size of software == This is such a big problem, I'm going to come back to it later. But for now, I'd like you to stop and think for a moment. If you're a processor designer, I'd like you to stop and think for a long moment, because in my experience, processor designers aren't thinking enough about this issue. Someone recently put it to me this way: there are 10 times as many software developers in the world as hardware developers. So, any problem you can solve at the hardware level will be 10 times more effective than if solved at the software level. But he was massively understating the case. There must be vastly more software developers than processor developers (does anyone have the numbers?). But the really big problem is the amount of time that software has been developed over. The days of throwing away all your source code and starting again from scratch are rare in the commercial world. Any software worth buying is either very new (quite rare nowadays) or based on software that's been in development for years and years. And had extra features added. And includes libraries from third parties. You could just parallelize the bits that are easy to parallelize, but only a tiny number of features would have increased performance so the user might not notice any difference. Most people that speak to me put it this way: "we have a million or so lines of source and we want to parallelize it. The people who wrote the source don't work for us anymore so we don't understand it. And we use some libraries that we don't have the source code for. We think that some of it might be parallelizable. But we don't know." You might think "what rubbish software developers", and processor designers do seem to think that. But, often, it's the people who have the best applications (from a user's point of view) that come to us with this statement. So I don't really see that we can criticize people for developing large-scale, feature-rich, user-friendly, and commercially successful applications in the real world. The problem needs to be solved much closer to the hardware level.
Problem 2: A solution looking for a problem
I often meet processor designers who are looking for applications to run on their processors. As I said above, there seem to be 2 sensible approaches: design a processor for an application; or adapt an application to work on a suitable processor. Designing a processor and then hoping something works on it seems to be doing things the wrong way around.
Problem 3: Amdahl's law
Amdahl's law is the inconvenient truth of computing. It states that, to get a worthwhile performance improvement on an application, you need to parallelize a large percentage of the software. It's a slightly odd rule in that it's not about the number of lines of source, but the percentage of total processing time spent in the section of source you have chosen to parallelize. This means that if your software spends 90% of its time in 10% of your software, then you need to parallelize that 10% of your software to get a maximum (theoretical) 10x performance improvement. This poses 2 big problems: if your application is a million lines, then that's 100,000 lines of code to parallelize; and (this is the bit of Amdahl's law people conveniently forget until it's too late) even inside the bits you parallelize, there will be sections that can only be partially parallelized. People get around Amdahl's law in one of 2 ways: they find the bit they can parallelize and then do lots of that and hope it's useful (this is called Gustafson's Law); or they hand the problem to someone else and hope it goes away. There are certain situations where Gustafson's Law is a perfectly reasonable thing to do: computer game graphics, for example. In computer games, it doesn't matter if objects aren't drawn to 100% accuracy, as long as it looks good. So games developers just fiddle things to make it run fast and look good. I know, I used to be one.
Problem 4: Modes of parallelism
Amdahl's Law tells us we have to parallelize large parts of our applications. We might hope that we could use a parallelization technique and then apply it throughout our application. Unfortunately, this just isn't possible. In real applications, there are various different kinds of parallelism and different parts of our application require completely different types of parallelism. People have tried to classify these different types of parallelism. The version I like to use is the one on the website of the University of Champaign Urbana. They break parallelism down into 10 different "patterns" and I think they've got it about right. Recently, I've read a lot of articles and spoken to people who talk about a new kind of parallelism called "data parallelism". Data parallelism looks to me a lot like what used to be called "Embarrassingly Parallel". It's called "Master/Slave" on the UIUC page, but if you click on the "Master/Slave" link you'll see it use the "Embarrassingly Parallel" name. There are a few applications that come under the Embarrassingly Parallel group: Mandelbrot renderers, a lot of simple graphics operations, some financial modeling applications, and some mathematical problems. However, if there really were a lot of these types of applications, do you think that anyone would have come up with the name "Embarrassingly Parallel"? We use these kinds of demos all the time, because they're simple to write and give good results. But I don't pretend that our customers have a lot of this type of code.
Problem 5: Memory Bandwidth
So, your shiny new multicore processor has 16 cores on it. It should, therefore, be able to run at the same speed as 16 processors. But it doesn't, because memory bandwidth is the big issue. You are now driving the memory bus at least 16 times faster. So, your cores spend ages just waiting for data to come in from memory. This is why the Cell processor uses a DMA streaming system. It allows much faster memory bandwidth, at the cost of being much harder to program. AMD solved this with their Barcelona chip by having 2 memory buses. Our tests here show it to work quite well, but can it work for 16 cores and above? It's expensive because memory buses aren't that cheap. Cell needed to have 7-9 cores on a chip and operate within a very tight power budget, so this option wasn't available to them. When using an accelerator card on a PCI/PCIx/PCIe bus, then this problem becomes even more pronounced. You have to get the PC to send data to and from the card. The card can't easily request data from main memory, because it doesn't have access to the data inside the PC's main processor cache, or the operating system's virtual memory table. And the bandwidth of even the fastest PCI buses doesn't compare to the bandwidth between x86 processor and main memory. It's a tough problem to solve and one that Cell deals with by having everything on the same chip. It's the problem that people pushing PC accelerator cards (whether they be pure numerical accelerators like Clearspeed's, GPUs, or FPGAs). Certain applications can be separated into separate sections with clearly-defined data-movement requirements: 3D graphics sends streams of data from CPU to GPU and Ageia 's PhysX only needs to send the physics data that's changed). But once you try to take advantage of high-performance floating-point processing in a data-parallel way, then you have to have a lot of data, otherwise it isn't worth parallelising.
Problem 6: Offloading costs
All over your code, you'll find simple loops that are trivial to parallelize. But, you need to take into account that there is a certain cost of starting threads. The more processors, the more cost. And if you have to send code to those processors (as with the SPUs on Cell, or on a GPU) then that is an even bigger cost. So, your loop needs to do enough work for it to be worthwhile. But if your loop is too big, then you can end up running out of fast local memory to store all the code and data in. It's a tricky balancing act, and the balance changes according as you develop your program.
Problem 7: Choosing a programming model
So, you've decided to parallelize your application. Your first job is to choose a programming model to use to write your new parallel application. Codeplay sells a programming model (or 2) and so we're not exactly an independent recommender. But we do have some experience of looking at programming models and trying out our own ideas. The balance you need to strike is that if you choose one programming model, you might be limited to only one type of processor architecture (shared memory for example). Whereas another programming model might only work on certain types of algorithms. And another might require you to change your data structures. Changing data structures is a big problem if your software is big. Because changing data structures may involve changing all of your software. Choosing a programming model that only works on certain types of algorithms is not always fatal. It's not uncommon for people to use a mixture of MPI and OpenMP in the parallel programming world. As long as the different programming models can share data and processors, then it is possible to make it work. And of course, you need your programmers to understand more than one model, otherwise how can you maintain your code? Choosing a programming model that requires shared memory (e.g. OpenMP, Intels Threaded Building Blocks, pthreads) does limit you to shared-memory architectures. Which means no to Cell, Clearspeed, FPGAs, GPUs. That might change in the future. There seems to be a lot of suggestion of almost shared memory processors. Processors that can run existing threaded applications, that appear to share memory, but run much faster if you recognize that each processor does have a separate local store. This seems like a good balance, but I haven't seen it working yet, so I can't comment on the performance implications. My predictions for the future OK, so I'm going to do the stupid thing of trying to predict the future. I make no promises that I've got it right!* Programmers are going to have to start thinking about memory bandwidth in their applications. Seeing as it's very hard to work out what this might mean, I think programmers will at least need to have tools to measure memory bandwidth in some way* The shared-memory and local-store memory models both have problems* Multicore will be most successful in games, servers, mobile phones, networking, scientific/engineering/medical/financial modeling. In that order. Just because financial modeling seems to be the easiest to parallelize, doesn't mean they'll be the earliest adopters.* The magic application that justifies data-parallel processors will appear. But the algorithm will be so simple that someone will then implement the application in silicon.* C++ will not die, it will just be extended. In a myriad of incompatible ways (sorry, we're partly responsible, but what can we do? People need solutions for existing multicore problems now)* PCs will get smaller, cheaper, quieter, less power-consuming. It's been a long time coming, but I think the time is finally (nearly) right for mass market single-chip PCs.* The GPU will (eventually) be integrated onto the motherboard, even for high-end gaming systems. Because the CPUGPU bandwidth requirement of next-generation 3D graphics and games is vastly higher than it is now.