« Concurrency at SD West: Herb Sutter | Main | Concurrency at SD West: Brian Goetz »
March 10, 2008
Concurrency at SD West: James Reinders
SD West, the premier software developer conference in the U.S. was held last week in Santa Clara. While C++ and Java tracks were packed with expert programmers eager to learn new tricks from their favorite gurus and authors, there was as yet no overt organizational focus on concurrency. SD West advisory board members told me they recommended the conference add a concurrency focus in the future, but admitted they had only a very short list of potential candidates to fill a track on parallel programming.
That said, throughout the week there were smatterings of concurrency, with Herb Sutter offering the most in-depth and varied takes on the topic. Bjarne Stroustrup alluded to the return of clusters via the manycore revolution, while Brian Goetz gave a meaty tutorial on Java concurrency. But Intel evangelist James Reinders's keynote on Thursday (which might better have been slotted early in the week) aimed for a gut-level appeal with "Parallel or Perish!! -- Are you Ready?"
Reinders began his keynote by reminding developers that while multicore laptops are already here, a processor explosion similar to Moore's exponential clockspeed law was just around the corner.
"Somewhere around eight or 16 or 32 processors, something happens with computer design because of the latencies and interconnects. Intel sometimes calls it terascale. Some call it manycore." His point? Scalability is suddenly going to be a huge issue for developers who want to wring additional performance out of their programs. Otherwise, we'll watch previously linear speedups (Sutter's famous "free lunch") slow to a far less aggressive climb -- one no longer in direct proportion to the additional CPUs future platforms may boast.
The first order of business, according to Reinders, is to embrace abstractions and avoid hand-coded threading. Despite the existence of threaded libraries in various languages, OpenMP (1996) was the first major tool for managing multicore. Intel's Threading Building Blocks for C++ (2006), now open source at TBB.org, are another leap forward for multicore developers.
Half of all developers recently surveyed by Intel, Reinders said, were already using abstractions for threading. And in the last year, far fewer have complained that threading is simply too difficult to learn.
"It’s become fashionable to blame not using concurrency in 2007 on schedules -- people told us they had no time to implement it. It's no longer as common to deny the need for it," Reinders said. However, some 14% said concurrency was too difficult, and 27% were still in denial -- er, crossing their fingers that concurrency was just a passing fad.
In closing, Reinders urged audience members to check out Whatif.intel.com for the chip maker's prototype concurrency technologies for transactional memory and its C++ Parallel Exploration Compiler, which introduces four keywords: __parallel, __spawn, __par, and __critical.
A wide-ranging audience Q&A followed Reinders's keynote -- here's a sample of the lively interaction:
Q: Why is programming more difficult when you get past eight or 32 cores?
A: Dual- and quad-core have the same time-to-memory. With more cores, you're dealing with the problem of distributed memory and nonuniform memory access.
The other thing is, programs either scale or they don’t. And that brings up the question, should we build some out-of-order cores that take the same die area, or should we build 400 smaller cores? I think we’re going to build chips with a few bigger cores, but it's more power- and die-efficient to use homogenous smaller cores. There's going to be a very interesting tradeoff in the future.
Q: You mentioned TBB. Are there libraries for Java concurrency?
A: I’m not aware of any that have caught on really. A great deal of Java is run on servers and gets parallelism by being instantiated multiple times.
Q: We've been using TBB, and doing task stealing because we don’t want to use locks. But internally, TBB does use locks. Doesn't that defeat the purpose of TBB?
A: TBB does use locks -- whenever you have some form of synchronization, you need them. However, they can be very difficult to introduce and debug. We think the value is in our testing them and reducing the need for new source code. You can still end up with programs that don’t scale if you code them yourself. We also have a tool that visualizes what the processor is doing: Why it’s delaying, what part is stuck on locks, what the cores are actually doing.
Q: We hear a lot about CPU-bound apps, but often you have huge amount of data you need to process. What about parallelizing I/O -bound apps?
A: You do hear the most talk about CPU-bound, but for input/output processing, asynchronous I/O is extraordinarily popular [for continuous processing while the I/O operation is performed in the background]. FORTRAN is a more disciplined language for asynchronous I/O with its format statements. This is the biggest request we get from users of TBB.
Q: Are there approaches that Intel is looking at for hardware support for threads, such as locking sections of memory?
A: The area that the industry thinks is most profitable in that respect is transactional memory. I think it’s too far from a solved problem to add to hardware right now. We currently have the most ambitious transactional memory compiler out there.
Of course, there are multiple groups in Intel that believe they know exactly what they can add to the hardware to solve these problems [laughing]. We take a look at how programs are running, and if we’re able to optimize the processor to make locks run better without changing the instruction sets, we do it.
Q: TBB focuses on threads, but is multiprocessing the better approach?
A: Multiprocessing is probably a better thing to do. In general, when you go too fine-grained, that causes a lot of trouble. However, it requires more discipline to break your program up for multiprocessing. Definitely go multiprocess whenever possible. It’s easier to debug.
Q: Do you see Intel moving toward a loosely coupled processor model (similar to the Transputer from way back when)?
A: Lots of people at Intel are thinking about that. I have strong personal opinions. After a while, communicating through memory is a convenient programming model, but it may not be exactly what you’re doing. One approach is using MPI, for example. I definitely think we’ll see more MPI. By the way, everyone in high performance computing hates MPI, but it’s the best thing we’ve got.
The most painful thing for developers is going to be if we wake up and say there’s no more shared memory. This is definitely an interesting time for processor architecture.
Posted by Alexandra Weber Morales on March 10, 2008 1:55 PM
