Notes from SC10 - WT8P's Notes to Self

King Neptune, outside Mardis Gras Land

A few weeks ago, I spent a week in New Orleans at Supercomputing 2010. (Sometimes my job has perks.) I wrote a really long summary of this for internal use, but thought I’d share some of my notes:

Jack Dongarra of Oak Ridge National Labs offered his perspective on HPC (High-Performance Computing), past and future.

1988 – first GigaFLOP (one billion floating point operations per second) machine, Cray Y-MP, 8 processors, used to solve static finite element cases. For comparison, the Nintendo Wii I purchased last year is supposed to have up to 61 GigaFLOPs of processing power. Its games are a lot more fun than the ones on the Cray Y-MP.
1998 – first TeraFLOP machine, Cray T3E, 1480 processors, used to simulate magnetism. Sony claims the Playstation 3 has a Teraflop of computing power.
2008 – first PetaFLOP machine, Cray XT (“Jaguar”), 150,000 processors, used to model atoms in a super conductor. (Interestingly, the IBM RoadRunner is also claiming it’s the first for hybrid computers.) A PetaFLOP is roughly 10x the estimated parallel processing power of the human brain, after a cup of coffee. Here is a time-lapse of the Jaguar being assembled, after a cup of coffee:

2018 – first ExaFLOP machine is expected. He anticipates it will have 10^7 (10 million) processors and effectively 10^9 (one billion) threads. With it, you’ll be able to see perspiration working its way down individual hair follicles of players in Madden 2030. (Why you would want to do this, I have no idea… but you can.)

Areas for more graduate student research:

1) New algorithms. This was also echoed by Steve Wallach (CEO of Convey Computers), the plenary speaker. For supercomputing, it’s a great research project. For commercial products, we’ve had more effective results from putting an application in a performance analyzer and identifying bottlenecks: unnecessary serialization, putting too much/too little on the heap, doing too much work up front. There are a lot of psychological tricks one can do in applications that give the perception of faster that we could still tap into.
2) Effective use of multi-core and hybrid architectures. Many of the Top 500 are hybrid architectures where CPUs (general purpose, can interface with I/O) ferret tasks onto (massively parallel) GPUs. These are notoriously difficult to program. The Parallel Computing 101 tutorial lead bristled every time someone asked about these computers.
3) Mixed-precision in algorithms. Single precision calculations are 2x faster on CPUs, 10x faster on GPUs. A coworker and I had discussed this separately, hypothesizing we might benefit from variable precision – to avoid using double precision when single precision plus a couple of bits for accuracy would do fine.
4) Self-adaptive/auto-tuning software. (This reminds me of a question John Bennett asked us in COMP 425: “How many of you have ever written self-modifying programs?” (hands go up) “How many intended to do this?”)
5) Fault-tolerant algorithms. This is the obvious fifth bullet, as systems achieve their scalability by throwing more hardware at problems. Unfortunately, the more components you have also begets a the higher statistical likelihood of a failure. In one of the sessions, I heard they expect some component failure to occur as often as once a day. He didn’t elaborate on what ideas there are.

I know I parked the car here last night.

I thought it was amusing he had a slide on the typical breakout of time for building supercomputers as:

20% what to build
80% building it
0% understanding its effectiveness to make changes for the next supercomputer

I think he left out: 25% write new grant proposals.

One of my favorite pet topics was how metrics can be abused and selectively chosen to make anything look good. (Well, duh, that’s the basis for marketing, right?):

FLOPS/s (floating point operations per second) – not useful for comparing among machines and applications because there’s an incentive to undo optimizations of the compiler to get “credit” for work that’s unnecessary.
Efficiency – artificially inflated and biased towards slower systems. One could make a supercomputer by stringing together a bunch of iPhone/Androids. It would be *very* power efficient, but take thousands of years to finish a weather model.
Speedup is only a characteristic; Amdahl’s law is a model for the relationship between the expected speedup of parallelized implementations of an algorithm relative to the serial algorithm, under the assumption that the problem size remains the same. They cited Amdahl’s law + Murphy’s law: if a system component can damage performance, it will
Application scaling – comes in two forms, weak scaling (how the solution time varies with the number of processors for a fixed problem size per processor) and strong scaling (how the solution time varies with the number of processors for a fixed total problem size). Pitfalls include having an unrealistic problem size.

Recommendations include:

Focus on first-order effects: there is no need to know performance of each transistor.
Split modeling effort: work on single processor execution time, then processor count, topology, network, etc. In other words, get it working on one processor first. Or, as we say in real world product delivery: get it working correctly first, then optimize.

Marie Laveau‘s tomb

When not enjoying the arguments between the CUDA acolytes and the OpenCL disciples (“Every time a programmer uses CUDA, [deity] smites a kitten”), I spent a lot of time walking around the French Quarter, clearing out every geocache I could. There are some pretty areas and unexpected finds, like this mini-park under a busy road:

Margaret of New Orleans

The best geocache in the city was View Carre’. I went into the building and asked the front desk security guard if they could “take me to the geocache.” The cache owner, Team Bamboozle, is the chief engineer, but had gone home a few minutes before my visit. However, another engineer, Shawn, took me up to the top floor, then to the secret engineering elevator to the 32nd floor maintenance room. It was a pretty easy find:

This is soooo not an urban microcache

After signing the log and dropping a travel bug, we went up on the roof where he gave me a visual tour of the city. Outstanding! I

John

2010-12-11 at 06:13

Jim: I don’t know how I managed to miss then when it was first written. Of your comments above, the one I found most interesting from a practical standpoint is mixed precision computing. Seems obvious in hindsight that sometimes you just don’t need a 64-bit floating point number. Of course, the trick is figuring out when that is.

I also like the debate about CUDA vs. OpenCL. I’d like to compare notes with you at some future date – January, for example.

1 thought on “Notes from SC10”