Queue grainy PSA video: “Here at the Intel Bit labs, we’re working on some exciting new things…”
Sufficed to say, multi-core programming is getting bigger and bigger – and is here to stay for the foreseeable future. We started back in the day with a single monolithic compute engine that people queued up jobs for. In the 60’s and 70’s, those big monolithic servers got many cores/threads – but you still queued up your individual jobs to run on them. Then, in the 80’s – single core computers caught on. Now, in the 2000’s, we bring many cores to the desktop masses. Now nVidia and ATI have graphics cards with hundreds of programmable ‘stream’ processors with massive floating point math support (dot products, 4×4 matrix multiplication, etc). And I’ll wager this is likely to be the direction for the next 10+ years. Gone are the halcyonic days of yore when one could write monolithic code for a single processor. And with multiple cores, comes multiple new challenges.
The first is algorithms and data structures. I just finished writing my first lock-less, thread-safe data structure. Granted, it was a simple lockless stack, but lockless and thread-safe none-the-less and was used to feed jobs to several cores. Seeing it in action is quite impressive. I had multiple producer threads filling it, while multiple consumer threads tried to empty it. Consumer/producer threads could switch back and forth being producers if the job queue emptied out too fast, or switch over and become a consumer if the stack was filling up too fast. All without loss of a single data element, and all without a single semaphore.
If you’re a computer programmer, you owe it to yourself to start thinking and writing in a thread-safe manners from here on out – and learning when you can ply your trade – and where you can’t. I know in my time, parallel algorithms were kind of a ho-hum addition towards the end of your algorithms courses. The 90’s was when all the big super-computer companies were going belly up. A lot of the research had already been done, and the basic take-away was that for some operations, parallel algorithms scale by the number of cores nicely, others stubbornly don’t scale at all. Best crack that old book and get yourself familiar with which ones do/don’t again.
There are two ways to go about thread safety – using the inherent structures of an API/system (SQL or other systems that inherently are built to handle multiple requests in a threadsafe way), or using hardware intrinsics down at the core. You should learn them both. Knowledge of high-level system abstractions is good, but when you find that suddenly your getting the same element twice, or one dissappears, or you get garbage at random intervals – you need to know how it works underneath – in spades.
Debugging parallel algorithms and data structures can be very painful. A very brilliant guy here worked for several weeks on a lock-based queue with memory/node reuse. The bugs were tenacious and very hard to duplicate. Some took a week or better to show up. The simplest change would fix or completely destroy the system. Memory allocation was something of voodoo art (anyone want to become way famous by writing up memory allocation routines for parallel/lock-less algorithms??). You think malloc/free bugs were hard to find before…
At least familiarize yourself with these topics:
- New atomic hardware instructions – atomic increment/decrement, compare-and-swap, etc. There are new commands coming out with each new processor – go find out what they are.
- What new services are provided by my OS/language for synchronization? I actually have had to use inline assembly for new hw intrinsics because many compilers don’t expose the new commands just yet but certain lock-less data structures and algorithms can’t be written without them.
- What API’s/OS calls are thread safe at the app level, the thread level, and/or the os level?
- Learn how to debug a multi-threaded app
- What algorithms (searching, sorting, counting, etc) parallelize nicely, which do not
- What kinds of data structures can be lockless? Which can’t?
- In what novel ways can I combine lower-level atomic operations to build new atomic instructions (such as using atomic INC and CAS to build an n-way CAS)
- How do I handle allocation/deallocation in a multi-threaded app?