Quantcast
Viewing latest article 2
Browse Latest Browse All 5

What Makes Parallel Programming Hard?

One of the challenges of multi-core and tera-scale architecture is how to make parallel programming “easier”. But what makes it hard in the first place? I thought it might be worth explaining some of our experiences with this as a prelude to explaining how we’re solving it. I’ve ranked things that make parallel programming hard in roughly increasing order of difficulty:

1. Finding The Parallelism

2. Avoiding The Bugs

3. Tuning Performance

4. Future Proofing

5. Using Modern Programming Methods

Finding The Parallelism: The first problem confronting the programmer is identifying where parallelism exists in their application. It might be as simple a solution as “Hmm….I’m calling function A after function B and they don’t interact in any way, so I’ll turn them into parallel tasks” or “I’m doing the same thing to this entire array of elements, so let me do it all in parallel.” However, sometimes, it might be a little trickier. The algorithm you use for good sequential (non-parallel) performance might not be easy to parallelize. To use a real-world example: A couple of years ago we were trying to parallelize a (MPEG-4) video compression encoder. It seemed easy enough on the surface: there’s a lot of parallelism in those video frames! However, a funny thing happened on the way to getting sequential performance…

One of the tricks of video compression is to try to predict where a particular chunk of a previous video frame will go in the next video frame. This is called “motion estimation”. This allows us to represent the chunk much more compactly. However, this is a really, really compute intensive part of the code. So, one of the tricks that clever video hackers used is to predict the current motion vector from neighboring motion vectors. This intuitively makes sense…pixels next to each other are likely to be moving in the same direction in the video stream (think of an object moving through the scene, or a camera pan). This really narrowed the amount of search you had for each chunk. The results were amazing: with little perceptible loss in video quality, you could speed up your video compression up to a couple of orders of magnitude!

But this causes problems for parallelism…they sped up sequential performance by creating a dependence between motion vector computations within each frame. We could no longer parallelize this (easily). The quality of the sequential result had raised the bar so high, that we had to rethink the algorithm a little (we did, successfully, by the way).

Avoiding The Bugs: Threads executing at the same time, accessing data at the same time introduces a new class of bugs that are really difficult to reason about and, in some cases, reliably find. These are called “data races”…termed such because the result of the computation often depends on the result of multiple threads racing to write a shared object. Programmers have to be careful to make sure that any accesses to shared data objects are well coordinated. There are some pretty subtle issues here, but even a simple example illustrates the complex interactions. If two threads attempt to swap two variables, one of which is shared, as illustrated in pseudocode below, several outcomes are possible. First, the two swaps might not overlap at all, so that the outcome is correct. However, there are several possible interleavings should these overlap. One bad outcome is when the assignments “B = A” are interleaved between the two swaps: we’d end up losing the value of either “c” or “b”.

swap(int A, int B) {

int tmp = B;

B = A;

A = tmp;

}

int a, b, c;

/thread 1/ {

swap(&c, &a);

}

/thread 2/ {

swap(&b, &a);

}

Tuning Performance: When we worked on that video compression project I mentioned earlier, we probably spent 95% of our time on tuning performance. Why? Well, in the world of multi-cores, things like cache behavior, how the cores are connected, and so on make a much bigger difference for performance. For example, in linear algebra algorithms (), “blocking” is used to improve caching behaviors of the algorithm. In laymen’s terms, this means that we modify the algorithm to work progressively on chunks of the problem that fit in varying levels of the cache. So, it’s useful to know what the cache sizes are.

With multi-core, this gets a little trickier because some of the cache levels are shared (e.g. a shared L2 cache in Core 2 Duo). So, you have to know how much of the L2 cache you can use….and you may not even know what’s running on the other core and how much cache it wants to use! Looking forward, this gets even more complex as we consider distributed shared caches, complex ways of connecting these cores, etc.

(
) Why is linear algebra important? It pops up in a lot of places…for example, it permeates those videogames your kids (or you) are playing.

Future Proofing: For sequential applications, you benefit from the increasing performance on single cores from processor generation to processor generation. The same should be true for parallel applications, but it’s not so easy. Everything I said for item 3 above is exacerbated because increasing core counts means that we know that cache sizes, interconnects, etc. will change, we just can’t say exactly how for products we haven’t begun to design. And this will affect performance.

For example, one of the things that will likely change is how many cycles takes for cores to communicate to each other. More importantly, the amount of arithmetic my core can do while communicating with another core will probably increase over time. So, if I implement an algorithm that does 100 arithmetic operations (in 100 cycles) while synchronizing with another core that takes 50 cycles, then we’ll keep the cores nice and busy. However, if I run that same binary on a system with a higher clock frequency, so that the relative synchronization time increases to 100 or more cycles, then I’ll be spending increasing amounts of time waiting around doing no arithmetic.

Even without knowing what future architectures precisely look like, we still must be able to future-proof our code through “forward-scaling” programming models, tools, and algorithms.

Using Modern Programming Methods: I learned to program in the “loop-y” days of C and Fortran programming. This meant that compute-intensive parts of most programs were confined to well understood and easy to identify chunks of code (typically, loops) that we call “kernels”. How things have changed…

These days, most software development occurs in some object-oriented programming language, like C++, Java, C#, Ruby, or Python. The abstraction methods that make these languages so attractive for software developers (and, indeed, arguably raised software engineering to an art) also makes it quite difficult to find the compute-intensive code we want to target for parallelization. Instead of loops, we have highly abstracted iterators, which themselves are comprised of many (virtual) function calls. Moreover, the processing of a large collection of data (which we parallelization folks look for) will often involve traversing object hierarchies and complex data structures in varying and unpredictable orders. These aren’t the loops your parents knew and loved…and I don’t think we are going to (or should) reverse this trend.

Here’s a slide presented by a game developer at a data parallel programming workshop we co-sponsored earlier in the year that highlights this issue.

Image may be NSFW.
Clik here to view.
gameslide.png

[Editor's note: this thread continues in Anwar's next blog]


Viewing latest article 2
Browse Latest Browse All 5

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>