Monday, 29 March 2010

Easy OpenCL Library Makes Feeding Babies More Difficult Than Writing OpenCL

Ha ha! Yeah, right! It's interesting to see libraries become available for OpenCL. I've seen two so far. That's Simplified OpenCL (SOCL) from Matt Reilly and libstdcl from Brown Deer Technology. Both these are linked from the AMD page here .

How useful are these? I have to confess I don't know. I want to get to grips with the nuts and bolts of OpenCL and parallel programming before I give into temptation and use other libraries.

The other thing is that it's easy to get confused. You have to learn the wrapper library. If you do that without understanding what's going on under the hood, it just makes solving problems more difficult.

So I think I'll get to the point where I really feel comfortable with the OpenCL APIs. At that stage, when things start to feel a bit repetitive and tiresome, I'll checkout those libraries to see how they can make life easier.

There's another point I'd like to make. Libraries are undoubtably useful to perform standard tasks. For example, NVIDIA provide for CUDA the CUDPP library. That could do everything I want in the scan example. After all, I just want to add a few numbers together. But for other tasks, if you want to make the most out of the acceleration you can get from the GPU, it seems to me that there is no easy option. You have to think deeply about your problem.

Imagine that you are the manager at a distribution centre. Until now you have one worker assigned to you. One day, your boss arrives and tells you that the investment capital they always dreamed of is now available. Six thousand workers will be assigned under your control. Helping them realise their potential will require careful consideration.

Parallel computing is a new paradigm. You now have thousands of hardware threads at your disposal. But that doesn't mean you can just do thousands of the same thing the same old way at the same time. Instead, you have to reconsider everything. The resources available and the problem at hand are inextricably linked in realising the optimal solution. Libraries are another tool - albeit a very useful one - in your OpenCL arsenal.

And in any case, who said feeding babies was easy?

Friday, 12 March 2010

Parallel Programming Tutorials

Just had a look at the Parallel Programming Tutorial Series at multicoreinfo.com and it looks really good.

I'm a newbie at OpenCL. But then isn't everyone?

What I'm discovering is that the tricky part is the algorithms. In my twenty years of experience as an IT Professional, I've hardly ever had to deal with something that looked like a real algorithm. Almost everything in Java land had already been written. All you needed to do was to call whatever the API was.

In parallel computing and therefore, in OpenCL, algorithms are the core of what you do. Setting up your contexts, kernels, read and write buffers, etc., all that stuff is the easy bit.

My impression is that writing the kernel algorithms is the key skill. My guess is that if you can master a fundamental suite of parallel computing algorithms - in the same way that with Java you master common design patterns - then you will be golden.

If this is indeed the case, my hope would be that these fundamental algorithms can be applied to solve most of the common parallel computing problems.

But that's a lot of hoping and guessing. Let's see how it goes.

Here's some new stuff I just found on the NVIDIA site. There's PDFs and WMV movie files for older web seminars and you can register for forthcoming events.

GPU Computing Online Seminars

More tutorials and resources to follow when I get a mo...

Debugging OpenCL

"The dusk was repeating them (his very last words) in a persistent whisper all around us, in a whisper that seemed to swell menacingly like the first whisper of a rising wind. 'The horror! the horror!'"

Heart of Darkness by Joseph Conrad.

I could lie here and tell you that you just set a nice little breakpoint wherever you fancy in your OpenCL kernel then mouse over whatever variable to inspect it. But then, I'd be lying.

The sad truth is that debugging OpenCL kernels is a frustrating black box like process. Yes, you can put a printf statement in there if you run on the CPU not the GPU. Problem is, your work group size will have to be one. To me, that seems to render printf useless for all but the simplest kernels.

What I have discovered to be useful is to add more global memory parameters solely for the purpose of debugging. To these, write out whatever variables your interested in. It's as good as printf, almost!

All that said, there is light at the end of the tunnel. Personally I'm gagging to get my hands on a beta version of the gDEBugger tool promised by Graphic Remedy. If it does what it says on the tin it will be a fantastically useful product.

Scan? Which Scan?

I always thought a scan was something you went for if you had stomach pains.

That was, until I posted a question on macresearch.org about how to do cumulative addition.

Seems easy enough. You have five numbers and you want to add them in a cumulative fashion. The result will also be five numbers, with each being the sum inclusive of it's position.

For example, let's say you have 1, 3, 2, 4, 1.

The result will be 1, 1 + 3, 1 + 3 + 2, 1 + 3 + 2 + 4, and 1 + 3 + 2 + 4 + 1.

Not too shabby. In good 'ol fashioned procedural code you could code that faster than a blindfolded fanatical cave dwelling warlord could re-assemble a Kalashnikov. But not so with parallel computing. For that, you need a scan.

But which scan? Inclusive or exclusive? Do you use the Hillis and Steele algorithm (thanks Geoffrey), or do you go for something even faster, as documented in the NVIDIA scan.pdf document? Not only that, but in your SDK folder you'll find about 4 different scan kernel examples? Apple have some too.

Some are naive, others are optimised. Some will handle big arrays, others small.

It's a bit like your wife asking for a cosmetic set for Christmas. You find yourself in a strange aisle of the city centre pharmacy store surrounded by a multitude of threatening and alien concepts. You start sweating. Women begin eyeing you strangely. In store cameras rotate and are now trained on you. Panic sets in.

What can you do? What can save you from all this, and more importantly, what can save you from yourself? Stay tuned for the next exciting episode of....The Joy of OpenCL.

Thursday, 11 March 2010

Parallel Prefix Sum (Scan) with CUDA

Right, here goes! First problem I have is with this example. Before reading any of this make sure that you have studied the scan.pdf document by Mark Harris detailing the Parallel Prefix Sum example. Then, consider the issues below.

It's worth saying that I am 99.99% confident that the problem is my understanding, not the example! Mark is most likely a parallel computing expert, and he works for NVIDIA. I on the other hand am a humble newbie trying to learn the trade.

Actually, perhaps I'll email Mark and ask him to have a look at this. It would be great to get some authoritative feedback.

Anyway let's get to the code. Converting to OpenCL seems pretty straightforward....

__kernel void scanExclusiveNaive(

__global uint *d_Dst,

__global uint *d_Src,

__local uint *l_Data,

uint n

){


int thid = get_local_id(0);

int offset = 1;

int ai = thid;

int bi = thid + (n/2);


l_Data[ai] = d_Src[ai];

l_Data[bi] = d_Src[bi]; // [1]

// Up sweep reduce phase


// build sum in place up the tree

for (int d = n>>1; d > 0; d >>= 1) {


barrier(CLK_LOCAL_MEM_FENCE);

if (thid <>

int ai = offset*(2*thid+1)-1;

int bi = offset*(2*thid+2)-1;

l_Data[bi] += l_Data[ai];

}


offset *= 2;

}


// Down sweep

// clear the last element

if (thid == 0) {

l_Data[n - 1] = 0;

}

// traverse down tree & build scan

for (int d = 1; d <>2)

{


offset >>= 1;

barrier(CLK_LOCAL_MEM_FENCE);

if (thid <>

int ai = offset*(2*thid+1)-1;

int bi = offset*(2*thid+2)-1;


float t = l_Data[ai];

l_Data[ai] = l_Data[bi];

l_Data[bi] += t;

}

}

barrier(CLK_LOCAL_MEM_FENCE);


d_Dst[ai] = l_Data[ai]; //[2]

d_Dst[bi] = l_Data[bi];

}


But here's a couple of weird things. First, look at the code marked by [1]. [bi] is calculated as:


int bi = thid + (n/2);


That means if we are processing say 32 numbers and our thread Id thid is 31, [bi] will index from global:


31 + (32 / 2) = 47


[47] does not exist, so hopefully at worst it will be zero.


Next, at the end of the kernel, each work item writes two values to global memory at different places. To me, this seems flawed for a couple of reasons. Again, to reiterate, it is doubtless that it is my understanding, not the code that is flawed.


First, why not have one thread write one value? This would mean coalesced writes, which is more efficient.


I can only guess that this example is intended to process a sample larger than itself. For example 16 threads processing a scan on 32 numbers. That's why each thread must write two values.


Curiously, when I ran the kernel multiple times, it generated spurious values. Running once was fine. Running more than once gave incorrect results. Changing the kernel to write one value at index of [thid]resolved the issue:


// d_Dst[ai] = l_Data[ai];

// d_Dst[bi] = l_Data[bi];

d_Dst[thid] = l_Data[thid]; // write results to device memory

Getting Started With OpenCL

To learn OpenCL, I have been busy working through some of the examples provided by Apple and NVIDIA. My initial experience of working through the code has induced suicidal thoughts, deep soul searching about my ability as a software developer, and face slapping exercises in front of the mirror combined with primal scream therapy.

Having worked through this, I have come to the conclusion that OpenCL is no more difficult, or easy, than anything else. It's simply new. Anything new always appears difficult at first. Keep going, and you will master it.

Part of the problem has been help, or rather, lack of it. Macresearch.org are brilliant and they have some great tutorials. You can also find some good stuff in the NVIDIA SDK examples.
The issue is that when you start working on stuff on your own, with specific programming issues related to your parallel algorithms, the forums often go dead.

It's not surprising. Answering a question about how to call a kernel, or why your program won't compile is relatively straightforward. Sitting down and spending several hours working through your code to find out why it doesn't work, that's another matter.

Anyway, I digress. What I want to do here is work through some OpenCL sample code and encourage collaboration. If I'm wondering why a bit of code is the way it is, perhaps others are too. Hopefully we can help each other.