Monday 29 March 2010
Easy OpenCL Library Makes Feeding Babies More Difficult Than Writing OpenCL
How useful are these? I have to confess I don't know. I want to get to grips with the nuts and bolts of OpenCL and parallel programming before I give into temptation and use other libraries.
The other thing is that it's easy to get confused. You have to learn the wrapper library. If you do that without understanding what's going on under the hood, it just makes solving problems more difficult.
So I think I'll get to the point where I really feel comfortable with the OpenCL APIs. At that stage, when things start to feel a bit repetitive and tiresome, I'll checkout those libraries to see how they can make life easier.
There's another point I'd like to make. Libraries are undoubtably useful to perform standard tasks. For example, NVIDIA provide for CUDA the CUDPP library. That could do everything I want in the scan example. After all, I just want to add a few numbers together. But for other tasks, if you want to make the most out of the acceleration you can get from the GPU, it seems to me that there is no easy option. You have to think deeply about your problem.
Imagine that you are the manager at a distribution centre. Until now you have one worker assigned to you. One day, your boss arrives and tells you that the investment capital they always dreamed of is now available. Six thousand workers will be assigned under your control. Helping them realise their potential will require careful consideration.
Parallel computing is a new paradigm. You now have thousands of hardware threads at your disposal. But that doesn't mean you can just do thousands of the same thing the same old way at the same time. Instead, you have to reconsider everything. The resources available and the problem at hand are inextricably linked in realising the optimal solution. Libraries are another tool - albeit a very useful one - in your OpenCL arsenal.
And in any case, who said feeding babies was easy?
Friday 12 March 2010
Parallel Programming Tutorials
I'm a newbie at OpenCL. But then isn't everyone?
What I'm discovering is that the tricky part is the algorithms. In my twenty years of experience as an IT Professional, I've hardly ever had to deal with something that looked like a real algorithm. Almost everything in Java land had already been written. All you needed to do was to call whatever the API was.
In parallel computing and therefore, in OpenCL, algorithms are the core of what you do. Setting up your contexts, kernels, read and write buffers, etc., all that stuff is the easy bit.
My impression is that writing the kernel algorithms is the key skill. My guess is that if you can master a fundamental suite of parallel computing algorithms - in the same way that with Java you master common design patterns - then you will be golden.
If this is indeed the case, my hope would be that these fundamental algorithms can be applied to solve most of the common parallel computing problems.
But that's a lot of hoping and guessing. Let's see how it goes.
Here's some new stuff I just found on the NVIDIA site. There's PDFs and WMV movie files for older web seminars and you can register for forthcoming events.
GPU Computing Online Seminars
More tutorials and resources to follow when I get a mo...
Debugging OpenCL
Heart of Darkness by Joseph Conrad.
I could lie here and tell you that you just set a nice little breakpoint wherever you fancy in your OpenCL kernel then mouse over whatever variable to inspect it. But then, I'd be lying.
The sad truth is that debugging OpenCL kernels is a frustrating black box like process. Yes, you can put a printf statement in there if you run on the CPU not the GPU. Problem is, your work group size will have to be one. To me, that seems to render printf useless for all but the simplest kernels.
What I have discovered to be useful is to add more global memory parameters solely for the purpose of debugging. To these, write out whatever variables your interested in. It's as good as printf, almost!
All that said, there is light at the end of the tunnel. Personally I'm gagging to get my hands on a beta version of the gDEBugger tool promised by Graphic Remedy. If it does what it says on the tin it will be a fantastically useful product.
Scan? Which Scan?
That was, until I posted a question on macresearch.org about how to do cumulative addition.
Seems easy enough. You have five numbers and you want to add them in a cumulative fashion. The result will also be five numbers, with each being the sum inclusive of it's position.
For example, let's say you have 1, 3, 2, 4, 1.
The result will be 1, 1 + 3, 1 + 3 + 2, 1 + 3 + 2 + 4, and 1 + 3 + 2 + 4 + 1.
Not too shabby. In good 'ol fashioned procedural code you could code that faster than a blindfolded fanatical cave dwelling warlord could re-assemble a Kalashnikov. But not so with parallel computing. For that, you need a scan.
But which scan? Inclusive or exclusive? Do you use the Hillis and Steele algorithm (thanks Geoffrey), or do you go for something even faster, as documented in the NVIDIA scan.pdf document? Not only that, but in your SDK folder you'll find about 4 different scan kernel examples? Apple have some too.
Some are naive, others are optimised. Some will handle big arrays, others small.
It's a bit like your wife asking for a cosmetic set for Christmas. You find yourself in a strange aisle of the city centre pharmacy store surrounded by a multitude of threatening and alien concepts. You start sweating. Women begin eyeing you strangely. In store cameras rotate and are now trained on you. Panic sets in.
What can you do? What can save you from all this, and more importantly, what can save you from yourself? Stay tuned for the next exciting episode of....The Joy of OpenCL.
Thursday 11 March 2010
Parallel Prefix Sum (Scan) with CUDA
__kernel void scanExclusiveNaive(
__global uint *d_Dst,
__global uint *d_Src,
__local uint *l_Data,
uint n
){
int thid = get_local_id(0);
int offset = 1;
int ai = thid;
int bi = thid + (n/2);
l_Data[ai] = d_Src[ai];
l_Data[bi] = d_Src[bi]; // [1]
// Up sweep reduce phase
// build sum in place up the tree
for (int d = n>>1; d > 0; d >>= 1) {
barrier(CLK_LOCAL_MEM_FENCE);
if (thid <>
int ai = offset*(2*thid+1)-1;
int bi = offset*(2*thid+2)-1;
l_Data[bi] += l_Data[ai];
}
offset *= 2;
}
// Down sweep
// clear the last element
if (thid == 0) {
l_Data[n - 1] = 0;
}
// traverse down tree & build scan
for (int d = 1; d <>2)
{
offset >>= 1;
barrier(CLK_LOCAL_MEM_FENCE);
if (thid <>
int ai = offset*(2*thid+1)-1;
int bi = offset*(2*thid+2)-1;
float t = l_Data[ai];
l_Data[ai] = l_Data[bi];
l_Data[bi] += t;
}
}
barrier(CLK_LOCAL_MEM_FENCE);
d_Dst[ai] = l_Data[ai]; //[2]
d_Dst[bi] = l_Data[bi];
}
But here's a couple of weird things. First, look at the code marked by [1]. [bi] is calculated as:
int bi = thid + (n/2);
That means if we are processing say 32 numbers and our thread Id thid is 31, [bi] will index from global:
31 + (32 / 2) = 47
[47] does not exist, so hopefully at worst it will be zero.
Next, at the end of the kernel, each work item writes two values to global memory at different places. To me, this seems flawed for a couple of reasons. Again, to reiterate, it is doubtless that it is my understanding, not the code that is flawed.
First, why not have one thread write one value? This would mean coalesced writes, which is more efficient.
I can only guess that this example is intended to process a sample larger than itself. For example 16 threads processing a scan on 32 numbers. That's why each thread must write two values.
Curiously, when I ran the kernel multiple times, it generated spurious values. Running once was fine. Running more than once gave incorrect results. Changing the kernel to write one value at index of [thid]resolved the issue:
// d_Dst[ai] = l_Data[ai];
// d_Dst[bi] = l_Data[bi];
d_Dst[thid] = l_Data[thid]; // write results to device memory