Source for this entry can be found at: https://github.com/NeilRobbins/CudaHack/tree/master/arrayscalar
In the last post, I demonstrated that you had
We called that method and executed some code on the server to do a really trivial task - giving us no benefit at all (apart from perhaps treating the GPU as "yet another CPU" in our machine).
The GPU is NOT just an additional CPU though, it has an entirely different architecture (to my understanding), which effectively boils down to that it was originally created for the single purpose of being able to perform a series of mathematical operations on a large set of data in parallel.
Let’s take a trivial example, again on the CPU:
The key aspect of this operation to note, is that each result in that array is achieved independently of any others – that is, instead of performing a loop in serial, this task could easily be parallelised. In C# it looks something like this:
The problem with the above, is that it is only taking advantage of the CPU, and the benefits very seldom outweigh the overhead of setting up the parallel operation in the first place. Parallelisation is only really useful when you have sets of data large enough to justify it, and hardware capable of massively parallelising that task rather than say, creating 2-3 threads for it.
Anyway, as in the last entry, let’s slowly start to modify this method and get it running on the GPU, first up let’s look at how we’re actually going to invoke this method – this is what it looks like in our CPU implementation:
First thing of note, is that we can’t pass numbers into our GPU implementation when we write it, it is declared in system memory – so our first job is to create a buffer on the device and copy our original data into that buffer:
Next thing up, is something we glossed over in the last entry – the numbers inside the strange <<<1,1>>> syntax for invoking the global method.
Without going into too much detail, it is this syntax and those numbers that determine first
Ignoring threads for now, a good first step would be to say that as we have 10 numbers, we can split our operation across 10 blocks, and parallelise it that way, like so:
Now, you’ll notice I’ve dropped the “length” parameter into the function call – and that’s because we need to perform a small change to our original method definition.
In our CPU implementation, the method itself was responsible for iterating through the array and performing the calculations – in our GPU implementation, the iteration task has been replaced by a parallelisation task and the GPU is going to be responsible for calling our method however many times (per block/thread) is necessary, so check this out:
First up, we already know about __global__ as a way of allowing our CPU code to invoke GPU code, what should stand out here is we have gotten ‘x’ from a magical local variable that hasn’t actually been declared anywhere. This is another NVCC peculiarity, and this has come from the numbers we used inside the angle brackets in order to set up the call in the first place.
The method will get called 10 times, with x being every value between 0 and 9 inclusive – and in parallel. Because we told it to.
BlockIdx is actually a vector, containing x y and z – and that gives us the ability to divide up our parallel operation in a multitude of ways that make sense to our logic – I’ll talk more about that in the coming entries.
Our entire program therefore looks something like this:
2015 © Rob Ashton. ALL Rights Reserved.