CUDA - A basic parallelised task

Published on 2011-1-21

Source for this entry can be found at: https://github.com/NeilRobbins/CudaHack/tree/master/arrayscalar

In the last post, I demonstrated that you had

We called that method and executed some code on the server to do a really trivial task - giving us no benefit at all (apart from perhaps treating the GPU as "yet another CPU" in our machine).

The GPU is NOT just an additional CPU though, it has an entirely different architecture (to my understanding), which effectively boils down to that it was originally created for the single purpose of being able to perform a series of mathematical operations on a large set of data in parallel.

Let’s take a trivial example, again on the CPU:

void multiplyNumbersByAScalar(float[] numbers, int length, float scalar)
{
    for(int x = 0 ; x < length ; x++){
        numbers[x] = numbers[x] * scalar;
    }
}

The key aspect of this operation to note, is that each result in that array is achieved independently of any others – that is, instead of performing a loop in serial, this task could easily be parallelised. In C# it looks something like this:

numbers.AsParallel().Select(x=> x * scalar).ToArray();

The problem with the above, is that it is only taking advantage of the CPU, and the benefits very seldom outweigh the overhead of setting up the parallel operation in the first place. Parallelisation is only really useful when you have sets of data large enough to justify it, and hardware capable of massively parallelising that task rather than say, creating 2-3 threads for it.

Anyway, as in the last entry, let’s slowly start to modify this method and get it running on the GPU, first up let’s look at how we’re actually going to invoke this method – this is what it looks like in our CPU implementation:

float numbers[] = { 0, 1, 2, 3, 4, 5 };
multiplyNumbersByAScalar( numbers, 6, 2.0f );

First thing of note, is that we can’t pass numbers into our GPU implementation when we write it, it is declared in system memory – so our first job is to create a buffer on the device and copy our original data into that buffer:

float numbersInSystemMemory[] = { 0, 1, 2 , 3 , 4 , 5 , 6 ,7 ,8 , 9};
float* numbersInDeviceMemory;
 
// Allocate some memory on the device
cudaMalloc( (void**)&numbersInDeviceMemory, sizeof(float) * 10);
 
// And upload our data from system memory to device memory
cudaMemcpy( numbersInDeviceMemory, numbersInSystemMemory, sizeof(float) * 10, cudaMemcpyHostToDevice );

Next thing up, is something we glossed over in the last entry – the numbers inside the strange <<<1,1>>> syntax for invoking the global method.

Without going into too much detail, it is this syntax and those numbers that determine first

Ignoring threads for now, a  good first step would be to say that as we have 10 numbers, we can split our operation across 10 blocks, and parallelise it that way, like so:

multiplyNumbersByAScalar<<<10,1>>>(numbersInDeviceMemory, 2.0f);

Now, you’ll notice I’ve dropped the “length” parameter into the function call – and that’s because we need to perform a small change to our original method definition.

In our CPU implementation, the method itself was responsible for iterating through the array and performing the calculations – in our GPU implementation, the iteration task has been replaced by a parallelisation task and the GPU is  going to be responsible for calling our method however many times (per block/thread) is necessary, so check this out:

__global__ void multiplyNumbersByAScalar(float numbers[], float scalar) {
        int x = blockIdx.x;
        numbers[x] = numbers[x] * scalar;
}

First up, we already know about __global__ as a way of allowing our CPU code to invoke GPU code, what should stand out here is we have gotten ‘x’ from a magical local variable that hasn’t actually been declared anywhere. This is another NVCC peculiarity, and this has come from the numbers we used inside the angle brackets in order to set up the call in the first place.

The method will get called 10 times, with x being every value between 0 and 9 inclusive – and in parallel. Because we told it to.

BlockIdx is actually a vector, containing x y and z – and that gives us the ability to divide up our parallel operation in a multitude of ways that make sense to our logic – I’ll talk more about that in the coming entries.

Our entire program therefore looks something like this:

__global__ void multiplyNumbersByAScalar(float numbers[], float scalar) {
        
        // So yeah, this magic variable is given to us by CUDA magic
        int x = blockIdx.x;
        numbers[x] = numbers[x] * scalar;
}
 
 
int main(int argc, char** args)
{
        float numbersInSystemMemory[] = { 0, 1, 2 , 3 , 4 , 5 , 6 ,7 ,8 , 9};
        float* numbersInDeviceMemory;
 
        // Allocate memory on the device and upload our data to it
        cudaMalloc( (void**)&numbersInDeviceMemory, sizeof(float) * 10);
        cudaMemcpy( numbersInDeviceMemory, numbersInSystemMemory, sizeof(float) * 10, cudaMemcpyHostToDevice );
 
        // Call the method 10 times with values of x between 0 and 9 inclusive
        multiplyNumbersByAScalar<<<10,1>>>(numbersInDeviceMemory, 2.0f);
 
        // Copy the results back into system memory
        cudaMemcpy(  numbersInSystemMemory, numbersInDeviceMemory, sizeof(float) * 10, cudaMemcpyDeviceToHost );
        
        // And free the memory allocated
        cudaFree( numbersInDeviceMemory );
 
        // Standard C again
        for(int x = 0; x < 10 ; x++){
                printf("%f ", numbersInSystemMemory[x]);
        }
 
 
        return 1;
}
Maximum value of x by the way, on my laptop is 65536 – now that’s what I call splitting my task up and parallelising it. On really large data sets having this kind of functionality is really powerful.

2020 © Rob Ashton. ALL Rights Reserved.