CUDA - A basic parallelised task

Published on 2011-1-21

Source for this entry can be found at: https://github.com/NeilRobbins/CudaHack/tree/master/arrayscalar

In the last post, I demonstrated that you had

System memory which is not accessible to the GPU
Device memory which is not accessible to the CPU
User defined methods marked with __global__ called from the CPU to run on the device

We called that method and executed some code on the server to do a really trivial task - giving us no benefit at all (apart from perhaps treating the GPU as "yet another CPU" in our machine).

The GPU is NOT just an additional CPU though, it has an entirely different architecture (to my understanding), which effectively boils down to that it was originally created for the single purpose of being able to perform a series of mathematical operations on a large set of data in parallel.

Let’s take a trivial example, again on the CPU:

void multiplyNumbersByAScalar(float[] numbers, int length, float scalar)

    for(int x = 0 ; x < length ; x++){

        numbers[x] = numbers[x] * scalar;

The key aspect of this operation to note, is that each result in that array is achieved independently of any others – that is, instead of performing a loop in serial, this task could easily be parallelised. In C# it looks something like this:

numbers.AsParallel().Select(x=> x * scalar).ToArray();

The problem with the above, is that it is only taking advantage of the CPU, and the benefits very seldom outweigh the overhead of setting up the parallel operation in the first place. Parallelisation is only really useful when you have sets of data large enough to justify it, and hardware capable of massively parallelising that task rather than say, creating 2-3 threads for it.

Anyway, as in the last entry, let’s slowly start to modify this method and get it running on the GPU, first up let’s look at how we’re actually going to invoke this method – this is what it looks like in our CPU implementation:

float numbers[] = { 0, 1, 2, 3, 4, 5 };

multiplyNumbersByAScalar( numbers, 6, 2.0f );

First thing of note, is that we can’t pass numbers into our GPU implementation when we write it, it is declared in system memory – so our first job is to create a buffer on the device and copy our original data into that buffer:

float numbersInSystemMemory[] = { 0, 1, 2 , 3 , 4 , 5 , 6 ,7 ,8 , 9};

float* numbersInDeviceMemory;

// Allocate some memory on the device

cudaMalloc( (void**)&numbersInDeviceMemory, sizeof(float) * 10);

// And upload our data from system memory to device memory

cudaMemcpy( numbersInDeviceMemory, numbersInSystemMemory, sizeof(float) * 10, cudaMemcpyHostToDevice );

Next thing up, is something we glossed over in the last entry – the numbers inside the strange <<<1,1>>> syntax for invoking the global method.

Without going into too much detail, it is this syntax and those numbers that determine first

How many “blocks” to separate the task into
How many “threads” to use per “block”

Ignoring threads for now, a good first step would be to say that as we have 10 numbers, we can split our operation across 10 blocks, and parallelise it that way, like so:

multiplyNumbersByAScalar<<<10,1>>>(numbersInDeviceMemory, 2.0f);

Now, you’ll notice I’ve dropped the “length” parameter into the function call – and that’s because we need to perform a small change to our original method definition.

In our CPU implementation, the method itself was responsible for iterating through the array and performing the calculations – in our GPU implementation, the iteration task has been replaced by a parallelisation task and the GPU is going to be responsible for calling our method however many times (per block/thread) is necessary, so check this out:

__global__ void multiplyNumbersByAScalar(float numbers[], float scalar) {

        int x = blockIdx.x;

        numbers[x] = numbers[x] * scalar;

First up, we already know about __global__ as a way of allowing our CPU code to invoke GPU code, what should stand out here is we have gotten ‘x’ from a magical local variable that hasn’t actually been declared anywhere. This is another NVCC peculiarity, and this has come from the numbers we used inside the angle brackets in order to set up the call in the first place.

The method will get called 10 times, with x being every value between 0 and 9 inclusive – and in parallel. Because we told it to.

BlockIdx is actually a vector, containing x y and z – and that gives us the ability to divide up our parallel operation in a multitude of ways that make sense to our logic – I’ll talk more about that in the coming entries.

Our entire program therefore looks something like this:

__global__ void multiplyNumbersByAScalar(float numbers[], float scalar) {

        // So yeah, this magic variable is given to us by CUDA magic

        int x = blockIdx.x;

        numbers[x] = numbers[x] * scalar;

int main(int argc, char** args)

        float numbersInSystemMemory[] = { 0, 1, 2 , 3 , 4 , 5 , 6 ,7 ,8 , 9};

        float* numbersInDeviceMemory;

        // Allocate memory on the device and upload our data to it

        cudaMalloc( (void**)&numbersInDeviceMemory, sizeof(float) * 10);

        cudaMemcpy( numbersInDeviceMemory, numbersInSystemMemory, sizeof(float) * 10, cudaMemcpyHostToDevice );

        // Call the method 10 times with values of x between 0 and 9 inclusive

        multiplyNumbersByAScalar<<<10,1>>>(numbersInDeviceMemory, 2.0f);

        // Copy the results back into system memory

        cudaMemcpy(  numbersInSystemMemory, numbersInDeviceMemory, sizeof(float) * 10, cudaMemcpyDeviceToHost );

        // And free the memory allocated

        cudaFree( numbersInDeviceMemory );

        // Standard C again

        for(int x = 0; x < 10 ; x++){

                printf("%f ", numbersInSystemMemory[x]);

        return 1;

Maximum value of x by the way, on my laptop is 65536 – now that’s what I call splitting my task up and parallelising it. On really large data sets having this kind of functionality is really powerful.

Index Subscribe Respond

Rob Ashton

CUDA - A basic parallelised task

Published on 2011-1-21