CUDA - A brief looksee into the world of generalised GPU computing

Published on 2011-1-17

Code for this entry can be found on Github at: https://github.com/NeilRobbins/CudaHack/tree/master/1

This weekend saw me ditch my normal routine and head into Belgium to visit @neilrobbins for a couple of days coding against an Amazon EC2 Cluster GPU Compute instance – this is big stuff  for the future and it’s nice to do something different (I’ve spent the past few weeks chasing up leads on jobs and generally not writing much code except at work).

Rather than do our knockabout in a “cool” language like Ruby, our daily-life language (C#) or something else, we opted to stay old school and do our work in C to get a better feel for what is actually going on when talking to the GPU – as I have previously written shader code against DirectX (SM1 and SM2) and wanted to see what is so different.

NVCC and GCC

CUDA C is a bit different from ordinary C, in that you use a different compiler (NVCC) which takes the special CUDA variant of C and compiles it into standard object files which can be linked ordinarily against your typical GCC created object files.

In our experiments, we kept a single file (main.cu) which we compiled and linked using NVCC,  and decided to write our standard C and CUDA c next to each other and get on with things.

A simple C Routine

Rather than parallelise anything to begin with,  we opted to see what happens when writing a simple bit of code to execute on the GPU, and how you pass data to it. Consider the following:

int addTwoNumbers(int x, int y)
{
    return x + y;
}

Yeah, I really mean that simple. There are some things of note here, chiefly:

Now, I said there were some differences if you want code to run on the GPU rather than the CPU

Getting output from a GPU function

First up, functions you define for running on the GPU seemingly have to be void – which leads us to the following version of this method:

void addTwoNumbers(int x, int y, int* output)
{
    *output = x + y;
}

In this, we pass in a pointer to an area of memory to write to, and the two values to be added, we then write to that area of memory with the result of the operation. Standard fare really if you’re a C programmer and you’re au fait with pointers and that sort of thing.

Calling this method (still running it on the CPU) would then look something like this:

int result;
addTwoNumbers(5,6, &result);

This is not all there is to it though, in the above example we are creating ‘result’ on system memory accessible to the CPU, if we pass this into a GPU method it won’t be able to see that data. We actually have to allocate memory on the GPU and pass a pointer to that in.

As a parallel of that, memory allocated on the GPU cannot be seen by code running on the CPU (obviously), so we start to look at the call to methods running on the GPU as remote, whereby we have to upload data to the other side of the remote call in order for that remote call to succeed. We also have to then download data from the other side to get the result.

This therefore gives us:

// Created in system memory
int cpuVisibleResult;
 
// Just an uninitialized pointer
int* gpuVisibleResult;
 
// Initialize some memory and set the pointer to it
cudaMalloc( (void**)&gpuVisibleResult , sizeof(int));
 
// TODO: Make the remote call
 
// Copy the result into CPU visible memory
cudaMemcpy(&cpuVisibleResult, gpuVisibleResult, cudaMemcpyDeviceToHost);
 
// TODO: Do stuff with the cpuVisibleResult
 
// Free that memory again
cudaFree( gpuVisibleResult );

If you’re familiar with C you’ll recognise that there is a CUDA version of the standard malloc call to assign a block of memory and get a pointer to that memory, which does exactly the same thing but on the GPU.

There is one final thing we have to do in order to make that call, which is to attribute up the method so the CUDA compiler knows that it is designed to be ran over at the GPU side of things, this gives us the final version of our method:

Actually being able to run the function on the GPU

__global__ void addTwoNumbers(int x, int y, int* output)
{
    *output = x + y;
}

That __global__ attribute is a special CUDA thing telling the compiler that this is a method we can call from the host (CPU), which will run on the device (GPU).  Invoking the method looks like this

addTwoNumbers<<<1,1>>>(2,5, gpuVisibleResult);

The two 1s? Well, that’ll be explained in the next blog entry as I go through the examples we created when learning ourselves.

2020 © Rob Ashton. ALL Rights Reserved.