Sunday, 16 June 2013

CUDA in .NET with Cudafy

Cudafy allows developers to write GPGPU code and target CUDA and OpenCL supported devices. I revisit the WpfBlender application to show how to integrate Cudafy into your app.

To begin in order to use CUDA with Cudafy you must have the latest CUDA 5 toolkit installed and the Microsoft C compiler in your system PATH. This is because in the background what Cudafy really does is language translation, it turns your .NET kernel code into a CUDA C kernel. This then gets compiled by the Nvidia nvcc CUDA C compiler into PTX. PTX is the native executable format for NVidia GPU's.

The WpfBlender code has just one very simple kernel, this is good as an introduction but in this case does not allow us to show any real benefit to GPGPU computing. The memory transfers to and from system memory to host memory required by Cudafy and the extra memory copying required by .NET code between Bitmaps and integer arrays adds massive overhead to the process and removes most of the speed gains. Pinned memory is used where possible to speeds up host to device transfers.

CUDA C allows the use of page locked shared memory to speed up transfers a lot using cuMemHostRegister, it is not clear if Cudafy supports shared memory, it claims transfers are a faster with pinned memory so maybe it does.

Bitmap, Image and WriteableBitmap do not support raw memory access so extra memory copying was required. The underlying buffer was copied to an int array and then back again.

Here what the WpfBlenderCuda application looks like :-


We now have a Cuda button that runs the CUDA kernel, as you can see there is some speedup over a sequential CPU approach. (This is a Core i7 920 with GTX 295)

Our CUDA kernel is very simple :
[Cudafy]  
public static void thekernel(GThread thread, int[] src1, int[] src2, int[] dst)  
{  
    int x = thread.blockIdx.x;  
    int y = thread.blockIdx.y;  
    int offset = x + y * thread.gridDim.x;  
    int num = ComputeColorCuda(src1[offset], src2[offset]);  
    dst[offset] = num;  
}  

The kernel takes two input int arrays for the two input images to blend, and returns an output buffer. The CUDA thread block data is passed into the kernel in a thread object parameter.

Using this information we just compute one blended pixel with each kernel thread and return.

Here is how CUDA processing with Cudafy is set up:
 protected override void Process()  
 {  
   _gpu.SetCurrentContext();  
   
   int buffSize = height * buffStride / 4;  
   
   int[] src1_dev_bitmap = _gpu.Allocate<int>(buffSize);  
   int[] src2_dev_bitmap = _gpu.Allocate<int>(buffSize);  
   int[] dst_dev_bitmap = _gpu.Allocate<int>(buffSize);  
   
   IntPtr host_ptr1 = _gpu.HostAllocate<int>(buffSize);  
   IntPtr host_ptr2 = _gpu.HostAllocate<int>(buffSize);  
   buffer = new int[buffSize];  
   
   int stride = (img1.PixelWidth * img1.Format.BitsPerPixel + 7) / 8;  
   
   img1.CopyPixels(new Int32Rect(0, 0, width, height), host_ptr1, buffSize*4, stride);        
   img2.CopyPixels(new Int32Rect(0, 0, width, height), host_ptr2, buffSize*4, stride);  
   
   _gpu.CopyToDevice (host_ptr1, 0, src1_dev_bitmap, 0, buffSize);  
   _gpu.CopyToDevice(host_ptr2, 0, src2_dev_bitmap, 0, buffSize);  
   
   _gpu.Launch(new dim3(width, height), 1).thekernel(src1_dev_bitmap, src2_dev_bitmap, dst_dev_bitmap);  
   
   _gpu.CopyFromDevice(dst_dev_bitmap, 0, buffer, 0, buffSize);  
   
   _gpu.FreeAll();  
   _gpu.HostFree(host_ptr1);  
   _gpu.HostFree(host_ptr2);  
   GC.Collect();  
 }  
   

GPU and host buffers are allocated, we copy from WPF ImageBitmap to array buffer and then copy from array buffer to device. (Too many copies !)

Then the kernel is launched with as many threads as pixels.

After the kernel has finished the results are copied back the GPU device output buffer to the host array buffer. Finally we free the buffers that were allocated

During compilation Cudafy does language translation and embeds the PTX in the assembly, on launch it runs the PTX code on the GPU. We are now using our graphics chip as a co-processor!

Limited speedup was observed, however speedups of up to 100 times compared  to single threaded code are possible with GPGPU.

You can get the source here.