Wednesday, 5 September 2012

Parallel Processing for Multicore in C#








Manycore architectures are now common on most consumer devices, and many traditional programming languages tend to be designed primarily for writing single threaded code.

Writing multi-threaded code has long been perceived as a task for advanced developers.

This article aims to address how can we write efficient code for the many core era by using C# modern idioms for parallel programming.

In order to present various basic patterns for parallel programming a WPF application has been created, the program 'blends' two images to create a third image.

First a traditional single threaded approach was used as a baseline, then additional parallel implementations based on Threads, ThreadPool, 'TPL ForkJoin' and 'Parallel For' were created.

Screenshot





Sequential

 unsafe protected override void Process()  
 {  
   int buffSize = height * buffStride / 4;  
   buffer = new int[buffSize];  
   int stride = (img1.PixelWidth * img1.Format.BitsPerPixel + 7) / 8;  
   int[] rgbim1 = new int[width];  
   int[] rgbim2 = new int[width];  
   for (int row = 0; row < height; row++)  
   {  
     fixed (int* p1 = rgbim1)  
     {  
       img1.CopyPixels(new Int32Rect(0, row, width, 1), (IntPtr)p1, stride, stride);  
     }  
     fixed (int* p2 = rgbim2)  
     {  
       img2.CopyPixels(new Int32Rect(0, row, width, 1), (IntPtr)p2, stride, stride);  
     }  
     for (int col = 0; col < width; col++)  
     {  
       int num = ComputeColor(rgbim1[col], rgbim2[col]);  
       int index = row * buffStride / 4 + col;  
       buffer[index] = num;  
     }  
   }  
 }  

Here you can see the basic sequential algorithm where the image is produced by iterating throught the rows and then columns and each output pixel colour is computed one by one.

The display WritableBitmap is owned by the UI thread in the application, .NET provides basic concurrency correctness by defining thread ownership on certain objects. In order to work on the same image we must therefore have some boilerplate code that first copies it into a buffer and then copies it back.

Threaded

 override protected void Process()  
 {  
   int processorCount = Environment.ProcessorCount;  
   Thread[] blenderThreads = new Thread[processorCount];  
   int buffSize = height * buffStride / 4;  
   buffer = new int[buffSize];  
   for (int i = 0; i < processorCount; i++)  
   {  
     BlenderThread blender = new BlenderThread(this, i * height / processorCount,  
         (i + 1) * height / processorCount);  
     blenderThreads[i] = new Thread(blender.ThreadRun);  
     blenderThreads[i].Start();  
   }  
   for (int i = 0; i < processorCount; i++)  
   {  
     blenderThreads[i].Join();  
   }  
 }  

Here you can see a basic Threaded approach, we create a thread per processor and then let each thread process a block of the image. In order to ensure all threads are complete before we copy the buffer back to the UI we use multiple calls to the Join method to wait on the current thread until all launched threads have completed.

Thread Pool

 override protected void Process() {  
   _taskCount = 100;  
   doneEvent.Reset();  
   int buffSize = height * buffStride / 4;  
   buffer = new int[buffSize];  
   for (int i = 0; i < _taskCount; i++) {  
     int taskSize = height / _taskCount;  
     BlenderTask bt = new BlenderTask(this, i * taskSize, (i + 1) * taskSize, doneEvent);  
     ThreadPool.QueueUserWorkItem(bt.ThreadPoolCallback, i);  
   }  
   doneEvent.WaitOne();  
 }  

The ThreadPool implementation has a similar strategy but instead focuses on creating enough parallelism for the Task library to queue up enough work for the cores to process in parallel. We arbitrarily create 100 tasks but splitting up the image into 100 parts and queue those tasks for processing. In order to determine when task processing has completed we use the WaitOne method. WaitAny would seem like a more logical choice but limitations in .NET mean this is unsuitable for our purposes, so instead WaitOne is used in combination with an atomic shared counter.

TPL Fork Join

 static void MyParallelInvoke(params Action[] actions)  
 {  
   Task.Factory.StartNew(() =>  
   {  
     for (int i = 0; i < actions.Length; i++)  
     {  
       int cur = i;  
       Task.Factory.StartNew(  
         () => actions[cur](), TaskCreationOptions.AttachedToParent);  
     }  
   }).Wait();  
 }  

 unsafe public void Compute() {  
   if (_to - _from > 1000) {  
     var taskLeft = new BlenderTask(outer, _from, (_to + _from) / 2);  
     var taskRight = new BlenderTask(outer, (_to + _from) / 2, _to);  
     MyParallelInvoke(taskLeft.Compute, taskRight.Compute);   
     return;  
   }  

   ....

One interesting parallel pattern is the ForkJoin pattern, it is similar to a binary tree of recursive calls. Every time we find a reason to increase parallelism we 'Fork' and have a thread for the left branch and a thread for the right. When finished we must coalesce the results so we must 'Join'. This does not directly map to any primitives in C# but we can approximate it with the Task Parallel Library Task.Factory.StartNew and TaskCreationOptions.AttachedToParent.

Parallel For

 unsafe protected override void Process() {  
   int buffSize = height * buffStride / 4;  
   buffer = new int[buffSize];  
   Parallel.For(0, height, row =>  
   {  
     int[] rgbim1 = new int[width];  
     int[] rgbim2 = new int[width];  
     int stride = (img1.PixelWidth * img1.Format.BitsPerPixel + 7) / 8;  
     fixed (int* p1 = rgbim1)  
     {  
       img1.CopyPixels(new Int32Rect(0, row, width, 1), (IntPtr)p1, stride, stride);  
     }  
     fixed (int* p2 = rgbim2)  
     {  
       img2.CopyPixels(new Int32Rect(0, row, width, 1), (IntPtr)p2, stride, stride);  
     }  
     for (int col = 0; col < width; col++)  
     {  
       int num = ComputeColor(rgbim1[col], rgbim2[col]);  
       int index = row*buffStride/4 + col;  
       buffer[index] = num;  
     }  
   });  
 }  

C# has inbuilt support for Paralell For loops, direct support exists as long as the processing operates on distinct elements, which is the case in our example. Here we can simply take the sequential version and change the outer loop to be parallel.

Program design and Other notes

A base class of blender was created and all the other blenders inherited from it, this allowed reuse avoiding duplication of common code. An MVVM class of BlenderData was created to allow data-binding, reduce UI code and minimize dependencies.

The button handler uses the latest C# 5 await async primitive in order to stop the UI thread from being blocked when the long running blending operations are in progress.

Source available here

Based on Java 7 sample code originally from I2PC Summer School on Multicore Programming held at the University of Illinois at Urbana-Champaign.

No comments:

Post a Comment