Sunday, 16 June 2013

CUDA in .NET with Cudafy

Cudafy allows developers to write GPGPU code and target CUDA and OpenCL supported devices. I revisit the WpfBlender application to show how to integrate Cudafy into your app.

To begin in order to use CUDA with Cudafy you must have the latest CUDA 5 toolkit installed and the Microsoft C compiler in your system PATH. This is because in the background what Cudafy really does is language translation, it turns your .NET kernel code into a CUDA C kernel. This then gets compiled by the Nvidia nvcc CUDA C compiler into PTX. PTX is the native executable format for NVidia GPU's.

The WpfBlender code has just one very simple kernel, this is good as an introduction but in this case does not allow us to show any real benefit to GPGPU computing. The memory transfers to and from system memory to host memory required by Cudafy and the extra memory copying required by .NET code between Bitmaps and integer arrays adds massive overhead to the process and removes most of the speed gains. Pinned memory is used where possible to speeds up host to device transfers.

CUDA C allows the use of page locked shared memory to speed up transfers a lot using cuMemHostRegister, it is not clear if Cudafy supports shared memory, it claims transfers are a faster with pinned memory so maybe it does.

Bitmap, Image and WriteableBitmap do not support raw memory access so extra memory copying was required. The underlying buffer was copied to an int array and then back again.

Here what the WpfBlenderCuda application looks like :-

We now have a Cuda button that runs the CUDA kernel, as you can see there is some speedup over a sequential CPU approach. (This is a Core i7 920 with GTX 295)

Our CUDA kernel is very simple :
public static void thekernel(GThread thread, int[] src1, int[] src2, int[] dst)  
    int x = thread.blockIdx.x;  
    int y = thread.blockIdx.y;  
    int offset = x + y * thread.gridDim.x;  
    int num = ComputeColorCuda(src1[offset], src2[offset]);  
    dst[offset] = num;  

The kernel takes two input int arrays for the two input images to blend, and returns an output buffer. The CUDA thread block data is passed into the kernel in a thread object parameter.

Using this information we just compute one blended pixel with each kernel thread and return.

Here is how CUDA processing with Cudafy is set up:
 protected override void Process()  
   int buffSize = height * buffStride / 4;  
   int[] src1_dev_bitmap = _gpu.Allocate<int>(buffSize);  
   int[] src2_dev_bitmap = _gpu.Allocate<int>(buffSize);  
   int[] dst_dev_bitmap = _gpu.Allocate<int>(buffSize);  
   IntPtr host_ptr1 = _gpu.HostAllocate<int>(buffSize);  
   IntPtr host_ptr2 = _gpu.HostAllocate<int>(buffSize);  
   buffer = new int[buffSize];  
   int stride = (img1.PixelWidth * img1.Format.BitsPerPixel + 7) / 8;  
   img1.CopyPixels(new Int32Rect(0, 0, width, height), host_ptr1, buffSize*4, stride);        
   img2.CopyPixels(new Int32Rect(0, 0, width, height), host_ptr2, buffSize*4, stride);  
   _gpu.CopyToDevice (host_ptr1, 0, src1_dev_bitmap, 0, buffSize);  
   _gpu.CopyToDevice(host_ptr2, 0, src2_dev_bitmap, 0, buffSize);  
   _gpu.Launch(new dim3(width, height), 1).thekernel(src1_dev_bitmap, src2_dev_bitmap, dst_dev_bitmap);  
   _gpu.CopyFromDevice(dst_dev_bitmap, 0, buffer, 0, buffSize);  

GPU and host buffers are allocated, we copy from WPF ImageBitmap to array buffer and then copy from array buffer to device. (Too many copies !)

Then the kernel is launched with as many threads as pixels.

After the kernel has finished the results are copied back the GPU device output buffer to the host array buffer. Finally we free the buffers that were allocated

During compilation Cudafy does language translation and embeds the PTX in the assembly, on launch it runs the PTX code on the GPU. We are now using our graphics chip as a co-processor!

Limited speedup was observed, however speedups of up to 100 times compared  to single threaded code are possible with GPGPU.

You can get the source here.

Saturday, 5 January 2013

Market Event Simulator in C#

I recently completed a Coursera course on Computational Finance. Part of the course was to customize a Market Event Simulator written in Python and Pandas.

I decided to do a conversion of the event simulator to C# and contribute it to a codeplex project called

Here is the output graph of an Event study on historical data from the NYSE.

Additional details on the Event Profiler can be found on the Python QSTK website.

You can get the latest source from codeplex using TFS,Subversion or zip download.

Tuesday, 18 December 2012

Updating Enterprise Logging Block config dynamically

We have a ClickOnce application that is using the Microsoft Enterprise Logging Block, its fairly easy to set up the blog from XML in the app.config, however we wanted out log path to be dynamic so that we could change it at runtime.

Here is our sample app.config.

  <!-- =================================== -->  
  <!-- logging configuration        -->  
  <!-- =================================== -->  
    <add name="LoggingPolicy">  
      <add name="All Exceptions" type="System.Exception" postHandlingAction="ThrowNewException">  
        <add name="Logging Exception Handler" type="Microsoft.Practices.EnterpriseLibrary.ExceptionHandling.Logging.LoggingExceptionHandler, Microsoft.Practices.EnterpriseLibrary.ExceptionHandling.Logging"  
         logCategory="General" eventId="100" severity="Error" title="Enterprise Library Exception Handling"  
         formatterType="Microsoft.Practices.EnterpriseLibrary.ExceptionHandling.TextExceptionFormatter, Microsoft.Practices.EnterpriseLibrary.ExceptionHandling"  
         priority="0" />  
  <loggingConfiguration name="" tracingEnabled="true" defaultCategory="General"  
    <add name="Rolling Flat File Trace Listener" type="Microsoft.Practices.EnterpriseLibrary.Logging.TraceListeners.RollingFlatFileTraceListener, Microsoft.Practices.EnterpriseLibrary.Logging"  
     listenerDataType="Microsoft.Practices.EnterpriseLibrary.Logging.Configuration.RollingFlatFileTraceListenerData, Microsoft.Practices.EnterpriseLibrary.Logging"  
     fileName="./Logs/Client.log" footer="" formatter="Text Formatter"  
     header="" rollSizeKB="20000" />  
    <add name="XML Trace Listener" type="Microsoft.Practices.EnterpriseLibrary.Logging.TraceListeners.XmlTraceListener, Microsoft.Practices.EnterpriseLibrary.Logging"  
     listenerDataType="Microsoft.Practices.EnterpriseLibrary.Logging.Configuration.XmlTraceListenerData, Microsoft.Practices.EnterpriseLibrary.Logging"  
     fileName="trace-xml.log" traceOutputOptions="None" filter="Verbose" />  
    <add name="Event Log Trace Listener" type="Microsoft.Practices.EnterpriseLibrary.Logging.TraceListeners.FormattedEventLogTraceListener, Microsoft.Practices.EnterpriseLibrary.Logging"  
     listenerDataType="Microsoft.Practices.EnterpriseLibrary.Logging.Configuration.FormattedEventLogTraceListenerData, Microsoft.Practices.EnterpriseLibrary.Logging"  
     source="Application Error" formatter="Text Formatter" log="Application" filter="Information" />  
    <add type="Microsoft.Practices.EnterpriseLibrary.Logging.Formatters.TextFormatter, Microsoft.Practices.EnterpriseLibrary.Logging"  
     template="{timestamp(local)} : {severity} : {keyvalue(namespace)} :{message}&#xA;"  
     name="Text Formatter" />  
    <add switchValue="information" name="General">  
 <!--     <add name="Rolling Flat File Trace Listener" />-->  
      <add name="Event Log Trace Listener" />  
    <allEvents switchValue="All" name="All Events" />  
    <notProcessed switchValue=
"All" name="Unprocessed Category">  
      <add name="Rolling Flat File Trace Listener" />  
    <errors switchValue="Verbose" name="Logging Errors &amp; Warnings">  
      <add name="Rolling Flat File Trace Listener" />  
      <add name="Event Log Trace Listener" />  

We can then dynamically update the config to add out new path :-

     private static DictionaryConfigurationSource ConfigureLoggingLocationForClickOnce()  
       string logFileName = @".\Logs\Client.log";  
       if (ApplicationDeployment.IsNetworkDeployed)  
         logFileName = ApplicationDeployment.CurrentDeployment.DataDirectory + @"\Logs\Client.log";  
       //Create the config builder for the Fluent APIvar   
       var source = new FileConfigurationSource(AppDomain.CurrentDomain.SetupInformation.ConfigurationFile);  
       ConfigurationSourceBuilder configBuilder = new ConfigurationSourceBuilder();  
       //Get the existing logging config section           
       var logginConfigurationSection = (LoggingSettings)source.GetSection("loggingConfiguration");  
       logginConfigurationSection.RevertImpersonation = false;  
       var _rollingFileListener = new RollingFlatFileTraceListenerData("Rolling Flat File Trace Listener", logFileName, "", "",  
                20000, "MM/dd/yyyy", RollFileExistsBehavior.Increment,  
                RollInterval.Day, TraceOptions.Callstack | TraceOptions.None,  
                "Text Formatter", SourceLevels.Information);  
       _rollingFileListener.MaxArchivedFiles = 1;  
       //Add trace listener to current config  
       //Configure the category source section of config for rolling file  
       var _rollingFileCategorySource = logginConfigurationSection.TraceSources.Get("General");  
       //Must be named exactly the same as the flat file trace listener above.  
       _rollingFileCategorySource.TraceListeners.Add(new TraceListenerReferenceData("Rolling Flat File Trace Listener"));  
       //Add category source information to current config  
       //Add the loggingConfiguration section to the config.  
       configBuilder.AddSection("loggingConfiguration", logginConfigurationSection);  
       //Required code to update the EntLib Configuration with settings set above.  
       var configSource = new DictionaryConfigurationSource();  
       //Set the Enterprise Library Container for the inner workings of EntLib to use when logging  
       EnterpriseLibraryContainer.Current = EnterpriseLibraryContainer.CreateDefaultContainer(configSource);  
       return configSource;  

Finally if using Unity and Prism you can setup both loggers like so:-

       var configurator = new UnityContainerConfigurator(Container);  
       // Read the configuration files and set up the container.  
       var configSource = ConfigureLoggingLocationForClickOnce();  
       EnterpriseLibraryContainer.ConfigureContainer(configurator, configSource);  
       // Set up logging with both IOC containers...  
       var oldLogger = EnterpriseLibraryContainer.Current.GetInstance<LogWriter>();  
       if (oldLogger != null)  
       var newLogger = Container.Resolve<LogWriter>();  
       EnterpriseLibraryContainer.Current = ServiceLocator.Current;  
       this.Container.RegisterType<ILoggerFacade, LoggingService>(new ContainerControlledLifetimeManager());  
       LoggingService logger = (LoggingService)this.Container.Resolve<ILoggerFacade>();  
       this.Container.RegisterInstance<ILoggingService>(logger, new ContainerControlledLifetimeManager());  

Monday, 17 December 2012

Forcing WCF generated proxies to share POJO's

We had a recent problem at work that seemed quite tough to solve, we have multiple SOAP endpoints exposed from a Java Enterprise Edition (JEE) container. The services were generated from Java code using Axis2. The services shared many POJOs.

We wanted to consume these services in a .NET client, client proxies were generated in the standard manner but this caused an issue. We ended up with the same POJO objects converted into multiple POCO's in different namespaces on the client.

I was resorting to svcutil without any success until I tried the following approach.

Open the Reference.Svc file and add multiple endpoints to one service definition.
   <MetadataSource Address="" Protocol="http" SourceId="1" />  
   <MetadataSource Address="" Protocol="http" SourceId="2" />  
   <MetadataSource Address="" Protocol="http" SourceId="3" />  
   <MetadataFile FileName="BananaService.xsd" MetadataType="Schema" ...  
   <MetadataFile FileName="BananaService.wsdl" MetadataType="Wsdl" ...  
   <MetadataFile FileName="OrangeService.xsd" MetadataType="Schema" ...  
   <MetadataFile FileName="OrangeService.wsdl" MetadataType="Wsdl" ...  
   <MetadataFile FileName="AppleService.wsdl" MetadataType="Wsdl" ...  
   <MetadataFile FileName="AppleService.xsd" MetadataType="Schema" ...  

Now 'Update the service Reference', and bingo one C# source file with all client proxies in in one namespace all sharing the same POCOS!

Making WPF DataGrid Keyboard friendly

The out of the box DataGrid does a great job of making available data, but the standard default config leaves a little to be desired to those stuck with a keyboard.

I aim to show you how to make it a little friendlier with the help of a custom behavior.

First we create a custom binding.
   public class DataGridListBehavior : Behavior<DataGrid>  
     private bool _endingEdit;  
     protected IList Model { get; private set; }  
     protected override void OnAttached()  
       AssociatedObject.DataContextChanged += HandleDataContextChanged;  
       AssociatedObject.MouseDoubleClick += HandleMouseDoubleClick;  
       AssociatedObject.CellEditEnding += HandleCellEditEnding;  
       AssociatedObject.PreviewKeyDown += PreviewKeyDown;  
       AssociatedObject.PreviewMouseLeftButtonDown += HandleLeftButtonDown;  
       AssociatedObject.GotFocus += CellGotFocus;  
     protected override void OnDetaching()  

This allows up to hook important grid events and add in our custom code.

Most of the work is done by CellGotFocus().

     private void CellGotFocus(object sender, RoutedEventArgs args)  
       DataGridCell gridCell = args.OriginalSource as DataGridCell;  
       if (gridCell != null)  
         DataGridHelper.EnterEdit(gridCell, AssociatedObject);  

Inside here we do the major work of ensuring we enter into the cell then focus the control automatically.

     public static void EnterEdit(DataGridCell gridCell, DataGrid grid, bool bFocus = true)  
       if (gridCell != null && !gridCell.IsEditing)  
         // enables editing on single click  
         if (!gridCell.IsFocused)  
         if (!gridCell.IsSelected && grid.SelectionUnit == DataGridSelectionUnit.Cell)  
           gridCell.IsSelected = true;  
         if (bFocus)  
           var control = FindChildElement<Control>(gridCell);  
           if (control != null)  

You can get the code here.

Tuesday, 2 October 2012

Monitor Pulse/Wait, ResetEvent and SpinWait

After some lively discussion with about the merits of Monitor Pulse/Wait as a thread synchronization primitive in C# I decided to test the results for myself.

One argument was that Monitor Pulse / Wait performs better than ManualResetEvent due to it running in User mode without the need for Kernel objects and therefore should be used as the preferred thread synchronization mechanism in C# for performance reasons.

Counter arguments were :-

1. That the code would be less maintainable
2. Pulses can be missed in real world production code.
3. If keen on optimum speed through micro optimizations then maybe 'lock free programming' should be considered.

Its hard to draw real conclusions on a system with only 4 cores as really to determine the effect of massive parallelism you would need a NUMA machine with many more cores.

However I tried to test the theory anyway :-

Here is a run with more iterations but the same amount of concurrency and contention (Small runs are unreliable due to general background multi-tasking on the machine. There is also a CLR/JIT warm up cost but I've also ignored this to keep things simple.).

AMD Processor (4 cores)

ManualResetEvent suffers a lot showing the advantage in performance of MonitorWaitAndPulse.

Kernel objects do become expensive if you use a LOT of them, which can be seen here.

What would make more sense would be to up the concurrency, however I cannot up the true concurrency on my hardware, only the total number of threads which would then be time sliced.

The second run was with 10,000 iterations of 120 (40x3) tasks. In real world use I'd recommend using ManualResetEvent until you find you need more performance. Premature optimization is usually a bad idea.

MonitorWaitAndPulse offers good mid performance as long as you can ensure you don't miss pulses and have low contention.

As you can see from this example with mild contention MonitorWaitAndPulse is not a clear winner over lock free code.

As contention rises the locks in MonitorWaitAndPulse will mean its performance will degrade favoring code that is lock free like MySpinLock or the Flag with MemoryFence and SpinWait.

MySpinLock makes use of 'lock free programming' using CompareAndSwap (CAS).

 #pragma warning disable 0420  
 namespace ThreadCommunication  
   struct MySpinLock  
     volatile int _taken;  
     public void Enter()  
       while(Interlocked.CompareExchange(ref _taken, 1, 0) != 0) Thread.Yield() /*spin*/;  
     public void Exit()  
       _taken = 0;  

This allows us to create a SpinLock without using CLR or Kernel locking primitives. Normally you should only spin for short periods, a more typical use would be to simulate a critical section. When using spin locks for short periods you would not want to yield. In this case without the yield we would effectively waste a threads quantum every time slice until the work is done.

We should be able to do better with SpinWait, SpinWait allows us to wait for a condition variable to be set. It also allows us to use a 'backing off' algorithm in order to yield or sleep for longer periods if our wait is 'too busy'.

Note :- The .NET framework also comes with its own SpinLock class, in production code you should use the .NET framework where possible, the MySpinLock code is provided to demonstrate CAS programming principles only. The framework SpinLock is specifically designed for short running operations and favors thread affinity.

Code can be downloaded here.

Monday, 24 September 2012

Which WCF binding is best ?

Following on from Rick Rainey's excellent WCF binding article I decided to have a quick look myself.

Obviously you should choose a binding that suits your transport needs, for example if you need security or transaction support then you will need to add that. If you want peak performance its also likely that you would go for a lighter weight communications stack than WCF and possibly maybe even not use the TCP/IP protocol at all in the most extreme cases.

The following results were obtained by running a modified version of Rick's code :-

 Payload Size: 35000, Iterations: 1  
   Binding Elements:  
 Results (milliseconds): 10  
 CustomBinding (HTTP/Binary/ReliableSession(ordered)/NoSecurity)  
 Payload Size: 35000, Iterations: 1  
   Binding Elements:  
 Results (milliseconds): 9  
 CustomBinding (TCP/Binary/NoSecurity)  
 Payload Size: 35000, Iterations: 1  
   Binding Elements:  
 Results (milliseconds): 6  

It appears at least in this simple case that the number of BindingElements is the limiting factor on performance.

Example code here.

Based on Rick Rainey's original work.