Monday, 24 September 2012

Which WCF binding is best ?

Following on from Rick Rainey's excellent WCF binding article I decided to have a quick look myself.

Obviously you should choose a binding that suits your transport needs, for example if you need security or transaction support then you will need to add that. If you want peak performance its also likely that you would go for a lighter weight communications stack than WCF and possibly maybe even not use the TCP/IP protocol at all in the most extreme cases.

The following results were obtained by running a modified version of Rick's code :-

 Payload Size: 35000, Iterations: 1  
   Binding Elements:  
 Results (milliseconds): 10  
 CustomBinding (HTTP/Binary/ReliableSession(ordered)/NoSecurity)  
 Payload Size: 35000, Iterations: 1  
   Binding Elements:  
 Results (milliseconds): 9  
 CustomBinding (TCP/Binary/NoSecurity)  
 Payload Size: 35000, Iterations: 1  
   Binding Elements:  
 Results (milliseconds): 6  

It appears at least in this simple case that the number of BindingElements is the limiting factor on performance.

Example code here.

Based on Rick Rainey's original work.

Actors in C# with NAct

The Actor Model is a popular way to get reliable currency into your application through isolated immutability.

Here we will see how to create a basic Actor and use that Actor to calculate primes just as in the previous Java and C# examples.

private static void Main(string[] args)  
    if(args.Length < 2)  
        Console.WriteLine("Usage: number numberOfParts");  
    int primesBound = int.Parse(args[0]);  
    int concurrency = int.Parse(args[1]);  
    Stopwatch stopwatch = Stopwatch.StartNew();  
    int count = CountPrimes(primesBound, concurrency);  
    Console.WriteLine("Found {0} primes under {1}", count, primesBound);  
    Console.WriteLine("It took {0} seconds", (stopwatch.ElapsedMilliseconds)/1.0e3f);  

Here we simply take in the upper bound of the number range in which we want to find primes and the level of concurrency like before in our Java example.

Here we will see how to use the open source library NAct to create Actors in C#.First we must define an actor interface :-
public interface IPrimesActor : IActor  
    Task<int> CountPrimesInRange(int lower, int upper);  

Then we must implement our actor interface :-

public class PrimesActor : IPrimesActor  
    async Task<int> IPrimesActor.CountPrimesInRange(int lower, int upper)  
        return await Task.Factory.StartNew(() => PrimesComputation.CountPrimesInRange(lower, upper));  

And finally, you need to tell NAct to turn your object into an actor, as you create it by using the ActorWrapper.
The object you get back from the wrapper is an NAct proxy. Whenever you call a method on it, NAct will call the same method on your object at some point in the right thread.

Here is CountPrimes implemented using NAct Actors.
internal static int CountPrimes(int bound, int concurrency)  
    var tasks = CountPrimesAsync(bound, concurrency);  
    var sums = tasks.GetAwaiter().GetResult();  
    return sums.Sum();  
private async static Task<int[]> CountPrimesAsync(int bound, int concurrency)  
    int chunks = bound / concurrency;  
    var actors = new List<Task<int>>();  
    for (int i = 0; i < concurrency; i++)  
        int lower = i * chunks;  
        int upper = (i + 1) * chunks;  
        var actor = ActorWrapper.WrapActor<IPrimesActor>(() => new PrimesActor());  
        var task = actor.CountPrimesInRange(lower, upper);  
    var primeCounts = await Task.WhenAll(actors);  
    return primeCounts;  

And here is the result :-
 Found 664579 primes under 10000000  
 It took 3.636 seconds  

Source code here.

You can find out more about NAct here.
You can install NAct using NuGet using this command.

Debugging Concurrent Code in VS 2012

Visual Studio 2012 comes with some enhanced debugger features that help when debugging parallel code.

We can see this by running the C# Blender example.

You can see Threads and Tasks using the debug windows.

Here is the Parallel Tasks view, you can get it with Debug > Windows > Parallel Tasks

You can also launch a Parallel Watch window with Debug > Windows > Parallel Watch
You can enter watch expressions and see their evaluation under multiple tasks, here we examine num.

Another feature is the Parallel Stacks, here you can see the main thread stack along with the fork join example and its spawned threads.

There is also a concurrency visualizer that allows to see if you are making good use of you machines cores, it mainly shows occupancy, but you can also see events and locks.

The following picture shows the result of running the Thread pool blender followed by the sequential blender. You can see that we get good occupancy with the thread pool followed by a period of inactivity, followed by the sequential version which fully occupies just one core.

This image shows the four threads in green that are occupying the four cores of the machine, along with 9 blocked threads in red, one UI thread in brown, and one thread sleeping. 

That concludes a brief overview of the concurrency features in the Visual Studio 2012 debugger.

Saturday, 22 September 2012

Quickstart Task-based WCF service

This post shows you how to create a task based WCF Service using Visual Studio 2012.

1. Create a new project using File > New > Project.
In the wizard select the Visual C# \ WCF \ WCF Service Application

2. Alter the service definition to return Task
public async Task<string> GetData(int value)   
   Thread.Sleep(4); // Simulate a slow call...  
   return await Task.Factory.StartNew( 
       () => string.Format("You entered: {0}", value));  

You will also need to alter the interface to return Task also.

We use a sleep to simulate a lengthy operation and make the asynchronous behavior very obvious, this would not exist in real code.

3. Right click the solution select Add > New Project
In the wizard select Visual C# \ Windows \ WPF Application

4. Right click the WPF application
Select Add Service Reference
Click Discover, select Service1
Click Advanced, ensure 'Allow generation of asynchronous operations' and 'Generate task-based operations' are enabled.

5. Edit the MainWindow.xaml for the WPF application, make it look like the following :-
<Window x:Class="WpfApplication1.MainWindow"  
    Title="MainWindow" Height="100" Width="200">  
    <RowDefinition />  
    <RowDefinition />  
  <TextBlock x:Name="txtMessage" Grid.Row="0">Service Not Yet Called</TextBlock>  
  <Button x:Name="btnAsync" Click="btnAsync_Click" Grid.Row="1">Call Service</Button>  

6. Now implement the buttom event handler in the MainWindow.xaml.cs
async private void btnAsync_Click(object sender, RoutedEventArgs e)
  using(var client = new ServiceReference1.Service1Client("BasicHttpBinding_IService1"))
    txtMessage.Text = await client.GetDataAsync(5);

7. Now right click the solution and select Set Startup projects
Select 'Multiple Strtup projects' and set both projects to 'Start'.

8. Now press F5 to launch the debugger

The WCF service will be started and hosted by IISExpress, you will see the root folder listing appear in your browser.
At the same time the client will be launched and you will see the WPF user interface.
Click the 'Call Service' button and you should see the message change to 'You entered: 5' indicating a successful service call!
To prove your client really is making an async call and not blocking the UI thread, try moving the window with the mouse after clicking the 'Call Service' button.

Well done! You've created your first Task based WCF service...

WCF Serialization performance test

This is an application I wrote some time ago to try to determine the performance benefits of protobuf-net serialization when used with WCF.

(If running the program in Visual Studio be sure to use 'Run as Administrator' to start VS.)

You can configure various aspects like the number of client proxies and the number of calls to make to each proxy.

 WcfClient /?  
 Wcf Test Client...  
 WcfClient [Verbose=False] [NamedPipes=True] [ServerAddress=localhost] [ProxyCount=1] [RunCount=1]  
 Possible arguments:  
 Help      Display Usage Help information.  
 Verbose   Turn on verbose logging to console.  
 NamedPipes   Use NamedPipes to talk to a local server process.  
 ServerAddress Servername / IP Address to use for remote server when not using NamedPipes  
 ProxyCount   The number of client proxies to run in parallel.  
 RunCount    The total number of test WCF calls to make across all proxies.  

Here is a basic run with just one proxy and one call.

 Wcf Test Client...  
 Running with ProxyCount=1, RunCount=1...  
 ---------- Run Basic Orders Tests ----------  
 Testing NetTcp Protobuf v2.4.0a ...  
 Setup test endpoint net.tcp://localhost:8754/WcfLoadTest/Proto  
 Time: 383ms, Bytes sent: 406, Bytes received: 585,507  
 Testing NetTcp ServiceStack v2.05 TypeSerializer ...  
 Setup test endpoint net.tcp://localhost:8755/WcfLoadTest/ServiceType  
 Time: 842ms, Bytes sent: 416, Bytes received: 1,080,166  
 Testing DataContract NetTcp ...  
 Setup test endpoint net.tcp://localhost:8752/WcfLoadTest/Tuned  
 Time: 90ms, Bytes sent: 406, Bytes received: 585,507  
 Testing NetDataContract NetTcp ...  
 Setup test endpoint net.tcp://localhost:8753/WcfLoadTest/Tuned  
 Time: 223ms, Bytes sent: 495, Bytes received: 415,674  
 Testing BasicHttp ...  
 Setup test endpoint http://localhost:8750/WcfLoadTest/Vanilla  
 Time: 159ms, Bytes sent: 365, Bytes received: 2,050,499  
 Testing Json WebHttp ...  
 Setup test endpoint http://localhost:8751/WcfLoadTest/Json  
 Time: 764ms, Bytes sent: 277, Bytes received: 1,207,660  
 ---------- Run Complex Graph Tests ----------  
 Testing NetTcp Protobuf v2.4.0a ...  
 Setup test endpoint net.tcp://localhost:8754/WcfLoadTest/Proto  
 Time: 234ms, Bytes sent: 416, Bytes received: 1,102,680  
 Testing DataContract NetTcp ...  
 Setup test endpoint net.tcp://localhost:8752/WcfLoadTest/Tuned  
 Time: 138ms, Bytes sent: 416, Bytes received: 1,102,680  
 Testing NetDataContract NetTcp ...  
 Setup test endpoint net.tcp://localhost:8753/WcfLoadTest/Tuned  
 Time: 130ms, Bytes sent: 501, Bytes received: 150,334  
 Testing BasicHttp ...  
 Setup test endpoint http://localhost:8750/WcfLoadTest/Vanilla  
 Time: 92ms, Bytes sent: 344, Bytes received: 1,662,004  
 Press any key to exit...  

Here is a run with 20 proxies and 80 service calls.

 Wcf Test Client...  
 Running with ProxyCount=20, RunCount=80...  
 ---------- Run Basic Orders Tests ----------  
 Testing NetTcp Protobuf v2.4.0a ...  
 Setup test endpoint net.tcp://localhost:8754/WcfLoadTest/Proto  
 Time: 3505ms, Bytes sent: 21,620, Bytes received: 46,814,400  
 Testing NetTcp ServiceStack v2.05 TypeSerializer ...  
 Setup test endpoint net.tcp://localhost:8755/WcfLoadTest/ServiceType  
 Time: 7010ms, Bytes sent: 22,360, Bytes received: 86,407,760  
 Testing DataContract NetTcp ...  
 Setup test endpoint net.tcp://localhost:8752/WcfLoadTest/Tuned  
 Time: 3510ms, Bytes sent: 21,620, Bytes received: 46,814,400  
 Testing NetDataContract NetTcp ...  
 Setup test endpoint net.tcp://localhost:8753/WcfLoadTest/Tuned  
 Time: 4992ms, Bytes sent: 24,240, Bytes received: 33,217,140  
 Testing BasicHttp ...  
 Setup test endpoint http://localhost:8750/WcfLoadTest/Vanilla  
 Time: 4923ms, Bytes sent: 27,376, Bytes received: 164,039,900  
 Testing Json WebHttp ...  
 Setup test endpoint http://localhost:8751/WcfLoadTest/Json  
 Time: 8056ms, Bytes sent: 20,360, Bytes received: 96,612,800  
 ---------- Run Complex Graph Tests ----------  
 Testing NetTcp Protobuf v2.4.0a ...  
 Setup test endpoint net.tcp://localhost:8754/WcfLoadTest/Proto  
 Time: 4002ms, Bytes sent: 21,820, Bytes received: 88,195,440  
 Testing DataContract NetTcp ...  
 Setup test endpoint net.tcp://localhost:8752/WcfLoadTest/Tuned  
 Time: 3577ms, Bytes sent: 21,820, Bytes received: 88,195,440  
 Testing NetDataContract NetTcp ...  
 Setup test endpoint net.tcp://localhost:8753/WcfLoadTest/Tuned  
 Time: 2931ms, Bytes sent: 24,360, Bytes received: 11,999,420  
 Testing BasicHttp ...  
 Setup test endpoint http://localhost:8750/WcfLoadTest/Vanilla  
 Time: 2249ms, Bytes sent: 27,520, Bytes received: 132,960,300  
 Press any key to exit...  

There are two types of serialization tested basic flat non cyclic data and cyclic graphs. It can be seen from these results that in real world scenarios there is very little advantage to using Protobuf over the standard Microsoft serializers.

Some people have reported an improvement of 1 millisecond per call with large basic payloads using protobuf v1 (without graph support), however you would need to carefully consider your system to determine if this will be suitable in your system, for most people the MS serializers should be more suitable.

The actual test is rather more complicated than many similar tests, this is because it tries to more closely mimic a real world scenario. There are separate processes for the client and the server and the client passes the binding to the server using a management WCF service. The server (WCF host) will create and tear down services as requested by the management service. In this manner the client can automate and contol the server in order to control the number of endpoints and the binding types.

One issue with this approach is not all WCF bindings are serializable, however most are and I believe the test covers enough of the standard bindings to still be useful.

You can get the code here.

Friday, 21 September 2012

Simple Calculators

Basic calculator apps that demonstrate how to write simple C++ MFC applications and perform simple GUI I/O. Second example also demonstrates basic recursive descent parsing and expression tree evaluation. Operators supported are binary addition and subtraction. Integral number operands in decimal and hexadecimal are supported.

Download Source

Download Executables


Ruminations on C++ by Andrew Koenig & Barbara Moo, ISBN: 0-201-42339-1, Addison Wesley.
The C++ Standard Library by Nicolai M. Josuttis, ISBN: 0-201-37926-0, Addison Wesley.
Recursive descent parsing - Html parsing article.

Thermodynanic Equations for Water and Steam

C Library to perform thermodynamic water calculations. Code based on FORTRAN 77 code from Steam book.
Quantities measured in units other than temperature, pressure or density are stored in dimensionless units until output time.

Download Source


NBS/NRC STEAM TABLES: Thermodynamic and Transport Properties and Computer Programs for Vapor and Liquid States of Water in SI Units by Haar L, Gallagher JS, Kell GS, ISBN: 0-89116-353-0, Hemisphere Publishing Co.

OpenGL molecule viewer

This another old project of mine that demonstrates basic OpenGL, it also makes use of WTL and the Boost Graph Library.

Main Interface

Molecule Colour Key


Atilla - ATL/WTL framework by The Sells Brothers.
OpenGL Superbible by Richard S. Wright, Jr. & Michael Sweet, ISBN: 1-57169-164-2, Waite Group Press.
OpenGL III: Building an OpenGL C++ Class by Dale Rogerson, MSDN.
OpenGL VI: Rendering on DIBs with PFD_DRAW_TO_BITMAP by Dale Rogerson, MSDN.
Animation Techniques for Win32 by Nigel Thompson, ISBN: 1-55615-669-3, Microsoft Press.
(Nigel's Animate Library).
The Boost Graph Library by Jeremy G. Siek, Lie-Quan Lee, Andrew Lumsdaine, ISBN: 0-201-72914-8, Addison Wesley.
WTL Makes UI Programming a Joy, Part 1: The Basics Doc with samples, DevelopMentor.
WTL Makes UI Programming a Joy, Part 2: The bells and whistles Doc with samples, DevelopMentor.

The source is here.

PID Controller in C++

This is an old project I did to teach myself basic control theory, its a PID controller written in Visual C++ 6 using MFC.

The project also contains its own Graph MFC control which is based on MFC GDI primitives.

Its a simulation of a basic PID controller.

PID Control Theory

Proportional-Integral-Derivative (PID) is a control algorithm used to set and maintain a measurement at a set-point. The PID algorithm calculates its output by summing three terms.

  • Proportional - A value proportional to the error of the measurement. 
  • Integral - A value proportional to the integral of the error over time. 
  • Derivative - A value proportional to the rate of change of the error.

The general form of the PID equation is thus:

m(t) = K* (e(t) + Kiòe(t)dt + Kdde)
          proportional   integral       derivative


m(t) = controller output
Kc = proportional gain
Ki = integral time constant
Kd = derivative time constant
e(t)  = error as a function of time = S(t) - X(t)
S(t) = current process setpoint.
X(t) = measured input

Variables Kc, Ki and Kd are adjustable for a given application of the controller.
Computer PID algorithms are based on data sampled on a cyclic basis.

The sample based PID equation is thus:
m(i) = K* (e(i) + T*Kiåe(k) + (Kd/T)*(e(i)-e(i-1)))
           proportional    integral            derivative


T = sampling interval
e(i) = error at ith sampling interval = S(t)-X(t)
e(i-1) = error at pervious sampling interval
e(k) = error at k where k increments by 1 through the range 0 <= k <= i
m(i) = controller output
Kc = proportional gain
Ki = integral action time
Kd = derivative action time

Complex Error

In practice, the first order difference term, (e(i) - e(i-1)), is very susceptible to noise problems. In most practical systems this term is replaced by the more stable, higher order equation:

                De = ( e(i) + 3*e(i-1) - 3*e(i-2) - e(i-3) ) / 6

Ramp Limit

Many systems implement a limit on the maximum rate at which the output value can change in order to stop any physical output devices from being ‘thrashed’.


Practical systems stop the summation of error terms if the current PID output level is outside a user specified range of high and low output values. This limiting of the summation term is commonly referred to as anti-reset-windup.


The general flow of a PID control will be as follows:

PID Control Diagram

An individual PID control  will function as follows: Automatic PID control A desired value or setpoint is entered. The input will be read. The difference between the input and the setpoint or error is calculated. An output value is calculated using this information and the PID algorithm. The output is optionally limited buy various methods. The output ramping is optionally limited.

Ramp Limit

The ramp limiting is optional when the PID calculation is active. If the user output is set the ramp limiting will always be applied. The ramp limiting is specified in degrees representing the angle the output makes with the horizontal on a graph. The time axis resolution is therefore critical to the significance of this parameter. A limit of zero degrees would make output ramping impossible and a limit of 90 degrees or greater would allow maximum ramping.


Anti reset windup is only applied when selected and the PID calculation is active. The anti reset windup upper and lower bounds are specified as a percentage of the entire output range. A limit of zero percent would make output ARW impossible and a limit of 50 percent or greater would allow maximum ARW.


Tuning and control loop performance
by Gregory K. McMillan
ISBN: 1-55617-49261
ISA - The Instrumentation, Systems, and Automation Society.

Tuning of industrial control systems
by Armando B. Corripio
ISBN: 1-55617-6996
ISA - The Instrumentation, Systems, and Automation Society.

Quinn-Curtis, Real-time graphics tools for windows manual.


Source code here.
Executable here.

Thursday, 20 September 2012

Troubleshooting Kerberos

Here are some tools that can be used to diagnose Kerberos issues.

This is the main tool to use on Windows to display ticket information for a given computer running the Kerberos protocol. It displays information from the Windows ticket cache.

This command-line tool allows you to manage the Service Principal Names (SPN) property for an Active Directory™ directory service account. SPNs are used to locate a target principal name for running a service.
Useful when creating test accounts for use with Kerberos on Windows.

Process Monitor
Process Monitor is an advanced monitoring tool for Windows that shows real-time file system, Registry and process/thread activity. It combines the features of two legacy Sysinternals utilities, Filemon and Regmon, and adds an extensive list of enhancements including rich and non-destructive filtering, comprehensive event properties such session IDs and user names, reliable process information, full thread stacks with integrated symbol support for each operation, simultaneous logging to a file, and much more. Its uniquely powerful features will make Process Monitor a core utility in your system troubleshooting and malware hunting toolkit.

Process Explorer
Ever wondered which program has a particular file or directory open? Now you can find out. Process Explorer shows you information about which handles and DLLs processes have opened or loaded.
The Process Explorer display consists of two sub-windows. The top window always shows a list of the currently active processes, including the names of their owning accounts, whereas the information displayed in the bottom window depends on the mode that Process Explorer is in: if it is in handle mode you'll see the handles that the process selected in the top window has opened; if Process Explorer is in DLL mode you'll see the DLLs and memory-mapped files that the process has loaded. Process Explorer also has a powerful search capability that will quickly show you which processes have particular handles opened or DLLs loaded.
The unique capabilities of Process Explorer make it useful for tracking down DLL-version problems or handle leaks, and provide insight into the way Windows and applications work.
You should be able to see what security tokens a process has by using properties on a process and clicking the security tab. (Make sure you are running process explorer as administrator.)

This allows you to authenticate with a principal and generate tickets.
This can be used on Windows and linux. Its use on Windows is limited however as it works off its own file cache, not the Windows ticket cache.

View and deleting the Kerberos tickets granted to the current logon session.
This can be used on windows and linux. Its use on windows is limited however as it works off its own file cache, not the Windows ticket cache.

Enable Kerberos Logging
Windows offers the capability of tracing detailed Kerberos events through the event log mechanism. You can use this information too troubleshoot Kerberos. This article describes how to enable Kerberos event logging.

Insight for Active Directory v1.01
ADInsight is an LDAP (Light-weight Directory Access Protocol) real-time monitoring tool aimed at troubleshooting Active Directory client applications. Use its detailed tracing of Active Directory client-server communications to solve Windows authentication, Exchange, DNS, and other problems.
ADInsight uses DLL injection techniques to intercept calls that applications make in the Wldap32.dll library, which is the standard library underlying Active Directory APIs such LDAP API and ADSI . Unlike network monitoring tools, ADInsight intercepts and interprets all client-side APIs, including those that do not result in transmission to a server. ADInsight monitors any process into which it can load it’s tracing DLL, which means that it does not require administrative permissions, however, if run with administrative rights, it will also monitor system processes, including windows services.

Logman creates and manages Event Trace Session and Performance logs and supports many functions of Performance Monitor from the command line.  Filters can be added to log Kerberos events.

MIT Kerberos Client
Network Identity Manager (NetIdMgr) is a graphical tool designed to simplify the management of network identities and their credentials which are used by network authentication protocols while providing secure access to network services.
When NetIDMgr is used with Kerberos v5 each network identity is a unique Kerberos principal name and the credentials are Kerberos v5 tickets. Kerberos v5 tickets can be used by NetIDMgr to obtain Andrew File System (AFS) tokens and X.509 public key certificates if the appropriate plug-ins are installed.

This is an ASP.NET application used to help troubleshoot and configure IIS and Active Directory to allow Kerberos and delegating Kerberos credentials.

Kerberos SPN Viewer
Simplify listing the ServicePrincipalName (SPN) and an integrated helper tool which can help us find out what SPN should we set based on the configuration that we are using.

Open source packet sniffer for Windows and Unix.

Microsoft Network Monitor
Microsofts packet sniffer, allows capturing and protocol analysis of network traffic.

This tool will compute the maximum token size and is used to test whether a system may exhibit the issue described in KB article 327825.

This GUI tool is a Lightweight Directory Access Protocol (LDAP) client that allows users to perform operations (such as connect, bind, search, modify, add, delete) against any LDAP-compatible directory, such as Active Directory. LDP is used to view objects stored in Active Directory along with their metadata, such as security descriptors and replication metadata.

Active Directory Users and Computers
Active Directory® Users and Computers is a Microsoft Management Console (MMC) snap-in that you can use to administer and publish information in the directory.

NTP commands

These can be useful if you are experiencing NTP issues, this can be common with some virtual machines.
w32tm /resync
net start w32time

Troubleshooting Kerberos Problems

Programming for Multicore with Java 7

With the age of Amdals Law almost at an end mulitcore processors are set to become commonplace and concurrent programming is going to become important to utilize those cores.

In this article we will look at how you can program for multicore era in Java.


First we will code an algorithm in a sequential fashion and then see how various approaches can be used to get speedup.
public class PrimesSequential extends PrimesComputation {  
    public Boolean[] computePrimes(int upto) {  
        Boolean[] results = new Boolean[upto];  
        for (int x = 0; x < results.length; x++)  
            results[x] = isPrime(x);  
        return results;  


Here we can see the basic Java Threads implementation, we have created one thread per processor.
We (roughly) partition the work across the threads. Then we wait all threads to complete by calling join() on all the threads from the main thread.
public Boolean[] computePrimes(int upto) throws InterruptedException {  
    int processorCount = Runtime.getRuntime().availableProcessors();  
    PrimesComputationThread[] threads = new PrimesComputationThread[processorCount];  
    Boolean[] results = new Boolean[upto];  
    for (int i = 0; i < processorCount; i++) {  
        threads[i] = new PrimesComputationThread(i * upto / processorCount,  
                (i + 1) * upto / processorCount, results);  
    for (int i = 0; i < processorCount; i++) {  
    return results;  
static class PrimesComputationThread extends Thread {  
    public void run() {  
        for (int x = from; x < to; x++)  
            results[x] = isPrime(x);  


Here you can see a similar approach to our threads version but this time using a thread pool. Thread pools give us better control over a collection of threads, as well as allowing us to manage threads over time and amortize thread overheads.
public Boolean[] computePrimes(int upto) throws InterruptedException {  
    int processorCount = Runtime.getRuntime().availableProcessors();  
    ExecutorService threadPool = Executors.newFixedThreadPool(processorCount);  
    int noTasks = 100;  
    Boolean[] results = new Boolean[upto];  
    for (int i = 0; i < noTasks; i++) {  
        threadPool.execute(new PrimesComputationTask(i * results.length  
                / noTasks, (i + 1) * results.length / noTasks, results));  
    threadPool.awaitTermination(100, TimeUnit.SECONDS);  
    return results;  


One problem with our partitioning scheme is that we do not evenly division up the work because calculating some primes will be harder than others, what is worse we cannot easily predict the difficulty of finding a given prime. There is however a basic solution, there is a well known scheduling algorithm called work stealing, idle threads can steal work from busy threads therefore helping to spread the load. 
Another advantage of the ForkJoin pattern is it maps well to certain types of real world divide and conquer algorithms.
public class PrimesForkJoin extends PrimesComputation {  
    public Boolean[] computePrimes(int upto) throws Exception {  
        ForkJoinPool pool = new ForkJoinPool();  
        Boolean[] results = new Boolean[upto];  
        pool.invoke(new PrimesTask(0, results.length, results));  
        return results;  
    static class PrimesTask extends RecursiveAction {  
        public void compute() {  
            if (to - from < 1000)  
                for (int x = from; x < to; x++)  
                    results[x] = isPrime(x);  
            else {  
                PrimesTask left = new PrimesTask(from, (to + from) / 2, results);  
                PrimesTask right = new PrimesTask((to + from) / 2, to, results);  


Java 8 will add support for ParallelArray's, you can get a glimpse of what this might look like by using the jsr166y libray jars. The idea is to eventually allow simple loop parallel like C#s Paralell.For.

The code is functional in nature, replaceWithMappedIndex() replaces each element in the array with the results of applying the given mapping. In this case our mapping is simply the primality status of the number.

public Boolean[] computePrimes(int upto) {  
    ParallelArray<Boolean> p = ParallelArray.create(upto, Boolean.class, 
    p.replaceWithMappedIndex(new IntToObject<Boolean>() {  
        public Boolean op(int x) {  
            return isPrime(x);  
    return p.getArray();  

 Sequential version  
 #   primes: 476648  
 Time: 15916ms  

 Threads version  
 #   primes: 476648  
 Time: 4818ms  
 Speed-up: 3.30  

 Pool version  
 #   primes: 476648  
 Time: 3450ms  
 Speed-up: 4.61  

 Fork-join version  
 #   primes: 476648  
 Time: 3380ms  
 Speed-up: 4.71  

 Parallel version  
 #   primes: 476648  
 Time: 4888ms  
 Speed-up: 3.26  

Here you can see the relative speed-up of the different approaches and how there is a trade off between implementation complexity and speed-up. In this case the sweet spot seems to be with the Pool version which is not too complicated but also gives good performance.

Parallel arrays ideally should provide the simplest implementation if there is sufficient language support, but it seems likely they will not quite match a hand crafted approach.

Correctly working with shared mutable state

Next we will look at the basic sorts of issues that arise with multi threaded applications with shared mutable state.

We are going to try to count the number of primes using a noLivePrimes integer field.

 public class PrimesLive extends PrimesComputation implements  
         LiveResults<Integer[]> {  
     Integer[] livePrimes;  
     Integer noLivePrimes;  
     public Boolean[] computePrimes(int upto) throws InterruptedException {  
         // Compute in Parallel ...
     class PrimesComputationTask implements Runnable {  
         public void run() {  
             for (int x = from; x < to; x++) {  
                 a[x] = isPrime(x);  
                 if (a[x])  
                     livePrimes[noLivePrimes++] = x;  

There is a problem with this code, the new functionality has introduced mutable shared state, we have introduced a Data Race.

A first attempt with synchronization...

The standard solution to this is use locks to serialise access to the shared state, in Java we normally do this with the synchronized keyword.

         public void run() {  
             for (int x = from; x < to; x++) {  
                 a[x] = isPrime(x);  
                 if (a[x]) {  
                     int tmp;  
                     synchronized (livePrimes) {  
                         tmp = noLivePrimes;  
                         noLivePrimes = tmp + 1;  
                     livePrimes[tmp] = x;  

Contended locks however hurt performance, we lose parallelism as threads have to wait for free access to the shared state.

 Live version  
 #   primes: 476648  
 # live primes: 471033  
 Time: 4147ms  
 Speed-up: 3.84  

 Live version with synchronization  
 #   primes: 476648  
 # live primes: 476648  
 Time: 3879ms  
 Speed-up: 4.10  

Here we can see the performance loss caused by the locks.

Atomics to the rescue...

AtomicInteger noLivePrimes;  
class PrimesComputationTask implements Runnable {  
    public void run() {  
        for (int x = from; x < to; x++) {  
            a[x] = isPrime(x);  
            if (a[x]) {  
                int tmp = noLivePrimes.getAndIncrement();  
                livePrimes[tmp] = x;  

There is a very simple solution in this case, since the shared state is a single primitive we can use an AtomicInteger to guard the  shared state.

Atomics are actually a form of lock free programming using special Compare and Swap (CAS) instructions supported in hardware. This works similarly to optimistic locking or STM where the processor proceeds on the basis of no conflict and then 'rolls back' if a conflict occurs.

 Live version with atomic
 #   primes: 476648  
 # live primes: 476648  
 Time: 3686ms  
 Speed-up: 4.32  

Here you can see we got our performance back.

You can find the source code here.

Based on I2PC Summer School on Multicore Programming course held at the University of Illinois at Urbana-Champaign.

Wednesday, 19 September 2012

Useful programs 2012

Here is a list of general free software that I use to make my life easier.

Ubuntu / Mint / Wubi - Popular Linux distros
VLC - Media player with many in built codecs
WinScp - SFTP client
WinWget - GUI based Wget
Wireshark - Packet sniffer
Fiddler - HTTP debug proxy
Http Ping - URL probing
Microsoft Security Essentials - Virus Scanner
Chrome - Web browser
Adblock - Ad blocker
Firefox  - Web browser
Firebug - DHTML debugger
Windows Live Essentials - Windows Extras
TeamViewer - Remote desktop
Skype - Voice Over IP client
PEBrowse - Portable Executable browser
IDAPro Free - Disassembler
WinDbg - Kernel mode debugger
Notepad2 - Notepad replacement
7-Zip - Archiver
SqlServer Express - RDBMS
Visual Studio Express - Integrated Development Environment
IIS Express - Web server
Eclipse -  Integrated Development Environment
Depends - DLL dependency viewer
Just Decompile - CIL decompiler
Java Decompiler - Java decompiler
Microsoft Windows SDK - Windows libraries and tools
ReferenceSource NET - Reference source for .NET
Microsoft Windows Performance Toolkit
XamlPad - XAML Editor
Kaxaml - XAML Editor
Tomcat - Web server
Tortoise SVN - Version Control client
Slik Subversion - Version Control client
AnkSvn - Version Control IDE plugin
CCleaner - Disk Cleanup
WinDirStat - Disk usage
SysInternals Suite - Utilities
ExactFile - Checksum tool
CCProxy - Personal proxy
Paint .NET - Paint package
EasyThumbnails - Thumbnail creator
Virtual CloneDrive - ISO mount tool
Microsoft Mathematics - Programmable Calculator
Speccy - System information
DiffMerge - Diff tool
EasyBCD - Boot editor
VirtualBox - Virtualization
VMware Player - Virtualization
CPU-Z - System information
doPDF - PDF Converter
HDTune - Hard disk benchmark
Windows Product Key Finder Professional - License key finder
RevoUninstaller - Uninstaller
LockHunter - File unlocker
ImgBurn - DVD Burner
Start8 - Get the 'start' button back on Windows 8 and Windows Server 2012

TDD with Mockito

You may not be getting the most out of your TDD and JUnit, Mockito is an effective mocking framework that works great.

Mockito Verify

Verify allows you to set up expectations on the mock methods that are 'probed' by the test case in a whitebox manner.
public void testMockitoVerify()  
    //mock creation  
    List<String> mockedList = mock(List.class);  
    //using mock object  

Mockito Stubbing (toReturn, toThrow)

Mockito also supports stubs by allowing you to specify return values when certain methods are called. Once stubbed, the method will always return stubbed value regardless of how many times it is called.
public void testMockitoStubbing()  
     //You can mock concrete classes, not only interfaces  
     LinkedList mockedList = mock(LinkedList.class);  
     stub(mockedList.get(1)).toThrow(new RuntimeException());  
     //following prints "first"  
     //following prints "null" because get(999) was not stubbed  

Mockito Invocations (times, atLeast, atMost)

You can verify a certain number of calls by using times for an exact number of calls, atMost for upto a number and atLeast for above a number. 
public void testMockitoInvocations()  
    //mock creation  
    List<String> mockedList = mock(List.class);  
    //using mock  
    mockedList.add("three times");  
    mockedList.add("three times");  
    mockedList.add("three times");  
    //verification, times(1) is used by default  
    //verification using never(). never() is an alias to times(0)  
    verify(mockedList, never()).add("never happened");  
    //verification using atLeast()  
    verify(mockedList, atLeastOnce()).add("three times");  
    verify(mockedList, times(3)).add("three times");  
    //verify using atMost  
    verify(mockedList, atMost(3)).add("three times");  

Mockito Whitebox (when / thenReturn)

You can specify mock return values using when and thenReturn.
public void testMockitoWhitebox() {  
    //mock creation  
    Calendar mock = mock(Calendar.class);  
    TimeZone tz = mock(TimeZone.class);  
    String result = mock.getTimeZone().getID();  
    assertEquals("PST", result);  

Mockito Throws

You can set up exception throwing behaviour for you mocks using doThrow.  
@Test(expected = RuntimeException.class)   
public void testMockitoThrows() {  
    //mock creation  
    List<String> mockedList = mock(List.class);  
    doThrow(new RuntimeException()).when(mockedList).clear();        

Mockito Verification Equals

You can verify mock method parameters using eq.
public void testMockitoEq() {  
    //mock creation  
    List mock = mock(List.class);  

Mockito Verification Any

If you do not care to verify a specific parameter values being passed to a mock you can use 'any'.
public void testMockitoAny() {  
    //mock creation  
    List<String> mockedList = mock(List.class);  

Mockito Verification ArgThat

If you want to use a custom matcher to verify an argument then you can use argThat.
public void testMockitoArgThat() throws IOException{          
    OutputStream mock=mock(OutputStream.class);  
    OutputStreamWriter osw=new OutputStreamWriter(mock);  
    // can't do this as we don't know how long the array is going to be  
    // verify(mock).write(new byte[]{'a'},0,1);  
    BaseMatcher arrayStartingWithA=new BaseMatcher(){  
        public void describeTo(Description description) {  
            // nothing  
        // check that first character is A  
        public boolean matches(Object item) {  
            byte[] actual=(byte[]) item;  
            return actual[0]=='a';  
    // check that first character of the array is A, 
    // and that the other two arguments are 0 and 1  
    verify(mock).write((byte[]) argThat(arrayStartingWithA), eq(0),eq(1));      

You can get Mockito here.
Example source here.

TDD with Hamcrest

You may not be getting the most out of your TDD and JUnit, Hamcrest provides some additions. Hamcrest allows you to make your assertions more fluent and get better error messages.

Hamcrest equalTo()

Although not necessary, some developers like to use is and equalTo together because it feels more fluent to them. This is the very reason for is's existence: to make use of other matchers more fluent.

   public void testWithHamcrestEqualTo()  
       final int result = 5*2;  
       assertThat(result, equalTo(10));  

Hamcrest sameInstance()

You can test for object identity using sameInstance, this test also shows a not matcher that negates a logical condition.
   public void testWithHamcrestNotSameInstance()  
       final BigDecimal expectedResult = new BigDecimal(6*2);  
       final BigDecimal result = new BigDecimal(6*2);  
       assertThat(result, not(sameInstance(expectedResult)));  

Hamcrest nullValue() and notNullValue()  

You can test for null values like so :-

   public void testHamcrestIsNull()  
       final String s = null;  
       assertThat(s, is(nullValue()));  
   public void testHamcrestNotNull()  
       final String s = "String";  
       assertThat(s, notNullValue());  

Hamcrest greaterThan(), etc...

Hamcrest has the usual relational operators you might expect.

   public void testHamcrestGreaterThan() {  
     assertThat(2, greaterThan(1));      

Hamcrest instanceOf()

You can test for the class type of an object using the instanceOf matcher.

   public void testHamcrestInstanceOf()  
       Date now = new Date();  
       assertThat(now, instanceOf(Date.class));  

Hamcrest containsInAnyOrder()

This is useful for checking collections contain certain elements when you don't care about the entire collection or the order.
     @Test public void testHamcrestContainsInAnyOrder() {  
       final Set<String> set = new HashSet<String>();   
       assertThat(set, containsInAnyOrder("one", "two"));  

Hamcrest hasProperty()

This is useful to access JavaBean properties.

   @Test public void testHamcrestHasProperty() {  
       Button test = new Button("Hello");  
     assertThat(test, hasProperty("label", equalTo("Hello")));  

The Hamcrest works with JUnit 4 and you can download it here.
Example source here.

Monday, 17 September 2012

Basic Actor Model using TPL Dataflow in C#

The Actor Model is a popular way to get reliable currency into your application through isolated immutability.

Microsoft does not provide an Actor class, but they do provide a part of the Task Parallel Library called Dataflow. This is most useful for building asynchronous Pipelines but can also be used to create basic agents.

Here we will see how to create a basic non transactional Actor and use that Actor to calculate primes just as in the Java Akka example.

private static void Main(string[] args)  
    if(args.Length < 2)  
        Console.WriteLine("Usage: number numberOfParts");  
    int primesBound = int.Parse(args[0]);  
    int concurrency = int.Parse(args[1]);  
    Stopwatch stopwatch = Stopwatch.StartNew();  
    int count = CountPrimes(primesBound, concurrency);  
    Console.WriteLine("Found {0} primes under {1}", count, primesBound);  
    Console.WriteLine("It took {0} seconds", (stopwatch.ElapsedMilliseconds)/1.0e3f);  

Here we simply take in the upper bound of the number range in which we want to find primes and the level of concurrency like before in our Java example.

public static int CountPrimes(int bound, int concurrency)  
    int chunks = bound / concurrency;  
    var actors = new List<PrimesActor<Tuple<int, int>, int>>();  
    for(int i=0; i < concurrency; i++) {  
        int lower = i * chunks;  
        int upper = (i+1) * chunks;  
        var actor = PrimesActor<Tuple<int, int>, int>.MakeActor(PrimesComputation.CountPrimesInRange);  
        actor.Post(new Tuple<int, int>(lower, upper));  
    return actors.Sum(a => a.Receive());  

You can see we create the number of Actors we need to meet our desired concurrency level. We add each actor to a list and we post a message to each actor. Finaly we sum the results for all actors by using LINQ.

This is a very basic example to show the fundamentals of TPL Dataflow, in a real implementation we would most likely want to have less actors and have them do more work. In such a situation we would most likely use other Dataflow blocks to perform the sum. If we wanted a more complete Actor model however we may find the TPL Dataflow inadequate compared to Akka or Stact.

 internal class PrimesActor<I, O> where I : Tuple<O, O>   
   private readonly BufferBlock<I> _incomingMessages = new BufferBlock<I>();  
   private readonly BufferBlock<O> _outgoingMessages = new BufferBlock<O>();  
   private readonly Func<O, O, O> _processFunc;  
   private PrimesActor(Func<O, O, O> processFunc)  
     _processFunc = processFunc;  
   public static PrimesActor<I, O> MakeActor(Func<O, O, O> processFunc)  
     var actor = new PrimesActor<I, O>(processFunc);  
     return actor;  
   internal async void ProcessMessages()  
       var message = await _incomingMessages.ReceiveAsync();  
       O count = _processFunc(message.Item1, message.Item2);  
   internal void Post(I message)  
   internal O Receive()  
     return _outgoingMessages.Receive();  

Here we can see the definition of our Actor class, it simply pulls messages off the input queue, executes a Func delegate and then puts the result on the output queue.

 Found 664579 primes under 10000000  
 It took 2.368 seconds  

Here you can see again we get a reasonable degree of concurrency.

Get the source code here.

References :-

Read about TPL Dataflow here.
Ensure you read this document.
Install TPL Dataflow with Nuget.

Sunday, 16 September 2012

Actor Based Concurrency in Java

Often the best way to implement concurrency in your program is with isolated mutability. A popular technique to achieve isolated mutability is by using the Actor model. This is not a new idea and was implemented in Smalltalk in the 1970's.

The core idea is based on message passing, essentially an Actor is an object with an internal message queue, and a single fibre that reads from that queue. Other objects can post messages to the actor and these will be processed sequentially. The concurrency arises from the ability to have multiple actors working asynchronously.

This example demonstrates how to implement basic actor support using Java and a Scala library called Akka.

We will create a program that takes a number range say 0-10 and computes the number of primes in that range, for example there are 8 primes in the range 0-10.

Here you can see our main Java entry point :-
public static void main(final String[] args) {  
    if(args.length < 2) {  
        out.println("Usage: number numberOfParts");  
    final int primesBound = parseInt(args[0]);  
    final int concurrency = parseInt(args[1]);  
    final long start = System.nanoTime();  
    final int count = countPrimes(primesBound, concurrency);  
    final long end = System.nanoTime();  
    out.println(format("Found {0} primes under {1}", count, primesBound));  
    out.println(format("It took {0} seconds", (end - start)/1.0e9f));  

We can see here we take the number range upper bound and also we can specify a concurrency level that determines the amount of partitioning we use.

Firstly to define our class as an Actor we must subclass from an Akka Actor base class, here we use an UntypedActor, this leaves messages untyped a leaves the responsibility of deserialization on us.
public class PrimesActor extends UntypedActor {  

Next we must implement the message processing by implementing onReceive.
public void onReceive(final Object boundsList) {  
    final List<Integer> bounds = (List<Integer>) boundsList;  
    int count = PrimesComputation.countPrimesInRange(bounds.get(0), bounds.get(1));  

Here you can see we kick of our main task of prime computation by calling countPrimesInRange(). We then return the result as a Future by calling tell(count).

Now it gets interesting, lets see how we use Akka to implement our actor :-
public static int countPrimes(final int bound, final int concurrency) {  
    final int chunks = bound / concurrency;  
    final List<Future<?>> futures = new ArrayList<Future<?>>();  
    final ActorSystem system = ActorSystem.create();  
    final FiniteDuration duration = Duration.create(3, SECONDS);  
    final Timeout timeout = new Timeout(duration);  
    for(int i=0; i < concurrency; i++) {  
        final int lower = i * chunks;  
        final int upper = (i+1) * chunks;  
        final List<Integer> bounds = Collections.unmodifiableList(
            Arrays.asList(lower, upper));  
        final ActorRef primeFinder = system.actorOf(new Props(PrimesActor.class));  
        futures.add(ask(primeFinder, bounds, timeout));  
    int count = 0;  
    for(Future<?> f : futures) {  
        try {  
            count += (Integer)(Await.result(f, Duration.create(1, SECONDS)));  
        } catch (Exception e) {  
    return count;  

First we must initialise the Actor system by calling ActorSystem.create().

We then break the problem down into 'chunks' based on our concurrency level.

Messages in an Actor based system MUST be immutable, we therefore ensure our message is passed as an unmodifiable List. We create an actor for each chunk, we post the message using ask() and store its future in a future List.

Finally we iterate through our futures waiting for them to finish, the Await is an asynchronous blocking operation. We 'Join' these results into a total count and return the count. The Actor system must be shut down to terminate any remaining actors and free resources.

Here is are some sample runs :-
 java PrimesActor 10 10  
 Found 8 primes under 10  
 It took 1.242 seconds

 java PrimesActor 10000000 100  
 Found 664,579 primes under 10,000,000  
 It took 6.597 seconds  

You can see we get pretty good speedup.

Download source here (Eclipse project).

Sunday, 9 September 2012

Under the hood with VisualVM

Few people realise that an excellent lightweight profiler comes with the Oracle JDK.

VisualVM specializes in JVM monitoring and performance analysis.

Most people should have it installed with the JDK, on my machine its in :-
C:\Program Files\Java\jdk1.7.0_05\bin\jvisualvm.exe

Otherwise if you can't find it simply go and download it here, unzip the archive to a suitable place.

You can then run VisualVM using \bin\jvisualvm.exe or \bin\visualvm.exe

Goto Tools\Plugins and first install the default plugins :-

VisualVM Extensions
Visual GC
Threads Inspector
Tracer-Swing Probes
Tracer-JVM Probes
Tracer-Collections Probes
Tracer-Jvmstat Probes
Tracer-Monitor Probes

For concurrency analysis you can get additional ThreadDumpAnalyzer plugins here.

Logfile Plugin Version 2.2
TDA Library Plugin Version 2.2
TDA VisualVM Plugin Version 2.2
TDA 2.2 Final - Optional

Use Tools\Plugins\Downloaded to install the custom plugins (*.nbm files).

After installing the plugins then restart VisualVM.


After starting VisualVM you see a list of running JVM processes on the left. Double clicking a process will open the main tabbed interface. The tabs allow you to investigate different aspects of your programs runtime behaviour, like memory usage, running threads, GC generations, etc.

Heap Dump

You can generate a Java heap dump by using a right click menu on the process or from a button on the Monitor tab.

You can then analyse the heap dump using VisualVM.
You can see the objects on the heap, and how much memory they use. You can find the root reference that is causing the instance to remain alive by looking in the references window.

Garbage Collection

You can observe the garbage collector and the size of the generations using the VisualGC plugin.


The Profiler and Sampler tabs give basic profiling functionality.
The Profiler allows code to be instrumented and then uses this to measure how much time is spent in each method.
The ‘Sampler’ looks at the execution stack at a certain sample rate and from there infer time spent / memory used by function.

Threads Inspector

You can observe running threads in the threads inspector, these include both JVM threads like the finalizer thread as well as your own application threads.

Tracer Plugins

The ‘Tracer’-plugins allows monitoring of specific things in the VM, mostly stuff that is exposed using JMX, things like IO usage and JIT compilation.


Sure there is the excellent free jmap for statistics and heap dumps useful for triage in production, there is the Eclipse open source TPTP, then there are commerical tools like JProfiler and YourKit.
However free is hard to beat, and VisualVM is a useful tool to have in anyones toolbox!