Fizz Buzz gets chained…

There’s a simple child’s game called Fizz Buzz that is also used as a test for developers to show basic coding skills. Jeff Atwood even mentioned in his blog the use of this game as a simple test to weed out the bad developers before hiring new ones. And there are still developers being hired these days that would fail the simple FizzBuzz test. And that’s a shame.

So, what is Fizz Buzz? Well, basically you will count from 1 to 100 but when you encounter a multiple of three, you say “Fizz” instead of the number. And if it’s a multiple of five, you’ll say “Buzz” instead. So if you encounter a multiple of three and five, you’ll have to say “Fizz Buzz”. It’s a simple math challenge for children and a coding challenge for developers. A good developer should be able to write this code within 10 minutes or so, which makes it an excellent test during the hiring process…

Now, the website “Rosetta Code” has an excellent solution of solving this problem, and many others, in a lot of different programming languages. These include several solutions for C#. (With C# being my most popular choice these days.) But I want to show how to FizzBuzz in a complex way with lots of data analysis and solve it through method chaining. In my previous post I already mentioned why using methods that return void will break a method chain. This post will show some interesting code aspects of how to split even a simple problem into many smaller parts so each piece can be solved and managed but also reused in other projects.

So, let’s analyse the data we need and the data we will generate. Basically, we start with an array of numbers that gets translated to a single string. So, here we go:

  • We need to make a list of numbers.
  • Create pairs so we can store each call.
  • We need to return “Fizz”.
  • We need to return “Buzz”.
  • We need to return numbers as string values.
  • We need a condition telling us when to Fizz.
  • We need a condition telling us when to Buzz.
  • We need a condition telling us when we to use a number.
  • We need to select just the string values.
  • We have to join the string values into a comma-separated string.

So, I generally start with creating a static class as I’m going to make a bunch of extension methods for C#. Why? Because extension methods allow me to add functionality to objects and classes without modifying them! But first let’s make a few friendlier names for data types that we will be “using”… (Pun intended!)

using IntEnum = IEnumerable<int>;
using Pair = KeyValuePair<int, string>;
using PairEnum = IEnumerable>< KeyValuePair<int, string>>;
using StringEnum = IEnumerable<string>;

That’s the data we will need. We will need an enumeration of integers. Those will be converted to pairs so we can maintain the proper order and know which number belongs to which result. These pairs are also enumerated and in the end, they will be converted to strings before we make a single string out of them all…

So now we’re going through the list of functions:

Make a list of numbers.

This is an easy one. But while many might just create a for-loop to walk through, I prefer to create an enumeration instead, as this allows me to “flow” all the data. So my first method:

public static IntEnum MakeArray(this int count, int start = 0) => Enumerable.Range(start, count);

This is a simple trich that not many developers are familiar with. Instead of using a for-loop, I use Enumerable.Range() to create an enumeration to start my method chain. I can now use 50.MakeArray(10) to make an enumeration that starts at 10 and goes up to 50.

Create pairs so we can store each call.

Here we will create a few methods. First, we’ll start with:

public static Pair NewPair(this int key) => new Pair(key,string.Empty);

This code takes an integer and will return a key pair with empty string as value. Then we’ll create:

public static Pair NewPair(this Pair pair, string data) => new Pair(pair.Key, data.Trim());

Here we create new key pair based on the old key pair. Thus we won’t change existing key pairs as those might still be used for other purposes. We also trim the string value to remove excess spaces.

We need to return “Fizz”.

Seriously? Well, yeah.

public static Pair DoFizz(this Pair data) => data.NewPair("Fizz " + data.Value);

The reasoning behind it is that we want to add the word “Fizz” in front of the string that we already have. As we start with empty strings and trim the value when creating a new pair, we should just get “Fizz” as result. But if the result already contains “Buzz”, we will get “Fizz Buzz” as result. Including the space!

We need to return “Buzz”.

This is oh, so simple. 🙂

public static Pair DoBuzz(this Pair data) => data.NewPair(data.Value + " Buzz");

The difference between this method and the DoFizz() method is that this one adds “Buzz” at the end of the result, not at the start. So it doesn’t matter if we first Fizz or Buzz, the result is always “Fizz Buzz”. Including the space…

We need to return numbers as string values.

More simple code, but even simpler…

public static Pair DoNumber(this Pair data) => data.NewPair(data.Key.ToString());

When we put a number in the result, we don’t care about it’s content. It gets replaced in all cases…

We need a condition telling us when to Fizz.

We can’t Fizz all the time so we need to have a condition check.

public static bool IsFizz(this Pair data) => data.Key % 3 == 0;

I could have made it even more generic with an int as parameter but I use this solution to make it readable!

We need a condition telling us when to Buzz.

We also need to determine when to Buzz…

public static bool IsBuzz(this Pair data) => data.Key % 5 == 0;

We need a condition telling us when we to use a number.

And we need to decide when we want to have a number as result!

public static bool IsNumber(this Pair data) => string.IsNullOrEmpty(data.Value);

But how to get those conditional statements inside my enumerations?

Well, we need the if-statement to become part of the method chain. So we create another extension method for it.

public static T If(this T data, Func condition, Func action) => condition(data) ? action(data) : data;

This shows some power behind extension methods, doesn’t it? Because I’ve declared both the condition and action as separate methods, I can use a shorthand method when calling them. And the shorthand I will use for the Fizz(), Buzz() and Number() methods I’m creating:

public static Pair Fizz(this Pair data) => data.If(IsFizz, DoFizz);
public static Pair Buzz(this Pair data) => data.If(IsBuzz, DoBuzz);
public static Pair Number(this Pair data) => data.If(IsNumber, DoNumber);

And as these methods work on just a single pair, I also need methods to use for the full enumeration. Fortunately, I can use the same name, with different types:

public static PairEnum Fizz(this PairEnum data) => data.Select(Fizz);
public static PairEnum Buzz(this PairEnum data) => data.Select(Buzz);
public static PairEnum Number(this PairEnum data) => data.Select(Number);

The PairEnum is just an IEnumerate<Pair> as declared by the “using” before. Rather than declaring a new class, I prefer to create an alias for this type.

The result is that we have three methods now which each will work on an enumeration, changing the result of each pair where this is required.

We need to select just the string values.

This is another simple one, but here the data flow changes type. Selecting the results can be done by using:

public static StringEnum Values(this PairEnum data) => data.Select(p => p.Value);

But before selecting all results, we might want to put all items in the correct order:

public static IOrderedEnumerable SortPairs(this IEnumerable data) => data.OrderBy(p => p.Key);

Now, this sort is not required when you go through the enumerations normally. But because you can also do parallel enumerating and thus use multiple threads to walk through everything, you might also get all results in a random order. So, sorting it will fix this…

How do we make it execute parallel? Well, I would just need one adjustment to one method:

public static PairEnum Number(this PairEnum data) => data.AsParallel().Select(Number);

I only added AsParallel() in this method to change it from a synchronous to an asynchronous enumeration. But the result is that all numbers will be executed asynchronously. This means the order of my data becomes unpredictable. And let’s make it even better:

public static PairEnum Number(this PairEnum data, bool asynchronous = false) => asynchronous ? data.AsParallel().Select(Number) : data.Select(Number);

Now we can specify if we want to execute the numbering synchronous or not. The default will be synchronous so I would not need to sort afterwards. But if I indicate that I want it executed asynchronous then the order of data will be more randomized and I would need to sort afterwards.

If I also apply this change to Fizz() and Buzz() then it can become extremely interesting when everything is done asynchronous.

We have to join the string values into a comma-separated string.

Well, joining string values in a list is simply done by calling the string.Join() method but as I’m making a lot of methods already, I can just add one more:

public static string Combine(this StringEnum data) => string.Join(", ", data);

And all these methods together allow me to control the data flow in the finest details. Generally too fine for solving the FizzBuzz challenge, but it does make an interesting example…

So, how to FizzBuzz?

Okay, one more method:

public static string FizzBuzz(this int count) =>
count.MakeArray(1)
.MakePairs()
.Fizz()
.Buzz()
.Number()
.Values()
.Combine();

Notice how all methods used so far are basically one-liners. And this method too is just a single statement. It’s just written over multiple lines to make it readable.

And it shows the data flow in reasonably clear names. We take a number and make an array of numbers. These get converted to pairs, the pairs then get Fizzed, Buzzed and numbered before we convert them to strings and join them into a single string result. No sorting as I’m not doing anything asynchronous here. And the result will be: Buzz, 1, 2, Fizz, 4, Buzz, Fizz, 7, 8, Fizz, Buzz, 11, Fizz, 13, 14, Fizz Buzz, 16, 17, Fizz, 19, Buzz, Fizz, 22, 23, Fizz, Buzz, 26, Fizz, 28, 29, Fizz Buzz, 31, 32, Fizz, 34, Buzz, Fizz, 37, 38, Fizz, Buzz, 41, Fizz, 43, 44, Fizz Buzz, 46, 47, Fizz, 49, Buzz, Fizz, 52, 53, Fizz, Buzz, 56, Fizz, 58, 59, Fizz Buzz, 61, 62, Fizz, 64, Buzz, Fizz, 67, 68, Fizz, Buzz, 71, Fizz, 73, 74, Fizz Buzz, 76, 77, Fizz, 79, Buzz, Fizz, 82, 83, Fizz, Buzz, 86, Fizz, 88, 89, Fizz Buzz, 91, 92, Fizz, 94, Buzz, Fizz, 97, 98, Fizz, 100

Which reminds me: I had to use MakeArray(1) to make sure I start with 1, not 0. Otherwise, I would get 100 numbers starting with 0 and ending at 99, not 100…

Now, you’re probably thinking about how this works. And it’s likely that you assume it creates a list of integers, then a list of pairs, then a list of Fizzed pairs, etcetera. And you’re wrong! But let’s modify the code a bit. Let’s display technical information using my Do() method, and a variant of Do()…

Adding Do…

public static T Do<T>(this T data, Action<T> action) { action(data); return data; }
public static T Do<T>(this T data, Action action) { action(); return data; }
public static IEnumerable<T> Do<T>(this IEnumerable<T> data, Action<T> action) { foreach (var item in data) { action(item); yield return item; } }

These three Do() methods are very generic and as I’ve mentioned in my previous post, it allows using methods returning void within a method chain.

The first Do() will perform an action and pass the data back to that method. As we just want to do simple WriteLine statements, we won’t use that.

The second Do() can perform actions that don’t depend on the data. This is ideal to “step out” of the method chain to do something else before returning to the chain.

The third Do() will do an action for each item inside an enumeration without breaking the enumeration! And why we want to do this will be shown when we call this monster:

public static string FizzBuzzNormalStatus(this int count) =>
count.Do(() => Console.WriteLine("Linear: Initializing…"))
.MakeArray()
.Do(() => Console.WriteLine("Linear: Array created…"))
.MakePairs()
.Do(pair => Console.WriteLine($"Pair: {pair.Key} -> {pair.Value}"))
.Do(() => Console.WriteLine("Linear: Pairs made…"))
.Fizz()
.Do(pair => Console.WriteLine($"Fizz: {pair.Key} -> {pair.Value}"))
.Do(() => Console.WriteLine("Linear: Fizzed…"))
.Buzz()
.Do(pair => Console.WriteLine($"Buzz: {pair.Key} -> {pair.Value}"))
.Do(() => Console.WriteLine("Linear: Buzzed…"))
.Number()
.Do(pair => Console.WriteLine($"Number: {pair.Key} -> {pair.Value}"))
.Do(() => Console.WriteLine("Linear: Numbered…"))
.SortPairs()
.Do(pair => Console.WriteLine($"Sort: {pair.Key} -> {pair.Value}"))
.Do(() => Console.WriteLine("Linear: Sorted…"))
.Values()
.Do(() => Console.WriteLine("Linear: Extracted…"))
.Combine()
.Do((s) => Console.WriteLine($"Linear: Finalizing: {s}…"));

And yes, this is still a single statement. It’s huge because I’m writing a line after each method indicating that this method has just fired after writing all the data it has processed. So, you would expect to see “Linear: Array created…” being written, followed by a bunch of pairs and then “Linear: Pairs made…”.

Wrong!

But I’ll show this by FizzBuzzing just the numbers 0 to 4:

Linear: Initializing…
Linear: Array created…
Linear: Pairs made…
Linear: Fizzed…
Linear: Buzzed…
Linear: Numbered…
Linear: Sorted…
Linear: Extracted…
Pair: 0 ->
Fizz: 0 -> Fizz
Buzz: 0 -> Fizz Buzz
Number: 0 -> Fizz Buzz
Pair: 1 ->
Fizz: 1 ->
Buzz: 1 ->
Number: 1 -> 1
Pair: 2 ->
Fizz: 2 ->
Buzz: 2 ->
Number: 2 -> 2
Pair: 3 ->
Fizz: 3 -> Fizz
Buzz: 3 -> Fizz
Number: 3 -> Fizz
Pair: 4 ->
Fizz: 4 ->
Buzz: 4 ->
Number: 4 -> 4
Sort: 0 -> Fizz Buzz
Sort: 1 -> 1
Sort: 2 -> 2
Sort: 3 -> Fizz
Sort: 4 -> 4
Linear: Finalizing: Fizz Buzz, 1, 2, Fizz, 4…

And yes, that’s the output of this code! It tells you the array was created, the pairs prepared, Fizzed, Buzzed, Numbered before sorting and extracting and only then will it do all those things!

This is what makes enumerations so challenging to understand! They’re not arrays or lists. They’re something many developers are less familiar with. This is why “yield result” is so useful when enumerating data. I don’t need a lot of memory to store lists of data but instead will process each item in the enumeration up to the point where the data gets aggregated.

Which in this case happens when I sort the data, as I need all data before I can sort it. If I don’t sort in-between, the data will only be aggregated when I call the Combine() method.

This is important to remember during data processing, as an exception in one of these methods can have unexpected consequences. That’s because all data before the faulty data will have been processed by the aggregator function. For FizzBuzz, that wouldn’t be a problem. But if one of the methods in the chain would write data to a database or file then it will have partially written some data up until it gets the exception.

So, when using enumerators, you might end up with garbage if you’re unaware of how data passes from one method to another. It feels unnatural for inexperienced developers yet it makes this FizzBuzz example so much more interesting…

Avoid the void, introducing Do!

I’ve been programming for a very long time. I wasn’t even 10 years old when I got access to computers and that was back around 1975! My first programming experience were on a programmable calculator and later with BASIC on a ZX-81. Around 1985 I started learning other programming languages and around 1988 I even used Turbo Pascal 5.5 which has support for object-oriented programming. Well, a bit limited.

When I started using Turbo Pascal and later Delphi, I also learned the value of methods that would return objects. But it would still take a while before I realized the full power of this. Still, in Turbo Pascal 6 there was a feature called Turbo Vision that could be used to create complete menu structures and dialog screens on a text console! It made extensive use of methods returning objects to make method chaining possible.

Method chaining is a simple principle. You have a class and the class has methods. And each method can return an object of a specific class. So that gives access to a second method. And a third, a fourth, a fifth until you get to a method that returns void.

That’s a dead stop there!

I recently worked on a project where I used a long method chain to keep a clear workflow visible in my code. And my code was selecting data, ordering it, manipulating it and then saved it to a comma-separated file or CSV file. Something like this:

MyData
.ToList()
.OrderBy(SortOrder)
.Take(50)
.Select(NewDataFormat)
.SaveToCSV(Filename);

But SaveToCSV was a method that returned void, so the chain breaks there! But I wanted to continue the workflow as I needed to do more with this data. I also wanted to display it on screen, save it to a database or even filter it a bit more. So, to resolve this I needed a better method that would allow me to continue the chain.

Of course, I could also have written my code like this:

var myList = MyData.ToList();
var mySortedList - myList.OrderBy(SortOrder);
var myTop50 = mySortedList.Take(50);
var myNewList = myTop50.Select(NewDataFormat);
myNewList.SaveToCSV(Filename);

This code would allow me to also continue the workflow and would also allow me to use myNewList for further processing. But it also introduces a bunch of variables and it allows non-related code to be included within these lines, obscuring the workflow! That would not work! The chain of work would be broken by irrelevant tasks.

So, now I have two options. Either I modify the SaveToCSV method so it will return a value (preferably the object itself) or I make an extension method inside a static class that would allow me to put a void method within my chain. And I came up with this beautiful, yet simple method:

public static T Do<T>(this T data, Action<T> action){ action(data); return data; }

And this simple, yet beautiful construction is a generic method that can be used for any class, any object! And it can change my code into this:

MyData
.ToList()
.OrderBy(SortOrder)
.Take(50)
.Select(NewDataFormat)
.Do(d=>d.SaveToCSV(Filename))
.Do(d=>Console.WriteLine(d));

Now my data flow is still intact and the flow of the code is still very readable. It is easy to see that this is just one block of code. Most people tend to forget that programming isn’t about writing code. It’s about how to process data.

Using a method chain is a perfect way to visualize data flow inside your code. But to allow proper method chaining, each method will have to return some kind of object for further processing. Otherwise, you will need a Do<> extension method in your project to make a void method part of the chain.

But to keep it simple, when using a void method, you’ll basically put a stop to the data flow within your application. You would then have to start a new data flow for further processing. This is okay, but generally not the best design in programming. While not everyone might be a fan of method chaining, it still is a very powerful way to write code as it forces you to keep irrelevant code outside of the flow.

A very generic datamodel.

I’ve come up with several projects in the past and a few have been mentioned here before. For example, the Garagesale project which was based on a system I called “CART”. Or the WordChain project that was a bit similar in structure. And because those similarities, I’ve been thinking about a very generic datamodel that should be handled to almost any project.

The advantage of a generic database is that you can focus on the business layer while you don’t need to change much in the database itself. The datamodel would still need development but by using the existing model, mapping to existing entities, you could keep it all very simple. And it resulted in this Datamodel:ClassDiagram(Click the image to see a bigger version.)

The top class is ‘Identifier’ which is just an ID of type GUID to find the records. Which will work fine in derived classes too. Since I’m using the Entity Framework 6 I can just use POCO to keep it all very simple. All I have to do is define a DBContext that tells me which tables (classes) I want. If I don’t create an entry for ‘Identifier’, the table won’t be created either.

The next class is the ‘DataContent’ class, which can hold any XML. That way, this class can contain all information that I define in code without the need to create new tables. I also linked it to a ‘DataTemplate’ class which can be used to validate the content of the XML with an XML schema or special style sheet. (I still need to work out how, exactly.) The template can be used to validate the data inside the content.

The ‘BaseItem’ and ‘BaseLink’ classes are the more important here. ‘BaseItem’ contains all fixed data within my system. In the CART system, this would be the catalog. And ‘BaseLink’ defines transactions of a specific item from one item to another. And that’s basically three-fourth of the CART system. (The template is already defined in the ‘DataTemplate’ class.)

I also created two separate link types. One to deal with fixed numbers which is called ‘CountLink’ which you generally use for items. (One cup, two girls, etc.) The other is for fractional numbers like weights or money and is called ‘AmountLink’. These two transaction types will be the most used transaction types, although ‘BaseLink’ can be used to transfer unique items. Derived links could be created to support more special situations but I can’t think of any.

The ‘BaseItems’ class will be used to derive more special items. These special items will define the relations with other items in the system. The simplest of them being the ‘ChildItem’ class that will define more information related to a specific item. They are strongly linked to the parent item, like wheels on a car or keys on a keyboard.

The ‘Relation’ class is used to group multiple items together. For example, we can have ‘Books’ defined as relation with multiple book items linked to it. A second group called ‘Possessions’ could also be created to contain all things I own. Items that would be in both groups would be what is in my personal library.

A special relation type is ‘Property’ which indicates that all items in the relation are owned by a specific owner. No matter what happens with those items, their owner stays the same. Such a property could e.g. be a bank account with a bank as owner. Even though customers use such accounts, the account itself could not be transferred to some other bank.

But the ‘Asset’ class is more interesting since assets are the only items that we can transfer. Any transaction will be about an asset moving from one item to another. Assets can still be anything and this class doesn’t differ much from the ‘BaseItem’ class.

A special asset is a contract. Contracts have a special purpose in transactions. Transactions are always between an item and a contract. Either you put an asset into a contract or extract it from a contract. And contracts themselves can be part of bigger contracts. By checking how much has been sent or received to a contract you can check if all transactions combined are valid. Transactions will have to specify if they’re sending items to the contract or receiving them from the contract.

The ‘BaseContract’ class is the more generic contract type and manages a list of transactions. When it has several transactions, it is important that there are no more ‘phantom items’. (A phantom item would be something that’s sent to the contract but not received by another item, or vice versa.) These contracts will need to be balanced as a check to see if they can be closed or not. They should be temporary and last from the first transaction to the last.

The ‘Contract’ type derived from ‘BaseContract’ contains an extra owner. This owner will be the one who owns any phantom items in the contract. This reduces the amount of transactions and makes the contract everlasting. (Although it can still be closed.) Balancing these contracts is not required, making them ideal as e.g. bank accounts.

Yes, it’s a bit more advanced than my earlier CART system but I’ve considered how I could use this for various projects that I have in mind. Not just the GarageSale project, but also a simple banking application, a chess notation application, a project to keep track of sugar measurements for people with diabetics and my WordChain application.

The banking application would be interesting. It would start with two ‘Relation’ records: “Banks” and “Clients”. The Banks relation would contain Bank records with information of multiple banks. The Clients relation would contain the client records for those banks. And because of the datamodel, clients can have multiple banks.

Banks would be owners of bank accounts, and those accounts would be contracts. All the bank needs to do is keep track of all money going in our out the account. (Making money just another item and all transactions will be of type ‘AmountLink’.) But to link those accounts to the persons who are authorized to receive money from the account, each account would need to be owner of a Property record. The property record then has a list of clients authorized to manage the account.

And we will need six different methods to create transactions. Authorized clients can add or withdraw money from the account. Other clients can send or receive payments from the account, where any money received from the contract needs to be authorized. Finally, the bank would like to have interest, or pays interest. (Or not.) These interest transactions don’t need authorization from the client.

The Chess Notation project would also be interesting. It would start with a Board item and 64 squares items plus a bunch of pieces assets. The game itself would be a basic contract without owner. The Game contract would contain a collection of transactions transferring all pieces to their first locations. A collection of ‘Move’ contracts would also be needed where the Game Contract owns them. The Move would show which move it is (including branches of the game) and the transactions that take place on the board. (White Rook gone from A1, White Rook added to A4 and Black pawn removed from A4, which translates into rook takes pawn at A4.)

It would be a very complex way to store a chess game, but it can be done in the same datamodel as my banking application.

With the diabetes project, each transaction would be a measurement. The contract would be owned by the person who is measuring his or her blood and we don’t need to send or receive these measurements, just link them to the contract.

The WordChain project would be a bit more complex. It would be a bunch of items with relations, properties and children. Contracts and assets would be used to support updates to the texts with every edit of a WordChain item kicking the old item out of the contract and adding a new item into the contract. That would result in a contract per word in the database.

A lot of work is still required to make sure it works as well as I expect. It would not be the most ideal datamodel for all these projects but it helps me to focus more on the business layer and the GUI without worrying about any database changes. Once the business model becomes more advanced, I could create a second data layer with a better datamodel to improve the performance of the data management.

 

 

 

Multithreading, multi-troubling.

Recently, I worked on a small project that needed to make a catalog of image files and folders on my hard disk and save this catalog in a database. Since my CGI and my photography hobby generated a lot of images, it would be practical to have something easy to support it all. Plenty of software that already does something like this, but none that I liked. Especially since I want to connect images to derived images, group them, tag them, share them, assign licenses to them and publish them. And I want to keep track of where I’ve shared them already. Are they on Flickr? CafePress? DeviantArt? Plus, I wanted to know if they should be rated as adult. Some of my CGI artwork is naughty by nature (because nude models are easier to work with) and thus unsuitable for a broad audience.

But for this simple catalog I just wanted to store the image folder, the image filename, an image name that would be the filename without extension and without diacritics, plus the width and height of the image so I could calculate the image ratio. To make it slightly more complex, the folder name would be a relative folder name based on a root folder that’s set in the configuration. This would allow me to move the images to a different folder or use the same database on a different machine without the need to adjust all records.

So, the database structure is simple. One table that has the folders, one table containing image ratios and one for the image names and sizes. The ratio table will help me to group images based on the ratio between width and height. The folder table would do the same for grouping by folder. The Entity Framework would help to connect to this database and take away a lot of my troubles. All I have to do now is write a simple library that would fill and keep up this catalog plus a console application to call those methods. Sounds simple enough.

Within 30 minutes, the first version was ready. I would first enumerate all folders below the source folder, then for each folder in that list I would collect all image files of type PNG, JPG and BMP. The folder would be written to the folder table and the file would be put in the Image table. Just one minor challenge, though…

I want to add the width and height of the image to the image table too, and based on the ratio between width and height, I would have to either add a new ratio record, or change an existing one. And this meant that I had to read every file into memory to find its size and then look if there’s already a ratio record related to it. If not, I would need to add the new ratio record and make sure the next request for ratio records would now include the new ratio record. Plus, I needed to check if the image and folder records also exist in the database, because this tool needs to update only for new images.

The performance was horrible, as could easily be predicted. Especially since I make images and photo’s at high resolutions, so reading those files does take dozens of milliseconds. No matter that my six cores at 3.5 GHz and 32 GB of RAM turns my system in a Speed Demon, these read actions are just slow. And I did it inefficiently since I have six cores but my code is just single-threaded. So, redo from start and this time do it multithreaded.

But multithreading and the Entity Framework don’t go well together. The database connection isn’t threadsafe and thus you cannot access the database methods from multiple threads. Besides, the ratio table could generate collisions when two images with the same, new ratio are processed. Both threads would notice the ratio doesn’t exist thus both would add it. But one of those would then fail because the other would have added it first. So I needed to change my approach.

So I Used ‘Parallel.ForEach’ to walk through the folder list and then again for all files within the folder. I would collect the data in internal lists and when the file loop was done, I would loop through all images and add those that didn’t exist. And yes, that improved performance a lot and kept the conflicts with the ratio table away. Too bad I was still reading all images but that was not a big issue.Performance went up from hours to slightly over one hour. Still slow.

So one more addition. I would first read all existing folders and images from the database and if a file existed in this list, I would not read it’s size anymore since it wasn’t needed. I could skip the image. As a result, it still took an hour the first time I imported all images, but the second run would finish within a minute, since there wasn’t anything left to read or add. The speed was limited to just reading the files and folders from the database and from the disk.

When you’re operating these kinds of projects in an Agile team and you’re scrumming around, things will slow down considerably if you haven’t thought about these challenges before you started the sprint to create the code. Since the first version looks quite simple, you might have planned it as a very short task and thus end up with extremely slow code. In the next sprint you would have to consider options to speed things up and thus you will realize that making it multithreaded is a bigger task. And while you are working on the multithreaded version, you might discover the conflicts with the Entity Framework plus the possible collisions within the tables. So the second sprint might end with a buggy but faster solution with lots of exception handling to catch all possible problems. The third sprint would then fix these, if you manage to find a better solution. Else, this problem might haunt you to the deadline of the project…

And this is where teams have to be real careful. The task sounds very simple, but it’s not. These things are easily underestimated by a team and should be well-planned before you start writing code. Experienced developers will detect these problems before they start, thus knowing that they should take their time and plan carefully without writing code immediately. (I only did it so I could write this post.) The task seems extremely simple and I managed to describe it in the second paragraph of this post with just three lines. But the solution with a high performance will require me to think before I start writing code.

My last approach is the most promising, though. And it can be done by using multithreading but it’s far more complex than you’d assume at first. And it will be memory-hungry because you need to create several lists in memory.

You would have to start with two threads. One thread will read the database and generate lists of files, folders and ratios. These lists must be completely in-memory because if you keep them as queryable lists, the system would try to continuously read them. Besides, once you’re done generating these lists you will want to close the database connection. This all tells you what you already have. The second thread will read all folders and by using parallel threads it would have to read all image files within those folders. But you would not read the image sizes yet, nor calculate all ratios.

When you’re done collecting the data, you will have to compare it all. You would start by comparing the lists of folders. Folders that exist in both lists can be ignored (but not their files.) Folders that exist in the database list but not the disk list should be deleted, including all files within those folders! Folders that are on disk but not in the database need to be added. Thus you can now start two threads, each with their own database connection. One will delete all folders plus their related images from the database that have been deleted while the other adds all new folders that are found on the disk. And by using two database connections, you can speed things up. You will have to wait for both threads to finish, though. But it shouldn’t be slow.

The next step would be the comparison of images. Here you do something similar as with folders. You split the lists in three different lists. One with all images that are unchanged. One with all images that need to be deleted. And one with all images that need to be added. And you would create a separate thread with its own database connection to delete the images so your main process can start working on the ratios table.

Because we now know which images need to be added, we can go through those files using parallel processing, read the image width and height and add this information to the image file records. When we have enriched this list with these sizes, we can use a LINQ query to generate a list of all ratios of those images and removing all duplicate ratios in this list. This generates the list of ratios that we would need to check.

Before we add the new images, we will have to check the ratios table. As with the folders table, we check for all differences. However, we cannot delete ratios that we haven’t found among the images, because we skipped the images that already exist. We will do this later, though. We will first start adding the new ratios to the database. This too can be done in a separate thread but it’s pretty fast anyways so why bother? A performance gain of two seconds isn’t worth the extra effort if a process takes minutes to finish. So add the new ratios.

Once all ratios are added, we can add all images. We could do this using parallel threads, with each thread creating a new database connection and processing all images from one specific folder or with one specific ratio. But if you want to add them multi-threaded I would just recommend to divide the images in groups of similar sizes. Keep the amount of groups relative to the number of processes (e.g. 24 for my six cores) and let the system do its work. By evenly dividing the images over multiple threads, they should all take about the same amount of time.

When adding the new images, you will have to find the related folder and ratio in the database again. This makes adding images slower than adding folders or ratios because you need the extra lookup. This performance would increase if we had kept the Folders and Ratio lists as queryable lists but then we could not open and close the connections, not could we use multiple connections to add those images. And we want multiple connections to speed things up. So we accept a slightly worse performance at this point, although we could probably speed it up a bit by using a stored procedure to add the images. The stored procedure would have parameters for the image name, the image filename, the width and height, the folder name and the ratio width and height. I’m not too fond of procedures with many parameters and I haven’t tested if this would increase the performance, but in theory it should be faster, especially if the database is on a different machine than the application.

And thus a simple task of adding images to a database turns out to be complex, simply because we need better performance. It would still take hours if it has a lot of new images to add but once you have it mostly filled, it will do quite well.

But you will have to ask yourself and your team if you are capable to detect these problems before you start a new sprint. Designs are simple, because designers don’t always keep the performance in mind. These things are easily asked for because they appear very simple, but have a lot of consequences. Similar problems might arise when you work with projects that need to be secure. The design might ask for a login screen with username and password, and optionally a few OpenID providers as alternative logins, but the amount of code to manage all this data and keep it secure is quite complex. These are real moments when you need to design some technical documentation first, which is something people often forget when working on an Agile project.

Still, you cannot blame the developer if the designer just writes a few lines and the developer chooses the first, slow solution. The result would be the requested task. It is the designer who needs to be aware of these possible performance pitfalls. And with Agile, you have a team. All team members should be able to point out that this simple description would have these pitfalls, thus making it a long and complex task. They should all realise that they will have to discuss possible solutions for this and preferably they do so as a team with just one computer. (The computer would be used to find information, not to write code!) Only when they agree on the proper solution then one or two of them could start writing code. And they would know how long this task will take. Thus, the task would finish within two sprints. In the first sprint, all team members would have a small task to meet and discuss the options. In the second sprint, one or more members would have a big task of implementing the code.

Or, to keep it simple: think before you start writing code!

The challenge for the CART system.

In The CART datamodel I displayed the datamodel that would be required for the CART system. Basically, the data model would store the items, transactions and contracts while the templates will be stored in code, as XML structures that are serialized to objects and back again. As a result, I would split the relations from the data, thus allowing me to avoid regular updates to the database structure. All that I might want to update are the templates but even that might not be required for as long as the software will support “older versions” of those objects.

But serializing those objects to and from XML isn’t as easy as it seems. Since I’ve separated data from relations, the data itself doesn’t know its own ID. Why? Because the ID is part of the relation, thus I would not need to store it within the XML. (It would be redundant.) But if I want to use these objects through a web service, having the ability to send this ID to the client is very practical, so I can send back changes through the same web service. I would need the ID to tell the server what to do with which object.

Thus I’ll end up with two methods of serializations. One is just to serialize the data itself. The other is the data plus its ID. And now I will have to decide a way to support both. Preferably in a simple way that would not require me to generate lots and lots of code.

In the data layer, I would split up every object into a relation and a data part. The data would be stored as XML within the relation object. To make sure the relation object will be able to expose the object, I would need to give it a DataObject property that would be the data object. It’s get/set methods should connect to the XML inside, preferably even by using an object as buffer so I don’t have to serialize it over and over again.

In the business layer, I should not have an XML property, nor should I have a DataObject property. The data fields should be there, with the ID. And basically, I would need a mapping to convert between the data layer and the business layer. The trouble with this approach is that I would define all data twice. Once in the data template and once in the business layer. That’s not very smart. I need to re-use things…

I’m considering to add my serialization method for the data templates. This means that I will include the ID within the template, so it becomes part of the object. All properties would be defined as data members, including the ID. That way, the ID is sent from the business layer to the client. But to store the template in the relation object, I would need to create my solution.

One solution would be by implementing methods to convert the data to XML plus a constructor that would accept XML to create it. It would also mean that I need a way to recognize each object type so I can call the proper construction and probably inherit a bunch of code or write interfaces to make objects more practical to be called upon. It would be complex…

Another solution would be by defining my own attributes. One would be for the class name, thus allowing me to find classes based on this custom attribute. The other would be for the property and my code would just use all of those to generate the XML or to read it back again. This too would allow custom field names. It would be a cleaner solution since I would define the template as just a bunch of properties. Which, basically, they are.

But this second solution is a bit complex, since I still need a way to call the constructor of these specific classes. So I’ve opened a question on StackOverflow, hoping I will get an interesting answer that would solve this easily. Why? Because part of being a good developer is to ask other experts for possible solutions when yourself don’t have a good answer! 🙂

The CART datamodel

Well, my back problems made me think a lot about the CART system that I’ve mentioned before. And it made me consider how I need to handle the difference between plain data and the relationship between the objects. Because the most troubling thing I had to deal with was that I did not want to change my datamodel for every new item that I want to store. So it made me think…

The CART system is mostly created to handle relationships between items, transactions and contracts. It’s not about handling of the data itself. Actually, the system doesn’t even care about the data. All that matters are the relationships. Data is just something to display to the user, something to store but normally not something that you’ll need to process very often at the data layer. So, considering the fact that you can serialize objects in .NET to XML data, I’ve decided to support a basic structure for my Garage Sale project for all the items, transactions and contracts. And each of them will contain a Data property that has the serialized data, that I could convert to data objects and back again.

This idea makes it also more clear where the templates are within my system. My templates are these object definitions! I will have a very generic database with a very simple layout, and I can generate a very complex business layer around this all that I can change as often as I like without the need to change my database model. As a result, I might never have to change the data model again!

Of course it will have a drawback, since the database will contain serialized objects. If I change those objects, I will also need to keep track of the changes in those stored structures and either update them or keep up multiple versions of those objects. Updating those structures would mean that I have to create update apps that know both the old structures and the new structures. It should then convert each old structure to a new structure. Maintaining multiple versions might be more practical since that would leave all old data intact in your system. Anyways, it’s some added complexity that I’ll have to consider.

But now, my datamodel as created by Visual Studio 2012 by using the Entity Framework:EF-CART

 

So, what do you see?

  • The DataObject is the base class for any CART object. It has a unique identifier, a name that can be used to show the related object and a Data property that will contain an object as XML.
  • DataItem is a generic item class, containing an additional description just for more practical reasons. When a user wants to select an existing item, having a description makes it possible to avoid reading the object within the data.
  • The Collection table is just an item with an extra bonus. It can contain multiple child items without the need for any transactions. Actually, this is just a shortcut solution to allow more complex structures within your items. For example, you might create a complete Household collection containing husband, wife, four children and a dog. And although you could link them together by using transactions, having them in a collection just saves the need to create those transactions.
  • DataTransactions is the base class for the transactions, having a sender, receiver and subject item connected together. It also has a link to a rule and a timestamp, indicating when the transaction tool place. (Or will take place for future transactions.)
  • IntegerTransaction is just a basic transaction with a multiplier. This way, you don’t have to add a lot of transactions when your customer buys ten bags of flour.
  • DecimalTransaction is also a basic transaction that will handle items that can be traded in all kinds of different numbers, including fractional amounts. For example, the price of a product, or it’s weight, length or light intensity.
  • DataRule is the basic contract type. It’s a collection of transactions that are all related to one another. For example, the sale of a product would result in a sale rule.
  • The Contract class is more than just a rule. It’s a rule that can contain child rules, thus allowing structured contracts that are made up of several subcontracts. These are used when you have to deal with complex
    situations, like mortgages. A mortgage would include a rule for purchasing a house, a rule for lending money and paying it back, plus other rules for all kinds of insurances.

Now, as I’ve said before, this datamodel should offer me more than enough structural parts to use for my Garage Sale project. All I need to do is compile it and then just leave it alone. There should not be a need to include anything else.

Well, okay… That’s not completely true, since I might want to handle user accounts and store large images of products. But these things would actually require new database structures and should preferably be part of separate databases.

Looking back at this design, it surprises even me how less data it actually has. But the trick here is that I’ve separated the relationships between objects from the actual data itself. Each object can still contain a huge amount of data. But it’s just not important for the model itself.

Code quality with NDepend

In the ICT branch, quality tends to be overlooked. Plenty of products need regular updates and patches to fix problems in the existing code. Problem is that the software market has many software developers who don’t have a good grasp on the concept of quality, because they’re afraid that writing quality code will slow them down. Quality tends to be difficult, hard to understand. While writing programs is quite easy, since all you do is tell a computer what to do in a language the computer will understand.

One of the things that will always annoy me is when I see a project created by someone else, and it works great yet the code is an unreadable mess of Jibber-jabber. And often it’s caused because multiple people have added code to it, each and everyone in their own style and based on their own knowledge. And some of them are very good at their work, while others are bad. And even when a project started with some great code at first, all changes and updates just tend to make it too complex. And often it’s just because of deadlines or inexperience that code becomes bad. Or the lack of courage among developers to just throw away bad code and completely rewrite it using a better style and technique.

When I was developing in Delphi (which is still a hobby of mine) I had Peganza’s Pascal Analyzer available to help me tell about the status of the code. And although the Delphi compiler would just tell me about errors and warnings, those weren’t just enough to find the code quality. Because badly written code could still compile without warnings and errors. So I used the analyzer to tell me where I needed to re-factor the code to improve its quality.

So I wanted something similar to use with Visual Studio and C#. And although VS2012 offers its own analysis tools, they didn’t show me very much information. I discovered that I had to select a set of rules first, so I did. I chose the “Extended Design guidelines” and it gave me a few more warnings about my code. I have to avoid namespaces with just a few types, I need to validate the arguments of public methods. And I needed to mark a member as static. Too bad that the code I’ve used started as a generated project by Visual Studio itself. All I did was add a bit of meat to its bones, but the meat wasn’t the problem. The bones were…

So Microsoft’s own analysis complains about Microsoft’s own templates. I see. Yeah. Hum.

The next in line is Resharper, and Resharper doesn’t just analyse code, it also makes suggestions while typing and can even correct things on a simple mouse-click. Resharper is great when you want to re-factor your code and even analyses JavaScript code and other languages. It’s a great tool to fix things, but if you need to fix things, it means you’ve broken something first. And generally, when something is fixed you would prefer to have an immediate warning when your code starts to “smell” again, so you can do a rollback and tell the person responsible to do a better job. You don’t want to fix things at the end of a sprint when things start to smell bad. You want warnings sooner, when the project is built by one of the developers in your team. And basically, you want a report telling you about the current quality of the project, which you could show to clients and use to compliment your team on their good work. (Or to bash some heads when to do it badly.)

So, NDepend… It’s a bit expensive but for those who use a special build server to make new, daily builds based upon the latest code it is also a practical solution. It allows you to define your own set of rules based upon a query language that’s based upon LINQ. And it gave me plenty of warnings on my little project that are actually a bit more helpful than those VS2012 or Resharper gave me. It advised me to re-factor one complex method, which was indeed a bit too complex. It told me to make some classes ‘sealed’ since I’m not inheriting from them. Practical, since I wasn’t planning to inherit from these anyways. Some funny warning about a namespace without types, which isn’t very practical. Unfortunately, this default rule didn’t tell me which namespace it was or where it’s located so I have to search a bit. And a warning that I should avoid defining multiple types in a single source file. And indeed, I had declared three types in a single file, which isn’t good practice. Then again, this file was auto-generated too…

As it turns out, this simple project of mine didn’t have a lot of issues. But some large solutions that I’ve worked on do have plenty to report and the analyzer makes it very clear that maintaining solutions for over 7 years without regular quality check-ups on code do collect a lot of stink. I already knew some of those solutions were bad, but the analyzer made it even look horrible, almost unable to tell about all the smelly parts. It would take months to just fix those smelly parts and thus cleaning it up becomes a bit too expensive. You wouldn’t need a tool that tells you it smells, you’d need a tool that cleans it up. So for those big solutions, Resharper and common sense would be better.

I also decided to use it on a solution that I’ve worked upon myself, just curious about the amount of code smell in that solution, compared to some of the other large solutions I’ve seen. And I wasn’t surprised about the fact that it would have plenty of warnings. And it did. My small project only violated 13 rules while a large solution that I have had broken 86 rules. My solution broke 71 rules… So, almost as bad as the big ones, until I started to check the number of times each rule was violated. In my solution, I made perhaps a dozen violations for things that I consider important up to 3302 methods that could have had a lower visibility. Lowering visibility on those methods it mentioned isn’t that important to me, so those I won’t even fix. But the other big solutions I’ve analyzed had more than 100 violations per rule with only a few exceptions. Several rules had over a thousand violations with one rule having more than 12.000 violations. I would almost feel sorry for those who have to support that!

So, NDepend tells me my code stinks. Does it tell me even more? Yes, it does! It provides practical statistics about my code, displaying all namespaces and types but also the dependencies between the assemblies within a solution. And it shows the dead code within my solution. I had one dead method, that was actually part of a third-party component.

NDepend gets it power from the ability to create new rules, to group rules together and in the fact that it generates an HTML report that you could share among all developers, your managers and CEO and perhaps even send to customers who want to check if you’re making quality software. It can help you to improve your code but its main purpose is to tell you the quality of your code. It gives you a challenge. The challenge to rewrite your solution so it won’t have anything to warn you about. Once you’re at that point, NDepend could become part of your build process to make sure your code maintains such a high quality.

So, would you need NDepend in your tool collection? Is it worth the price you’d have to pay for it, or is it too expensive?

Well, for me as a single developer it’s practical since it challenges me to write code that creates no warnings. And this can be a real tough challenge when you’re working on complex solutions. But my personal budget is reasonable big, thus allowing me to make a few large expenses and probably even allows me to overpay a bit. It is an expensive product and I would recommend to first buy a Resharper license before considering buying NDepend. As long as Resharper tells you about code smell, you’re still in need of learning better design practices.

For a team of professional developers it is basically the same as for single developers. However, teams of developers will need reports telling them if the quality of the code is improving or not. You need to depend on the others in your team, and often you will lack the time to do a code review on the work of each team member. For teams of developers I would recommend to use NDepend if you, as a team, want to improve the code quality and are willing to check reports on the code quality with every new build. A license for your build server would be most practical in that case, especially when the build server will do regular (daily) builds automatically.

When your job is doing code reviews, you might want to use the build-in analyzer of VS2012 or Resharper but those are more meant to allow developers to fix code immediately. Code reviewing isn’t just meant to improve code, but is meant to improve developers. When you’re doing regular code reviews, NDepend would be very useful to do a lot of work for you. You would have to write your own rules and change several of the existing rules that NDepend provides, but it will help you to review the works of others.

If you’re a manager in an ICT department then reviewing code will probably be a bit complex. However, you would still need to know about the code quality and the quality of the team of developers. Here, too, a license for the build server would be practical to keep you informed about the health of your product and to take steps when your developers start adding smelly code. Especially when you need to outsource the code the use of code analysis becomes very important since you need to know if that far-away team is actually doing a good job. Too many outsourced projects have become extremely smelly, even though they seem to work just great. It would be a shame is a great start-up becomes a rotten egg simply because you forgot to check the code quality.

When you look at NDepend more closely, you will see how powerful this tool actually can be. You just have to keep in mind that this tool isn’t really meant to improve code, like Resharper and the VS2012 Analyzer does. It’s meant to improve the developer because it tells him if he’s improving or not. It tells teams of developers where their weaknesses are and will help to get rid of those weaknesses. And it will help companies to keep up a quality in the code that should equal the quality of their products.

Just don’t use it when your projects are already extremely smelly. You would need to clean up your code before you can clean up your developers…