Fizz Buzz gets chained…

There’s a simple child’s game called Fizz Buzz that is also used as a test for developers to show basic coding skills. Jeff Atwood even mentioned in his blog the use of this game as a simple test to weed out the bad developers before hiring new ones. And there are still developers being hired these days that would fail the simple FizzBuzz test. And that’s a shame.

So, what is Fizz Buzz? Well, basically you will count from 1 to 100 but when you encounter a multiple of three, you say “Fizz” instead of the number. And if it’s a multiple of five, you’ll say “Buzz” instead. So if you encounter a multiple of three and five, you’ll have to say “Fizz Buzz”. It’s a simple math challenge for children and a coding challenge for developers. A good developer should be able to write this code within 10 minutes or so, which makes it an excellent test during the hiring process…

Now, the website “Rosetta Code” has an excellent solution of solving this problem, and many others, in a lot of different programming languages. These include several solutions for C#. (With C# being my most popular choice these days.) But I want to show how to FizzBuzz in a complex way with lots of data analysis and solve it through method chaining. In my previous post I already mentioned why using methods that return void will break a method chain. This post will show some interesting code aspects of how to split even a simple problem into many smaller parts so each piece can be solved and managed but also reused in other projects.

So, let’s analyse the data we need and the data we will generate. Basically, we start with an array of numbers that gets translated to a single string. So, here we go:

  • We need to make a list of numbers.
  • Create pairs so we can store each call.
  • We need to return “Fizz”.
  • We need to return “Buzz”.
  • We need to return numbers as string values.
  • We need a condition telling us when to Fizz.
  • We need a condition telling us when to Buzz.
  • We need a condition telling us when we to use a number.
  • We need to select just the string values.
  • We have to join the string values into a comma-separated string.

So, I generally start with creating a static class as I’m going to make a bunch of extension methods for C#. Why? Because extension methods allow me to add functionality to objects and classes without modifying them! But first let’s make a few friendlier names for data types that we will be “using”… (Pun intended!)

using IntEnum = IEnumerable<int>;
using Pair = KeyValuePair<int, string>;
using PairEnum = IEnumerable>< KeyValuePair<int, string>>;
using StringEnum = IEnumerable<string>;

That’s the data we will need. We will need an enumeration of integers. Those will be converted to pairs so we can maintain the proper order and know which number belongs to which result. These pairs are also enumerated and in the end, they will be converted to strings before we make a single string out of them all…

So now we’re going through the list of functions:

Make a list of numbers.

This is an easy one. But while many might just create a for-loop to walk through, I prefer to create an enumeration instead, as this allows me to “flow” all the data. So my first method:

public static IntEnum MakeArray(this int count, int start = 0) => Enumerable.Range(start, count);

This is a simple trich that not many developers are familiar with. Instead of using a for-loop, I use Enumerable.Range() to create an enumeration to start my method chain. I can now use 50.MakeArray(10) to make an enumeration that starts at 10 and goes up to 50.

Create pairs so we can store each call.

Here we will create a few methods. First, we’ll start with:

public static Pair NewPair(this int key) => new Pair(key,string.Empty);

This code takes an integer and will return a key pair with empty string as value. Then we’ll create:

public static Pair NewPair(this Pair pair, string data) => new Pair(pair.Key, data.Trim());

Here we create new key pair based on the old key pair. Thus we won’t change existing key pairs as those might still be used for other purposes. We also trim the string value to remove excess spaces.

We need to return “Fizz”.

Seriously? Well, yeah.

public static Pair DoFizz(this Pair data) => data.NewPair("Fizz " + data.Value);

The reasoning behind it is that we want to add the word “Fizz” in front of the string that we already have. As we start with empty strings and trim the value when creating a new pair, we should just get “Fizz” as result. But if the result already contains “Buzz”, we will get “Fizz Buzz” as result. Including the space!

We need to return “Buzz”.

This is oh, so simple. 🙂

public static Pair DoBuzz(this Pair data) => data.NewPair(data.Value + " Buzz");

The difference between this method and the DoFizz() method is that this one adds “Buzz” at the end of the result, not at the start. So it doesn’t matter if we first Fizz or Buzz, the result is always “Fizz Buzz”. Including the space…

We need to return numbers as string values.

More simple code, but even simpler…

public static Pair DoNumber(this Pair data) => data.NewPair(data.Key.ToString());

When we put a number in the result, we don’t care about it’s content. It gets replaced in all cases…

We need a condition telling us when to Fizz.

We can’t Fizz all the time so we need to have a condition check.

public static bool IsFizz(this Pair data) => data.Key % 3 == 0;

I could have made it even more generic with an int as parameter but I use this solution to make it readable!

We need a condition telling us when to Buzz.

We also need to determine when to Buzz…

public static bool IsBuzz(this Pair data) => data.Key % 5 == 0;

We need a condition telling us when we to use a number.

And we need to decide when we want to have a number as result!

public static bool IsNumber(this Pair data) => string.IsNullOrEmpty(data.Value);

But how to get those conditional statements inside my enumerations?

Well, we need the if-statement to become part of the method chain. So we create another extension method for it.

public static T If(this T data, Func condition, Func action) => condition(data) ? action(data) : data;

This shows some power behind extension methods, doesn’t it? Because I’ve declared both the condition and action as separate methods, I can use a shorthand method when calling them. And the shorthand I will use for the Fizz(), Buzz() and Number() methods I’m creating:

public static Pair Fizz(this Pair data) => data.If(IsFizz, DoFizz);
public static Pair Buzz(this Pair data) => data.If(IsBuzz, DoBuzz);
public static Pair Number(this Pair data) => data.If(IsNumber, DoNumber);

And as these methods work on just a single pair, I also need methods to use for the full enumeration. Fortunately, I can use the same name, with different types:

public static PairEnum Fizz(this PairEnum data) => data.Select(Fizz);
public static PairEnum Buzz(this PairEnum data) => data.Select(Buzz);
public static PairEnum Number(this PairEnum data) => data.Select(Number);

The PairEnum is just an IEnumerate<Pair> as declared by the “using” before. Rather than declaring a new class, I prefer to create an alias for this type.

The result is that we have three methods now which each will work on an enumeration, changing the result of each pair where this is required.

We need to select just the string values.

This is another simple one, but here the data flow changes type. Selecting the results can be done by using:

public static StringEnum Values(this PairEnum data) => data.Select(p => p.Value);

But before selecting all results, we might want to put all items in the correct order:

public static IOrderedEnumerable SortPairs(this IEnumerable data) => data.OrderBy(p => p.Key);

Now, this sort is not required when you go through the enumerations normally. But because you can also do parallel enumerating and thus use multiple threads to walk through everything, you might also get all results in a random order. So, sorting it will fix this…

How do we make it execute parallel? Well, I would just need one adjustment to one method:

public static PairEnum Number(this PairEnum data) => data.AsParallel().Select(Number);

I only added AsParallel() in this method to change it from a synchronous to an asynchronous enumeration. But the result is that all numbers will be executed asynchronously. This means the order of my data becomes unpredictable. And let’s make it even better:

public static PairEnum Number(this PairEnum data, bool asynchronous = false) => asynchronous ? data.AsParallel().Select(Number) : data.Select(Number);

Now we can specify if we want to execute the numbering synchronous or not. The default will be synchronous so I would not need to sort afterwards. But if I indicate that I want it executed asynchronous then the order of data will be more randomized and I would need to sort afterwards.

If I also apply this change to Fizz() and Buzz() then it can become extremely interesting when everything is done asynchronous.

We have to join the string values into a comma-separated string.

Well, joining string values in a list is simply done by calling the string.Join() method but as I’m making a lot of methods already, I can just add one more:

public static string Combine(this StringEnum data) => string.Join(", ", data);

And all these methods together allow me to control the data flow in the finest details. Generally too fine for solving the FizzBuzz challenge, but it does make an interesting example…

So, how to FizzBuzz?

Okay, one more method:

public static string FizzBuzz(this int count) =>
count.MakeArray(1)
.MakePairs()
.Fizz()
.Buzz()
.Number()
.Values()
.Combine();

Notice how all methods used so far are basically one-liners. And this method too is just a single statement. It’s just written over multiple lines to make it readable.

And it shows the data flow in reasonably clear names. We take a number and make an array of numbers. These get converted to pairs, the pairs then get Fizzed, Buzzed and numbered before we convert them to strings and join them into a single string result. No sorting as I’m not doing anything asynchronous here. And the result will be: Buzz, 1, 2, Fizz, 4, Buzz, Fizz, 7, 8, Fizz, Buzz, 11, Fizz, 13, 14, Fizz Buzz, 16, 17, Fizz, 19, Buzz, Fizz, 22, 23, Fizz, Buzz, 26, Fizz, 28, 29, Fizz Buzz, 31, 32, Fizz, 34, Buzz, Fizz, 37, 38, Fizz, Buzz, 41, Fizz, 43, 44, Fizz Buzz, 46, 47, Fizz, 49, Buzz, Fizz, 52, 53, Fizz, Buzz, 56, Fizz, 58, 59, Fizz Buzz, 61, 62, Fizz, 64, Buzz, Fizz, 67, 68, Fizz, Buzz, 71, Fizz, 73, 74, Fizz Buzz, 76, 77, Fizz, 79, Buzz, Fizz, 82, 83, Fizz, Buzz, 86, Fizz, 88, 89, Fizz Buzz, 91, 92, Fizz, 94, Buzz, Fizz, 97, 98, Fizz, 100

Which reminds me: I had to use MakeArray(1) to make sure I start with 1, not 0. Otherwise, I would get 100 numbers starting with 0 and ending at 99, not 100…

Now, you’re probably thinking about how this works. And it’s likely that you assume it creates a list of integers, then a list of pairs, then a list of Fizzed pairs, etcetera. And you’re wrong! But let’s modify the code a bit. Let’s display technical information using my Do() method, and a variant of Do()…

Adding Do…

public static T Do<T>(this T data, Action<T> action) { action(data); return data; }
public static T Do<T>(this T data, Action action) { action(); return data; }
public static IEnumerable<T> Do<T>(this IEnumerable<T> data, Action<T> action) { foreach (var item in data) { action(item); yield return item; } }

These three Do() methods are very generic and as I’ve mentioned in my previous post, it allows using methods returning void within a method chain.

The first Do() will perform an action and pass the data back to that method. As we just want to do simple WriteLine statements, we won’t use that.

The second Do() can perform actions that don’t depend on the data. This is ideal to “step out” of the method chain to do something else before returning to the chain.

The third Do() will do an action for each item inside an enumeration without breaking the enumeration! And why we want to do this will be shown when we call this monster:

public static string FizzBuzzNormalStatus(this int count) =>
count.Do(() => Console.WriteLine("Linear: Initializing…"))
.MakeArray()
.Do(() => Console.WriteLine("Linear: Array created…"))
.MakePairs()
.Do(pair => Console.WriteLine($"Pair: {pair.Key} -> {pair.Value}"))
.Do(() => Console.WriteLine("Linear: Pairs made…"))
.Fizz()
.Do(pair => Console.WriteLine($"Fizz: {pair.Key} -> {pair.Value}"))
.Do(() => Console.WriteLine("Linear: Fizzed…"))
.Buzz()
.Do(pair => Console.WriteLine($"Buzz: {pair.Key} -> {pair.Value}"))
.Do(() => Console.WriteLine("Linear: Buzzed…"))
.Number()
.Do(pair => Console.WriteLine($"Number: {pair.Key} -> {pair.Value}"))
.Do(() => Console.WriteLine("Linear: Numbered…"))
.SortPairs()
.Do(pair => Console.WriteLine($"Sort: {pair.Key} -> {pair.Value}"))
.Do(() => Console.WriteLine("Linear: Sorted…"))
.Values()
.Do(() => Console.WriteLine("Linear: Extracted…"))
.Combine()
.Do((s) => Console.WriteLine($"Linear: Finalizing: {s}…"));

And yes, this is still a single statement. It’s huge because I’m writing a line after each method indicating that this method has just fired after writing all the data it has processed. So, you would expect to see “Linear: Array created…” being written, followed by a bunch of pairs and then “Linear: Pairs made…”.

Wrong!

But I’ll show this by FizzBuzzing just the numbers 0 to 4:

Linear: Initializing…
Linear: Array created…
Linear: Pairs made…
Linear: Fizzed…
Linear: Buzzed…
Linear: Numbered…
Linear: Sorted…
Linear: Extracted…
Pair: 0 ->
Fizz: 0 -> Fizz
Buzz: 0 -> Fizz Buzz
Number: 0 -> Fizz Buzz
Pair: 1 ->
Fizz: 1 ->
Buzz: 1 ->
Number: 1 -> 1
Pair: 2 ->
Fizz: 2 ->
Buzz: 2 ->
Number: 2 -> 2
Pair: 3 ->
Fizz: 3 -> Fizz
Buzz: 3 -> Fizz
Number: 3 -> Fizz
Pair: 4 ->
Fizz: 4 ->
Buzz: 4 ->
Number: 4 -> 4
Sort: 0 -> Fizz Buzz
Sort: 1 -> 1
Sort: 2 -> 2
Sort: 3 -> Fizz
Sort: 4 -> 4
Linear: Finalizing: Fizz Buzz, 1, 2, Fizz, 4…

And yes, that’s the output of this code! It tells you the array was created, the pairs prepared, Fizzed, Buzzed, Numbered before sorting and extracting and only then will it do all those things!

This is what makes enumerations so challenging to understand! They’re not arrays or lists. They’re something many developers are less familiar with. This is why “yield result” is so useful when enumerating data. I don’t need a lot of memory to store lists of data but instead will process each item in the enumeration up to the point where the data gets aggregated.

Which in this case happens when I sort the data, as I need all data before I can sort it. If I don’t sort in-between, the data will only be aggregated when I call the Combine() method.

This is important to remember during data processing, as an exception in one of these methods can have unexpected consequences. That’s because all data before the faulty data will have been processed by the aggregator function. For FizzBuzz, that wouldn’t be a problem. But if one of the methods in the chain would write data to a database or file then it will have partially written some data up until it gets the exception.

So, when using enumerators, you might end up with garbage if you’re unaware of how data passes from one method to another. It feels unnatural for inexperienced developers yet it makes this FizzBuzz example so much more interesting…

Avoid the void, introducing Do!

I’ve been programming for a very long time. I wasn’t even 10 years old when I got access to computers and that was back around 1975! My first programming experience were on a programmable calculator and later with BASIC on a ZX-81. Around 1985 I started learning other programming languages and around 1988 I even used Turbo Pascal 5.5 which has support for object-oriented programming. Well, a bit limited.

When I started using Turbo Pascal and later Delphi, I also learned the value of methods that would return objects. But it would still take a while before I realized the full power of this. Still, in Turbo Pascal 6 there was a feature called Turbo Vision that could be used to create complete menu structures and dialog screens on a text console! It made extensive use of methods returning objects to make method chaining possible.

Method chaining is a simple principle. You have a class and the class has methods. And each method can return an object of a specific class. So that gives access to a second method. And a third, a fourth, a fifth until you get to a method that returns void.

That’s a dead stop there!

I recently worked on a project where I used a long method chain to keep a clear workflow visible in my code. And my code was selecting data, ordering it, manipulating it and then saved it to a comma-separated file or CSV file. Something like this:

MyData
.ToList()
.OrderBy(SortOrder)
.Take(50)
.Select(NewDataFormat)
.SaveToCSV(Filename);

But SaveToCSV was a method that returned void, so the chain breaks there! But I wanted to continue the workflow as I needed to do more with this data. I also wanted to display it on screen, save it to a database or even filter it a bit more. So, to resolve this I needed a better method that would allow me to continue the chain.

Of course, I could also have written my code like this:

var myList = MyData.ToList();
var mySortedList - myList.OrderBy(SortOrder);
var myTop50 = mySortedList.Take(50);
var myNewList = myTop50.Select(NewDataFormat);
myNewList.SaveToCSV(Filename);

This code would allow me to also continue the workflow and would also allow me to use myNewList for further processing. But it also introduces a bunch of variables and it allows non-related code to be included within these lines, obscuring the workflow! That would not work! The chain of work would be broken by irrelevant tasks.

So, now I have two options. Either I modify the SaveToCSV method so it will return a value (preferably the object itself) or I make an extension method inside a static class that would allow me to put a void method within my chain. And I came up with this beautiful, yet simple method:

public static T Do<T>(this T data, Action<T> action){ action(data); return data; }

And this simple, yet beautiful construction is a generic method that can be used for any class, any object! And it can change my code into this:

MyData
.ToList()
.OrderBy(SortOrder)
.Take(50)
.Select(NewDataFormat)
.Do(d=>d.SaveToCSV(Filename))
.Do(d=>Console.WriteLine(d));

Now my data flow is still intact and the flow of the code is still very readable. It is easy to see that this is just one block of code. Most people tend to forget that programming isn’t about writing code. It’s about how to process data.

Using a method chain is a perfect way to visualize data flow inside your code. But to allow proper method chaining, each method will have to return some kind of object for further processing. Otherwise, you will need a Do<> extension method in your project to make a void method part of the chain.

But to keep it simple, when using a void method, you’ll basically put a stop to the data flow within your application. You would then have to start a new data flow for further processing. This is okay, but generally not the best design in programming. While not everyone might be a fan of method chaining, it still is a very powerful way to write code as it forces you to keep irrelevant code outside of the flow.

Seriously?

I noticed a very interesting post today on LinkedIn and I seriously have to rant about how dumb it really is. But look for yourself first and see if you’re good enough as a developer to see the flaws in it.

This is a preview image for a free course on C++ programming. Now, I’m no expert at C++ as I prefer C# and Delphi but I can see several flaws in this little preview. Yes, even though it’s a very tiny piece of code, I see things that trouble me…

And the reason for ranting is not because the code won’t work. But because the code is badly written. And a course is useless when it encourages writing code badly… So, here it is:

But keep in mind that code can be bad but if it works just fine then that’s good enough. Good developers will go for great products, not great code. It’s just that some bad code makes software development a bit harder… In general, coding practices include the use of proper design patterns and to make sure code is reusable.

So, not counting the lines with just curly brackets, I see five lines of code and two serious problems. And that’s bad. But the flaws are related because the person who wrote this C++ code seems to ignore the fact that C++ supports objects and classes. But it probably works so these aren’t really errors. They’re flaws in the design…

So, the first flaw should be obvious. I see various variable names that all start with a single letter and an underscore. The “m+” seems to suggest that these variables are all related. That means, they should be part of an object instead of just some random variables. The “m_ probably stands for “man” or whatever so you would need a class called “Man” and it has properties like ‘IsJumping’, ‘IsFalling’, ‘TimeThisJump’ and ‘JustJumped’. It suggests that Man is in a specific state like falling or jumping. And as it is a state, you would expect a second class for the “Jump” state. And probably others for “Falling”, “Walking” and other states.

The action triggered is a jump so the Man needs to be in a state that supports a jump. Jumping and Falling would not support this state so if man is in one of those states, he can’t jump. Otherwise, his state is changed to Jumping and that would start the timer. And if the timer is zero then ‘JustJumped’ would be true, right? So, no need to keep that property around.

So, all this code should have been classes and objects instead. One object for the thing that is doing something and various “State” classes that indicates what’s he doing at that moment. The fact that the author did not use classes and objects for this example shows a lack of OO knowledge. Yet C++ requires a lot of OO experience to use it properly and allow code reuse. After all, other items might jump and fall also, so they would use the same states.

The next flaw is less obvious but relates to “Keyboard::IsPressed”. This is an event that gets triggered when the user presses a key. It’s incomplete and there’s a parenthesis in front of it so I don’t see the complete context. But we have an event that is actually changing data inside itself, rather than calling some method from some class and let that method alter any data. That’s bad as it makes code reuse more difficult.

One thing to avoid when writing code is doing copy and paste of code from somewhere else. If a piece of code is needed in two or more places then it should be encapsulated in a single method that can be called from multiple locations. So this event should call a method “DoJump”, for example, so you can also have other calls to this method from other locations. For example, if you also want to support jumping on a mouse click.

Too many developers write events and fill these events with dozens of lines of code. Which is okay if this code is only called at one moment. But it’s better to keep in mind that these event actions might also get triggered by other events so you need to move them into a separate method which you can then call.

That would allow you to reuse the same code over and over again without the need for copy and paste programming.

So, in conclusion, I see a preview for a C++ course that looks totally flawed to me. No errors, just flawed. This is not the proper way to write maintainable code!

A fifth pillar for programming?

For many years I’ve been comfortable knowing that programming is build on just four pillars. These are all you need to develop anything you want and these would be:

  • Statements, or basically all the actions that need to perform.
  • Conditions, which determines in which direction the code will flow.
  • Loops for situations where actions need to be repeated.
  • Structures to organize both code and data into logical units like records and classes, databases with indices and even protocols so two applications can communicate with one another.

But as I’m developing more and more asynchronous code where you just trigger an action and later wait for the response, I start wondering if there should be a fifth one. So, should there be a fifth one?

  • Parallel tasks, where two or more flows run concurrently.
Parallel execution?

It’s a tricky one and if you learned to use flow charts and Nassi–Shneiderman diagrams to design applications then you’ll discover that these diagrams don’t have any popular option to represent parallel tasks in any way. Well, except in the way I just said, by triggering an external event and ignore the result until you need it.

Well, Nassi-Shneiderman does have this type for concurrent execution:

File:NassiShneiderman-Parallel.svg

Image by Homer Landskirty, CC-BY-SA 4.0 International.

And flowcharts also have a fork-join model to represent concurrent execution. So these diagrams do provide some way to indicate parallel execution. But the representation is one that suggests several different tasks are executed at the same time. But what if you have an array of records and need to execute the same task on every record in a parallel way instead?

Parallel execution is quite old yet I never found a satisfying way to display concurrency in these kinds of diagrams. Which makes sense as they are meant to define a single flow while parallel execution results in multiple, simultaneous flows. And parallel execution can make things quite complex as you might just want to cancel all these flows if one of them results in an error.

But it is possible to represent them in diagrams for a long time already even though their usage was rare in the past. But in modern times where computers have numerous cores and multithreading has become a mainstream development technique, it has become clear that parallel or concurrent execution is becoming an important standard in programming.

So it’s time to recognize that there’s a fifth pillar to programming now. But mainly because it has become extremely common nowadays to split the flow of execution only to join things at a later moment.

But this will bring many challenges. Especially because forking can result in similar tasks being executed while they don’t have to join all at once either. It basically adds a complete new dimension to programming and while it’s an old technique, it isn’t until recently that mainstream developers have started using it. The diagrams are still limited in how it can be executed while developers can be extremely creative.

Of course, there’s also a thing like thread pools as computers still have limits to how many threads can be executed simultaneously. You might trigger 50 threads at once yet add limitations so only 20 will run at the same time while the rest will wait for one of those threads to finish.

It becomes even more complex when two or more threats start sending signals to one another while they continue running. This is basically a start for event-driven programming, which doesn’t really fit inside the old diagram styles. You could use UML for various of these scenarios but it won’t fit in a single diagram anymore. A sequence diagram could be used to display parallel execution with interprocess communications but this focuses only on communication between processes. Not the flow of a single process. Then again, UML doesn’t even have those kinds of flow diagrams.

So, while programming might have a fifth pillar, it also seems it breaks the whole system due to the added complexity. It also makes programming extremely different, especially for beginners. When you start to learn programming then the four pillars should teach you what you need to know. But once you start on the fifth pillar, everything will turn upside down again…

So maybe I should even forget about my four pillars theory? Or add the fifth one and ignore the added complexity?

So you want to start programming?

Fanny with Laptop.jpg

It is interesting to see how many people want to start programming as they’ve played with a computer and now want to start making their own games, their own apps and their own websites. So they read about the more interesting programming languages and think they should learn PHP for web development, Java for Android or Swift for IOS. Or some other languages like Python, Go, Rust, C# or any of the other programming languages mentioned in the Tiobe Index. Why? Because these languages are very popular. But most won’t really notice that the Standard C language is also extremely popular yet everyone seems to skip it? The reason is often because the language is considered complex and difficult, it has no object-oriented paradigms, no overloading and no cool graphical libraries or even standard database functionality. Even worse, the book The C Programming Language has less than 300 pages explaining everything you need to know yet whole operating systems are written in it so it must be complex. Yet not, as it fits within 300 pages…

But if you want to learn programming, you should start with standard C simply because it teaches you the basics of programming. Sure, objects and classes are fun, but that’s not code! That’s structure! Code is organised in classes and objects to add more structure to the code you can do more with less code. But to use those structures properly, you must first know what code is.

  • Remember this line, which I will refer to as Label 1.

And code is simple. You have statements and statements are simple actions. You add two numbers, you show something on the screen or you’ll wait for input. Those are simple actions. They’re tasks and they could be very simple or very complex. But from the code perspective, they’re all tasks.

Next, you will have conditions. Conditions will determine if certain statements (tasks) will be performed or not. If the user pressed a key, close the window. If a file is not found, show an error. And even in real life we see lots of conditions because if the cookie jar is empty, we will have to fill it with cookies again.

And last, you have loops. Basically, the repetition of code. You have done an action but you need to do it again. And again. And again. And rather than repeating the same statement over and over again, you just put it in a loop which will basically run forever. Unless you add a condition to the loop allowing the control flow to just out of the loop.

Like filling the empty cookie jar with cookies. It is empty so you add a cookie. It’s still not full so you add another cookie. And another one. And more. All the way until it is full as you can stop when it’s full.

So basically programming is all about statements, conditions and loops and how control of the actions flows through all of it. And if this is too difficult to understand then read this part again by scrolling up to the line that I refer to as Label 1 If you understand this principle, continue to read.

So, you’ve just learned about statements, conditions and loops so now you know programming! The rest has nothing to do with programming but is all about the structure of code. And that part are the functions and procedures, data structures, objects, classes, interfaces and whatever more. And as you should start practicing, the best thing to start with is a simple programming language that focuses mostly on the programming part and less on structure. And that language is C.

In C the only structural parts you’ll deal with are functions and data structures. And while C is perfect for making operating systems it is also ideal to learn the basics. You can use it to make simple console applications to maintain an address book, shopping list or even a calendar. But the lack of a GUI, database support and classes actually make C harder to learn for people who are already used to those! And it’s time to forget about those things and return to the basics.

One thing to keep in mind is that standard C has some very good build-in functionality for handling dates and times so that makes it perfect for a simple agenda. But the lack of database support means you’d have to do your own file management if you want to store your data for safekeeping and sharing. But a modern computer will have gigabytes of RAM and even a busy calendar is unlikely to have more than a few thousand records of about 100 bytes each. So reading all data in memory should not be a big problem. But file input and output will be required to load and save your data to be sure it’s all safe.

So, how to start? Start by determining what kind of data you’ll be handling. Do this with pen and paper, sitting on your couch with some milk and cookies or whatever else helps you think and stay away from your computer! You start by thinking about what you’re going to make so that computer isn’t needed yet! Design first, structure next and then you can code.

So, the agenda… Say, you want an agenda to keep track of the whole family and where they are and what they’re doing. The agenda part suggests dates and times, the family means persons and the locations means addresses. But what they’re doing is undefined so we just keep it as free text. So, that’s the data that matters.

Next, what kind of control would you like? Well, it will be a console applications so the user will have to type in commands. A nice GUI would be nice but the GUI is not part of the C standard. So command line instructions it will be. Statements like “load” and “save” would be the first important thing. Then, because we’re dealing with data, we need statements to manage it so we will have “Add”, “Modify” and “Delete” followed by “Calendar”, “Person” or “Location”. These statements would almost be SQL like! And we need to be able to show data so that would be either “Show” or “Select” or “Print” or whatever you think sounds nice. Again with an indicator for the right table.

But now things become more challenging as you need indicators for the data records. Say, you want to add a new event so your statement would have a syntax like “Add Agenda <date> <time> <duration> <person_id> <location_id> <free text>” and your code would have to parse it, check if it’s all okay and then add it to the list. And by defining this simple statement you will already know more about your data structure as the agenda record will have a date, time, duration, link to a person, link to a location and some additional text while both the person and location records will have unique IDs. But that’s structure and we’re still designing here.

So we continue defining the actions we want to perform on our data. A “Delete agenda <agenda_id>” suggests that agenda needs a unique ID also. Fine! We have at least three tables and four actions per table (Create, Read, Update and Delete or CRUD) so we have at least 12 functions to define. We don’t care how they are called but we know how we will manage the data. And perhaps we have some alternate versions for these functions as a location might not always be required so we would have two “Add” functions”. Or maybe we want to delete all agenda entries for a person or for a specific day. We’re almost making our own SQL syntax here!

But we won’t go full SQL on our data as we only focus on the things we consider important. We’re keeping things simple and don’t want users to work out all kinds of complex queries as this application needs to be simple.

Interesting will be the saving and loading of the data but my advice would be to just write it all to a simple text file so you can edit it afterwards in notepad or another editor, to fix errors. It means parsing text to data and converting data to text, preferably in some easy way. It’s a good exercise to come up with a good file format.

So once we have the actions we want to perform on our data, we will have to consider what data we actually want. We have three simple tables and we already know the “Agenda” table will link to the other two. For persons, we would like to have a first name, last name, their function and maybe a phone number or birth date. For locations we want the name for the location, address and maybe even the phone extension number for the phone in that room. And while it is tempting to add more tables and stuff into this application, we have to focus on making something first and all we have by now are still notes on paper.

So, now we know what we want to make, how we will control it and what the data will look like, we can start working on the actual code. And the code will be simple as it’s a console application waiting for user commands. User types a command, presses ‘Enter’ and the application starts parsing it and tries to execute any valid command while reporting any errors.

This is a simple programming exercise which also offers plenty of distractions as you likely want to make it prettier. Don’t fall for that trap as you need something to show first! Once you have something to show, that would be version 1 and you can start version 2 now, while version 1 is save in your source control system.

You do use source control, don’t you?

Anyways, this is a project to start to learn programming as it mostly focuses on the programming itself, not the structure. When you start expanding the whole project, you can consider rewriting it in Java or C# or whatever language you prefer that supports OOP. Probably something with a nice graphical shell so the user doesn’t need to type commands but can open dialog windows and use drop-down options to select whatever they want. But all of that isn’t really programming. It’s adding structure. Call it structuring for all I care!

When you want to create applications then you should understand both programming and structuring. But keep in mind that the structure depends on the programming, not the other way around. Too many developers get lost in those structures and make things far more difficult than need be. Then again, too many developers start behind the computer, entering code instead of reading documentation, making notes and don’t think ahead. Or they did think but are trying to do way too much all at once.

How SSL is failing…

Every web developer should know about the Secure Sockets Layer and it’s successor, the Transport Layer Security. (SSL and TLS.) These techniques are nowadays a requirement to keep the Internet secure and to keep private matters private. But while this technique is great, the implementation of it has some flaws.

Flaws that aren’t easy to resolve, though. But to understand, you will need to understand how this technique works when you use it to protect the communication between a visitor and a website.

This is a very long post as it is a complex topic and I just don’t want to split it in multiple posts. So I add several headers to separate parts…

Femke Wittemans Armored
What is SSL?

When you communicate with a website, you’re basically have a line with two endpoints. You on one side and the web server on the other side. But this isn’t a direct line but goes from your computer to your router/modem, to your provider, over some other nodes on the Internet, to the provider of the host, the router of the host to the computer where the host is running the website on. And this is actually a very simplified explanation!

But it shows a problem. The communication between the two endpoints goes over various nodes and at every node, someone might be monitoring the communication, listening for sensitive information like usernames, passwords, social security numbers, credit card numbers and whole lot more. And to make it even more harder, you don’t know if your information will travel over nodes that you can trust! So, you need to make sure that the information you send over between both endpoints is secure. And that’s where SSL is used.

SSL is an encryption technique with asynchronous keys. That means that the host has a private key and a public key. It keeps the private key hidden while giving everyone the public key. You can generally use one key to encrypt a message but you would need the other key to decrypt it again. And that’s basically how security works! You have the public key and you encrypt your data with it before sending it to me. No one on those nodes have the private key, except me, so I’m the only one who can decrypt it. I can then encrypt a response with the private key and send that back to you, as you can decrypt it again with the public key.

Unfortunately, everyone on those nodes can too, as they would all know this public key as I had just sent it to you. So things need to be a little more secure here. Which fortunately is the case. But in general, you should never encrypt sensitive data with a private key as anyone with the public key would be able to read it!

Session keys…

But the Internet uses a better trick. Once you receive the public key from the web server, your browser will generate a session key, which is a synchronous encryption key. This session key can be used by anyone who knows it and right now, you would be the only one. But as you have my public key, you would use my public key to encrypt the session key and send it to me. Now I know it too and we can have a communication session using this session key for security. And once the session is done, the session key can be discarded as you would make a new one for every session.

Most people think the encryption is done using the SSL certificates but that’s not the case! SSL is only used to send a session key to the server so both can communicate safely. That is, as long as no one else knows that session key. Fortunately, session keys are short-lived so there isn’t much time for them to fall in the wrong hands…

The main weakness…

So, this seems pretty secure, right? My site sends you a public key, you create a session key and then we communicate using this session key and no one who is listening in on us can know what we are talking about! And for me it would be real easy to make the SSL certificate that would be used as most web servers can do this without any costs, and generally within just a few minutes. So, what can go wrong?

Well, the problem is called the “Man in the Middle” attack and this means that one of the nodes on the line will listen in on the communication and will intercept the request for secure communications! It will notice that you ask for a public key so it will give its own public key to you instead that of the host. It also asks for the public key of the host so it can relay all communications. You would then basically set up a secure line with the node and the node does the same with the host and will be able to listen in to anything that moves between you, as it has to decrypt and then encrypt each and every message. So, it can listen to sensitive data without you realizing that this is happening!

Authorities…

So the problem is that you need to know that the public key I gave you is my public key, and not the key of this node. How do you know for sure that this is my key? Well, this is where the Certificate Authorities (CA) have a role.

The CA has a simple role of validating the certificates that I use for my website. I want to secure my host so I make a certificate. I then ask the CA to sign this certificate for me. The CA then checks if I really am the owner of the domain at that specific moment or have at least some control over the domain. And when they believe that I’m the owner, they will sign my certificate.

Then, when you receive my public key then you can check the credentials of this certificate. It should tell you if it is for the specific domain that you’re visiting and it should be signed by the CA who issued the certificate. If it is signed correctly then the CA will have confirmed that this certificate is linked to my host and not that of some node between you and me.

Trust…

But the problem is that when your connection isn’t secure because some node is trying to listen than checking if my certificate is properly signed by the CA won’t work, as you would be requesting the CA to validate it over the same unsafe connection. The node will just claim it is so that option won’t work. No, you already need to know the public key of the CA on your system so you can decrypt my signature. And you need to have received the Ca’s certificate from some secure location. Otherwise, you can’t trust if my public key is the real thing.

So most web browsers have a list of public keys as part of their setup. When they install themselves, they will also include several trustworthy public keys from the more popular CA’s. Basically, the CA’s they deem reliable. So your browser will validate any public key from my site with the public keys it knows and trusts and if everything is okay, you will be notified that the connection is secure and the session key can be generated for further secure communications.

Otherwise, your browser will give you a warning telling you what’s wrong with the certificate. For example, it might be outdated or not meant for the specific domain. In general, you should not continue as the connection with the host has security problems!

Distrust!…

But here’s the thing… The list of trusted CA’s in your browser can be modified and to be honest, it sometimes gets modified for various reasons. Some are legitimate, others are not.

For example, this list is modified when a specific CA is deemed unreliable. This happens regularly with smaller CA’s but once in a while, some major scandal happens. For example, in 2011 it was discovered that the company DigiNotar had a security breach which had resulted in several certificates being falsified. Most likely, the Iranian Government was behind this in an attempt to check all emails that their citizens were sending through GMail. The fake certificates allowed them to listen in on all emails using the man in the middle technique. DigiNotar went bankrupt shortly afterwards, as all the certificates they had issued had become worthless.

Similar problems occurred at StartCom, a CA that actually gave away free certificates. The Israeli company was purchased by a Chinese company and some suspicious behavior happened soon afterwards. The fear was that this Chinese company (and perhaps even the Chinese government) would use this trust that StartCom had to make fake certificates to listen in on all communications in China. Both Mozilla and Google started to raise questions about this and didn’t get satisfying answers so they decided to drop the StartCom certificates. This CA had become controversial.

And then there’s Symantec. Symantec is a company that has been making software for decades that all relate to security. It is an American company and has been trustworthy for a long time. And in 2010 Symantec acquired Verisign’s authentication business unit which includes releasing SSL certificates for websites. But in 2015 it was discovered by Google that Symantec had issued several test certificates for impersonating Google and Opera. Further research has led Google to believe that Symantec has been publishing questionable certificates for over 7 years now and thus they announced that they will distrust Symantec SSL certificates in the near future. In April 2018, all Symantec certificates will be useless in Google Chrome and other browsers might follow soon.

Also interesting is that Symantec is selling their SSL business to DigiCert. This could solve the problem as DigiCert is still trusted. Or it makes things worse when browser manufacturers decide to distrust DigiCert from now on also.

But also: Telecom!

But there are more risks! Many people nowadays have mobile devices like tablets and phones. These devices are often included with a subscription to services of some mobile phone company. (T-Mobile and Vodafone, for example.) These companies also sell mobile devices to their customers and have even provided “free phones” to new subscriptions.

However, these companies will either provide you with a phone that has some of their software pre-installed on your new device or will encourage you to install their software to make better use of their services. The manufacturers of these mobile devices will generally do similar things if given a chance. And part of these additions they make to your Android or IOS device is to include their own root certificates with the others. This means that they are considered trustworthy by the browser on your device.

Is that bad? Actually, it is as it allows these companies to also do a man in the Middle attack on you. Your telecom provider and the manufacturer of your phone would be able to listen to all your data that you’re sending and receiving! This is worse, as local government might require these companies to listen in on your connection. It is basically a backdoor to your device and you should wonder why you would need to trust your provider directly. After all, your provider is just another node in your connection to the host.

Did you check the certificate?

So the problem with SSL is that it’s as reliable as the Certificate Authorities who made those certificates. It’s even worse if your device has been in the hands of someone who wants to listen in on your secure connections as they could install a custom trusted certificate. Then again, even some malware could install extra public keys in your trusted certificates list without you noticing. So while it is difficult to listen to your secure conversations, it is not impossible.

You should make a habit of checking any new SSL certificate that you see pop up in your browser and it would be a good idea if browsers would first ask you to check a certificate whenever they detect that a site has a new one. It would then be up to the user to decide to trust the certificate or not. And those certificates would then be stored so the browser doesn’t need to ask again.

Unfortunately, that would mean that you get this question to trust a certificate very often when you’re browsing various different sites and users will tend to just click ‘Ok’ to get rid of the question. So, that’s not a very good idea. Most users aren’t even able to know if a certificate is trustworthy or not!

For example, while you’re reading this blog entry, you might not have noticed that this page is also a secured page! But did you check the certificate? Did you notice that it is signed by Automattic and not by me? That it’s a certificate issued by “Let’s Encrypt“? Well, if the certificate is showing this then it should be the right one. (But when my host Automattic changes to another CA, this statement becomes invalid.)

ACME protocol…

And here things become interesting again. “Let’s Encrypt” gives away free certificates but they are only valid for a short time. This makes sense as domains can be transferred to other owners and you want to invalidate the old owner’s certificates as soon as possible. It uses a protocol called ACME which stands for “Automatic Certificate Management Environment. It’s basically a toolkit that will automate the generation of certificates for domains so even though the certificates are only valid for a short moment, they will be replaced regularly. This is a pretty secure setup, although you’d still have to trust this CA.

Problem is that “Let’s Encrypt” seems to prefer Linux over Windows as there is almost no good information available on how to use ACME on Windows in IIS. But another problem is that this protocol is still under development and thus still has some possible vulnerabilities. Besides, it is very complex, making it useless for less technical developers. The whole usage of certificates is already complex and ACME doesn’t make things easier to understand.

Also troublesome is that I tried to download the ACME client “Certify the Web” for Windows but my virus scanner blocked the download. So, now I have to ask myself if I still trust this download. I decided that it was better not to trust them, especially as I am trying to be secure. Too bad as it seemed to have a complete GUI which would have made things quite easy.

Don’t ignore security warnings! Not even when a site tells you to ignore them…

Additional problems?

Another problem with SSL is that it is an expensive solution so it is disliked by many companies who are hosting websites. It’s not the cost for the certificates, though. It’s the costs for hiring an expert on this matter and making sure they stay with the company! A minor issue is that these security specialists do have access to very sensitive material for companies so you need to be sure you can trust the employee.

Of course, for very small companies and developers who also host websites as a hobby, using SSL makes things a bit more expensive as the costs are generally per domain or sub domain. So if you have three domains and 5 subdomains then you need to purchase 8 certificates! That’s going to easily cost hundreds of euros per year. (You could use a Multi-Domain (SAN) Certificate but that will cost about €200 or more per year.)

Plus, there’s the risk that your CA does something stupid and becomes distrusted. That generally means they will have to leave the business and that the certificates you own are now worthless. Good luck trying to get a refund…

But another problem is that the whole Internet is slowly moving away from insecure connections (HTTP) to secure (HTTPS) connections, forcing everyone to start using SSL. Which is a problem as it starts to become a very profitable business and more and more malicious people are trying to fool people into buying fake or useless certificates or keep copies of the private key so they can keep listening to any communications done with their keys. This security business has become highly profitable!

So, alternatives?

I don’t know if there are better solutions. The problem is very simple: Man in the Middle. And the biggest problem with MITM is that he can intercept all communications so you need something on your local system that you already can trust. Several CA’s have already been proven untrustworthy so who do you trust? How do you make sure that you can communicate with my server without any problems?

There is the Domain Name Service but again, as the MITM is intercepting all your transactions, they can also listen in on DNS requests and provide false information. So if a public key would be stored within the DNS system, a node can just fake this when you request for it. The MITM would again succeed as you would not be able to detect the difference.

Blockchain?

So maybe some kind of blockchain technology? Blockchains have proven reliable with the Bitcoin technology as the only reason why people have lost bitcoins is because they were careless with the storage of their coins. Not because the technique itself was hacked. And as a peer-to-peer system you would not need a central authority. You just need to keep the blocks on your system updated at all times.

As we want the Internet to be decentralized, this technology would be the best option to do so. But if we want to move security into blockchains then we might have to move the whole DNS system into a blockchain. This would be interesting as all transactions in blockchains are basically permanent so you don’t have to pay a yearly fee to keep your domain registered.

But this is likely to fail as Bitcoin has also shown. Some people have lost their bitcoins because their disks crashed and they forgot to make a copy of their coins. Also, the file size of the Bitcoin blockchain has grown to 100 GB of data in 2017 which is quite huge. The whole DNS system is much bigger than Bitcoin is so it would quickly have various problems. Your browser would need to synchronize their data with the network which would take longer and longer as the amount of data grows every day.

So, no. That’s not really a good option. Even though a decentralized system sounds good.

Conclusion?

So, SSL has flaws. Then again, every security system will have flaws. The main flaw in SSL is that you need to trust others. And that has already proven to be a problem. You have to be ultra-paranoid to want to avoid all risks and a few people are this paranoid.

Richard Stallman, for example, is a great expert on software yet he doesn’t use a mobile phone and avoids using the Internet directly. Mobile phones are “portable surveillance and tracking devices” so he won’t use them. He also avoids key cards and other items that allow people to track wherever he goes. And he generally doesn’t access the Web directly, as this too would allow people to track what he’s doing. (He does use Tor, though.) And maybe he’s on to something. Maybe we are putting ourselves in danger with all this online stuff and various devices that we have on our wrists, in our pockets and at our homes.

Thing is that there is no alternative for SSL at this moment so being paranoid is useful to protect yourself. Everyone should be aware of the risks they take when they visit the Internet.  This also depends on how important or wealthy you are, as poor, boring people are generally not interesting for malicious people. There isn’t much to gain from people with no money.

Still, people are still too careless when they’re online. And SSL isn’t as secure as most people think, as events from the past have already proven…

The Binary Search problem

Many developers will have to learn all kinds of algorithms in their lives so they can write highly optimized code. Many of these algorithms have long histories and are well-tested. And one of them is the binary search method.

The binary search is a fast algorithm to find a record in a sorted list of records. For most people, this is a very familiar algorithm if you had to ever guess a value between 1 and 10, or 1 and 100. The principle is quite simple. You have an x amount of records and you pick the record in the middle of the list. For guessing between 1 and 100, you would pick 50. (100/2) If it is the correct value, you’ve won. If it is too high, you now know the value must be between 1 and 50 so you guess again with value 25. Too low and you pick the value between 50 and 100, which would be 75. You should be able to guess the value in up to 8 tries for values between 1 and 100.

Actually, the binary search is actually the easiest explained as bit-wise checking of values. A single byte will go from 00000000 to 11111111 so basically all you do is a bitwise compare from the highest bit to the lowest. You start with value 10000000 (128) and if the value you search for is higher, you know that first bit is 1, else it needs to be 0.

Your second guess would either be 11000000 (192) or 01000000 (64) and you would continue testing bits until you’ve had all bits tested. However, your last test could also indicate that you guessed wrong so the maximum number of guesses would be equal to the number of bits plus one.

And that’s basically what a binary search is. But it tends to be slightly more complicated. You’re not comparing numbers from 0 to some maximum value but those numbers are generally a kind of index for an array, and you compare the value at the position in the array. You basically have a value X which could basically be any data type and even be a multi-field record and you have an array of records which has all data sorted for some specific index. And these arrays can be reasonably large. Still, the binary search will allow you some very quick search.

Now, the biggest problem with the binary search is how people will calculate the index value for the comparison. I already said that you could basically check the bits from high to low but most developers will use a formula like (floor+ceiling)/2 where floor would be the lowest index value and ceiling the highest index value. This can cause an interesting problem with several programming languages because there’s a risk of overflows when you do it like this!

So, overflow? Yes! If the index is an unsigned byte then it can only hold a value of 11111111 (255) as a maximum value. So as soon when you have a floor value of 10000000 (128) and a ceiling of at least (10000001) then the sum would require 9 bits. But bytes can’t contain 9 bits so an overflow occurs. And what happens next is difficult to predict.

For a signed byte it would be worse, since value 1000000 would be -128 so you would effectively have 7 bits to use. If the 8th bit is set, your index value becomes negative! This means that with a signed byte, your array could never be longer than 64 records, else this math will generate an overflow. (64+65 would be 129, which translates to -127 for signed bytes.)

Fortunately, most developers use integers as index, not bytes. They generally have arrays larger than 256 records anyways. So that reduces the risk of overflows. Still, integers use one bit for the sign and the other bits for the number. A 16-bit integer thus has 15 bits for the value. So an overflow can happen if the number of records has the highest bit value set, meaning any value of 16384 and over. If your array has more than 16384 records then the calculation (floor+ceiling)/2 will sometimes generate an overflow.

So, people solved this by changing the formula to floor+((ceiling-floor)/2) because ceiling-floor cannot cause an overflow. It does make the math slightly more complex but this is the formula that most people are mostly familiar with when doing a binary search!

Yet this formula makes no sense if you want a high performance! If you want a binary search, you should actually just toggle each bit for the index until you found the value. To do so, you need to know how many bits you need for the highest value. And that will also tell you how many guesses you will need, at most, to find the value. But this kind of bitwise math tends to be too complex for most people.

So, there is another solution. You can promote the index value to a bigger one. You could use a 32-bit value if the index is a 16-bit value. Thus, you could use (int16(((int32)floor+(int32)ceiling)/2) and the overflow is gone again. And for a 32-bit index you could promote the math to a 64-bit integer type and again avoid any overflows.

It is still less optimal than just toggling bits but the math still looks easy and you can explain why you’re promoting the values.

But what if the index is a 64-bit value? There are almost no 128-bit values in most programming languages. So how to avoid overflows in those languages?

Well, here’s another thing. As I said, the index value is part of an array. And this array is sorted and should not have any duplicate values. So if you have 200 records, you would also need 200 unique values, with each value being at least 1 byte in size. If the index is a 15-bit signed integer then the values in the array must also be at least 15-bits and would generally be longer. Most likely, it would contain pointers to records elsewhere in memory and pointers are generally 32-bits. (In the old MS-DOS era, pointers were 20 bits, so these systems could manage up to 1.048.576 bytes or 1 megabyte of memory.)

So, let’s do math! For an overflow to occur with an index as a signed 16-bit integer you would need to have at least 16384 records. Each record would then be at least 2 bytes in size, thus you would have at least 32 kilobytes of data to search through. Most likely even more, since the array is probably made up by pointers pointing to string values or whatever. But 21 KB would be the minimum to occur when using a 16-bit signed index.

So, a signed 32-bit index would at least have bit 30 set to 1 before an overflow can occur. It would also need to contain 32-bit values to make sure every value is unique so you would have 4 GB of data to search through. And yes, that is the minimum amount of data required before an overflow would occur. You would also need up to 31 comparisons to find the value you’re searching for, which is becoming a bit high already.

So, a signed 64-bit index would have records of at least 8 bytes in size! This requires 36.893.488.147.419.103.232 bytes of data! That’s 33.554.432 terabytes! 32.768 petabytes! 32 exabytes! That’s a huge number of data, twice the amount of data stored by Google! And you need more data than this to get an overflow. And basically, this is assuming that you’re just storing 64-bit integer values in the array but in general, the data stored will be more complex.

So, chances of overflows with a 32-bit index are rare and on 64-bit indices it would be very unlikely. The amount of data required would be huge. And once you’re dealing with this much data, you will have to consider alternate solutions instead.

The alternate solution would be hash tables. By using a hash function you could reduce any value to e.g. a 16-bit value. This would be the index of an array of pointers with a 16-bit index so it would be 256 KB for the whole array. And each record in this array could be pointing to a second, sorted array of records so you would have 65536 different sorted arrays and in each of them you could use a binary search for data. This would be ideal for huge amounts of data, although things can be optimized better to calculate to an even bigger hash-value. (E.g. 20 bits.)

The use of a hash table is quite easy. You calculate the hash over the value you’re searching for and then check the list at that specific address in your hash table. If it is empty then your value isn’t in the system. Otherwise, you have to search the list at that specific location. Especially if the hash formula is evenly distributing all possible values then a hash table will be extremely effective.

Which brings me to a clear point: the binary search isn’t really suitable for large amounts of data! First of all, your data needs to be sorted! And you need to maintain this sort order every time when you add or delete items to this record, or when you change the key value of a record! Hash tables are generally unsorted and have a better performance, especially with large amounts of data.

So, people who use a 32-bit index for a binary search are just bringing themselves in trouble if they fear any overflows. When they start using floor+((ceiling-floor)/2) for their math, they’re clearly showing that they just don’t understand the algorithm that well. The extra math will slow down the algorithm slightly while the risk of overflows should not exist. If it does exist with a 32-bit index then you’re already using the wrong algorithm to search for data. You’re at least maintaining an index of 4 GB in size, making it really difficult to insert new records. That is, if overflows can occur. The time needed to sort that much data is also quite a lot and again, far from optimal.

Thing is, developers often tend to use the wrong algorithms and often have the wrong fears. Whenever you use a specific algorithm you will have to consider all options. Are you using the right algorithm for the problem? Are you at risk of having overflows and underflows? How much data do you expect to handle? And what are the alternative options.

Finally, as I said, doing a binary search basically means toggling bits for the index. Instead of doing math to calculate the half value, you could instead just toggle bits from high to low. That way, you never even have a chance of overflows.