Multithreading, multi-troubling.

Recently, I worked on a small project that needed to make a catalog of image files and folders on my hard disk and save this catalog in a database. Since my CGI and my photography hobby generated a lot of images, it would be practical to have something easy to support it all. Plenty of software that already does something like this, but none that I liked. Especially since I want to connect images to derived images, group them, tag them, share them, assign licenses to them and publish them. And I want to keep track of where I’ve shared them already. Are they on Flickr? CafePress? DeviantArt? Plus, I wanted to know if they should be rated as adult. Some of my CGI artwork is naughty by nature (because nude models are easier to work with) and thus unsuitable for a broad audience.

But for this simple catalog I just wanted to store the image folder, the image filename, an image name that would be the filename without extension and without diacritics, plus the width and height of the image so I could calculate the image ratio. To make it slightly more complex, the folder name would be a relative folder name based on a root folder that’s set in the configuration. This would allow me to move the images to a different folder or use the same database on a different machine without the need to adjust all records.

So, the database structure is simple. One table that has the folders, one table containing image ratios and one for the image names and sizes. The ratio table will help me to group images based on the ratio between width and height. The folder table would do the same for grouping by folder. The Entity Framework would help to connect to this database and take away a lot of my troubles. All I have to do now is write a simple library that would fill and keep up this catalog plus a console application to call those methods. Sounds simple enough.

Within 30 minutes, the first version was ready. I would first enumerate all folders below the source folder, then for each folder in that list I would collect all image files of type PNG, JPG and BMP. The folder would be written to the folder table and the file would be put in the Image table. Just one minor challenge, though…

I want to add the width and height of the image to the image table too, and based on the ratio between width and height, I would have to either add a new ratio record, or change an existing one. And this meant that I had to read every file into memory to find its size and then look if there’s already a ratio record related to it. If not, I would need to add the new ratio record and make sure the next request for ratio records would now include the new ratio record. Plus, I needed to check if the image and folder records also exist in the database, because this tool needs to update only for new images.

The performance was horrible, as could easily be predicted. Especially since I make images and photo’s at high resolutions, so reading those files does take dozens of milliseconds. No matter that my six cores at 3.5 GHz and 32 GB of RAM turns my system in a Speed Demon, these read actions are just slow. And I did it inefficiently since I have six cores but my code is just single-threaded. So, redo from start and this time do it multithreaded.

But multithreading and the Entity Framework don’t go well together. The database connection isn’t threadsafe and thus you cannot access the database methods from multiple threads. Besides, the ratio table could generate collisions when two images with the same, new ratio are processed. Both threads would notice the ratio doesn’t exist thus both would add it. But one of those would then fail because the other would have added it first. So I needed to change my approach.

So I Used ‘Parallel.ForEach’ to walk through the folder list and then again for all files within the folder. I would collect the data in internal lists and when the file loop was done, I would loop through all images and add those that didn’t exist. And yes, that improved performance a lot and kept the conflicts with the ratio table away. Too bad I was still reading all images but that was not a big issue.Performance went up from hours to slightly over one hour. Still slow.

So one more addition. I would first read all existing folders and images from the database and if a file existed in this list, I would not read it’s size anymore since it wasn’t needed. I could skip the image. As a result, it still took an hour the first time I imported all images, but the second run would finish within a minute, since there wasn’t anything left to read or add. The speed was limited to just reading the files and folders from the database and from the disk.

When you’re operating these kinds of projects in an Agile team and you’re scrumming around, things will slow down considerably if you haven’t thought about these challenges before you started the sprint to create the code. Since the first version looks quite simple, you might have planned it as a very short task and thus end up with extremely slow code. In the next sprint you would have to consider options to speed things up and thus you will realize that making it multithreaded is a bigger task. And while you are working on the multithreaded version, you might discover the conflicts with the Entity Framework plus the possible collisions within the tables. So the second sprint might end with a buggy but faster solution with lots of exception handling to catch all possible problems. The third sprint would then fix these, if you manage to find a better solution. Else, this problem might haunt you to the deadline of the project…

And this is where teams have to be real careful. The task sounds very simple, but it’s not. These things are easily underestimated by a team and should be well-planned before you start writing code. Experienced developers will detect these problems before they start, thus knowing that they should take their time and plan carefully without writing code immediately. (I only did it so I could write this post.) The task seems extremely simple and I managed to describe it in the second paragraph of this post with just three lines. But the solution with a high performance will require me to think before I start writing code.

My last approach is the most promising, though. And it can be done by using multithreading but it’s far more complex than you’d assume at first. And it will be memory-hungry because you need to create several lists in memory.

You would have to start with two threads. One thread will read the database and generate lists of files, folders and ratios. These lists must be completely in-memory because if you keep them as queryable lists, the system would try to continuously read them. Besides, once you’re done generating these lists you will want to close the database connection. This all tells you what you already have. The second thread will read all folders and by using parallel threads it would have to read all image files within those folders. But you would not read the image sizes yet, nor calculate all ratios.

When you’re done collecting the data, you will have to compare it all. You would start by comparing the lists of folders. Folders that exist in both lists can be ignored (but not their files.) Folders that exist in the database list but not the disk list should be deleted, including all files within those folders! Folders that are on disk but not in the database need to be added. Thus you can now start two threads, each with their own database connection. One will delete all folders plus their related images from the database that have been deleted while the other adds all new folders that are found on the disk. And by using two database connections, you can speed things up. You will have to wait for both threads to finish, though. But it shouldn’t be slow.

The next step would be the comparison of images. Here you do something similar as with folders. You split the lists in three different lists. One with all images that are unchanged. One with all images that need to be deleted. And one with all images that need to be added. And you would create a separate thread with its own database connection to delete the images so your main process can start working on the ratios table.

Because we now know which images need to be added, we can go through those files using parallel processing, read the image width and height and add this information to the image file records. When we have enriched this list with these sizes, we can use a LINQ query to generate a list of all ratios of those images and removing all duplicate ratios in this list. This generates the list of ratios that we would need to check.

Before we add the new images, we will have to check the ratios table. As with the folders table, we check for all differences. However, we cannot delete ratios that we haven’t found among the images, because we skipped the images that already exist. We will do this later, though. We will first start adding the new ratios to the database. This too can be done in a separate thread but it’s pretty fast anyways so why bother? A performance gain of two seconds isn’t worth the extra effort if a process takes minutes to finish. So add the new ratios.

Once all ratios are added, we can add all images. We could do this using parallel threads, with each thread creating a new database connection and processing all images from one specific folder or with one specific ratio. But if you want to add them multi-threaded I would just recommend to divide the images in groups of similar sizes. Keep the amount of groups relative to the number of processes (e.g. 24 for my six cores) and let the system do its work. By evenly dividing the images over multiple threads, they should all take about the same amount of time.

When adding the new images, you will have to find the related folder and ratio in the database again. This makes adding images slower than adding folders or ratios because you need the extra lookup. This performance would increase if we had kept the Folders and Ratio lists as queryable lists but then we could not open and close the connections, not could we use multiple connections to add those images. And we want multiple connections to speed things up. So we accept a slightly worse performance at this point, although we could probably speed it up a bit by using a stored procedure to add the images. The stored procedure would have parameters for the image name, the image filename, the width and height, the folder name and the ratio width and height. I’m not too fond of procedures with many parameters and I haven’t tested if this would increase the performance, but in theory it should be faster, especially if the database is on a different machine than the application.

And thus a simple task of adding images to a database turns out to be complex, simply because we need better performance. It would still take hours if it has a lot of new images to add but once you have it mostly filled, it will do quite well.

But you will have to ask yourself and your team if you are capable to detect these problems before you start a new sprint. Designs are simple, because designers don’t always keep the performance in mind. These things are easily asked for because they appear very simple, but have a lot of consequences. Similar problems might arise when you work with projects that need to be secure. The design might ask for a login screen with username and password, and optionally a few OpenID providers as alternative logins, but the amount of code to manage all this data and keep it secure is quite complex. These are real moments when you need to design some technical documentation first, which is something people often forget when working on an Agile project.

Still, you cannot blame the developer if the designer just writes a few lines and the developer chooses the first, slow solution. The result would be the requested task. It is the designer who needs to be aware of these possible performance pitfalls. And with Agile, you have a team. All team members should be able to point out that this simple description would have these pitfalls, thus making it a long and complex task. They should all realise that they will have to discuss possible solutions for this and preferably they do so as a team with just one computer. (The computer would be used to find information, not to write code!) Only when they agree on the proper solution then one or two of them could start writing code. And they would know how long this task will take. Thus, the task would finish within two sprints. In the first sprint, all team members would have a small task to meet and discuss the options. In the second sprint, one or more members would have a big task of implementing the code.

Or, to keep it simple: think before you start writing code!