Productive Rage

Hosting a DigitalOcean App Platform app on a custom subdomain (with CORS)

Sun, 06 Apr 2025 19:27:00 GMT

TL;DR

I host my blog using GitHub Pages (repo here), and have the domain registered through GoDaddy. I wanted to experiment with adding some additional functionality to my static content, using DigitalOcean App Platform (where I can essentially throw a Docker container and have it appear on the internet).

I wanted this DigitalOcean-hosted app to be available through a productiverage.com subdomain, and I wanted it to be accessible as an API from JavaScript on the page. SSL* has long been a given, and I hoped that I would hit few (if any) snags with that.

There are instructions out there for doing what I wanted, but I still encountered so many confusions and gotchas, that I figured I'd try to document the process (along with a few ways to reassure yourself when things look bleak).. even if it's only for future-me!

* (Insert pedantic comment about how TLS has replaced SSL, and so we shouldn't refer to "SSL" or "SSL certificates" - for the rest of the post, I'll be saying "SSL" and hopefully that doesn't upset anyone too much despite it not being technically correct!)

DigitalOcean App Platform

So you have something deployed using DigitalOcean's App Platform solution. It will have an automatically generated unique url that you can access it on, that is a subdomain of "ondigitalocean.app" (something like. https://productiverage-search-58yr4.ondigitalocean.app). This will not change (unless you delete your app), and you can always use it to test your application.

You want to host the application on a subdomain of a domain that you own (hosted by GoDaddy, in my case).

To start the process, go into the application's details in DigitalOcean (the initial tab you should see if called "Overview") and click into the "Settings" tab.

Note: Do not click into the "Networking" section through the link in the left hand navigation bar (under "Manage), and then into "Domains" (some guides that I found online suggested this, and it only resulted in me getting lost and confused - see the section below as to why).

This tab has the heading "App Settings" and the second section should be "Domains", click "Edit" and then the "+Add Domain" button.

Here, enter the subdomain that you want to use for your application. Again, the auto-assigned ondigitalocean.app subdomain will never go away, and you can add multiple custom domains if you want (though I only needed a single one).

You don't actually have to own the domain at this point; DigitalOcean won't do any checks other than ensuring that you don't enter a domain that is registered by something else within DigitalOcean (either one of your own resources, or a resource owned by another DigitalOcean customer). If you really wanted to, you could enter a subdomain of a domain that you know that you can't own, like "myawesomeexperiment.google.com" - but it wouldn't make a lot of sense to do this, since you would never be able to connect that subdomain to your app!

In my case, I wanted to use "search.productiverage.com".

Note: It's only the domain or subdomain that you have to enter here, not the protocol ("http" or "https") because (thankfully) it's not really an option to operate without https these days. Back in the dim and distant past, SSL certificates were frustrating to purchase, and register, and renew - and they weren't free! Today, life is a lot easier, and DigitalOcean handles it for you automatically when you use a custom subdomain on your application; they register the certificate, and automatically renew it. When you have everything working, you can look up the SSL certificate of the subdomain to confirm this - eg. when I use sslshopper.com to look up productiverage.com then I see that the details include "Server Type: GitHub.com" (same if I look up "www.productiverage.com") because I have my domain configured to point at GitHub Pages, and they look after that SSL certificate. But if I use sslshopper.com to look up search.productiverage.com then I see "Server Type: cloudflare" (although it doesn't mention DigitalOcean, it's clearly a different certificate).

With your sub/domain entered (and with DigitalOcean having checked that it's of a valid form, and not already in use by another resource), you will be asked to select some DNS management options. Click "You manage your domain" and then the "Add Domain" button at the bottom of the page.

This will redeploy your app. After which, you should see the new domain listed in the table that opened after clicked "Edit" alongside "Domains" in the "Settings" tab of your app. It will probably show the status as "Pending". It might show the status as "Configuring" at this point - if it doesn't, then refreshing the page and clicking "Edit" again alongside the "Domains" section should result in it now showing "Configuring".#

There will be a "?" icon alongside the "Configuring" status - if you hover over it you will see the message "Your domain is not yet active because the CNAME record was not found". Once we do some work on the domain registrar side (eg. GoDaddy), this status will change!

DigitalOcean App Platform - Avoiding "Networking" / "Domains"

I read some explanations of this process that said that you should configure your custom domain not by starting with the app settings, but by clicking the "Networking" link in the left hand nav (under "Manage") and then clicking into "Domains". I spent an embarrassing amount of time going down this route, and getting frustrated when I reached a step that would say something like "using the dropdown in the 'Directs to' column, select where the custom domain should be used" - I never had a dropdown, and couldn't find an explanation why!

When you configure a custom sub/domain this way, it can only be connected to (iirc) Load Balancers (which "let you distribute traffic between multiple Droplets either regionally or globally") or, I think, Reserved IPs (which you can associate with any individual Droplet, or with a DigitalOcean's managed Kubernetes service - referred to as "DOKS"). You can not select an App Platform instance in a 'Directs To' dropdown in the "Networking" / "Domains" section, and that is what was causing me to stumble since I only have my single App Platform instance right now (I don't have a load balancer or any other, more complicated infrastructure).

Final note on this; if you configure a custom domain as I'm describing, you won't see that custom domain shown in the "Networking" / "Domains" list. That is nothing to worry about - everything will still work!

My use of GoDaddy (in short; I configure DNS to serve GitHub Pages content)

Long ago, I registered my domain with GoDaddy and hosted my blog with them as an ASP.NET site. I wasn't happy with the performance of it - it was fast much of the time, but would intermittently serve requests very slowly. I had a friend who had purchased a load of hosting capacity somewhere, so I shifted my site over to that (where it was still hosted as an ASP.NET site) and configured GoDaddy to send requests that way.

Back in 2016, I shifted over to serving the blog through GitHub Pages as static content. The biggest stumbling block to this would have been the site search functionality, which I had written for my ASP.NET app in C# - but I had put together a way to push that all to JS in the client in 2013 when I got excited about Neocities being released (I'm of an age where I remember the often-hideous, but easy-to-build-and-experiment-with, Geocities pages.. back before the default approaches to publishing content seemed to within walled gardens or on pay-to-access platforms).

As my blog is on GitHub Page, I have A records configured in the DNS settings for my domain within GoDaddy that point to GitHub servers, and a CNAME record that points "www" to my GitHub subdomain "productiverage.github.io".

The GitHub documentation page "Managing a custom domain for your GitHub Pages site" describes the steps that I followed to end up in this position - see the section "Configuring an apex domain and the www subdomain variant". The redirect from "productiverage.com" to "www.productiverage.com" is managed by GitHub, as is the SSL certificate, and the redirection from "http" to "https".

Until I created my DigitalOcean app, GoDaddy's only role was to ensure that when someone tried to visit my blog that the DNS lookup resulted in them going to GitHub, who would pick up the request and serve my content.

Configuring the subdomain for DigitalOcean in GoDaddy

Within the GoDaddy "cPanel" (ie. their control panel), click into your domain, then into the "DNS" tab, and then click the "Add New Record" button. Select CNAME in the "Type" dropdown, type the subdomain segment into the "Name" text (in my case, I want DigitalOcean to use the subdomain "search.productiverage.com" so I entered "search" into that textbox, since I was managing my domain "productiverage.com"), paste the DigitalOcean-generated domain into the "Value" textbox ("productiverage-search-58yr4.ondigitalocean.app" for my app), and click "Save".

You should see a message informing you that DNS changes may take up to 48 hours to propagate, but that it usually all happens in less than an hour.

In my experience, it often only takes a few minutes for everything to work.

If you want to get an idea about how things are progressing, there are a couple of things you can do -

If you open a command prompt and ping the DigitalOcean-generated subdomain (eg. "productiverage-search-58yr4.ondigitalocean.app") and then ping your new subdomain (eg. "search.productiverage.com") they should resolve to the same IP address
With the IP address resolving correctly, you can try visiting the subdomain in a browser - if you get an error message like "Can't Establish a Secure Connection" then DigitalOcean hasn't finished configuring the SSL certificate, but this error is still an indicator that the DNS change has been applied (which is good news!)
If you go back to your app in the DigitalOcean control panel, and refresh the "Settings" tab, and click "Edit" alongside the "Domains" section, the status will have changed from "Configuring" to "Active" when it's ready (you may have to refresh a couple of times, depending upon how patient you're being, how slow the internet is being, and whether DigitalOcean's UI automatically updates itself or not)

If you don't want to mess about with these steps, you are free to go and make a cup of tea, and everything should sort itself out on its own!

I had gone round and round so many times trying to make it work that I was desperate to have some additional insight into whether it was working or not, but now that I'm confident in the process I would probably just wait five minutes if I did this again, and jump straight to the final step..

At this point, you should be able to hit your DigitalOcean app in the browser! Hurrah!

If it fails, then it's worth checking that the app is still running and working when you access it via the DigitalOcean-generated address
If the app works at the DigitalOcean-generated address but still doesn't work on your custom subdomain, hopefully running again through those three steps above will help you identify where the blocker is, or maybe you'll find clues in the app logs in DigitalOcean

Bonus material: Enabling CORS access for the app (in DigitalOcean)

Depending upon your needs, you may be done by this point.

After I'd finished whooping triumphantly, however, I realised that I wasn't done..

My app exposes a html form that will perform a semantic search across my blog content (it's essentially my blog's Semantic Search Demo project, except that - depending upon when you read this post and when I update that code - it uses a smaller embedding model and it adds a call to a Cohere Reranker to better remove poor matches from the result set). That html form works fine in isolation.

However, the app also supports application/json requests, because I wanted to improve my blog's search by incorporating semantic search results into my existing lexical search. This meant that I would be calling the app from JS on my blog. And that would be a problem, because trying to call https://search.productiverage.com from JS code executed within the context of https://www.productiverage.com would be rejected due to the "Cross-Origin Resource Sharing" (CORS) mechanism, which exists for security purposes - essentially, to ensure that potentially-malicious JS can't send content from a site to another domain (even if the sites are on subdomains of the same domain).

To make a request through JS within the context of one domain (eg. "www.productiverage.com") to another (eg. "search.productiverage.com"), the second domain must be explicitly configured to allow access from the first. This configuration is done against the DigitalOcean app -

In the DigitalOcean control panel, navigate back to the "Settings" tab for your app
The first line (under the tab navigation and above the title "App Settings") should display "App" on the left and "Components" on the right - you need to click into the component (I only have a single component in my case)

Click "Edit" in the "HTTP Request Routes" section and click "Configure CORS" by the route that you will need to request from another domain (again, I only have a single route, which is for the root of my application)
I want to provide access to my app only from my blog, so I set a value for the Access-Control-Allow-Origins header, that has a "Match Type" of "Exact" and an "Origin" of "https://www.productiverage.com"
Click "Apply CORS" - and you should be done!

Now, you should be able to access your DigitalOcean app on the custom subdomain from another domain through JS code, without the browser giving you an error about CORS restrictions denying your attempt!

To see an example of this in action, you can go to www.productiverage.com, open the dev tools in your browser, go to the "Network" tab and filter requests to "Fetch/XHR", type something into the "Site Search" text box on the site and click "Search", and you should see requests for content SearchIndex-{something}.lz.txt (which is used for lexical searching) and a single request that looks like ?q={what you searched for} which (if you view the Headers for) you should see comes from search.productiverage.com. Woo, success!!

(Approximately) correcting perspective with C# (fixing a blurry presentation video - part two)

Tue, 29 Mar 2022 19:02:00 GMT

TL;DR

I have a video of a presentation where the camera keeps losing focus such that the slides are unreadable. I have the original slide deck and I want to fix this.

Step one was identifying the area in each frame that it seemed likely was where the slides were being projected, now step two is to correct the perspective of the projection back into a rectangle to make it easier to perform comparisons against the original slide deck images and try to determine which slide was being projected.

(An experimental TL;DR approach: See this small scale .NET Fiddle demonstration of what I'll be discussing)

The basic approach

An overview of the processing to do this looks as follows:

Load the original slide image into a Bitmap
Using the projected-slide-area region calculated in step one..
1. Take the line from the top left of the region to the top right
2. Take the line from the bottom left of the region to the bottom right (note that this line may be a little longer or shorter than the first line)
3. Create vertical slices of the image by stepping through the first line (the one across the top), connecting each pixel to a pixel on the bottom line
These vertical slices will not all be the same height and so they'll need to be adjusted to a consistent size (the further from the camera that a vertical slice of the projection is, the smaller it will be)
The height-adjusted vertical slices are then combined into a single rectangle, which will result in an approximation of a perspective-corrected version of the projection of the slide

Note: The reason that this process is only going to be an approximation is due to the way that the height of the output image will be determined -

For my purposes, it will be fine to use the largest of the top-left-to-bottom-left length (ie. the left-hand edge of the projection) and the top-right-to-bottom-right length (the right-hand edge of the projected) but this will always result in an output whose aspect ratio is stretched vertically slightly because the largest of those two lengths will be "magnified" somewhat due to the perspective effect
What might seem like an obvious improvement would be to take an average of the left-hand-edge-height and the right-hand-edge-height but I decided not to do this because I would be losing some fidelity from the vertical slices that would be shrunken down to match this average and because this would still be an approximation as..
The correct way to determine the appropriate aspect ratio for the perspective-corrected image is to use some clever maths to try to determine that angle of the wall that the projection is on (look up perspective correction and vanishing points if you're really curious!) and to use that to decide what ratio of the left-hand-edge-height and the right-hand-edge-height to use
- (The reason that the take-an-average approach is still an approximation is that perspective makes the larger edge grow more quickly than the smaller edge shrinks, so this calculation would still skew towards a vertically-stretched image)

Slice & dice!

So if we follow the plan above then we'll generate a list of vertical slices a bit like this:

.. which, when combined would look like this:

This is very similar to the original projection except that:

The top edge is now across the top of the rectangular area
The bottom left corner is aligned with the left-hand side of the image
The bottom right corner is aligned with the right-hand side of the image

We're not done yet but this has brought things much closer!

In fact, all that is needed is to stretch those vertical slices so that they are all the same length and; ta-da!

Implementation for slicing and stretching

So, from previous analysis, I know that the bounding area for the projection of the slide in the frames of my video is:

topLeft: (1224, 197)
topRight: (1915, 72)

bottomLeft: (1229, 638)
bottomRight: (1915, 662)

Since I'm going to walk along the top edge and create vertical slices from that, I'm going to need the length of that edge - which is easy enough with some Pythagoras:

private static int LengthOfLine((PointF From, PointF To) line)
{
    var deltaX = line.To.X - line.From.X;
    var deltaY = line.To.Y - line.From.Y;
    return (int)Math.Round(Math.Sqrt((deltaX * deltaX) + (deltaY * deltaY)));
}

So although it's only 691px horizontally from the top left to the top right (1915 - 1224), the actual length of that edge is 702px (because it's not a line that angles up slightly rather than being a flat horizontal one).

This edge length determines how many vertical slices that we'll take and we'll get them by looping across this top edge, working out where the corresponding point on the bottom edge should be and joining them together into a line; one vertical slice. Each time that the loop increments, the current point on the top edge is going to move slightly to the right and even more slightly upwards while each corresponding point on the bottom edge will also move slightly to the right but it will move slightly down as the projection on the wall gets closer and closer to the camera.

One way to get all of these vertical slice lines is a method such as the following:

private sealed record ProjectionDetails(
    Size ProjectionSize,
    IEnumerable<((PointF From, PointF To) Line, int Index)> VerticalSlices
);

private static ProjectionDetails GetProjectionDetails(
    Point topLeft,
    Point topRight,
    Point bottomRight,
    Point bottomLeft)
{
    var topEdge = (From: topLeft, To: topRight);
    var bottomEdge = (From: bottomLeft, To: bottomRight);
    var lengthOfEdgeToStartFrom = LengthOfLine(topEdge);
    var dimensions = new Size(
        width: lengthOfEdgeToStartFrom,
        height: Math.Max(
            LengthOfLine((topLeft, bottomLeft)),
            LengthOfLine((topRight, bottomRight))
        )
    );
    return new ProjectionDetails(dimensions, GetVerticalSlices());

    IEnumerable<((PointF From, PointF To) Line, int Index)> GetVerticalSlices() =>
        Enumerable
            .Range(0, lengthOfEdgeToStartFrom)
            .Select(i =>
            {
                var fractionOfProgressAlongPrimaryEdge = (float)i / lengthOfEdgeToStartFrom;
                return (
                    Line: (
                        GetPointAlongLine(topEdge, fractionOfProgressAlongPrimaryEdge),
                        GetPointAlongLine(bottomEdge, fractionOfProgressAlongPrimaryEdge)
                    ),
                    Index: i
                );
            });
}

This returns the dimensions of the final perspective-corrected projection (which is as wide as the top edge is long and which is as high as the greater of the left-hand edge's length and the right-hand edge's length) as well as an IEnumerable of the start and end points for each slice that we'll be taking.

The dimensions are going to allow us to create a bitmap that we'll paste the slices into when we're ready - but, before that, we need to determine pixel values for every point on every vertical slice. As the horizontal distance across the top edge is 691px and the vertical distance is 125px but its actual length is 702px, each time we move one along in that 702px loop the starting point for the vertical slice will move (691 / 702) = 0.98px across and (125 / 702) = 0.18px up. So almost all of these vertical slices are going to have start and end points that are not whole pixel values - and the same applies to each point on that vertical slice. This means that we're going to have to take average colour values for when we're dealing with fractional pixel locations.

For example, if we're at the point (1309.5, 381.5) and the colours at (1309, 381), (1310, 381), (1309, 382), (1310, 382) are all white then the averaging is really easy - the "averaged" colour is white! If we're at the point (1446.5, 431.5) and the colours at (1446, 431), (1447, 431), (1446, 432), (1447, 432) are #BCA6A9, #B1989C, #BCA6A9, #B1989C then it's also not too complicated - because (1446.5, 431.5) is at the precise midpoint between all four points then we can take a really simple average by adding all four R values together, all four G values together, all four B values together and diving them by 4 to get a combined result. It gets a little more complicated where it's not 0.5 of a pixel and it's slightly more to the left or to the right and/or to the top or to the bottom - eg. (1446.1, 431.9) would get more of its averaged colour from the pixels on the left than on the right (as 1446.1 is only just past 1446) while it would get more of its averaged colour from the pixels on the bottom than the top (as 431.9 is practically ay 432). On the other hand, on the rare occasion where it is a precise location (with no fractional pixel values), such as (1826, 258), then it's the absolute simplest case because no averaging is required!

private static Color GetAverageColour(Bitmap image, PointF point)
{
    var (integralX, fractionalX) = GetIntegralAndFractional(point.X);
    var x0 = integralX;
    var x1 = Math.Min(integralX + 1, image.Width);

    var (integralY, fractionalY) = GetIntegralAndFractional(point.Y);
    var y0 = integralY;
    var y1 = Math.Min(integralY + 1, image.Height);

    var (topColour0, topColour1) = GetColours(new Point(x0, y0), new Point(x1, y0));
    var (bottomColour0, bottomColour1) = GetColours(new Point(x0, y1), new Point(x1, y1));

    return CombineColours(
        CombineColours(topColour0, topColour1, fractionalX),
        CombineColours(bottomColour0, bottomColour1, fractionalX),
        fractionalY
    );

    (Color c0, Color c1) GetColours(Point p0, Point p1)
    {
        var c0 = image.GetPixel(p0.X, p0.Y);
        var c1 = (p0 == p1) ? c0 : image.GetPixel(p1.X, p1.Y);
        return (c0, c1);
    }

    static (int Integral, float Fractional) GetIntegralAndFractional(float value)
    {
        var integral = (int)Math.Truncate(value);
        var fractional = value - integral;
        return (integral, fractional);
    }

    static Color CombineColours(Color x, Color y, float proportionOfSecondColour)
    {
        if ((proportionOfSecondColour == 0) || (x == y))
            return x;

        if (proportionOfSecondColour == 1)
            return y;

        return Color.FromArgb(
            red: CombineComponent(x.R, y.R),
            green: CombineComponent(x.G, y.G),
            blue: CombineComponent(x.B, y.B),
            alpha: CombineComponent(x.A, y.A)
        );

        int CombineComponent(int x, int y) =>
            Math.Min(
                (int)Math.Round((x * (1 - proportionOfSecondColour)) + (y * proportionOfSecondColour)),
                255
            );
    }
}

This gives us the capability to split the wonky projection into vertical slices, to loop over each slice and to walk down each slice and get a list of pixel values for each point down that slice. The final piece of the puzzle is that we then need to resize each vertical slice so that they all match the projection height returned from the GetProjectionDetails method earlier. Handily, the .NET Bitmap drawing code has DrawImage functionality that can resize content, so we can:

Create a Bitmap whose dimensions are those returned from GetProjectionDetails
Loop over each vertical slice (which is an IEnumerable also returned from GetProjectionDetails)
Create a bitmap just for that slice - that is 1px wide and only as tall as the current vertical slice is long
Use DrawImage to paste that slice's bitmap onto the full-size projection Bitmap

In code:

private static void RenderSlice(
    Bitmap projectionBitmap,
    IEnumerable<Color> pixelsOnLine,
    int index)
{
    var pixelsOnLineArray = pixelsOnLine.ToArray();

    using var slice = new Bitmap(1, pixelsOnLineArray.Length);
    for (var j = 0; j < pixelsOnLineArray.Length; j++)
        slice.SetPixel(0, j, pixelsOnLineArray[j]);

    using var g = Graphics.FromImage(projectionBitmap);
    g.DrawImage(
        slice,
        srcRect: new Rectangle(0, 0, slice.Width, slice.Height),
        destRect: new Rectangle(index, 0, 1, projectionBitmap.Height),
        srcUnit: GraphicsUnit.Pixel
    );
}

Pulling it all together

If we combine all of this logic together then we end up with a fairly straightforward static class that does all the work - takes a Bitmap that is a frame from a video where there is a section that should be extracted and then "perspective-corrected", takes the four points that describe that region and then returns a new Bitmap that is the extracted content in a lovely rectangle!

/// <summary>
/// This uses a simple algorithm to try to undo the distortion of a rectangle in an image
/// due to perspective - it takes the content of the rectangle and stretches it into a
/// rectangle. This is only a simple approximation and does not guarantee accuracy (in
/// fact, it will result in an image that is slightly vertically stretched such that its
/// aspect ratio will not match the original content and a more thorough approach would
/// be necessary if this is too great an approximation)
/// </summary>
internal static class SimplePerspectiveCorrection
{
    public static Bitmap ExtractAndPerspectiveCorrect(
        Bitmap image,
        Point topLeft,
        Point topRight,
        Point bottomRight,
        Point bottomLeft)
    {
        var (projectionSize, verticalSlices) =
            GetProjectionDetails(topLeft, topRight, bottomRight, bottomLeft);

        var projection = new Bitmap(projectionSize.Width, projectionSize.Height);
        foreach (var (lineToTrace, index) in verticalSlices)
        {
            var lengthOfLineToTrace = LengthOfLine(lineToTrace);

            var pixelsOnLine = Enumerable
                .Range(0, lengthOfLineToTrace)
                .Select(j =>
                {
                    var fractionOfProgressAlongLineToTrace = (float)j / lengthOfLineToTrace;
                    var point = GetPointAlongLine(lineToTrace, fractionOfProgressAlongLineToTrace);
                    return GetAverageColour(image, point);
                });

            RenderSlice(projection, pixelsOnLine, index);
        }
        return projection;

        static Color GetAverageColour(Bitmap image, PointF point)
        {
            var (integralX, fractionalX) = GetIntegralAndFractional(point.X);
            var x0 = integralX;
            var x1 = Math.Min(integralX + 1, image.Width);

            var (integralY, fractionalY) = GetIntegralAndFractional(point.Y);
            var y0 = integralY;
            var y1 = Math.Min(integralY + 1, image.Height);

            var (topColour0, topColour1) = GetColours(new Point(x0, y0), new Point(x1, y0));
            var (bottomColour0, bottomColour1) = GetColours(new Point(x0, y1), new Point(x1, y1));

            return CombineColours(
                CombineColours(topColour0, topColour1, fractionalX),
                CombineColours(bottomColour0, bottomColour1, fractionalX),
                fractionalY
            );

            (Color c0, Color c1) GetColours(Point p0, Point p1)
            {
                var c0 = image.GetPixel(p0.X, p0.Y);
                var c1 = (p0 == p1) ? c0 : image.GetPixel(p1.X, p1.Y);
                return (c0, c1);
            }

            static (int Integral, float Fractional) GetIntegralAndFractional(float value)
            {
                var integral = (int)Math.Truncate(value);
                var fractional = value - integral;
                return (integral, fractional);
            }

            static Color CombineColours(Color x, Color y, float proportionOfSecondColour)
            {
                if ((proportionOfSecondColour == 0) || (x == y))
                    return x;

                if (proportionOfSecondColour == 1)
                    return y;

                return Color.FromArgb(
                    red: CombineComponent(x.R, y.R),
                    green: CombineComponent(x.G, y.G),
                    blue: CombineComponent(x.B, y.B),
                    alpha: CombineComponent(x.A, y.A)
                );

                int CombineComponent(int x, int y) =>
                    Math.Min(
                        (int)Math.Round(
                            (x * (1 - proportionOfSecondColour)) +
                            (y * proportionOfSecondColour)
                        ),
                        255
                    );
            }
        }
    }

    private sealed record ProjectionDetails(
        Size ProjectionSize,
        IEnumerable<((PointF From, PointF To) Line, int Index)> VerticalSlices
    );

    private static ProjectionDetails GetProjectionDetails(
        Point topLeft,
        Point topRight,
        Point bottomRight,
        Point bottomLeft)
    {
        var topEdge = (From: topLeft, To: topRight);
        var bottomEdge = (From: bottomLeft, To: bottomRight);
        var lengthOfEdgeToStartFrom = LengthOfLine(topEdge);
        var dimensions = new Size(
            width: lengthOfEdgeToStartFrom,
            height: Math.Max(
                LengthOfLine((topLeft, bottomLeft)),
                LengthOfLine((topRight, bottomRight))
            )
        );
        return new ProjectionDetails(dimensions, GetVerticalSlices());

        IEnumerable<((PointF From, PointF To) Line, int Index)> GetVerticalSlices() =>
            Enumerable
                .Range(0, lengthOfEdgeToStartFrom)
                .Select(i =>
                {
                    var fractionOfProgressAlongPrimaryEdge = (float)i / lengthOfEdgeToStartFrom;
                    return (
                        Line: (
                            GetPointAlongLine(topEdge, fractionOfProgressAlongPrimaryEdge),
                            GetPointAlongLine(bottomEdge, fractionOfProgressAlongPrimaryEdge)
                        ),
                        Index: i
                    );
                });
    }

    private static PointF GetPointAlongLine((PointF From, PointF To) line, float fraction)
    {
        var deltaX = line.To.X - line.From.X;
        var deltaY = line.To.Y - line.From.Y;
        return new PointF(
            (deltaX * fraction) + line.From.X,
            (deltaY * fraction) + line.From.Y
        );
    }

    private static int LengthOfLine((PointF From, PointF To) line)
    {
        var deltaX = line.To.X - line.From.X;
        var deltaY = line.To.Y - line.From.Y;
        return (int)Math.Round(Math.Sqrt((deltaX * deltaX) + (deltaY * deltaY)));
    }

    private static void RenderSlice(
        Bitmap projectionBitmap,
        IEnumerable<Color> pixelsOnLine,
        int index)
    {
        var pixelsOnLineArray = pixelsOnLine.ToArray();

        using var slice = new Bitmap(1, pixelsOnLineArray.Length);
        for (var j = 0; j < pixelsOnLineArray.Length; j++)
            slice.SetPixel(0, j, pixelsOnLineArray[j]);

        using var g = Graphics.FromImage(projectionBitmap);
        g.DrawImage(
            slice,
            srcRect: new Rectangle(0, 0, slice.Width, slice.Height),
            destRect: new Rectangle(index, 0, 1, projectionBitmap.Height),
            srcUnit: GraphicsUnit.Pixel
        );
    }
}

Coming next

So step one was to take frames from a video and to work out what the bounds were of the area where slides were being projected (and to filter out any intro and outro frames), step two has been to be able to take the bounded area from any slide and project it back into a rectangle to make it easier to match against the original slide images.. step three will be to use these projections to try to guess what slide is being displayed on what frame!

The frame that I've been using as an example throughout this post probably looks like a fairly easy case - big blocks of white or black and not actually too out of focus.. but some of the frames look like this and that's a whole other kettle of fish!

Finding the brightest area in an image with C# (fixing a blurry presentation video - part one)

Tue, 15 Mar 2022 21:06:00 GMT

TL;DR

I have a video of a presentation where the camera keeps losing focus such that the slides are unreadable. I have the original slide deck and I want to fix this.

The first step is analysing the individual frames of the video to find a common "most illuminated area" so that I can work out where the slide content was being projected, and that is what is described in this post.

(An experimental TL;DR approach: See this small scale .NET Fiddle demonstration of what I'll be discussing)

The basic approach

An overview of the processing to do this looks as follows:

Load the image into a Bitmap
Convert the image to greyscale
Identify the lightest and darkest values in the greyscale range
Calculate a 2/3 threshold from that range and create a mask of the image where anything below that value is zero and anything equal to or greater is one
- eg. If the darkest value was 10 and the lightest was 220 then the difference is 220 - 10 = 210 and the cutoff point would be 2/3 of this range on top of the minimum, so the threshold value would equal ((2/3) * range) + minimum = ((2/3) * 210) + 10 = 140 + 10 = 150
Find the largest bounded area within this mask (if there is one) and presume that that's the projection of the slide in the darkened room!

Before looking at code to do that, I'm going to toss in a few other complications that arise from having to process a lot of frames from throughout the video, rather than just one..

Firstly, the camera loses focus at different points in the video and to different extents and so some frames are blurrier than others. Following the steps above, the blurrier frames are likely to report a larger projection area for the slides. I would really like to identify a common projection area that is reasonable to use across all frames because this will make later processing (where I try to work out what slide is currently being shown in the frame) easier.

Secondly, this video has intro and outro animations and it would be nice if I was able to write code that worked out when they stopped and started.

The implementation for a single image

To do this work, I'm going to introduce a variation of my old friend the DataRectangle (from "How are barcodes read?" and "Face or no face") -

public static class DataRectangle
{
    public static DataRectangle<T> For<T>(T[,] values) => new DataRectangle<T>(values);
}

public sealed class DataRectangle<T>
{
    private readonly T[,] _protectedValues;
    public DataRectangle(T[,] values) : this(values, isolationCopyMayBeBypassed: false) { }
    private DataRectangle(T[,] values, bool isolationCopyMayBeBypassed)
    {
        if ((values.GetLowerBound(0) != 0) || (values.GetLowerBound(1) != 0))
            throw new ArgumentException("Both dimensions must have lower bound zero");
        var arrayWidth = values.GetUpperBound(0) + 1;
        var arrayHeight = values.GetUpperBound(1) + 1;
        if ((arrayWidth == 0) || (arrayHeight == 0))
            throw new ArgumentException("zero element arrays are not supported");

        Width = arrayWidth;
        Height = arrayHeight;

        if (isolationCopyMayBeBypassed)
            _protectedValues = values;
        else
        {
            _protectedValues = new T[Width, Height];
            Array.Copy(values, _protectedValues, Width * Height);
        }
    }

    public int Width { get; }

    public int Height { get; }

    public T this[int x, int y]
    {
        get
        {
            if ((x < 0) || (x >= Width))
                throw new ArgumentOutOfRangeException(nameof(x));
            if ((y < 0) || (y >= Height))
                throw new ArgumentOutOfRangeException(nameof(y));
            return _protectedValues[x, y];
        }
    }

    public IEnumerable<(Point Point, T Value)> Enumerate()
    {
        for (var x = 0; x < Width; x++)
        {
            for (var y = 0; y < Height; y++)
            {
                var value = _protectedValues[x, y];
                var point = new Point(x, y);
                yield return (point, value);
            }
        }
    }

    public DataRectangle<TResult> Transform<TResult>(Func<T, TResult> transformer)
    {
        var transformed = new TResult[Width, Height];
        for (var x = 0; x < Width; x++)
        {
            for (var y = 0; y < Height; y++)
                transformed[x, y] = transformer(_protectedValues[x, y]);
        }
        return new DataRectangle<TResult>(transformed, isolationCopyMayBeBypassed: true);
    }
}

For working with DataRectangle instances that contain double values (as we will be here), I've got a couple of convenient extension methods:

public static class DataRectangleOfDoubleExtensions
{
    public static (double Min, double Max) GetMinAndMax(this DataRectangle<double> source) =>
        source
            .Enumerate()
            .Select(pointAndValue => pointAndValue.Value)
            .Aggregate(
                seed: (Min: double.MaxValue, Max: double.MinValue),
                func: (acc, value) => (Math.Min(value, acc.Min), Math.Max(value, acc.Max))
            );

    public static DataRectangle<bool> Mask(this DataRectangle<double> values, double threshold) =>
        values.Transform(value => value >= threshold);
}

And for working with Bitmap instances, I've got some extension methods for those as well:

public static class BitmapExtensions
{
    public static Bitmap CopyAndResize(this Bitmap image, int resizeLargestSideTo)
    {
        var (width, height) = (image.Width > image.Height)
            ? (resizeLargestSideTo, (int)((double)image.Height / image.Width * resizeLargestSideTo))
            : ((int)((double)image.Width / image.Height * resizeLargestSideTo), resizeLargestSideTo);

        return new Bitmap(image, width, height);
    }

    /// <summary>
    /// This will return values in the range 0-255 (inclusive)
    /// </summary>
    // Based on http://stackoverflow.com/a/4748383/3813189
    public static DataRectangle<double> GetGreyscale(this Bitmap image) =>
        image
            .GetAllPixels()
            .Transform(c => (0.2989 * c.R) + (0.5870 * c.G) + (0.1140 * c.B));

    public static DataRectangle<Color> GetAllPixels(this Bitmap image)
    {
        var values = new Color[image.Width, image.Height];
        var data = image.LockBits(
            new Rectangle(0, 0, image.Width, image.Height),
            ImageLockMode.ReadOnly,
            PixelFormat.Format24bppRgb
        );
        try
        {
            var pixelData = new byte[data.Stride];
            for (var lineIndex = 0; lineIndex < data.Height; lineIndex++)
            {
                Marshal.Copy(
                    source: data.Scan0 + (lineIndex * data.Stride),
                    destination: pixelData,
                    startIndex: 0,
                    length: data.Stride
                );
                for (var pixelOffset = 0; pixelOffset < data.Width; pixelOffset++)
                {
                    // Note: PixelFormat.Format24bppRgb means the data is stored in memory as BGR
                    const int PixelWidth = 3;
                    values[pixelOffset, lineIndex] = Color.FromArgb(
                        red: pixelData[pixelOffset * PixelWidth + 2],
                        green: pixelData[pixelOffset * PixelWidth + 1],
                        blue: pixelData[pixelOffset * PixelWidth]
                    );
                }
            }
        }
        finally
        {
            image.UnlockBits(data);
        }
        return DataRectangle.For(values);
    }
}

With this code, we can already perform those first steps that I've described in the find-projection-area-in-image process.

Note that I'm going to throw in an extra step of shrinking the input images if they're larger than 400px because we don't need pixel-perfect accuracy when the whole point of this process is that a lot of the frames are too blurry to read (as a plus, shrinking the images means that there's less data to process and the whole thing should finish more quickly).

using var image = new Bitmap("frame_338.jpg");
using var resizedImage = image.CopyAndResize(resizeLargestSideTo: 400);
var greyScaleImageData = resizedImage.GetGreyscale();
var (min, max) = greyScaleImageData.GetMinAndMax();
var range = max - min;
const double thresholdOfRange = 2 / 3d;
var thresholdForMasking = min + (range * thresholdOfRange);
var mask = greyScaleImageData.Mask(thresholdForMasking);

This gives us a DataRectangle of boolean values that represent the brighter points as true and the less bright points as false.

In the image below, you can see the original frame on the left. In the middle is the content that would be masked out by hiding all but the brightest pixels. On the right is the "binary mask" (where we discard the original colour of the pixel and make them all either black or white) -

Now we need to identify the largest "object" within this mask - wherever bright pixels are adjacent to other bright pixels, they will be considered part of the same object and we would expect there to be several such objects within the mask that has been generated.

To do so, I'll be reusing some more code from "How are barcodes read?" -

private static IEnumerable<IEnumerable<Point>> GetDistinctObjects(DataRectangle<bool> mask)
{
    // Flood fill areas in the mask to create distinct areas
    var allPoints = mask
        .Enumerate()
        .Where(pointAndIsMasked => pointAndIsMasked.Value)
        .Select(pointAndIsMasked => pointAndIsMasked.Point).ToHashSet();
    while (allPoints.Any())
    {
        var currentPoint = allPoints.First();
        var pointsInObject = GetPointsInObject(currentPoint).ToArray();
        foreach (var point in pointsInObject)
            allPoints.Remove(point);
        yield return pointsInObject;
    }

    // Inspired by code at
    // https://simpledevcode.wordpress.com/2015/12/29/flood-fill-algorithm-using-c-net/
    IEnumerable<Point> GetPointsInObject(Point startAt)
    {
        var pixels = new Stack<Point>();
        pixels.Push(startAt);

        var valueAtOriginPoint = mask[startAt.X, startAt.Y];
        var filledPixels = new HashSet<Point>();
        while (pixels.Count > 0)
        {
            var currentPoint = pixels.Pop();
            if ((currentPoint.X < 0) || (currentPoint.X >= mask.Width)
            || (currentPoint.Y < 0) || (currentPoint.Y >= mask.Height))
                continue;

            if ((mask[currentPoint.X, currentPoint.Y] == valueAtOriginPoint)
            && !filledPixels.Contains(currentPoint))
            {
                filledPixels.Add(new Point(currentPoint.X, currentPoint.Y));
                pixels.Push(new Point(currentPoint.X - 1, currentPoint.Y));
                pixels.Push(new Point(currentPoint.X + 1, currentPoint.Y));
                pixels.Push(new Point(currentPoint.X, currentPoint.Y - 1));
                pixels.Push(new Point(currentPoint.X, currentPoint.Y + 1));
            }
        }
        return filledPixels;
    }
}

As the code mentions, this is based on an article "Flood Fill algorithm (using C#.NET)" and its output is a list of objects, where each object is a list of points within that object. So the way to determine which object is largest is to take the one that contains the most points!

var pointsInLargestHighlightedArea = GetDistinctObjects(mask)
    .OrderByDescending(points => points.Count())
    .FirstOrDefault();

(Note: If pointsInLargestHighlightedArea is null then we need to escape out of the method that we're in because the source image didn't produce a mask with any highlighted objects - this could happen if the image has every single with the same colour, for example; an edge case, surely, but one that we should handle)

From this largest object, we want to find a bounding quadrilateral, which we do by looking at every point and finding the one closest to the top left of the image (because this will be the top left of the bounding area), the point closest to the top right of the image (for the top right of the bounding area) and the same for the points closest to the bottom left and bottom right.

This can be achieved by calculating, for each point in the object, the distances from each of the corners to the points and then determining which points have the shortest distances - eg.

var distancesOfPointsFromImageCorners = pointsInLargeHighlightedArea
    .Select(p =>
    {
        // To work out distance from the top left, you would use Pythagoras to take the
        // squared horizontal distance of the point from the left of the image and add
        // that to the squared vertical distance of the point from the top of the image,
        // then you would square root that sum. In this case, we only want to be able to
        // compare determine which distances are smaller or larger and we don't actually
        // care about the precise distances themselves and so we can save ourselves from
        // performing that final square root calculation.
        var distanceFromRight = greyScaleImageData.Width - p.X;
        var distanceFromBottom = greyScaleImageData.Height - p.Y;
        var fromLeftScore = p.X * p.X;
        var fromTopScore = p.Y * p.Y;
        var fromRightScore = distanceFromRight * distanceFromRight;
        var fromBottomScore = distanceFromBottom * distanceFromBottom;
        return new
        {
            Point = p,
            FromTopLeft = fromLeftScore + fromTopScore,
            FromTopRight = fromRightScore + fromTopScore,
            FromBottomLeft = fromLeftScore + fromBottomScore,
            FromBottomRight = fromRightScore + fromBottomScore
        };
    })
    .ToArray(); // Call ToArray to avoid repeating this enumeration four times below
    
var topLeft = distancesOfPointsFromImageCorners.OrderBy(p => p.FromTopLeft).First().Point;
var topRight = distancesOfPointsFromImageCorners.OrderBy(p => p.FromTopRight).First().Point;
var bottomLeft = distancesOfPointsFromImageCorners.OrderBy(p => p.FromBottomLeft).First().Point;
var bottomRight = distancesOfPointsFromImageCorners.OrderBy(p => p.FromBottomRight).First().Point;

Finally, because we want to find the bounding area of the largest object in the original image, we may need to multiply up the bounds that we just found because we shrank the image down if either dimension was larger than 400px and we were performing calculations on that smaller version.

We can tell how much we reduced the data by looking at the width of the original image and comparing it to the width of the greyScaleImageData DataRectangle that was generated from the shrunken version of the image:

var reducedImageSideBy = (double)image.Width / greyScaleImageData.Width;

Now we only need a function that will multiply the bounding area that we've got according to this value, while ensuring that none of the point values are multiplied such that they exceed the bounds of the original image:

private static (Point TopLeft, Point TopRight, Point BottomRight, Point BottomLeft) Resize(
    Point topLeft,
    Point topRight,
    Point bottomRight,
    Point bottomLeft,
    double resizeBy,
    int minX,
    int maxX,
    int minY,
    int maxY)
{
    if (resizeBy <= 0)
        throw new ArgumentOutOfRangeException("must be a positive value", nameof(resizeBy));

    return (
        Constrain(Multiply(topLeft)),
        Constrain(Multiply(topRight)),
        Constrain(Multiply(bottomRight)),
        Constrain(Multiply(bottomLeft))
    );

    Point Multiply(Point p) =>
        new Point((int)Math.Round(p.X * resizeBy), (int)Math.Round(p.Y * resizeBy));

    Point Constrain(Point p) =>
        new Point(Math.Min(Math.Max(p.X, minX), maxX), Math.Min(Math.Max(p.Y, minY), maxY));
}

The final bounding area for the largest bright area of an image is now retrieved like this:

var bounds = Resize(
    topLeft,
    topRight,
    bottomRight,
    bottomLeft,
    reducedImageSideBy,
    minX: 0,
    maxX: image.Width - 1,
    minY: 0,
    maxY: image.Height - 1
);

For the example image that we're looking at, this area is outlined liked this:

Applying the process to multiple images

Say that we put all of the above functionality into a method called GetMostHighlightedArea that took a Bitmap to process and returned a tuple of the four points that represented the bounds of the brightest area, we could then easily prepare a LINQ statement that ran that code and found the most common brightest-area-bounds across all of the source images that I have. (As I said before, the largest-bounded-area will vary from image to image in my example as the camera recording the session gained and lost focus)

var files = new DirectoryInfo("Frames").EnumerateFiles("*.jpg");
var (topLeft, topRight, bottomRight, bottomLeft) = files
    .Select(file =>
    {
        using var image = new Bitmap(file.FullName);
        return IlluminatedAreaLocator.GetMostHighlightedArea(image);
    })
    .GroupBy(area => area)
    .OrderByDescending(group => group.Count())
    .Select(group => group.Key)
    .FirstOrDefault();

Presuming that there is a folder called "Frames" in the output folder of project*, this will read them all, look for the largest bright area on each of them individually, then return the area that appears most often across all of the images. (Note: If there are no images to read then the FirstOrDefault call at the bottom will return a default tuple-of-four-Points, which will be 4x (0,0) values)

* (Since you probably don't happen to have a bunch of images from a video of my presentation lying around, see the next section for some code that will download some in case you want to try this all out!)

This ties in nicely with my recent post "Parallelising (LINQ) work in C#" because the processing required for each image is..

Completely independent from the processing of the other images (important for parallelising work)
Expensive enough that the overhead from splitting the work into multiple threads and then combining their results back together would be overshadowed by the work performed (which is also important for parallelising work - if individual tasks are too small and the computer spends more time scheduling the work on threads and then pulling all the results back together than it does on actually performing that work then using multiple threads can be slower than using a single one!)

All that we would have to change in order to use multiple threads to process multiple images is the addition of a single line:

var files = new DirectoryInfo("Frames").EnumerateFiles("*.jpg");
var (topLeft, topRight, bottomRight, bottomLeft) = files
    .AsParallel() // <- WOO!! This is all that we needed to add!
    .Select(file =>
    {
        using var image = new Bitmap(file.FullName);
        return IlluminatedAreaLocator.GetMostHighlightedArea(image);
    })
    .GroupBy(area => area)
    .OrderByDescending(group => group.Count())
    .Select(group => group.Key)
    .FirstOrDefault();

(Parallelisation sidebar: When we split up the work like this, if the processing for each image was solely in memory then it would be a no-brainer that using more threads would make sense - however, the processing for each image involves LOADING the image from disk and THEN processing it in memory and if you had a spinning rust hard disk then you may fear that trying to ask it to read multiple files simultaneously would be slower than asking it to read them one at a time because its poor little read heads have to physically move around the plates.. it turns out that this is not necessarily the case and that you can find more information in this article that I found interesting; "Performance Impact of Parallel Disk Access")

Testing the code on your own computer

I haven't quite finished yet but I figured that there may be some wild people out there that would like to try running this code locally themselves - maybe just to see it work or maybe even to get it working and then chop and change it for some new and exciting purpose!

To this end, I have some sample frames available from this video that I'm trying to fix - with varying levels of fuzziness present. To download them, use the following method:

private static async Task EnsureSamplesAvailable(DirectoryInfo framesfolder)
{
    // Note: The GitHub API is rate limited quite severely for non-authenticated apps, so we just
    // only call use it if the framesFolder doesn't exist or is empty - if there are already files
    // in there then we presume that we downloaded them on a previous run (if the API is hit too
    // often then it will return a 403 "rate limited" response)
    if (framesfolder.Exists && framesfolder.EnumerateFiles().Any())
    {
        Console.WriteLine("Sample images have already been downloaded and are ready for use");
        return;
    }

    Console.WriteLine("Downloading sample images..");
    if (!framesfolder.Exists)
        framesfolder.Create();

    string namesAndUrlsJson;
    using (var client = new WebClient())
    {
        // The API refuses requests without a User Agent, so set one before calling (see
        // https://docs.github.com/en/rest/overview/resources-in-the-rest-api#user-agent-required)
        client.Headers.Add(HttpRequestHeader.UserAgent, "ProductiveRage Blog Post Example");
        namesAndUrlsJson = await client.DownloadStringTaskAsync(new Uri(
            "https://api.github.com/repos/" +
            "ProductiveRage/NaivePerspectiveCorrection/contents/Samples/Frames"
        ));
    }

    // Deserialise the response into an array of entries that have Name and Download_Url properties
    var namesAndUrls = JsonConvert.DeserializeAnonymousType(
        namesAndUrlsJson,
        new[] { new { Name = "", Download_Url = (Uri?)null } }
    );
    if (namesAndUrls is null)
    {
        Console.WriteLine("GitHub reported zero sample images to download");
        return;
    }

    await Task.WhenAll(namesAndUrls
        .Select(async entry =>
        {
            using var client = new WebClient();
            await client.DownloadFileTaskAsync(
                entry.Download_Url,
                Path.Combine(framesfolder.FullName, entry.Name)
            );
        })
    );

    Console.WriteLine($"Downloaded {namesAndUrls.Length} sample image(s)");
}

.. and call it with the following argument, presuming you're trying to read images from the "Frames" folder as the code earlier illustrated:

await EnsureSamplesAvailable(new DirectoryInfo("Frames"));

Filtering out intro/outro slides

So I said earlier that it would also be nice if I could programmatically identify which frames were part of the intro/outro animations of the video that I'm looking at.

It feels logical that any frame that is of the actual presentation will have a fairly similarly-sized-and-located bright area (where a slide is being projected onto a wall in a darkened room) while any frame that is part of an intro/outro animation won't. So we should be able to take the most-common-largest-brightest-area and then look at every frame and see if its largest bright area is approximately the same - if it's similar enough then it's probably a frame that is part of the projection but if it's too dissimilar then it's probably not.

Rather than waste time going too far down a rabbit hole that I've found won't immediately result in success, I'm going to use a slightly altered version of that plan (I'll explain why in a moment). I'm still going to take that common largest brightest area and compare the largest bright area on each frame to it but, instead of saying "largest-bright-area-is-close-enough-to-the-most-common = presentation frame / largest-bright-area-not-close-enough = intro or outro", I'm going to find the first frame whose largest bright area is close enough and the last frame that is and declare that that range is probably where the frames for the presentation are.

The reason that I'm going to do this is that I found that there are some slides with more variance that can skew the results if the first approach was taken - if a frame in the middle of the presentation is so blurry that the range in intensity from darkest pixel to brightest pixel is squashed down too far then it can result in it identifying a largest bright area that isn't an accurate representation of the image. It's quite possible that I could still have made the first approach work by tweaking some other parameters in the image processing - such as considering changing that arbitrary "create a mask where the intensity threshold is 2/3 of the range of the brightness of all pixels" (maybe 3/4 would have worked better?), for example - but I know that this second approach works for my data and so I didn't pursue the first one too hard.

To do this, though, we are going to need to know what order the frames are supposed to appear in - it's no longer sufficient for there to simply be a list of images that are frames out of the video, we now need to know what were they appeared relative to each other. This is simple enough with my data because they all have names like "frame_1052.jpg" where 1052 is the frame index from the original video.

So I'm going to change the frame-image-loading code to look like this:

// Get all filenames, parse the frame index from them and discard any that don't
// match the filename pattern that is expected (eg. "frame_1052.jpg")
var frameIndexMatcher = new Regex(@"frame_(\d+)\.jpg", RegexOptions.IgnoreCase);
var files = new DirectoryInfo("Frames")
    .EnumerateFiles()
    .Select(file =>
    {
        var frameIndexMatch = frameIndexMatcher.Match(file.Name);
        return frameIndexMatch.Success
            ? (file.FullName, FrameIndex: int.Parse(frameIndexMatch.Groups[1].Value))
            : default;
    })
    .Where(entry => entry != default);

// Get the largest bright area for each file
var allFrameHighlightedAreas = files
    .AsParallel()
    .Select(file =>
    {
        using var image = new Bitmap(file.FullName);
        return (
            file.FrameIndex,
            HighlightedArea: IlluminatedAreaLocator.GetMostHighlightedArea(image)
        );
    })
    .ToArray()

// Get the most common largest bright area across all of the images
var (topLeft, topRight, bottomRight, bottomLeft) = allFrameHighlightedAreas
    .GroupBy(entry => entry.HighlightedArea)
    .OrderByDescending(group => group.Count())
    .Select(group => group.Key)
    .FirstOrDefault();

(Note that I'm calling ToArray() when declaring allFrameHighlightedAreas - that's to store the results now because I know that I'm going to need every result in the list that is generated and because I'm going to enumerate it twice in the work outlined here, so there's no point leaving allFrameHighlightedAreas to be a lazily-evaluated IEnumerable that would be recalculated each time it was looped over; then it would be doing all of the IlluminatedAreaLocator.GetMostHighlightedArea calculations for each image twice if enumerated the list twice, which would just be wasteful!)

Now to look at the allFrameHighlightedAreas list and try to decide if each HighlightedArea value is close enough to the most common area that we found. I'm going to use a very simple algorithm for this - I'm going to:

Take all four points from the HighlightedArea on each entry in allFrameHighlightedAreas
Take all four points from the most common area (which are the topLeft, topRight, bottomRight, bottomLeft values that we already have in the code above)
Take the differences in X value between all four points in these two areas and add them up
Compare this difference to the width of the most common highlighted area - if it's too big of a proportion (say if the sum of the X differences is greater than 20% of the width of the entire area) then we'll say it's not a match and drop out of this list
If the X values aren't too bad then we'll take the differences in Y value between all four points in these two areas and add those up
That total will be compared to the height of the most common highlighted area - if it's more than the 20% threshold then we'll say that it's not a match
If we got to here then we'll say that the highlighted area in the current frame is close enough to the most common highlighted area and so the current frame probably is part of the presentation - yay!

In code:

var highlightedAreaWidth = Math.Max(topRight.X, bottomRight.X) - Math.Min(topLeft.X, bottomLeft.X);
var highlightedAreaHeight = Math.Max(bottomLeft.Y, bottomRight.Y) - Math.Min(topLeft.Y, topRight.Y);
const double thresholdForPointVarianceComparedToAreaSize = 0.2;
var frameIndexesThatHaveTheMostCommonHighlightedArea = allFrameHighlightedAreas
    .Where(entry =>
    {
        var (entryTL, entryTR, entryBR, entryBL) = entry.HighlightedArea;
        var xVariance =
            new[]
            {
                entryBL.X - bottomLeft.X,
                entryBR.X - bottomRight.X,
                entryTL.X - topLeft.X,
                entryTR.X - topRight.X
            }
            .Sum(Math.Abs);
        var yVariance =
            new[]
            {
                entryBL.Y - bottomLeft.Y,
                entryBR.Y - bottomRight.Y,
                entryTL.Y - topLeft.Y,
                entryTR.Y - topRight.Y
            }
            .Sum(Math.Abs);
        return
            (xVariance <= highlightedAreaWidth * thresholdForPointVarianceComparedToAreaSize) &&
            (yVariance <= highlightedAreaHeight * thresholdForPointVarianceComparedToAreaSize);
    })
    .Select(entry => entry.FrameIndex)
    .ToArray();

This gives us a frameIndexesThatHaveTheMostCommonHighlightedArea array of frame indexes that have a largest brightest area that is fairly close to the most common one. So to decide which frames are probably the start of the presentation and the end, we simply need to say:

var firstFrameIndex = frameIndexesThatHaveTheMostCommonHighlightedArea.Min();
var lasttFrameIndex = frameIndexesThatHaveTheMostCommonHighlightedArea.Max();

Any frames whose index is less than firstFrameIndex or greater than lastFrameIndex is probably part of the intro or outro sequence - eg.

Any frames whose index is within the firstFrameIndex / lastFrameIndex range is probably part of the presentation - eg.

Coming soon

As the title of this post strongly suggests, this is only the first step in my desire to fix up my blurry presentation video. What I'm going to have to cover in the future is to:

Extract the content from the most-common-brightest-area in each frame of the video that is part of the presentation and contort it back into a rectangle - undoing the distortion that is introduced by perspective due to the position of the camera and where the slides were projected in the room (I'll be tackling this in a slightly approximate-but-good-enough manner because to do it super accurately requires lots of complicated maths and I've managed to forget nearly all of the maths degree that I got twenty years ago!)
Find a way to compare the perspective-corrected projections from each frame against a clean image of the original slide deck and work out which slide each frame is most similar to (this should be possible with some surprisingly rudimentary calculations inspired by some of the image preprocessing that I've mentioned in a couple of my posts that touch on machine learning but without requiring any machine learning itself)
Some tweaks that were required to get the best results with my particular images (for example, when I described the GetMostHighlightedArea function earlier, I picked 400px as an arbitrary value to resize images to before greyscaling them, masking them and looking for their largest bright area; maybe it will turn out that smaller or larger values for that process result in improved or worsened results - we'll find out!)

Once this is all done, I will take the original frame images and, for each one, overlay a clean version of the slide that appeared blurrily in the frame (again, I'll have clean versions of each slide from the original slide deck that I produced, so that should be an easy part) - then I'll mash them all back together into a new video, combined with the original audio. To do this (the video work), I'll likely use the same tool that I used to extract the individual frame files from the video in the first place - the famous FFmpeg!

I doubt that I'll have a post on this last section as it would only be a small amount of C# code that combines two images for each frame, writes the results to disk, followed by me making a command line call to FFmpeg to produce the video - and I don't think that there's anything particularly exciting there! If I get this all completed, though, I will - of course - link to the fixed-up presentation video.. because why not shameless plug myself given any opportunity!

So.. what is machine learning? (#NoCodeIntro)

Mon, 28 Feb 2022 23:44:00 GMT

TL;DR

Strap in, this is a long one. If the title of the post isn't enough of a summary for you but you think that this "TL;DR" is too long then this probably isn't the article for you!

A previous job I had was, in a nutshell, working on improving searching for files and documents by incorporating machine learning algorithms - eg. if I've found a PowerPoint presentation that I produced five years ago on my computer and I want to find the document that I made with loads of notes and research relating to it, how can I find it if I've forgotten the filename or where I stored it? This product could pull in data from many data sources (such as Google Docs, as well as files on my computer) and it could, amongst other things, use clever similarity algorithms to suggest which documents may be related to that presentation. This is just one example but even this is a bit of a mouthful! So when people outside of the industry asked me what I did, it was often hard to answer them in a way that satisfied us both.

This came to a head recently when I tried to explain to someone semi-technical what the difference actually was between machine learning and.. er, not machine learning. The "classic approach", I suppose you might call it. I tried hard but I made a real meal of the explanation and doing lots of hand waving did not make up for a lack of whiteboard, drawing apparatus or a generally clear manner to explain. So, for my own peace of mind (and so that I can share this with them!), I want to try to describe the difference at a high level and then talk about how machine learning can work (in this case, I'll mostly be talking about "supervised classification" - I'll explain what that means shortly and list some of the other types) in a way that is hopefully understandable without requiring any coding or mathematical knowledge.

The short version

The "classic approach" involves the programmer writing very specific code for every single step in a given process
"Supervised classification" involves the programmer writing some quite general (ie. not specific to the precise task) code and then giving it lots of information along with a brief summary of each piece of information (also known as a label) so that it can create a "trained model" that can guess how to label new information that it hasn't seen before

What this means in practice

An example of the first ("classic") approach might be to calculate the total for a list of purchases:

The code will look through each item and lookup in a database what rate of tax should be applied to it (for example, books are exempt from VAT in the UK)
If there is a tax to apply then the tax for the item will be calculated and this will be added to the initial item's cost
All of these costs will be added up to produce a total

This sort of code is easy to understand and if there are any problems encountered in the process then it's easy to diagnose them. For example, if an item appeared on the list that wasn't in the database - and so it wasn't possible to determine whether it should be taxed or not - then the problem could easily stop with an "item not found in database" error. There are a lot of advantages to code being simple to comprehend and having it easy to understand how and why bad things have happened.

(Anyone involved in coding knows that dealing with "the happy path" of everything going to plan is only a small part of the job and it's often when things go wrong that life gets hard - and the easier it is to understand precisely what happened when something does go wrong, the better!)

An example of the second ("machine learning") approach might be to determine whether a given photo is of a cat or a dog:

There will be non-specific (or "generic") code that is written that can take a list of "labelled" items (eg. this is a picture of a cat, this is a picture of a dog) and use it to predict an unseen and unlabelled item (eg. here is a picture - is it a cat or a dog?) - this code is considered to be generic because nothing in the way it is written relates to cats or dogs, all it is intended to do is be able to receive lots of labelled data and produce a trained model that can make predictions on future items
The code will be given a lot of labelled data (maybe there are 10,000 pictures of cats and 10,000 pictures of dogs) and it will perform some sort of clever mathematics that allows it to build a model trained to differentiate between cats and dogs - generally, the more labelled data that is provided, the better the final trained model will be at making predictions.. but the more data that there is, the longer that it will take to train
When it is finished "training" (ie. producing this "trained model"), it will then be able to be given a picture of a cat or a dog and say how likely it is that it thinks it is a cat vs a dog

This sounds like quite a silly example but there are many applications of this sort of approach that are really useful - for example, the same non-specific/generic code could be given inputs that are scans of hospital patients where it is suspected that there is a cancerous growth in the image. It would be trained by being given 1,000s of images that doctors have already said "this looks like a malignant growth" or "this looks like nothing to worry about" and the trained model that would be produced from that information would then be able to take images of patient scans that it's never seen before and predict whether it shows something to worry about.

(This sort of thing would almost certainly never replace doctors but it could be used to streamline some medical processes - maybe the trained model is good enough that if it predicts with more than 90% certainty that the scan is clear then a doctor wouldn't need to look at it but if there was even a 10% chance that it could be a dangerous growth then a doctor should look at it with higher priority)

Other examples could be taken from self-driving cars; from the images coming from the cameras on the car, does it look like any of them indicate pedestrians nearby? Does it look like there are speed limit signs that affect how quickly the car may travel?

The results of the trained model need not be binary (only two options), either - ie. "is this a picture of a cat or is it a picture of a dog?". It could be trained to predict a wide range of different animals, if we're continuing on the animal-recognition example. In fact, an application that I'm going to look at in more depth later is using machine learning to recognise hand-written digits (ie. numbers 0 through 9) because, while this is a very common introductory task into the world of machine learning, it's a sufficiently complicated task that it would be difficult to imagine how you might solve it using the "classic" approach to coding.

Back to the definition of the type of machine learning that I want to concentrate on.. the reason it's referred to as "supervised classification" is two-fold:

The trained model that it produces has the sole task of taking inputs (such as pictures in the examples above, although there are other forms of inputs that I'll mention soon) and predicting a "classification" for them. Generally, it will offer a "confidence score" for each of the classifications that it's aware of - to continue the cat/dog example, if the trained model was given a picture of a cat then it would hopefully give a high prediction score that it was a cat (generally presented as a percentage) and the less confident it was that it was a cat, the more confident it would be that the picture was of a dog.
The model is trained by the "labelled data" - it can't guess which of the initial pictures are cats and which are dogs if it's just given a load of unlabelled pictures and no other information to work from. The fact that this data is labelled means that someone has had to go through the process of manually applying these labels. This is the "supervised" aspect.

There are machine learning algorithms (where an "algorithm" is just a set of steps and calculations performed to produce some result) that are described as "unsupervised classification" but the most common example of this would be to train a model on a load of inputs and ask it to split them into groups based upon which it thinks seem most similar. It won't be able to give a name to each group because all it has access to is the raw data of each item and no "label" for what each one represents.

This sort of approach is a little similar to how the "find related documents" technology that I described at the top of this post works - the algorithm looks for "features"* that it thinks makes it quite likely that two documents contain the same sort of content and uses this to produce a confidence score that they may be related. I'll talk about other types of machine learning briefly near the end of this post but, in an effort to give this any semblance of focus, I'm going to stick with talking about "supervised classification" for the large part.

* ("Features" has a specific meaning in terms of machine learning algorithms but I won't go into detail on it right now, though I will later - for now, in the case of similar documents, you can imagine "features" as being uncommon words or phrases that are more likely to crop up in documents that are similar in some manner than in documents that are talking about entirely different subject matters)

Supervised classification with neural networks

Right, now we're sounding all fancy and technical! A "neural network" is a model commonly used for supervised classification and I'm going to go through the steps of explaining how it is constructed and how it works. But first I'm going to try to explain what one is.

The concept of a neural net was inspired by the human brain and how it has neurons that connect to each other with varying strengths. The strengths of the connections are developed based upon patterns that we've come to recognise. The human brain is amazing at recognising patterns and that's why two Chicago researchers in 1944 were inspired to wonder if a similar structure could be used for some form of automated pattern recognition. I'm being intentionally vague here because the details aren't too important and the way that connections are made in the human brain is much more complicated than those in the neural networks that I'll be talking about here, so I only really mention it for a little historical context and to explain some of the names of things.

A neural net has a set of "input neurons" and a set of "output neurons", where the input neurons are connected to the output neurons.

Each time that the network is given a single "input" (such as an image of a cat), that input needs to be broken down into values to feed to the input neurons; these input neurons accept numeric values between 0 and 1 (inclusive of those two values) and it may not immediately be apparent how a picture can be somehow represented by a list of 0-to-1 values but I'll get to that later.

There are broadly two types of classifier and this determines how many output neurons there will be - there are "binary classifiers" (is the answer yes or no; eg. "does this look like a malignant growth or not?") and there are "multi-class classifier" (such as a classifier that tries to guess what kind of fruit an image is of; a banana, an apple, a mango, an orange, etc..). A binary classifier will have one output neuron whose output is a confidence score for the yes/no classification (eg. it is 10% certain that it is not an indication of cancer) while a multi-class classifier will have as many output neurons as there are known outputs (so an image of a mango will hopefully produce a high confidence score for the mango output neuron and a lower confidence score for the output neurons relating to the other types of fruit that it is trained to recognise).

Each connection from the input neurons to the output neurons has a "weight" - a number that represents how strong the connection is. When an "input pattern" (which is the name for the list of 0-to-1 input neuron values that a single input, such as a picture of a cat, maybe represented by) is applied to the input neurons, the output neurons are set to a value that is the sum of every input neuron's value that is connected to it multiplied by the weight of the connection.

I know that this is sounding very abstract, so let's visualise some extremely simple possible neural nets.

Examples that are silly to use machine learning for but which are informative for illustrating the principles

The image below depicts a binary classifier (because there is only one output neuron) where there are only two input neurons. The connections between the two inputs neurons to the single output neuron each have a weight of 0.3.

This could be considered to be a trained model for performing a boolean "AND" operation.. if you'll allow me a few liberties that I will take back and address properly shortly.

An "AND" operation could be described as a light bulb that is connected to two switches and the light bulb only illuminates if both of the switches are set to on. If both switches are off then the light bulb is off, if only one of the switches is on (and the other is off) then the light bulb is off, if both switches are on then the light bulb turns on.

Since the neuron inputs have to accept values between 0 and 1 then we could consider an "off" switch as being a 0 input and an "on" switch as being a 1 input.

If both switches are off then the value at the output is (0 x 0.3) + (0 x 0.3) = 0 because we take the input values, multiply them by their connection weights to the output and add these values up.

If one switch is on and the other is off then the output is either (1 x 0.3) + (0 x 0.3) or (0 x 0.3) + (1 x 0.3), both of which equal 0.3 and we'll consider 0.5 to be the cut-off point at which we consider the output to be "on".

If both switches are on then the output is (1 x 0.3) + (1 x 0.3) = 0.6, which is greater than 0.5 and so we consider the output to be on, which is the result that we wanted!

Just in case it's not obvious, this is not a good use case for machine learning - this is an extremely simple process that would be written much more obviously in the "classic approach" to programming.

Not only would it be simpler to use the classic approach, but this is also not suited for machine learning because we know all of the possible input states and what their outputs should be - we know what happens when both switches are off and when precisely one switch is on and when both switches are on. The amazing thing with machine learning is that we can produce a trained model that can then make predictions about data that we've never seen before! Unlike this two-switches situation, we can't possibly examine every single picture of either a cat or a dog in the entire world but we can train a model to learn from one big set of pictures and then perform the crucial task of cat/dog identification in the future for photos that haven't even been taken yet!

For a little longer, though, I'm going to stick with some super-simple boolean operation examples because we can learn some important concepts.

Where the "AND" operation requires both inputs to be "on" for the output to be "on", there is an "OR" operation where the output will be "on" if either or both of the inputs are on. The weights on the network shown above will not work for this.

Now this second network would work to imitate an OR operation - if both switches are off then the output is (0 x 0.5) + (0 x 0.5) = 0, if precisely one switch is on then the output is (1 x 0.5) + (0 x 0.5) or (0 x 0.5) + (1 x 0.5) = 0.5, if both switches are on then the output is (1 x 0.5) + (1 x 0.5) = 1. So if both switches are off then the output is 0, which means the light bulb should be off, but if one or both of the switches are on then the output is at least 0.5, which means that the light bulb should be on.

This highlights that the input neurons and output neurons determine the form of the data that the model can receive and what sort of output/prediction it can make - but it is the weight of the connections that control what processing occurs and how the input values contribute in producing produce the output value.

*(Note that there can be more layers of neurons in between the input and output layer, which we'll see an example of shortly)

In both of the two examples above, it was as if we were looking at already-trained models for the AND and the OR operations but how did they get in that state? Obviously, for the purposes of this post, I made up the values to make the examples work - but that's not going to work in the real world with more complex problems where I can't just pull the numbers out of my head; what we want is for the computer to determine these connection weight values and it does this by a process of trial and improvement and it is this act that is the actual "machine learning"!

The way that it often works is that, as someone who wants to train a model, I decide on the number of inputs and outputs that are appropriate to my task and then the computer has a representation of a neural network of that shape in its memory where it initially sets all of the connection weights to random values. I then give it my labelled data (which, again, is a list of inputs and the expected output - where each individual "input" is really a list of input values that are all 0-1) and it tries running each of those inputs through its neural network and compares the calculated outputs to the outputs that I've told it to expect. Since this first attempt will be using random connection weights, the chances are that a lot of its calculated output will not match the outputs that I've told it to expect. It will then try to adjust the connection weights so that hopefully things get a bit closer and then it will try running all of the inputs through the neural network with the new weights and see if the calculated outputs are closer to the expected output. It will do this over and over again, making small adjustments to the connection weights each time until it produces a network with connection weights that calculate the expected output for every input that I gave it to learn with.

The reason that the weights that it uses initially are random values (generally between 0 and 1) is that the connection weight values can actually be any number that makes the network operate properly. While the input values are all 0-1 and the output value should end up being 0-1, the connection weights could be larger than one or they could be negative; they could be anything! So your first instinct might be "why set them all to random values instead of setting them all to 0.5 initially" and the answer is that while 0.5 is the mid-point in the allowable ranges for the input values and the output values, there is no set mid-point for the connection weight values. You may then wonder why not set them all to zero because that sounds like it's in the middle of "all possible numbers" (since the weights could be positive or they could be negative) and the machine learning could then either change them from zero to positive or negative as seems appropriate.. well, at the risk of skimming over details, numbers often behave a little strangely in some kinds of maths when zeroes are involved and so you generally get better results starting with random connection weight values, rather than starting with them all at zero.

Let's imagine, then, that we decided that we wanted to train a neural network to perform the "OR" operation. We know that there are two inputs required and one output. And we then let the computer represent this model in memory and have it give the connections random weight values. Let's say that it picks weights 0.9 and 0.1 for the connections from Input1-to-Output and Input2-to-Output, respectively.

We know that the labelled data that we're training with looks like this:

Input 1	Input 2	Output
0	0	0
0	1	1
1	0	1
1	1	1

.. and the first time that we tried running these inputs through our 0.9 / 0.1 connection weight neural network, we'd get these results:

Input 1	Input 2	Calculated Output	Expected Output	Is Correct
0	0	(0x0.9) + (0x0.1) = 0.0	0	Yes (0.0 < 0.5 so consider this 0)
0	1	(0x0.9) + (1x0.1) = 0.1	1	No (0.1 < 0.5 so consider this 0 but we wanted 1)
1	0	(1x0.9) + (0x0.1) = 0.9	1	Yes (0.9 >= 0.5 so consider this 1)
1	1	(1x0.9) + (1x0.1) = 1.0	1	Yes (1.9 >= 0.5 so consider this 1)

Unsurprisingly (since completely random connection weights were selected), the results are not correct and so some work is required to adjust the weight values to try to improve things.

I'm going to grossly simplify what really happens at this point but it should be close enough to illustrate the point. The process tries to improve the model by repeatedly running every input pattern through the model (the input patterns in this case are (0, 0), (0, 1), (1, 0), (1, 1)) and comparing the output of the model to the output that is expected for each input pattern (as we know (0, 0) should have an output of < 0.5, while any of the input patterns (0, 1), (1, 0) and (1, 1) should have an output of >= 0.5). When there is a discrepancy in the model's output and the expected output, it will adjust one or more of the connection weights up and down.. then it will do it again and hopefully find that the calculated outputs are closer to the expected outputs, then again and again until the model's calculated outputs for every input pattern match the expected outputs.

So it will first try the pattern (0, 0) and the output will be 0 and so no change is needed there.

Then it will try (0, 1) and find that the output is too low and so it will increase the weight of the connections slightly, so now maybe they go from 0.9 / 0.1 to 0.91 / 0.11.

Then it will try (1, 0) with the new 0.91 / 0.11 weights and find that it gets the correct output (more than 0.5) and so make no change.

Then it will try (1, 1) with the same increased 0.91 / 0.11 weights and find that it still gets the correct output there and so make no more changes.

After this adjustment, the input pattern (0, 1) will still be too low (0 x 0.9) + (1 x 0.11) and so it will have to go round again.

It might continue doing this multiple times until the weights end up something like 0.5 / 1.4 and now it will have a model that gets all of the right values!

Input 1	Input 2	Calculated Output	Expected Output	Is Correct
0	0	(0x1.4) + (0x0.5) = 0.0	0	Yes (0.0 < 0.5 so consider this 0)
0	1	(0x1.4) + (1x0.5) = 0.5	1	Yes (0.5 >= 0.5 so consider this 1)
1	0	(1x1.4) + (0x0.5) = 1.4	1	Yes (1.4 >= 0.5 so consider this 1)
1	1	(1x1.4) + (1x0.5) = 1.9	1	Yes (1.9 >= 0.5 so consider this 1)

That's the very high-level gist, that it goes round and round in trying each input pattern and comparing the computed output to the expected output until the computed and expected outputs match. Great success!

(I'm not going to go into any more detail about how this weight-adjusting process works because I'm trying to avoid digging into any code in this post - just be aware that this process of calculating the output for each known input and then adjusting the connection weights and retrying until the output values are what we expect for each set of inputs is the actual training of the model, which I'll be referring to multiple times throughout the explanations here)

Now, there are a few things that may seem wrong based upon what I've said previously and how exactly it adjusts those weights through trial-and-improvement:

The calculations here show that the four outputs are 0.0, 0.5, 1.4 and 1.9 but I said earlier that the input values should all be in the range 0-1 and the output values should be in the same range of 0-1
Why does it adjust the weights so slowly when it needs to alter them; why would it only add 0.01 to the weight that connects Input1-to-Output each time?

Because I'm contrary, I'll address the second point first. If the weights were increased too quickly then the outputs may then "overshoot" the target output values that we're looking for and the next time that it tries to improve the values, it may find that it has to reduce them. Now, in this simple case where we're trying to model an "OR" operation, that's not going to be a problem because the input pattern (0, 0) will always get an output of 0 since it is calculated as (0 x Input1-to-Output-connection-weight) + (0 x Input2-to-Output-connection-weight) and that will always be 0, while the other three input patterns should all end up with an output of 0.5 or greater. However, for more complicated models, there will be times when weights need to be reduced in some cases as well as increased. If the changes made to the weights are too large then they might bounce back and forth on each attempt and ever settle into the correct values, so smaller adjustments are more likely to result in training a model that matches the requirements but at the cost of having to go round and round on the trial-and-improvement attempts more often.

This means that it will take longer to come to the final result and this is one of the issues with machine learning - for more complicated models, there can be a huge number of these trial-and-improvement attempts and each attempt has to run every input pattern through the model. When I was talking about training a model with 10,000 pictures of cats and 10,000 pictures of dogs and all these inputs have to be fed through a neural network until the outputs are correct then it can take a long time. That's not the case here (where there are only 4 input patterns and it's a very simple network) but for larger cases, there can be a point where you allow the model to train for a certain period and then accept that it won't be perfect but hope that it's good enough for your purposes, as a compromise against how long it takes to train - it can take days to train some really complex models with lots and lots of labelled data! Likewise, another challenge/compromise is trying to decide how quickly the weights should be adjusted - the larger the changes that it makes to the weight values of connections between neurons, the closer that it can get to a good result but it might actually make it impossible to get the best possible result if it keeps bouncing some of the weights back and forth, as I just explained!

Now to address the first point. There's a modicum of maths involved here but you don't have to understand it in any great depth. I've been pretending that the way to calculate the output value on our network is to take (Input1's value x the Input1-to-Output's connection weight) + (Input2's value x the Input2-to-Output's connection weight) but, as we've just seen, this result of this can be greater than 1 and input values and output values are all supposed to be within the 0-1 range. In fact, using this calculation, it would be possible to get a negative output value because neuron connection weights can be negative (I'll explain why in some more examples of machine learning a little later on) and that would also mean that the output value would fall outside of the 0-1 range that we require.

To fix this, we take the simple calculation that I've been using so far and pass the value through a formula that can take any number and squash it into the 0-1 range. While there are different formula options for neural networks, a common one is the "sigmoid function" and it would look like this if it was drawn on a graph (picture courtesy of Wikipedia) -

Although this graph only shows values from -8 to +8, you can see that its "S shape" means that the lines get very flat the larger that the number is. So if the formula is given a value of 0 then the result will be 0.5, if it's given a value of 2 then the result will be about 0.88, if it's given a value of 4 then the result is about 0.98, it's given a value of 8 then it's over 0.999 and the larger the value that the function is given the closer that the result will be to 1. It has the same effect for negative numbers - negative numbers that are -8 or larger (-12, -100, -1000) will all return a value very close to 0.

The actual formula for this graph is shown on the top left of the image ("sig(t) = 1 / (1 + e^-t)") but that's really not important to us right now, what is important is the shape of the graph and how it constrains all possible values to the range 0-1.

If we took the network that we talked about above (that trains a model to perform an "OR" operation and where we ended up with connection weights of 1.4 and 0.5) and then applied the sigmoid function to the calculated output values then we'd find that those weights wouldn't actually work and the machine learning process would have to produce slightly different weights to get the correct results. But I'm not going to worry about that now since the point of that example was simply to offer a very approximate overview of how the trial-and-improvement process works. Besides, we've got a more pressing issue to talk about..

The limits of such a simple network and the concept of "linearly separable" data

The two examples of models that we've trained so far are extremely simple in one important way - if you drew a graph with the four input values on them and were asked to draw a straight line that separated the inputs that should relate to an "off" state from the inputs that should relate to an "on" state then you do it very easily, like this:

But not all sets of data can be segregated so simply and, unfortunately, it is a limitation to the very simple network shape that we've seen so far (where the input layer is directly connected to the output layer) that it can only work if the data can be split with all positive results on one side of a straight line and all negative results on the other side. Cases where this is possible (such as the "AND" and "OR" examples) are referred to as being "linearly separable" (quite literally, the results in either category can be separated by a single straight line and the model training is, in effect, to work out where that line should lie). Interestingly, there are actually quite a lot of types of data analysis that have binary outcomes that are linearly separable - but I don't want to go too far into talking about that and listing examples because I can't cover everything about machine learning and automated data analysis in this post!

A really simple example of data that is not linearly separable is an "XOR" operation. While I imagine that the "AND" and "OR" operations are named so simply that you could intuit their definitions without a grounding in boolean logic, this may require slightly more explanation. "XOR" is an abbreviation of "eXclusive OR" and, to return to our light bulb and two switches example, the light should be off if both switches are off, it should be on if one of the switches is on but it should be off if both of the switches are on. On the surface, this sounds like a bizarre situation but it's actually encountered in nearly every two-storey residence in the modern world - when you have a light on your upstairs landing, there will be a switch for it downstairs and one upstairs. When both switches are off, the light is off. If you are downstairs and switch the downstairs switch on then the light comes on. If you then go upstairs and turn on your bedroom light, you may then switch the upstairs landing light and the light will go off. At this point, both switches are on but the light is off. So the upstairs light is only illuminated if only one of the switches is on - when they're both on, the light goes off.

If we illustrated this with a graph like the "AND" and "OR" graphs above then you can see that there is no way to draw a single straight line on that graph where every state on one side of the line represents the light being on while every state on the other side of the line represents the light being off.

This is a case where the data points (where "data" means "all of the input patterns and their corresponding output values") are not linearly separable. And this means that the simple neural network arrangement that we've seen so far can not produce a trained model that can represent the data. If we tried to train a model in the same way as for AND and OR, the neuron connection weights would go back and forth as the training process kept finding that its "trial-and-improvement" approach continuously came up with at least one wrong result.

There is a solution to this, and that is to introduce another layer of neurons into the graph. In our simple network, there are two "layers" of neurons - the "input" neurons on the left and the "output" neuron on the right. What we would need to do here is add a layer in between, which is referred to as a "hidden layer". A neural network to do this would look something like the following:

To calculate the output value for any pair of input values, more calculations are required now that we have a hidden layer. Whereas before, we only had to multiple Input 1 by the weight that joined it to the Output and add that value to Input 2 multiplied by its connection we weight, now we have three hidden layer neurons and we have to:

Multiply Input 1's value by the weight of its connection to Hidden Input 1 and then add that to Input 2's value multiplied by its connection weight to Hidden Input 1 to find Hidden Input 1's "initial value"
Do the same for Input 1 and Input 2 as they connect to Hidden Input 2
Apply the sigmoid function for each of the Hidden Input values to ensure that they are between 0 and 1
Take Hidden Input 1's value multiplied by its connection weight to the output and add that to Hidden Input 2's value multiplied by its connection weight to the Output to find the Output's "initial value"
Apply the sigmoid function to the Output value

The principle is just the same as when there were only two layers (the Input and Output), except now there are three and we have to take the Input layer and calculate values for the second layer (the Hidden layer) and then use the values there to calculate the value for third and final layer (the Output layer).

(Note that with this extra layer in the model, it is necessary to apply the sigmoid function after each calculation - we could get away with pretending that it didn't exist on the earlier examples but things would fall apart here if kept trying to ignore it)

The learning process described earlier can be applied here to determine what connection weights to use; start with all connection weights set to random values, calculate the final output for every set of inputs, then adjust the connection weights to try to get closer and repeat until the desired results are achieved.

For example, the learning process may result in the following weights being determined as appropriate:

.. which would result in the following calculations occurring for the four sets of inputs (0, 0), (1, 0), (0, 1) and (1, 1) -

Input 1	Input 2	Hidden 1 Initial	Hidden 2 Initial	Hidden 1 Sigmoid	Hidden 2 Sigmoid	Output Initial	Output Sigmoid
0	0	(0x0.2) + (0x0.2) = 0.0	(0x1) + (0x1) = 0.0	0.50	0.50	(-3.9x0.50) + (3.1x0.50) = -0.40	0.17
0	1	(0x0.2) + (1x0.2) = 0.2	(0x1) + (0x1) = 1.0	0.69	0.98	(-3.9x0.69) + (3.1x0.98) = 0.35	0.80
1	0	(1x0.2) + (0x0.2) = 0.2	(0x1) + (0x1) = 1.0	0.69	0.98	(-3.9x0.69) + (3.1x0.98) = 0.35	0.80
1	1	(1x0.2) + (1x0.2) = 0.4	(0x1) + (0x1) = 2.0	0.83	1.00	(-3.9x0.83) + (3.1x1.00) = -0.14	0.36

Since we're considering an output greater than or equal to 0.5 to be equivalent to 1 and an output less than 0.5 to be equivalent to 0, we can see that these weights have given us the outputs that we want:

Input 1	Input 2	Calculated Output	Expected Output	Is Correct
0	0	0.17	0	Yes (0.17 < 0.5 so consider this 0)
0	1	0.80	1	Yes (0.80 >= 0.5 so consider this 1)
1	0	0.80	1	Yes (1.80 >= 0.5 so consider this 1)
1	1	0.36	0	Yes (0.36 < 0.5 so consider this 0)

How do you know if your data is linearly separable?

Or, to put the question another way, how many layers should your model have??

A somewhat flippant response would be that if you try to specify a model that doesn't have a hidden input layer and the training never stops calculating because it can find weights that perfectly match the data then it's not linearly separable. While it was easy to see with the AND and OR examples above that a training approach of fiddling with the connection weights between the two input nodes and the output should result in values for the model that calculate the outputs correctly, if we tried to train a model of the same shape (two inputs, one output, no hidden layers) for the XOR case then it would be impossible for the computer to find a combination of weights that would correctly calculate outputs for all of the possible inputs. You could claim that because the training process for the XOR case could never finish that it must not be linearly separable - and this is, sort of, technically, correct. But it's not very useful.

One reason that it's not very useful is that the AND, OR, XOR examples only exist to illustrate how neural networks can be arranged, how they can be trained and how outputs are calculated from the inputs. In the real world, it would be crazy to use a neural network for a tiny amount of fixed data for which all of the outputs are known - where a neural network becomes useful is when you use past data to predict future results. An example that I've used before is a fictitious history of a manager's decisions for feature requests that a team receives:

The premise is that every time this manager decides whether to give the green light or not to a feature that has been requested, they consider what strategic importance it has to the company and how much of the work the customer that is requesting it is willing to pay. If it's of high strategic importance and the customer expects to receive such value from it that they are willing to pay 100% of the costs of implementation then surely this manager will be delighted to schedule it! If the customer's budget is less than what it will cost to implement but the feature has sufficiently high strategic value to the company (maybe it will be a feature that could then be sold to many other customers to almost zero cost to the company or maybe it is an opportunity to address an enormous chunk of technical debt) then it still may get the go-ahead! But if the strategic value is low and the customer doesn't have the budget to cover the entire cost of development then the chances are that it will be rejected.

This graph only shows a relatively small number of points and it is linearly separable. As such, the data points on the graph could be used to train a simple two-input (strategic value on a scale of 0-1 and percentage payable by the customer on a scale of 0-1) and single-output binary classifier (the output is whether the feature gets agreed or rejected) neural network. The hope would be that the data used to train it is indicative of how that manager reacts to incoming requests and so it should be possible to take future requests and predict whether that manager is likely to take them on.

However, neural networks are often intended to be used with huge data sets and whenever there is a large amount of data then there is almost always bound to be some outliers present - results that just seem a little out of keeping with those around them. If you were going to train a model like this with 100,000 previous decisions then you might be satisfied with a model that can correctly is 99.9% accurate, which would mean that out of every 100,00 decisions that it might get 100 of them wrong. If you were trying to train a neural network model using 100,000 sets of historical inputs and outputs then you might decide that the computer can stop its training process when the neuron connection weights that it calculates results in outputs being calculated that are correct in 99.9% of cases, rather than hoping that a success rate of 100% can be achieved. This will have the advantage of finishing slightly more quickly but there's always a chance that a couple of those historical input/output entries were written down wrong and, with them included, the data isn't linearly separable - but with them excluded, the data is linearly separable. And so there is a distinction that can be made between whether the entirety of the data is, strictly speaking, linearly separable and whether a simple model can be trained (without any hidden layers) that is a close enough approximation.

The next problem with that simple approach ("if you can't use your data to train a model without hidden layers then it's not linearly separable") is that it suggests that adding in hidden layers will automatically mean that a neural network can be trained with the provided data - which is definitely not correct. Say, for example, that someone believed that this manager would accept or reject feature requests based upon what day of the week it was and what colour tie they were wearing that day. This person could provide historical data for sets of inputs and output - they know that decisions are only made Monday-to-Friday, which so that works itself easily into a 0-1 scale for one input (0 = Monday, 0.2 = Tuesday, etc..) and they have noticed that there are only three colours of tie worn (so a similar numeric value can be associated with each colour). The issue is that there is no correlation between these two inputs and the output, so it's extremely unlikely that a computer would be able to train a simple two-input / one-output model but that does not mean that adding in hidden layers would fix the problem!

This conveniently brings us to the subject of "feature selection". As I touched on earlier, features are measurable aspects of whatever we are trying to make predictions for. The "strategic importance" and "percentage that the customer will pay" were features on the example data before. Feature selection is an important part of machine learning - if you don't capture the right information then it's unlikely that you'll be able to produce something that makes good predictions. When I said before that there could be results in the managerial decision history that don't fit a linearly separable model for these two features, maybe it's not because some of the data points were written down wrong; maybe it's because other factors were at play. Maybe this manager is taking into account other factors such as an agreement with a customer that they can't foot the bill for the entirety of the current feature but they will contribute significantly to another feature that is of high strategic importance to the company but only if this first feature is also completed.

Capturing another feature in the model (perhaps something that reflects how likely the current feature request is to bring in future valuable revenue) is a case of adding another input neuron. So far, we've only seen networks that have two inputs but that has only been the case because they're very easy to talk about and to illustrate as diagrams and to describe calculations for and to draw graphs for! If a third feature was added then all of the calculation processes are essentially the same (if there are three inputs and one output then the output value is calculated by adding together each of the three input values multiplied by their connection weights) and it would still be possible to visualise, it's just that it would be in 3D rather than being a 2D graph. Adding a fourth dimension would mean that it couldn't be easily visualised but the maths and the training process would be the same - and this holds for adding a fifth, sixth or hundredth dimension!

(Other examples of features for the manager decision data might include "team capacity" and "opportunity cost" - would we be sacrificing other, more valuable work if we agree to do this task - and "required timescale for the feature" - is the customer only able to pay for it if it's delivered in a certain time frame, after which they would not be willing to contribute? I'm sure that you wouldn't have to think very hard to conjure up features like this that could explain the results that appear to be "outliers" when only the original two proposed features were considered)

It may well be that adding further relevant features is what makes a data set linearly separable, whereas before - in absence of sufficient information - it wasn't. And, while it can be possible to train a neural network using hidden layers such that it can take all of your historical input/output data and calculate neuron connection weights that make the network appear to operate correctly, it may not actually be useful for making future predictions - where it takes inputs that it hasn't seen before and determines an output. If it can't do this, then it's not actually very useful! A model that is trained to match its historical data but that is poor at making future predictions is said to have been a victim of "overfitting". A way to avoid that is to split up the historical data into "training data" and "test data" as I'll explain in the next section*.

* (I'll also finish answering the question about how many layers you should use and how large they should be!)

A final note on feature selection: In general, gathering more features in your data is going to be better than having fewer. The weights between neurons in a network represent how important the input is that the weight is associated with - this means that inputs/features that make a larger difference to the final output are going to end up with a higher weight in the trained model than inputs/features that are of lower importance. As such, irrelevant features tend to get ignored and so there is little downside to the final trained model from including them - in fact, there will be many circumstances where you don't know beforehand which features are going to be the really important ones and so you may be preventing yourself from training a good model if you exclude data! The downside is that the more inputs that there are, the more calculations need to be performed for each iteration of the "see how good the network is with the current weights and then adjust them accordingly", which can add up quickly if the amount of historical data being used to train is large. But you may well be happier with a model that took a long time to train into a useful state than you would be with a model that was quick to train but which is terrible at making predictions!

A classic example: Reading handwritten numeric digits

An extremely well-known data set that is often used in introductions to machine learning is MNIST; the Modified National Institute of Standards and Technology database, which consists of a large number of 28x28 pixel images of handwritten digits and the number that each image is a picture of (making it a collection of "labelled data" as each entry contains input data, in the form of the image, and an output value, which is the number that the image is known to be of).

It may seem counterintuitive but those 28x28 pixels can be turned into a flat list of 784 numbers (28 x 28 = 784) and they may be used as inputs for a neural network. The main reason that I think that it may be counterintuitive is that by going from a square of pixels to a one-dimensional list of numbers, you might think that valuable information is being discarded in terms of the structure of the image; surely it's important which pixels are above which other pixels or which pixels are to the left of which other pixels? Well, it turns out that discarding this "spatial information" doesn't have a large impact on the results!

Each 28x28 pixel image is in greyscale and every pixel has a brightness value in the range 0-255, which can easily be scaled down to the range 0-1 by dividing each value by 255.

For the outputs, this is not a binary classifier (which is what we've looked at mostly so far); this is a multi-class classifier that has ten outputs because, for any given input image, it should predict whether the digit is 0, 1, etc.. up to 9.

The MNIST data (which is readily available for downloading from the internet) has two sets of data - the training data and the test data. Both sets of data are in the exact same format but the training data contains 60,000 labelled images while the test data contains 10,000 labelled images. The idea behind this split is that we can use the training data to train a model that can correctly give the correct output for each of its labelled images and then we confirm that the model is good by running the test data through it. The test data is the equivalent of "future data" that trained neural networks are supposed to be able to make good predictions for - again, if a neural net is great at giving the right answer for data that it has already seen (ie. the data that it was trained with) but it's rubbish at making predictions for data that it hasn't seen before then it's not a very useful model! Having labelled training data and test data should help us ensure that we don't construct a model that suffers from "overfitting".

Whereas we have so far mostly been looking at binary classifiers that have a single output (where a value of greater than or equal to 0.5 indicates a "yes" and a value less than 0.5 indicates a "no"), here we want a model with ten output nodes - for each of the possible digits. And the 0-1 value that each output will receive will be a "confidence" value for that output. For example, if our trained network processes and image and the outputs for 0, 1, .. 7 are low (say, 0.1) the outputs for 8 and 9 are both similarly high then it indicates that the model is fairly sure that the image is either an 8 or a 9 but it is not very sure which. On the other hand, if 0..7 are low (0.1-ish) and 8 is high (say, 0.9) and 9 is in between (0.6 or so) then the model still thinks that 8 or 9 are the most likely results but, in this case, it is much more confident that 8 is the correct output.

What we know about the model at this point, for sure, is that there are 784 input neurons (for each pixel in the source data) and 10 output neurons (for each possible label for each image). What we don't know is whether we need a hidden layer or not. What we do know in this case, though, is how we can measure the effectiveness of any model that we train - because we train it using the training data and then see how well that trained model predicts the results of the test data. If our trained model gets 99% of the correct answers when we try running the test data through it then we know that we've done a great job and if we only get 10% of the correct answers for our test data then we've not got a good model!

I said earlier that one way to try to train a model is to take the approach of "start with random weights and see how well it predicts outputs for the training data then adjust weights to improve it and then run through the training data again and then adjust weights to improve it.." iterations until the model either calculates all of the outputs correctly or gets to within an acceptable range. Well, another way to approach it is to decide how many iterations you're willing to do and to just stop the training at that point. Each iteration is referred to as an "epoch" and you might decide that you will run this training process for 500 epochs and then test the model that results against your test data to see how accurate it is. Depending upon your training data, this may have the advantage that the resulting model will not have an acceptably low error rate but one advantage that it definitely does have is that the training process will end - whereas if you were trying to train a model that doesn't have any hidden layer of neurons and the training data is not linearly separable then the training process would never end if you were going to let it run until it was sufficiently accurate because it's just not possible for a model of that form to be accurate for that sort of data.

This gives us a better way to answer the question "how many layers should the model have?" because you could start with just an input layer and an output layer, with randomly generated connection weights (as we always start with), have the iterative run-through-the-input-data-and-check-the-outputs-against-the-expected-values-then-try-to-improve-the-weights-slightly-to-get-better-results process run for 500 epochs and then see how well the resulting model does as handling the test data (for each of the 10k labelled images, run them through the model and count it as a pass if the output that matches the correct digit for the image has the highest value out of all ten of the outputs and count it is a fail if that is not the case).

Since we're not looking at code here to perform all this work, you'll have to take my word for what would happen - this simple model shape would not do very well! Yes, it might predict the correct digit for some of the inputs but even if a broken clock is right twice a day!

This indicates that we need a different model shape and the only other shape that we've seen so far has an additional "hidden layer" between the input layer and the output layer. However, even this introduces some questions - first, if we add a hidden layer then how many neurons should it have in it? The XOR example has two neurons in the input layer, two neurons in the hidden layer and one neuron in the output layer - does this tell us anything about how many we should use here? Secondly, is there any reason why we have to limit ourselves to a single hidden layer? Could there be any advantages to having multiple hidden layers (say, the input layer, a first hidden layer, a second hidden layer, the output layer)? If so, should each hidden layer have the same number of neurons as the other hidden layers or should they be different sizes?

Well, again, on the one hand, we now have a mechanism to try to answer these questions - we could guess that we want a single hidden layer that has 100 neurons (a number that I've completely made up for the sake of an example) and try training that model for 500 epochs* using the training, then see how accurate the resulting model is at predicting the results of the test data. If the accuracy seems acceptable then you could say that you've found a good model! But if you want to see if the accuracy could be improved then you might try repeating the process but with 200 neurons in the hidden layer and trying again - or maybe even reducing it down to 50 neurons in the hidden layer to see how that impacts the results! If you're not happy with any of these results then maybe you could try adding a second hidden layer and then playing around with how many neurons are in the first layer and the second layer!

* (This 500 value is also one that I've just made up for sake of example right now - deciding how many iterations to attempt when training the model may come down to how much training data that you have because the more data that there is to train with, the longer each iteration will take.. so if 500 epochs can be completed in a reasonable amount of time then it could be a good value to use but if it takes an entire day to perform those 500 iterations then maybe a small number would be better!)

One thing to be aware of when adding hidden layers is that the more layers that there are, the more calculations that must be performed as the model is trained. Similarly, the more neurons that there are in the hidden layers, the more calculations there are that must be performed for each training iteration/epoch. If you have a lot of training data then this could be a concern - it's tedious to fiddle around with different shapes of models if you have to wait hours (or even days!) each time that you want to train a model in a new configuration. And so erring on the side of fewer layers and few neurons in those layers is a good starting point - if you can get good results from that then you will get over the finish line sooner!

To finally offer some concrete advice, I'm going to quote a Stack Overflow answer that repeats a rule of thumb that I've read in various other places:

(i) number of hidden layers equals one;

and (ii) the number of neurons in that layer is the mean of the neurons in the input and output layers.

This advice recommends that we start with a single hidden layer and that it have 397 neurons (which is the average of the number of neurons in the input layer = 28 * 28 pixels = 784 inputs and the number of neurons in the output layer = 10).

Using the MNIST training data to train a neural network of this shape (784 inputs, 387 neurons in hidden layer, 10 outputs) across 10 epochs will result in a model that has 95.26%* accuracy when the test data is run through it. Considering what some of these handwritten digits in the test data look like, I think that that is pretty good!

* (I know this because I went through the process of writing the code to do this a few years ago - I was contemplating using it as part of a series of posts about picking up F# if you're a C# developer but I ran out of steam.. maybe one day I'll pick it back up!)

To try to put this 95.26% accuracy into perspective, the PDF "Formal Derivation of Mesh Neural Networks with Their Forward-Only Gradient Propagation" claims that the MNIST data has an "average human performance of 98.29%"* (though there are people on Reddit who find this hard to believe and that it is too low) while the state of the art error rate for MNIST (where, presumably, the greatest machine learning minds of our time compete) is 0.21%, which indicates an accuracy of 99.71% (if I've interpreted the leaderboard's information correctly!).

* (Citing P. Simard, Y. LeCun, and J. Denker, "Efficient pattern recognition using a new transformation distance" in Advances in Neural Information Processing Systems (S. Hanson, J. Cowan, and C. Giles, eds.), vol. 5, pp. 50–58, Morgan-Kaufmann, 1993)

Note: If I had trained a model that had TWO hidden layers where, instead of there being a single hidden layer whose neuron count was 1/2 x number-of-inputs-plus-number-of-outputs, the two layers had 2/3 and 1/3 x number-of-inputs-plus-number-of-outputs then the accuracy could be increased to 97.13% - and so the advice above is not something set in stone, it's only a guideline or a starting point. But I don't want to get too bogged down in this right now as the section just below talks more about multiple hidden layers and about other options, such as pre-processing of data; with machine learning, there can be a lot of experimentation required to get a "good enough" result and you should never expect perfection!

Other shapes of network and other forms of processing

The forms of neural network that I've shown above are the most simple (regardless of how many hidden layers there are - or aren't) but there is a whole range of variations that offer tools to potentially improve accuracy and/or reduce the amount of time required to train them. This is a large topic and so I won't go deeply into any individual possibility but here are some common variations..

Firstly, let's go back to thinking about those hidden layers and how many of them that you might want. To quote part of a Quora answer:

There is a well-known problem of facial recognition, where computer learns to detect human faces. Human face is a complex object, it must have eyes, a nose, a mouth, and to be in a round shape, for computer it means that there are a lot of pixels of different colours that are comprised in different shapes. And in order to decide whether there is a human face on a picture, computer has to detect all those objects.

Basically, the first hidden layer detects pixels of light and dark, they are not very useful for face recognition, but they are extremely useful to identify edges and simple shapes on the second hidden layer. The third hidden layer knows how to comprise more complex objects from edges and simple shapes. Finally, at the end, the output layer will be able to recognize a human face with some confidence.

Basically, each layer in the neural network gets you farther from the input which is raw pixels and closer to your goal to recognize a human face.

I've seen various explanations that describe hidden layers as being like feature inputs for increasingly specific concepts (with the manager decision example, the two input neurons represented very specific features that we had chosen for the model whereas the suggestion here is that the neurons in each hidden layer represent features extracted from the layer before it - though these features are extracted as a result of the training process and they may not be simple human-comprehendible features, such as "strategic value").. but I get the impression that this is something of an approximation of what's going on. In the face detection example above, we don't really know for sure that the third hidden layer really consists of "complex objects from edges and simple shapes" - that is just (so far as I understand it) an approximation to give us a feeling of intuition about what is going on during the training process.

It's important to note that ALL that is happening during the training is fiddling of neuron connection weights such that the known inputs of the training data get closer to producing the expected outputs in the training data corresponding to those inputs! While we might be able to understand and describe this as a mathematical process (and despite these neural network structures having been inspired by the human brain), we shouldn't fool ourselves into believing that this process is "thinking" and analysing information in the same way that we do! I'll talk about this a little more in the section "The dark side of machine learning predictions a little further down".

Another variation used is what "activation function" is used for neurons in each layer. When I described the sigmoid function earlier, which took the sums of the neurons in the previous layer multiplied by their connection weights and then applied a formula to squash that value into the 0-1 range; that was an activation function. In the XOR example earlier, the sigmoid function was used as the activation function for each neuron in the hidden layer and each neuron in the output layer. But after I did this, I had to look at the final output values and say "if it's 0.5 or more then consider it to be a 1 and if it's less than 0.5 then consider it to be a 0".. instead of doing that, I could have used a "step function" for the output layer that would be similar to the sigmoid function but which would have sharp cut off points (for either 0 or 1) instead of a nice smooth curve*.

* (The downside to using a step function for the output of a binary classifier is that you lose any information about how confident the result is if you only get 0 or 1; for example, an output of 0.99 indicates a very confident >= 0.5 result while an output of 0.55 is still >= 0.5 and so indicates a 1 result but it is a less confident result - if the information about the confidence is not important, though, then a step function could have made a lot of sense in the XOR example)

Another common activation function that appears in the literature about neural networks is the "Rectified Linear Unit (ReLU)" function - it's way out of the scope of this post to explain why but if you have many hidden layers then you can encounter difficulties if you use the sigmoid function in each layer and the ReLU can ease those woes. If you're feeling brave enough to dig in further right now then I would recommend starting with "A Gentle Introduction to the Rectified Linear Unit (ReLU)".

Finally, there are times when changing your model isn't the most efficient way to improve its accuracy. Sometimes, cleaning up the data can have a more profound effect. For example, there is a paper Spatial Transformer Networks (PDF) that I saw mentioned in a StackOverflow answer that will try to improve the quality of input images before using them to train a model or to make a prediction on test data or not-seen-before data.

In the case of the MNIST images, it can be seen to locate the area of the image that looks to contain the numeral and to then rotate it and stretch it such that it will hopefully reduce the variation between the many different ways that people write numbers. The PDF describes the improvements in prediction accuracy and also talks about using the same approach to improve recognition of other images such as street view house numbers and even the classification of bird species from images. (Unfortunately, while the StackOverflow answer links to a Google doc with further information about the performance improvements, it's a private document that you would have to request access to).

The approach to image data processing can also be changed by no longer considering the raw pixels but, instead, deriving some information from them. One example would be to look at every pixel and then see how much lighter or darker it is than its surrounding pixels - this results in a form of edge detection and it can be effective at reducing the effect of light levels in the source image (in the case of a photo) by looking at the changes in brightness, rather than considering the brightness on a pixel-by-pixel basis. This changes the source data to concentrate more on shapes within the images and less on factors such as colours - which, depending upon the task at hand, may be appropriate (in the case of recognising handwritten digits, for example, whether the number was written in red or blue or green or black shouldn't have any impact on the classification process used to predict what number an image contains).

(In my post from a couple of years ago, "Face or no face", I was using a different technique than a neural network to train a model to differentiate between images that were faces and ones that weren't but I used a similar method of calculating "intensity gradients" from the source data and using that to determine "histograms of gradients (HoGs)" - I won't repeat the details here but that approach resulted in a more accurate model AND the HoGs data for each image was smaller than the raw pixel data and so the training process was quicker; double win!)

The next thing to introduce is a "convolutional neural network (CNN)", which is a variation on the neural network model that adds in "convolution layers" that can perform transformations on the data (a little like the change from raw colour image data to changes-in-brightness data, as shown in the edge detection picture above) though they will actually be capable of all sorts of types of alteration, all with multiple configuration options to tweak how they may be applied. But..

a CNN learns the values of these filters on its own during the training process

(From the article "An Intuitive Explanation of Convolutional Neural Networks")

.. and so the training process for this sort of model will not just experiment with changing the weights between neurons to try to improve accuracy, it will also try running the entire process over and over again with variations on the convolutional layers to see if altering their settings can produce better results.

To throw another complication into the mix - as I mentioned earlier, the more hidden layers (and the more neurons that each layer has), the more calculations that are required by the training process and so the slower that training a model will be. This is because every input neuron is connected to every neuron in the first hidden layer, then every neuron in the first hidden layer is connected to every neuron in the second hidden layer, etc.. until every neuron in the last hidden layer is connected to every neuron in the output layer - which is why the number of calculations expands massively with each additional layer. These are described as "fully-connected layers". But there is an alternative; the imaginatively-named approach "sparsely connected layers". By having fewer connections, the necessary calculations are fewer and the training time should be shorter. In a neural net, there are commonly a proportion of connections that have a great impact on the accuracy of the training model and a proportion that have a much lower (possibly even zero) effect. Removing these "lower value" connections is what allows us to avoid a lot of calculations/processing time but identifying these connections is a complex subject. I'm not even going to attempt to go into any detail in this post about how this may be achieved but if you want to know more about the process of intelligently selecting what connections to use, I'll happily direct you to the article "The Sparse Future of Deep Learning!

One final final note for this section, though: in most cases, more data will get you superior results in comparison to trying to eke out better results from a "cleverer" model that is trained with less data. If you have a model that seems decent and you want to improve the accuracy, if you have the choice between spending time obtaining more quality data (where "quality" is an important word because "more data" isn't actually useful if that data is rubbish!) or spending time fiddling with the model to try to get a few more percentage points of accuracy, generally you will be better to get more data. As Peter Norvig, Google's Director of Research, was quoted as saying (in the article "Every Buzzword Was Born of Data"):

We don't have better algorithms. We just have more data.

The dark side of machine learning predictions

The genius of machine learning is that it can take historical data (what inputs lead to what output) and produce a model that can use that information to make an output prediction for a set of inputs that it's never seen before.

A big downfall of machine learning is that all it is doing is taking historical data and producing a model that uses that information to predict an output for a set of inputs that it's never seen before.

In an ideal world, this wouldn't be a downfall because all decisions would have been made fairly and without bias. However.. that is very rarely the case and when a model is trained using historical data, you aren't directly imbuing it with any moral values but it will, in effect, exhibit any biases in the data that was used to train it.

One of the earliest examples that stick in my mind of this was of a handheld camera that had blink detection to try to help you get a shot where everyone has their eyes open. However, the data used to train the model used photos of caucasian people, resulting in an Asian American writing a post "Racist Camera! No, I did not blink... I'm just Asian!" (as reported on PetaPixel) as the camera "detected" that she was blinking when she wasn't.

And more horrifying is the article from the same site "Google Apologizes After Photos App Autotags Black People as 'Gorillas'" which arose from Flickr adding an auto-tagging facility that would make suggestions as to what it recognised in your photos. Again, this comes down to the source data that was used to train the models - and it's not to suggest that the people sourcing and using this data (which could well be two groups of people; one that collects and tags sets of images and a second group that presumes that that labelled data is sufficiently extensive and representative) are unaware of the biases that it contains. As with life, there are always unconscious biases and the only way to tackle them is to be aware of them and try as best you can to eradicate them!

Another example is that, a few years ago, Amazon toyed with introducing a resume-screening process using machine learning - where features were extracted from CVs (the features would be occurrences of a long, known list of words or phrases in this case, as opposed to the numeric values in the manager decision example of the pixel brightnesses in the MNIST example) and the recruitment outcomes (hired / not-hired) from historical data to train a model. However, all did not go to plan. Thankfully, they didn't just jump into the deep end and accept the results of the model when they received new CVs; instead, they ran them through the model and performed the manual checks, to try to get an idea of whether the model was effective or not. I'm going to take some highlights from the article "Amazon built an AI tool to hire people but had to shut it down because it was discriminating against women", so feel free to read that if you want more information. The upshot is that historically there had been many more CVs submitted by men than women, which resulted in there being many more "features" present on male CVs that resulted in a "hire them!" result. As I described before, when a neural network (presuming that Amazon was using such an approach) is presented with many features, it will naturally work out which have a greater impact on the final outcome and the connection weights for these input neurons will be higher. What I didn't describe at that point is that the opposite also happens - features that are found to have a negative impact on the final outcome will not just be given a smaller weight, they will be given a negative weight. And since this model was effectively learning that men are more likely to be hired and women are less likely, the model that it ended up with gave greater positive weight to features that indicated that the CV was for a male and negative weight to features that indicated that the CV was for a female, such as a mention of them being a "women's chess club captain" or even if they had attended one of two all-women's colleges (the name of which had presumably been in the list of known words and phrases that would have been used as features - and which would not have appeared on any man's CV). The developers at Amazon working on this project made changes to try to avoid this issue but they couldn't be confident that other biases were not having an effect and so the project was axed.

There are, alas, almost certainly always going to be issues with bias when models are trained in this manner - hopefully, it can be reduced as training data sets become more representative of the population (for cases of photographs of people, for example) but it is something that we must always be aware of. I thought that some industries were explicitly banned from applying judgements to their customers using a non-accountable system such as the neural networks that we've been talking about but I'm struggling to find definitive information. I had it in my head that the UK car insurance market was not allowed to produce prices from a system that isn't transparent and accountable (eg. it would not be acceptable to say "we will offer you this price because the computer says so" as opposed to a "decision-tree-based process" where it's essentially like a big flow chart that could be explained in clear English, where the impacts on price for each decision are based on statistics from previous claims) but I'm also unable to find any articles stating that. In fact, sadly, I find articles such as "Insurers 'risk breaking racism laws'" which describe how requesting quotes from some companies, where the only detail that varies between them is the name (a traditionally-sounding white English name compared to another traditional English - but not traditionally white - name, such as Muhammad Khan), results in wildly different prices being offered.

Talking of training neural nets with textual data..

In all of my descriptions before the previous section, I've been using examples where the inputs to the model as simple numbers -

Zeros and ones for the AND, OR, XOR cases
Numeric 0..1 value ranges for the features of the manager decision example (which were, if we're being honest, oversimplifications - can you really reliably and repeatedly rate the strategic importance to the company of a single feature in isolation? But I digress..)
Pixel brightness values for the MNIST example, which are in the range 0-255 and so can easily be reduced down to the 0-1 range
I mentioned brightness gradients (rather than looking at the intensity of individual pixels, looking at how much brighter or darker they are compared to surrounding pixels) and this also results in values that are easy to squeeze into the 0-1 range

However, there are all sorts of data that don't immediately look like they could be represented as numeric values in the 0-1 range. For example, above I was talking about analysis of CVs and that is purely textual content (ok, there might be the odd image and there might be text content in tables or other layouts but you can imagine how simple text content could be derived from that). There are many ways that this could be done but one easy way to imagine it would be to:

Take a bunch of documents that you want to train a classifier on (to try to avoid the contentiousness of the CV example, let's imagine that it's a load of emails and you want to automatically classify them as "spam", "company newsletter", "family updates" or one of a few other categories
Identify every single unique word across all of the documents and record them in one big "master list"
Go through each document individually and..
- Split it into individual words again
- Go through each word in the master list and calculate a score by counting how many times it appears in the current document divided by how many words there are in the document (the smaller that this number is, the less common that it is and, potentially, the more interesting it is in differentiating one document from another)
For each document, you now have a long list of numbers in the range 0-1 and you could potentially use this list to represent the features of the document
- Each list of numbers is the same length for each document because the same master list of words was used (this is vitally important, as we will see shortly)
- The list of numbers that is now used to describe a given document is called a "vector"

One problem here is that the vocabulary used throughout these emails could be very large, meaning that the master list would be very large and the list of numbers for each document equally large. The input layer of a neural network that we would train using this data would have to have as many neurons as there are entries in these lists. This might not necessarily be a problem because it is possible to train large neural networks - but the larger that they are, the longer that it takes.

There are almost certainly going to be lots of words that appear that will have no effect on the classification of an email - words like "a" and "the", for example. As discussed earlier, this is also not strictly a problem because features that have little effect on the classification of data will naturally end up with low connection weights in the final trained model. However, it feels wasteful to throw data into the mix that we know isn't going to be useful but is certainly going to slow down the training process by forcing it to perform many more calculations. One approach is to ignore "stop words" when generating the master list of all words in the source data -

Stopwords are the words in any language which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence.

(From "Stop Words in NLP")

This is a common approach and lists of stop words are freely available - but it's important to be aware that they are language-dependent; English stop words are not the same as French stop words, for example, and you will need to consider this if you're not always working with English language data!

Another approach is to take advantage of the fact that the calculation above generates smaller values for words that are potentially more relevant (because the word "the" is likely to appear many times in a document, if you divide the number of times that it appears by the total number of words in the document then that value will normally be much higher than if you divided the number of time that a less common word - like "bonus" - appeared in a document by the total number of words in that document). You might decide to discard, say, 20% of the words by looking at which of them appear in the documents with a high score and removing those words from the master list (and removing the corresponding entries from each of the lists of numbers that represents each document). By doing so, you would likely end up removing the stop words without having to have had to start with a known list of stop words to get rid of! (Again, this sort of process is language-dependent because words that are common in one language are different in another language and this sort of approach will work best if all of the documents used as training data are written in the same language).

This still leaves us with many more distinct words than is probably optimal. For example, if the document text is split into words in a very simple manner (such as breaking on whitespace) then you may have an email that is a thrilling update about Mark's cat, while another email may be talking about Henry's cats - in the context of trying to working out what an email is talking about, is there any meaningful distinction between "cat" (singular) and "cats" (plural)? Probably not. And so adding in a pre-processing step to the act of splitting the text into individual words could help reduce the number of distinct words further by grouping words that are probably equivalent. An example of this is the Snowball Stemmer which performs simple transformations such as lower-casing content, removing punctuation and removing common endings from words so that different forms of verbs are combined into the same word; eg.

Dan's cats like going to the park

.. becomes

dan cat like go to the park

(There is unlikely to be any benefit to considering the words "Dan's" and "Dan" as different features, just as there is little benefit to considering "cat" and "cats" as unique features).

Note that the Snowball Stemmer is also language-dependent and so you need to know what language the text that you're working with is in - again, if you're lucky enough to be dealing with an entirely English set of training documents then that's fine.. though the nltk.stem.snowball Python package used in the link above has support for different languages, so long as you tell it which language you want it to use.

(Shameless plug: Some years ago, I wrote my own interpretation of a Full Text Indexer - which actually powers the site search on this blog! - and core functionality for that include "tokenising" and "token normalising", which are the splitting of a string of text into individual words and the transformation of those tokens such that equivalent tokens like "Cat" and "cat" and "cats" are all reduced down to a single string; if you wanted to see one way to do this sort of work in C# then you could poke around my FullTextIndexer GitHub repo or read some of the articles listed in my "Full Text Indexer Post Round-up".. I would be remiss if I didn't mention that there are plenty of other libraries for .NET that have been written in the last ten years that are quite likely more featureful and optimised, but this is my blog and so if I want to link to my own code instead then I will! :))

In general, it would make sense to apply the stemming logic before culling the most common words from the list of features, so that you have an accurate picture of what words are true most "interesting" (ie. uncommon).

When I described the way that you might approach an MNIST classifier earlier, I explained that you could get a surprisingly accurate model by flattening out the 28x28 pixel image into one long 784 list of values - seemingly discarding structural information about where the pixels appear in relation to each other in two dimensions. What I've described just now, regarding the words-as-features, is similar in some ways as it is also discarding information about how words appear in relation to each other; the features that are produced are all solely based on individual words in isolation.

However, with natural language, the words that appear around a given word can make a big difference - the word "new" represents an entirely different concept when it talks about releasing a "new feature" for some software to when it is used in the city name "New York". As such, depending upon the data and the task in hand, it will often result in a more accurate model if the features extracted from textual content are not only individual words but also phrases, such as adjacent pairs of words (like "New York") or concurrent runs of three words or even more. Doing so will mean that the number of possible features grows again (it's not just a master list of individual words any longer, it's now a list of words and phrases - and one of the decisions that you will have to make is how long you allow the extracted phrases to be. The longer that they can be, the more possible features that you will produce (which will make model-training slower). The shorter that they can be, the fewer features there will be (and so model-training will be faster) but you risk missing out on vital context if the limit is too low (and so you might end up training models relatively quickly but finding that their accuracy is low for the task at hand).

With this knowledge, you could now picture a neural network model because you can take the documents in your training data and convert them into vectors (which are, as a reminder, simply lists of numeric values) and each vector will be the same length (ie. it will have the same number of values in it). The length of this vector will determine how large the input layer is in the neural network because there must be as many neurons in the input layer as the document vectors are long (if, say, the master list of words used to generate the document vectors had 10,000 unique entries then each document will have been translated into a vector with 10,000 values in it and the input layer for the neural network will have to have 10,000 nodes).

This brings us to an important decision.. we will have to decide how large the document vectors should be - before, I said that you might discard the least interesting 20% of the words but that was just an example figure that I made up on the spot. I started this example by saying that you may wish to classify emails into a particular set of categories but you will have had to decide on what exactly is in that "what type of email is this" list before you can create a model and you'll need to have manually classified every document in your training data before you can start training a neural net to do the same work. The idea behind training a classifier is that it can predict a category for data (eg. an email) that it's never seen before but it needs historical labelled data to do that, which is what this pre-categorised training data will be.

All of the variables that have been mentioned - the percentage of "non-interesting" words (or "tokens", as they are often referred to) that are discarded, the maximum number of tokens that may be combined into one to produce larger features (like the "New York" example), the number of outputs (which is the number of categories that emails may be assigned to - which will dictate how many neurons in the output layer; one per category), how many hidden layers that your model should have (and how large those layers should be).. these will all impact the training process and are known as "hyperparameters". Changing some of them may result in large changes (good or bad) in the accuracy of the trained model while changing others may not have much impact at all. As I said before, part of working with machine learning is in experimenting with different models and different hyperparameters to see what works well for your data and what doesn't!

Side note: The process that I've described above (about producing a "master list" of words and phrases) is a sort of simplified/bastardised of TF-IDF ("term frequency-inverse document frequency" - ie. how often do individual words appear in a document relative to the total number of words in that document), for which there is a nice intro at "Understanding TF-ID: A Simple Introduction". And when I spoke about the process of combining multiple individual words together to produce new tokens (such as "New York"), those new tokens are referred to as "n-grams" - so if you want to find out more about all this then there is plenty of great information out there!

Other types of machine learning

Almost the entirety of this lengthy post has been about neural network classifiers but it's definitely worth mentioning that these are not the only type of machine learning techniques.

The first example that I want to talk about is "document similarity" - say you have 100,000 text documents and you are fairly sure that some of them are very nearly duplicates of each other (maybe one is a technical spec for an audio amplifier from 2019 and another is a revision of that document from 2020 that includes some corrections but is basically the same), how would you find these similar documents? Well, a good starting point would be by using the textual content feature extraction described in the previous section - ie. converting each document into a vector. When that is done, it's actually quite simple to come up with an approximate "distance measurement". If you imagine a load of 2D points on a graph, you can easily measure the distance between any two points by using Pythagorean theorem; take the horizontal difference between the two points and square it, add that to the vertical difference between them squared, then square root the result. This same approach works for points in a 3D world. And the same principle works for points with any number of dimensions - and this is how we could imagine our vectors that have been generated from each document, as points in some crazy world with many, many dimensions. We can measure the distances between pairs of these points and pairs that are relatively close to each other are likely to correspond to documents that are fairly similar in content, while pairs of points that are further away will correspond to documents that are less similar to each other.

Now, this doesn't actually involve any machine learning - it's just extracting features from documents and then measuring the distances between every single pair of points. If you only have 100 documents then these calculations can probably be performed so quickly that it wouldn't pose any sort of problem but if we go back to imagining 100,000 documents (or even imagining millions of documents, if not more!) then calculating the distances between every single pair of documents in that list could become a gargantuan task (as in, performing the calculations in a "brute force" approach - where every single distance is individually calculated - could potentially take a computer longer to compute than you have left to live, which would be very sad). There are clever machine learning algorithms that work out how to approximate these calculations such that the brute force approach is not necessary while still ensuring that document similarity measures for all of your data remain attainable (one such algorithm is "Hierarchical Navigable Small World (HNSW)", which is way too technical for me to do anything about but mention here in passing). Something interesting to note about this technique is that it is an example of "unsupervised" learning - when we looked at the MNIST (handwritten digit recognition) example earlier, that required that the input data was all labelled so that the machine could learn how the images compared or differed to each other - in this case, though, there is no such labelling required; we just tell the computer to go off and work out on its own what documents look like other documents!

(Last year, I wrote a post about how I was using a C# machine learning library that a company that I used to work for published to automatically generate "You may also be interested in" links for each of my blog posts and that used some of the same techniques described here; the FastText algorithm automates the extraction of features from textual data and HNSW calculates distances between the document vectors - I don't actually have anywhere near enough posts on my blog for this to be necessary, I just wanted to try it for fun! If you want to find out more, see "Automating suggested / related posts links for my blog")

The next example is another form of binary classifier. I mentioned earlier my post "Face or no face", where I wanted to write code that could look at a photo and identify areas in it that appeared to be people's faces. There was a bunch of pre-processing to take a colour picture, then a very rough algorithm was run to find areas that might potentially be faces - this would return many false positives (ie. sections of a photo that were not faces) and I trained a Support Vector Machine (SVM) to be able to predict with much greater accuracy whether these image subsections were indeed faces or not. An SVM is trained by giving it a large list of labelled points (where a point doesn't have to be 2D or 3D, it can have many dimensions - so its' vectors again, in other words) and leaving it to try to work out a way to split those points so that all of the points on one side of the line are of one category and all of the points on the other side are of another category. We could train an SVM with the data from the manager decision example from earlier - its training data would be the historical list of 2D points (where the dimensions represented "strategic value to the company" and "percentage that customer will pay for feature development cost") and whether the manager gave a yes or no answer and the SVM would try to find a line that splits (or "delineates") those historical points. In a somewhat comparable way to the training of a neural network, it will pick a line at random, see how well or poorly that line delineates the results, adjusts the line to try to improve the situation and then repeats and repeats until it manages to split the data effectively. The vectors for the face-or-no-face code were much larger than two dimensions and so it's no longer a case of finding a 1D line that splits points on a 2D plane; instead, it is (cue impressive-sounding technical terms!) a case of finding an {n-1} dimensional "hyperplane" that splits the {n} dimensional space that the vector points exist in. The training data for the face-or-no-face SVM was derived from the publicly accessible "Caltech 10, 000 Web Faces".

I first mentioned unsupervised classification right near the start of this post, where training data will consist of points that should be arranged into groups of other points that they are most similar to. The problem is that the computer has no way of coming up with a description for what this group represents and so it is less directly useful for classifying into categories as those categories don't come with titles! However, it could be feasibly be used by someone like Netflix when they want to come up with creative new categories - they could have the computer extract features from tv shows and films (where the features may actually be extracted from metadata about the programs, such as description or even public reviews) and then have it arrange them into groups, which a Netflix employee could manually poke around in to see if a theme presents itself that could be used as a new niche category. But this still feels quite nebulous and so maybe a concrete example may help. One of the hyperparameters that you will need to set is how many groups you want to be generated, so let's go back to the MNIST data and imagine that we want to give a machine-learning algorithm all of the source images and for it to split the data into ten groups (for the digits 0..9) but without giving it any labels for that source data (in contrast to the supervised learning approach that was described before). Well, one such algorithm is "t-distributed stochastic neighbour embedding (t-SNE)" and that can produce results such as the following:

(Reproduced under license terms on Kyle McDonald's Flickr album page)

The t-SNE takes vectors with 784 dimensions (since the MNIST data is a set of 28x28 pixel images) as input and returns a vector for each of them that has only two dimensions, which is how the results can be plotted on the graph above. This is known as "dimension reduction" (since the 2D vectors are essentially approximations of the original 784-dimension vectors). One that image, the labels for each of the MNIST images (ie. whether that image is the digit 0 or 1 or .. 9) are used to determine the colour to draw the point as - but this is only to illustrate the results of the t-SNE algorithm, those labels were not used as part of the training. And so it's quite amazing just how effective it is at grouping similar digits together! (If there were big groups that were full of intermingled colours then that would indicate a poor job but the fact that the groups are so distinct, with only a few outliers here and there, suggests that it's done a fantastic job!) Of course, one part of the reason that it does such a good job of separating the images for each digit is that the hyperparameter specified for the algorithm about the number of groups that it should try to identify is set to 10 - if it had been set to 3 or to 12 (or to anything else) then the groupings wouldn't have been so obviously correct.

Another unsupervised algorithm that is similar to t-SNE is "Uniform Manifold Approximation and Projection (UMAP)", which you can find available as a C# implementation (also published by the company that I worked for that released the machine learning library; Curiosity!) in case you are a .NET developer and want to try it out. The "Tester" project in there includes a binary file containing the MNIST image data and it will use this to train a model that groups together the images that it thinks are similar and then generates a bitmap of the results that looks similar to the image shown above. Nothing is forcing this algorithm to reduce the vectors to two dimensions, in case you were wondering - that also is a hyperparameter that is set to train the model. It could be set to 3 and it would generate 3-dimensional vectors that could be plotted in 3D space, rather than the 2D graph above.

So machine learning is always the most amazing-est thing, right (except when it's evil and discriminatory)?

I know that I've just written thousands of words espousing the power of machine learning but I did also start the post by giving an example of coding in a "classic approach" - summing up a list of purchases that someone has made, ensuring that the correct tax is included for each of them. This is a simple and predictable process and there would not be any benefit to trying to replace this with a machine learning system. For adding together costs and calculating tax, absolute precision is expected and known predictable rules are in place; machine-trained models will almost always have some level of error (much as we may try to tweak the model to minimise it) and they are very often difficult (if not impossible) to definitively reason about - so if someone disputed a bill that had been generated from a machine-trained model, it would be very difficult to justify why it was correct or not!

However, there are some middle grounds where you may be tempted to go one way or the other. The first example that comes to mind is that where I used to work, the CTO took the opportunity to learn how to write a Roslyn analyser by implementing a "Stop commenting out code (delete it if you don't need it - we have source control, you know!)" analyser. Roslyn (well, the ".NET Compiler Platform SDK (Roslyn APIs) if you're being pernickety) makes it easy to locate comments and to show warnings in the Error List relating to them if you want to, but how to decide whether the comment text is C# code or whether it's a useful explanatory message? He could have:

Performed a one-off analysis of the code base and extracted all of the comments
Taken a random subset of those comments (say 5% - it was a large codebase, so even that might have been too high!)
Manually classified each of the comments as "code" or "not code"
Decided on a text-based feature extraction process to translate each piece of text into a vector, resulting in a list of labelled (as either 0 for "not code" or 1 for "code") vectors
Trained a binary classifier using those labelled vectors (perhaps splitting it up to use 70% of them as training data and 30% as test data)
Potentially spent time fiddling with the hyperparameters used in the training until the accuracy of the model was sufficiently high, based upon the results of running the test data through it
Exported that classifier model as C# code that the analyser could execute so that it could record warnings against comments that looked they were commented-out code

Alternatively, he could have come up with a system that tried to guess whether a comment was commented-out code or something useful by splitting it into tokens (ie. individual words and symbols), assigning a score (either positive or negative) to a set of known tokens and then adding up the total for the tokens in the comment. For example, a curly brace symbol may have a positive score since they are much more common in C# code than in English phrases. If the score is greater than a particular threshold then the analyser will decide that it's probably commented-out code and record a warning about it.

This may not be quite as easy as you might expect because you can't just assign a positive score to every keyword in the C# language otherwise comments like this:

// this should be a private class

.. might be identified as commented-out code because it contains the keywords "private" and "class", but that would be wrong! And so there would be a fine line to walk to try to get it right for at least enough of the time.

(In case you were wondering, he went with the second option and it ended up working pretty well!)

A final example is a project that I only heard about very recently and just thought that it was ingenious! PawSense will "catproof your computer" because:

When cats walk or climb on your keyboard, they can enter random commands and data, damage your files, and even crash your computer. This can happen whether you are near the computer or have suddenly been called away from it.

PawSense is a software utility that helps protect your computer from cats. It quickly detects and blocks cat typing, and also helps train your cat to stay off the computer keyboard.

This might sound like the sort of thing that you would somehow try to train a model for because surely the difference between me hammering the keys and a cat jumping on the keyboard and smashing some down could be quite subtle?? Having said that, I don't personally have any great intuition for this because the closest that my cats get to this is when they lie down too close to the keyboard and end up leaning down some weight on one of the keys near the edge - they don't actually walk across it. However, the author of this software used some simple observations that are explained in the FAQ on the site, that:

If you carefully measure cat paws, you will find that practically all cat paws are significantly larger than a typical keyboard key. When a cat first places its paw down, the cat's weight plus the momentum of the cat's movement exerts pounds of force on the keyboard, primarily through the cat's paw pads.

The cat's paw angles and toe positions also undergo complex changes while the paw lands on the keyboard. This forces keys and often key combinations down in a distinctive style of typing which includes unusual timing patterns.

Cats' patterns of overall movement in walking or lying down also help make their typing more recognizable.

So simple! And yet, when you read the briefer description from the home page:

PawSense constantly monitors keyboard activity. PawSense analyzes keypress timings and combinations to distinguish cat typing from human typing.

.. then you could be forgiven for imagining that something much more convoluted would be required and that somehow the author of this software acquired lots of sample data of him (and other humans) typing in various manners (because keyboard interaction varies hugely depending upon whether you're scrolling through a web site or if you're typing code or if you're writing prose) and compared it to lots of samples of cats interacting with a keyboard. But, thinking about it, trying to record lots of keyboard interactions by cats sounds extremely difficult to me - they like to do what they want to do and are not likely to be coerced into keyboard walking if they don't already feel in the mood to do so!

You can see, then, that it's not only programs that have an extremely simple-looking set of interactions (such as the summing of item costs and taxes) that are better done by writing a traditional algorithm, there are also lots of programs that look like they exhibit "fuzzy" or "intelligent" behaviour where there is no clever machine learning involved, it is simply a result of clever observations and experimentation by whoever designed and wrote the code. However, in cases where machine learning can shine it can be incredibly powerful - whether that is YouTube keeping you on the site longer by recommending videos to watch, whether it's an automated categorisation of incoming help desk tickets in a large company to try to get customer problems directed to the relevant departments more quickly, whether it's keeping spam emails out of your inbox or whether it's used to program a cat flap to not let a cat in if it's bringing a "gift" with it (ie. a dead bird or rodent), it makes possible things that traditional algorithms would make very difficult to do well if they could even be done at all. And while it's certainly not in all software, you might be astonished to know just how much directly contains some or interacts with services that do!

Parallelising (LINQ) work in C#

Tue, 10 Aug 2021 08:01:00 GMT

TL;DR

For computationally-expensive work that can be split up into tasks for LINQ "Select" calls, .NET provides a convenient way to execute this code on multiple threads. This "parallelism" should not be confused with "concurrency", which is what async / await is for.

A "parallelism vs concurrency" summary

Before getting started, I want to nip in the bud any confusion with the differences between code that runs "in parallel" and code that runs "concurrently".

In short, recruiting a parallelisation strategy for code allows you to:

use multiple cores simultaneously to work on the same task

..while concurrency allows you to:

handle multiple tasks on the same core.

A common example that I like to use is to refer to Node.js because it is a single-threaded environment that supports concurrent execution of multiple requests; each request will call out to external resources such as disk, out-of-process cache, a database, etc.. and it will be non-blocking when it does so, meaning that another request can be processed while it waits for that external resource to reply. So there is only a single thread but multiple overlapping requests can be handled because each time one pauses while it waits, another one can proceed until it calls an external resource. One thread / multiple requests.

Parallelising a calculation is kind of the opposite - instead of one thread for multiple requests it tackles one request using multiple threads. This only makes sense when the work to be done is some sort of computation that consists of crunching away on data and not just waiting for an external resource to reply.

When talking about concurrency, it's worth noting that in ASP.NET, if there is a lot of load then there might be multiple threads used to process work concurrently - each of the threads will be handling requests that spend most of their time waiting for some async work to complete. This is just like "one thread / multiple requests" but multiplied out to be "{x} threads / {y} requests" where {x} < {y}.

For a web server, it is possible that it never makes sense to do work that benefits from being parallelised because that work, by its very nature, is very computationally-expensive and you wouldn't want multiple requests to get bogged down in repeating the same costly work. You might require complicated synchronisation mechanisms (to avoid multiple requests doing the same work; instead, having one request do the work while other requests queue up and wait for the result to become available) and maybe you would be better moving that computationally-heavy work off into another service entirely (in which case your web server is back to making async requests as it asks a different server to do the work and send back the result).

A "parallelism vs concurrency" example

This is what concurrent (aka "async") work looks likes - if we use Task.Delay to imitate the delay that would be incurred by waiting on an external resource then we can create 50 requests and await the completion of them all like this:

var items = await Task.WhenAll(
    Enumerable
        .Range(0, 50)
        .Select(async i =>
        {
            LogWithTime($"Starting {i}");

            // Pause for 1, 2, 3, 4, 5 or 6 seconds depending upon the value of i
            await Task.Delay(TimeSpan.FromSeconds((i % 6) + 1));

            LogWithTime($"Finished {i}");
            return i;
        })
);

foreach (var item in items)
{
    LogWithTime($"Received item {item}");
}

static void LogWithTime(string message) =>
    Console.WriteLine($"{DateTime.Now:HH:mm:ss.fff} {message}");

This work will all complete within about 6s because all it does is create 50 tasks (which it can do near-instantly) where the longest of those has a Task.Delay call of 6s. Whenever one task is waiting, other work is free to continue. This means that all 50 of the tasks may be started using a single thread and that single thread may also be used to jump around receiving each of the results of those tasks.

In this example, the Task.WhenAll call creates a 50-element array where each element returns the value of "i" where i is 0-49. These 50 elements will be the 50 tasks' results, appearing in the array in the same order as they were created. This means that enumerating over the array - when Task.WhenAll says that all of the tasks have completed - will reveal the task results to be in the same order in which they were specified.

The 50 results, when the work is coordinated by Task.WhenAll, will be:

In order
Not available for enumeration until all of them have completed (due to the "Task.WhenAll" call) - all of the "Starting {i}" and "Finished {i}" messages will be displayed before any of the "Received item {item}" message
Almost certainly handled by a single thread, across all 50 tasks (this isn't guaranteed but it's extremely likely to be true)
The total running time will be about 6s since there is almost no work involved in starting the tasks, nor receiving the results of the tasks - all that we have to wait for is the time it takes for the longest tasks to complete (which is 6s)

Now, if this code is changed such that the Thread.Sleep is used instead of of Task.Delay then the thread will be blocked as each loop is iterated over. Whereas Task.Delay was used to imitate a call to an external service that would do the work, Thread.Sleep is used to imitate an expensive computation performed by the current thread.

var items = Enumerable
    .Range(0, 50)
    .Select(i =>
    {
        LogWithTime($"Starting {i}");

        // Pause for 1, 2, 3, 4, 5 or 6 seconds depending upon the value of i
        Thread.Sleep(TimeSpan.FromSeconds((i % 6) + 1));

        LogWithTime($"Finished {i}");
        return i;
    });

foreach (var item in items)
{
    LogWithTime($"Received item {item}");
}

static void LogWithTime(string message) =>
    Console.WriteLine($"{DateTime.Now:HH:mm:ss.fff} {message}");

Because there is no Task.WhenAll call that requires that every iteration complete before enumeration can begin, the foreach loop will write out a line as soon as iteration finishes. The results will still be written to the console in the order in which they were defined.

Note that this code is neither concurrent not parallelised.

Its behaviour, in comparison to the async example above, is that the results are returned:

In order
Available for enumeration as soon as each iteration completes - so the console messages will always appear as "Starting 1", "Finished 1", "Receiving item 1", "Starting 2", "Finished 2", "Receiving item 2", etc..
Handled by a single thread as there is merely the one thread that is processing the loop and blocking on each Thread.Sleep call
The total running time is the sum of every Thread.Sleep delay, which is 171s (50 iterations where each sleep call is between 1 and 6s)

With one simple change, we can alter this code such that the work is parallelised -

var items = Enumerable
    .Range(0, 50)
    .AsParallel() // <- Paralellisation enabled here
    .Select(i =>
    {
        LogWithTime($"Starting {i}");
                
        Thread.Sleep(TimeSpan.FromSeconds((i % 6) + 1));

        LogWithTime($"Finished {i}");
        return i;
    });

foreach (var item in items)
{
    LogWithTime($"Received item {item}");
}

static void LogWithTime(string message) =>
    Console.WriteLine($"{DateTime.Now:HH:mm:ss.fff} {message}");

This changes the behaviour considerably (unless you happen to be running this code on a single core machine, which is pretty unusual these days!) because as AsParallel() call allows the 50 iterations to be distributed over multiple cores.

My computer has 24 cores and so that means that up to 24 iterations can be run simultaneously - there will be up to 24 threads running and while each of those will be blocked as the Thread.Sleep calls are hit (which, again, are intended to mimic an expensive computation that would tie up a thread), the work will be done much more quickly than when a single thread had to do all the waiting.

When this code is running, there will be many "Starting {i}" messages written out at once and then some "Finished {i}" messages will be written as soon as the first threads complete their current iterations and are ready to move onto another (until all 50 have been processed). It also means that "Received item {item}" messages will be interspersed throughout because enumeration of the list can commence as soon as any of the loops complete.

It's important to note that the scheduling of the threads should be considered undefined in this configuration and there is no guarantee that you will first see "Starting 1", followed by "Starting 2", followed by "Starting 3". In fact, when I run it, the first messages are as follows:

15:15:10.423 Starting 3
15:15:10.423 Starting 9
15:15:10.423 Starting 15
15:15:10.423 Starting 16
15:15:10.423 Starting 11
15:15:10.423 Starting 20
15:15:10.423 Starting 5
15:15:10.423 Starting 19
15:15:10.423 Starting 6
15:15:10.423 Starting 0
15:15:10.423 Starting 12
15:15:10.423 Starting 17
15:15:10.423 Starting 23
15:15:10.423 Starting 1
15:15:10.423 Starting 14
15:15:10.423 Starting 2
15:15:10.423 Starting 10
15:15:10.423 Starting 22
15:15:10.423 Starting 18
15:15:10.423 Starting 4
15:15:10.423 Starting 13
15:15:10.423 Starting 21
15:15:10.423 Starting 7
15:15:10.423 Starting 8
15:15:11.437 Finished 18
15:15:11.437 Finished 0
15:15:11.437 Finished 12
15:15:11.437 Finished 6
15:15:11.437 Starting 24
15:15:11.437 Starting 25

While the starting order is not predictable, the iteration-completion order is somewhat more predictable in this example code as loops 0, 6, 12, etc.. (ie. every multiple of 6) completes in 1s while every other value of i takes longer.

As such, the first "Finished {i}" messages are 18, 0, 12, 6 in the output shown above.

The "Received item {item}" messages will be interspersed between "Starting {i}" and "Finished {i}" messages because enumeration of the results can commence as soon as some of the loops have completed.. however, again, it's important to note that the ordering of the results should not be considered to be defined as the scheduling of the threads depends upon how .NET decides to use its ThreadPool to handle the work and how it will "join" the separate threads used for the loop iteration back to the primary thread that the program is running as.

That may sound a little confusing, so if we change the code a little bit then maybe it can become clearer:

var items = Enumerable
    .Range(0, 50)
    .AsParallel() // <- Paralellisation enabled here
    .Select(i =>
    {
        LogWithTime($"Starting {i}");
                
        Thread.Sleep(TimeSpan.FromSeconds((i % 6) + 1));

        LogWithTime($"Finished {i}");
        return i;
    });

foreach (var item in items)
{
    LogWithTime($"Received item {item}");
}

static void LogWithTime(string message) =>
    Console.WriteLine($"{DateTime.Now:HH:mm:ss.fff} {message} " + 
                      $"(Thread {Thread.CurrentThread.ManagedThreadId})");

Running this now has those first progress messages look like this:

15:32:56.683 Starting 8 (Thread 19)
15:32:56.688 Starting 14 (Thread 16)
15:32:56.699 Starting 19 (Thread 18)
15:32:56.692 Starting 17 (Thread 11)
15:32:56.690 Starting 15 (Thread 14)
15:32:56.687 Starting 11 (Thread 17)
15:32:56.685 Starting 9 (Thread 12)
15:32:56.695 Starting 18 (Thread 21)
15:32:56.688 Starting 13 (Thread 26)
15:32:56.703 Starting 21 (Thread 27)
15:32:56.692 Starting 16 (Thread 25)
15:32:56.700 Starting 20 (Thread 24)
15:32:56.683 Starting 6 (Thread 4)
15:32:56.683 Starting 0 (Thread 7)
15:32:56.683 Starting 5 (Thread 13)
15:32:56.683 Starting 1 (Thread 5)
15:32:56.687 Starting 12 (Thread 10)
15:32:56.683 Starting 2 (Thread 6)
15:32:56.683 Starting 4 (Thread 9)
15:32:56.685 Starting 10 (Thread 22)
15:32:56.683 Starting 7 (Thread 15)
15:32:56.706 Starting 22 (Thread 20)
15:32:56.683 Starting 3 (Thread 8)
15:32:56.706 Starting 23 (Thread 23)
15:32:57.722 Finished 18 (Thread 21)
15:32:57.722 Finished 6 (Thread 4)
15:32:57.722 Finished 0 (Thread 7)
15:32:57.722 Finished 12 (Thread 10)
15:32:57.723 Starting 24 (Thread 21)
15:32:57.723 Starting 25 (Thread 4)
15:32:57.723 Starting 26 (Thread 7)
15:32:57.723 Starting 27 (Thread 10)
15:32:58.711 Finished 1 (Thread 5)
15:32:58.711 Finished 7 (Thread 15)
15:32:58.711 Finished 19 (Thread 18)

Firstly, note that the "Starting {i}" and "Finished {i}" messages are in a different order again - as I said, the order in which the tasks will be delegated to threads from the ThreadPool should be considered undefined and so you can't rely on having each loop started in the same order.

Secondly, note that all of those first "Starting {i}" messages are being written from a different thread (19, 16, 18, 11, etc..). But when one of the loops is completed, the thread that processed it becomes free to work on a different iteration and so shortly after we see "Finished 18 (Thread 24)" we see "Starting 25 (Thread 24)" - meaning that one thread (the one with ManagedThreadId 24) finished with loop 18 and then became free to be assigned to start working on loop 25.

Scrolling further down the output when I run it on my computer, I can see the first "Receiving item {item}" messages:

15:33:01.732 Received item 9 (Thread 1)
15:33:01.732 Received item 42 (Thread 1)
15:33:01.734 Finished 32 (Thread 21)
15:33:01.734 Received item 18 (Thread 1)
15:33:01.742 Received item 24 (Thread 1)
15:33:01.742 Received item 32 (Thread 1)
15:33:01.734 Finished 37 (Thread 24)
15:33:01.734 Finished 27 (Thread 10)
15:33:01.744 Received item 20 (Thread 1)

Note that all of the "Received item {item}" messages are being logged by thread 1, which is the thread that the "Main" method of my program started on.

Having "AsParallel()" join up its enumeration results such that the enumeration itself can happen on the "primary" thread can be useful because there are some environments that get unhappy if you try to do particular types of work on separate threads - for example, if you wrote an old-school WinForms app and had a separate thread do some work and then try to update a control on your form then you would get an error:

Cross-thread operation not valid. Control accessed from a thread other than the thread it was created on.

(You may be wondering why the "Received item {i}" messages appeared a couple of seconds after the corresponding "Finished {i}" messages, rather than immediately after each loop completed - this is due to buffering of the results and I'll touch on this later in this post)

When "AsParallel()" is used in this way, the characteristics (as compared to the Task.WhenAll async work and to the single-thread work) are that:

The results are not returned in order
Enumeration starts before all of the processing has completed
Multiple threads are used (by default, one thread per core in your computer - but, again, there are options for this that I'll discuss further down)
The total running time depends upon the number of cores you have - if you had 50 cores then every loop iteration would be running simultaneously and it would take about 6s for everything to complete, as the longest iterations take 6s each (but they would be getting processed simultaneously). If you only had 1 core then you would see the same behaviour as the non-parallelised version above and it would take 171s. On my computer, with 24 cores, it takes around 11s because there are threads that get through the quick iterations until they hit the longer Thread.Sleep calls but there will still be multiple of these slower iterations being processed at the same time.

If ordering of the results is important then the code can easily be changed like this:

var items = Enumerable
    .Range(0, 50)
    .AsParallel() // <- Paralellisation enabled here
    .Select(i =>
    {
        LogWithTime($"Starting {i}");
                
        Thread.Sleep(TimeSpan.FromSeconds((i % 6) + 1));

        LogWithTime($"Finished {i}");
        return i;
    })
    .OrderBy(i => i); // <- Ordering enforced here

foreach (var item in items)
{
    LogWithTime($"Received item {item}");
}

static void LogWithTime(string message) =>
    Console.WriteLine($"{DateTime.Now:HH:mm:ss.fff} {message} " + 
                      $"(Thread {Thread.CurrentThread.ManagedThreadId})");

Now the work will still be performed on multiple threads at once but enumeration will not be able to start until all of the iterations have completed.

This means that the console messages will consist entirely of "Starting {i}" and "Finished {i}" messages until all 50 iterations are completed, then all of the "Received item {item}" messages will be written out. This will still have the same running time (eg. 11s on my computer) because the work is being performed in the same way - the only difference is that the results are all buffered up until the work is completed, otherwise the OrderBy call wouldn't be able to do its job because it couldn't know all of the values that were going to be produced.

Implementation details

There are a lot of options and intricacies that you can find if you dig deep enough into how this works in the .NET library. I have no intention of trying to cover all of them but there are a few options and observations that I think are worth including in this post.

The first thing to be aware of is that parallelisation of the work will not be enabled until after the "AsParallel()" call is made - for example, the following code will not spread the Thread.Sleep calls across multiple cores:

var items = Enumerable
    .Range(0, 50)
    .Select(i =>
    {
        LogWithTime($"Starting {i}");
                
        Thread.Sleep(TimeSpan.FromSeconds((i % 6) + 1));

        LogWithTime($"Finished {i}");
        return i;
    })
    .AsParallel(); // <- Too late!

foreach (var item in items)
{
    LogWithTime($"Received item {item}");
}

static void LogWithTime(string message) =>
    Console.WriteLine($"{DateTime.Now:HH:mm:ss.fff} {message} " + 
                      $"(Thread {Thread.CurrentThread.ManagedThreadId})");

This may seem counterintuitive as the IEnumerable returned from "Select" may be lazily evaluated and so you may expect the runtime to be able to distribute its work over multiple cores due to the "AsParallel()" call after it but this is not the case.

To get an idea where parallelisation may occur, there are hints in the method return types - eg. where "Enumerable.Range" returns an IEnumerable<int> and a "Select" call following it will also return an IEnumerable<int>, when there is an "AsParallel" call after "Enumerable.Range" then the type is now a ParallelQuery<int>int and there is a "Select" overload on that type that means that when "Select" is called on a ParallelQuery then that too returns a ParallelQuery.

Limiting how many cores may be used

The default behaviour of "AsParallel()" is to spread the work over as many cores as your computer has available (obviously if there are only 10 work items to distribute and there are 24 cores then it won't be able to use all of your cores but if there are at least as many things to do as there are cores then it will use them all until it starts running out of things).

Depending upon your scenario, this may or may not be a good thing. For example, in my previous post (Automating "suggested / related posts" links for my blog posts - Part 2), I spoke about how I've started using the C# machine learning library Catalyst (produced by a startup that I used to work at) to suggest "you may be also be interested in" links for the bottom of my posts - in this case, it's a one-off task performed before I push an update to my blog live and so I want the computer to spend all of its resources calculating this as fast as possible.

One of the applicable lines in the library is in the TFIDF implementation and looks like this:

documents.AsParallel().ForAll(doc => UpdateVocabulary(ExtractTokenHashes(doc)));

(As you can see in the source file TF-IDF.cs; along with the rest of the implementation for if you're curious)

However, I could also imagine that there might be a web server that is serving requests from many people each day but occasionally there is a request that requires some more intense computation and it might take too long to calculate this while feeling responsive to the User if it tried to do the work on a single thread - but if it used every core available on the server then it would impact all of the other requests being handled. In this case it may be appropriate to say "parallelise this work but don't allow more than four cores to be utilised". There is a method "WithDegreeOfParallelism" available for just this purpose!

var items = Enumerable
    .Range(0, 50)
    .AsParallel().WithDegreeOfParallelism(4)
    .Select(i =>
    {
        LogWithTime($"Starting {i}");
                
        Thread.Sleep(TimeSpan.FromSeconds((i % 6) + 1));

        LogWithTime($"Finished {i}");
        return i;
    });

foreach (var item in items)
{
    LogWithTime($"Received item {item}");
}

static void LogWithTime(string message) =>
    Console.WriteLine($"{DateTime.Now:HH:mm:ss.fff} {message} " + 
                      $"(Thread {Thread.CurrentThread.ManagedThreadId})");

If the value passed to "WithDegreeOfParallelism" exceeds the number of cores then it will have no effect but if it is less then it will constrain that parallelised work such that it will not use more than that number of cores at any time.

Buffering options

I mentioned earlier that when work is spread over multiple cores using "AsParallel()" and then later enumerated that some buffering of the results occurs. There are three options for the buffering behaviour:

AutoBuffering
FullyBuffered
NotBuffered

The default is "AutoBuffering" and the behaviour of this is that results are not available for enumeration as soon as the work items are completed - instead, the runtime determines a batch size that it thinks makes sense to buffer the results up for before making them available for looping through.

To be completely honest, I don't know enough about how it decides on this number or the full extent of the benefits of doing so (though I will hint at a way to find out more in the "Partitioner" section further down); I presume that there are some performance benefits to reducing how often execution jumps from one thread to another - because, as we saw earlier, as soon as enumeration commences, execution returns to the "primary thread" and hopping between threads can be a relatively expensive operation.

The second option ("FullyBuffered") is simple to understand - enumeration will not commence until all of the work items are completed; they will all be added to a buffer first. This not only has the disadvantage that enumeration can't start until the final item is completed but it also means that all of those results must be held in memory, which could be avoided (if it's a concern) by having the results "stream" out as they become ready in the other buffering scenarios. This has the advantage of minimising "thread hops" but, even though the results are all buffered, it does not preserve the order of the work items when it comes to enumeration - despite what I've read elsewhere (you can see this yourself by running the code a little further down).

The final option is "NotBuffered" and that, as you can probably tell from the name, doesn't buffer results at all and makes the available for enumeration as soon as they have been processed (the disadvantage being the additional cost of changing thread context more frequently - ie. more "thread hops").

To override the default ("AutoBuffering") behaviour, you may use the "WithMergeOptions" function like this -

var items = Enumerable
    .Range(0, 50)
    .AsParallel().WithMergeOptions(ParallelMergeOptions.FullyBuffered)
    .Select(i =>
    {
        LogWithTime($"Starting {i}");
                
        Thread.Sleep(TimeSpan.FromSeconds((i % 6) + 1));

        LogWithTime($"Finished {i}");
        return i;
    });

foreach (var item in items)
{
    LogWithTime($"Received item {item}");
}

static void LogWithTime(string message) =>
    Console.WriteLine($"{DateTime.Now:HH:mm:ss.fff} {message} " + 
                      $"(Thread {Thread.CurrentThread.ManagedThreadId})");

Cancellation

Say you have many work items distributed over multiple cores in order to calculate something very expensive and parallelisable. Part way through, you might decide that actually you don't want the result any more - maybe some of the data that it relies on has changed and a "stale" result will not be of any use. In this case, you will want to cancel the parallelised work.

To enable this, there is a "WithCancellation" method that takes a CancellationToken and will stop allocating work items to threads if the token is marked as cancelled - instead, it will throw an OperationCanceledException. To imitate this, the code below has a token that will be set to be cancelled after 3s and the exception will be thrown during the enumeration:

var cts = new CancellationTokenSource();
cts.CancelAfter(TimeSpan.FromSeconds(3));

var items = Enumerable
    .Range(0, 50)
    .AsParallel().WithCancellation(cts.Token)
    .Select(i =>
    {
        LogWithTime($"Starting {i}");
                
        Thread.Sleep(TimeSpan.FromSeconds((i % 6) + 1));

        LogWithTime($"Finished {i}");
        return i;
    });

foreach (var item in items)
{
    LogWithTime($"Received item {item}");
}

static void LogWithTime(string message) =>
    Console.WriteLine($"{DateTime.Now:HH:mm:ss.fff} {message} " + 
                      $"(Thread {Thread.CurrentThread.ManagedThreadId})");

It's worth noting that "WithCancellation" can only cancel the "AsParallel" work of allocating items to threads, it doesn't have any ability to cancel the individual work items themselves. If you want to do this - such that all work is halted immediately as soon as the token is set to cancelled, then you would have to add cancellation-checking code to the work performed in each step - ie.

var cts = new CancellationTokenSource();
cts.CancelAfter(TimeSpan.FromSeconds(3));

var items = Enumerable
    .Range(0, 50)
    .AsParallel().WithCancellation(cts.Token)
    .Select(i =>
    {
        LogWithTime($"Starting {i}");
                
        cts.Token.ThrowIfCancellationRequested();
        Thread.Sleep(TimeSpan.FromSeconds((i % 6) + 1));

        LogWithTime($"Finished {i}");
        return i;
    });

foreach (var item in items)
{
    LogWithTime($"Received item {item}");
}

static void LogWithTime(string message) =>
    Console.WriteLine($"{DateTime.Now:HH:mm:ss.fff} {message} " + 
                      $"(Thread {Thread.CurrentThread.ManagedThreadId})");

(Granted, using "cts.Token.ThrowIfCancellationRequested" alongside "Thread.Sleep" isn't a perfect example of how to deal with cancellation because you can't cancel the "Thread.Sleep" call itself - but hopefully it demonstrates that if you want immediate cancellation of every work item then you need to incorporate cancellation support into each work item as well as calling "WithCancellation" on the ParallelQuery)

For more detailed information on "PLINQ (parallel LINQ) cancellation", there is a great article by Reed Copsey Jr entitled Parallelism in .NET – Part 10, Cancellation in PLINQ and the Parallel class.

Partitioner<TSource>

When an "AsParallel" call decides how to split up the work, it uses something called a "Partitioner". This determines how big the buffer will be when "AutoBuffering" is used and it may even perform other optimisations (up to this point, I've said that "AsParallel" will always spread the work over multiple cores - so long as you have multiple cores at your disposal and "WithDegreeOfParallelism" doesn't specify a value of 1) but, actually, the partitioner could look at the work load and decide that parallelising the work would probably incur more overhead than performing it one step at a time on a single thread and so it won't actually use multiple cores.

The .NET library will use its own default Partitioner unless it is told to use a custom one. This is a complex subject matter that:

I don't have a lot of knowledge about
I don't want to try to add to this article, lest it end up ginormous!

If you want to find out more, I recommend starting at the Microsoft documentation about it here: Custom Partitioners for PLINQ and TPL and also checking out Parallel LINQ in Depth (2) Partitioning from Dixin's Blog (whose blog I also referenced under the "Further reading" section of my I didn't understand why people struggled with (.NET's) async post).

When to parallelise work (and when to not)

Much of the time, there is no need for you to try to spread individual tasks over multiple threads. A very common model in this day and age for processing is a web server that is dealing with requests from many Users and most of the time is spent waiting for external caches, file system accesses, database retrievals, etc.. This is not the sort of heavy computation that would lead you to want to try to utilise multiple cores on that web server for any single request.

Also, some computational work, even if it's expensive, doesn't lend itself to parallelisation - if you can't split the work into clearly delineated and independent work items then it's going to be awkward (if not impossible) to make the work parallelisable. For example, the Fibonacci Sequence starts with the numbers 0 and 1 and each subsequent number is the sum of the previous two; so the third number is (0 + 1) = 1, the fourth number is (1 + 1) = 2, the fifth number is (1 + 2) = 3, etc.. In case you're not familiar with it and that description is a little confusing, maybe it will help to know that the first ten numbers in the sequence are:

0, 1, 1 (=0+1), 2 (=1+1), 3 (=1+2), 5 (=2+3), 8 (=3+5), 13 (=5+8), 21 (=8+13), 34 (=13+21)

If you calculate the nth number like this (based on the previous two) then it's near impossible to split the work into big distinct chunks that you could run on different threads and so it wouldn't be a good candidate for parallelisation*.

* (If you search Google then you will find that there are people proposing ways to calculate Fibonacci numbers using multiple threads but it's much more complicated than working them out the simple way described above, so let's forget about that for now so that the Fibonacci sequence works as an easily-understood example of when not to parallelise!)

Another thing to bear in mind is that there is some cost to having the runtime jump around multiple threads, to coordinate what work is done on which and to then join the results all back up on the original thread. For this reason, the ideal use cases are when the main task can be split into fairly large chunks so that the amount of time that each thread spends doing work makes the thread coordination time negligible in comparison.

One example is the TF-IDF class that I mentioned earlier where there are a list of documents (blog posts, in my use case) and there is analysis required on each one to look for "interesting" words:

documents.AsParallel().ForAll(doc => UpdateVocabulary(ExtractTokenHashes(doc)));

Another example is something that I was tinkering with some months ago and which I'm hoping to write some blog posts about when I can motivate myself! A few years ago, I gave a tech talk to a local group that was recorded but the camera was out of focus for most of the video and so the slides are illegible. I've still got the slide deck that I prepared for the talk and so I can produce images of those in full resolution - which gave me the idea of analysing the frames of the original video and trying to determine which slide should be shown on which frame and then superimposing a clear version of the slide onto the blurry images (then creating a new version of the video with the original audio, the original blurry view of me but super-clear slide contents). Some of the steps involved in this are:

Load all of the original slide images and translate their pixel data into a form that will make comparisons easier for the code later on
Look at every frame of the video and look for the brightest area on the image and hope that that is the projection of the slide (it will be a quadrilateral but not a rectangle, due to perspective of the wall onto which the slides were projected)
Load every frame of the video, extract the content that is in the "brightest area" that appears most commonly throughout the slides (it varies a little from slide to slide, depending upon how out of focus the camera was at the time), stretch the area back into a simple rectangle (reversing the effect of perspective), translate the pixel data into the same format as the original slides were converted into earlier and then try to find the closest match

Each of these steps lends itself to parallelisation because the work performed on each frame may be done in isolation and the work itself is sufficiently computationally expensive that the task of coordinating the work between threads can basically be considered to be zero in comparison.

(If you're just absolutely desperate to know more about this still-slightly-rough-around-the-edges project, you can find it on my GitHub account under NaivePerspectiveCorrection - like I said, I hope to write some more posts about it in the coming months but, until then, you can see some sensible uses of "AsParallel()" in Program.cs)

Automating "suggested / related posts" links for my blog posts - Part 2

Wed, 28 Apr 2021 21:56:00 GMT

TL;DR

By training another type of model from the open source .NET library that I've been using and combining its results with the similarity model from last time (see Automating "suggested / related posts" links for my blog posts), I'm going to improve the automatically-generated "you may be interested in" links that I'm adding to my blog.

Improvement, in fact, sufficient such that I'll start displaying the machine-suggested links at the bottom of each post.

Where I left off last time

In my last post, I had trained a fastText model (as part of the Catalyst .NET library) by having it read all of my blog posts so that it could predict which posts were most likely to be similar to which other posts.

This came back with some excellent suggestions, like this:

Learning F# via some Machine Learning: The Single Layer Perceptron
How are barcodes read?? (Library-less image processing in C#)
Writing F# to implement 'The Single Layer Perceptron'
Face or no face (finding faces in photos using C# and AccordNET)

.. but it also produced some less good selections, like this:

Simple TypeScript type definitions for AMD modules
STA ApartmentState with ASP.Net MVC
WCF with JSON (and nullable types)
The joys of AutoMapper

I'm still not discounting the idea that I might be able to improve the results by tweaking hyperparameters on the training model (such as epoch, negative sampling rate and dimensions) or maybe even changing how it processes the blog posts - eg. it's tackling the content as English language documents but there are large code segments in many of the posts and maybe that's confusing it; maybe removing the code samples before processing would give better results?

However, fiddling with those options and rebuilding over and over is a time-consuming process and there is no easy way to evaluate the "goodness" of the results - so I need to flick through them all myself and try to get a rough feel for whether I think the last run was an improvement or not.

Introducing a new model

The premise that I wil be experimenting with is to determine what words in my post titles are "interesting" and to then order the suggested-similar posts first by a score based upon how many interesting words they share and then by the similarity score that I already have.

The model that I'll be training for this is called "TF-IDF" or "Term Frequency - Inverse Document Frequency" and it looks at every word in every blog post and considers how many times that word appears in the document (the more often, the more likely that the document relates to the word) and how many times it appears across multiple documents (the more often, the more common and less "specific" it's likely to be).

For each blog post that I'm looking for similar posts to, I'll:

take the words from its title
take the words from another post's title
add together all of the TF-IDF scores for words that appear in both titles (the higher the score for each word, the greater the relevance)
repeat until all other post titles have been compared

Taking the example from above that didn't have particularly good similar-post recommendations, the words in its title will have the following scores:

Word	Score
Simple	0.6618375
TypeScript	4.39835453
type	0.7873714
definitions	2.60178781
for	0
AMD	3.81998682
modules	3.96386051

.. so it should be clear that any other titles that contain the word "TypeScript" will be given a boost.

This is by no means a perfect system as there will often be posts whose main topics are similar but whose titles are not. The example from earlier that fastText generated really good similar-post suggestions for is a great illustration of this:

Learning F# via some Machine Learning: The Single Layer Perceptron
How are barcodes read?? (Library-less image processing in C#)
Writing F# to implement 'The Single Layer Perceptron'
Face or no face (finding faces in photos using C# and AccordNET)

All of them are investigations into some form of machine learning or computer vision but the titles share very little in common. It's likely that the prediction quality of this one will actually suffer a little with the change I'm introducing but I'm looking for an overall improvement, across the entire blog. I'm also not looking for a perfect general solution, I'm trying to find something that works well for my data (again, bearing in mind that there is a relatively small quantity of it as there are only around 120 posts, which doesn't give the computer a huge amount of data to work from).

(It's also worth noting that the way I implement this in my blog is that I maintain two lists - the manually-curated list that I had before that had links for about a dozen posts and a machine-generated list; if there are manual links present then they will be displayed and the auto-generated ones will be hidden - so if I find that I have a particularly awkward post where the machine can't find nice matches then I can always tidy it up myself by manually creating the related-post links for that post)

Implementation

Last time, I had code that was reading and parsing my blog posts into a "postsWithDocuments" list.

After training the fastText model, I'll train a TF-IDF model on all of the documents. I'll then go back round each document again, have this new model "Process" them and retrieve Frequency values for each word. These values allow for a score to be generated - since the scores depend upon how often a word appears in a given document, the scores will vary from one blog post to another and so I'm taking an average score for each distinct word.

(Confession: I'm not 100% sure that this averaging is the ideal approach here but it seems to be doing a good enough job and I'm only fiddling around with things, so good enough should be all that I need)

Console.WriteLine("Training TF-IDF model..");
var tfidf = new TFIDF(pipeline.Language, version: 0, tag: "");
await tfidf.Train(postsWithDocuments.Select(postWithDocument => postWithDocument.Document));

Console.WriteLine("Getting average TF-IDF weights per word..");
var tokenValueTFIDF = new Dictionary<string, List<float>>(StringComparer.OrdinalIgnoreCase);
foreach (var doc in postsWithDocuments.Select(postWithDocument => postWithDocument.Document))
{
    // Calling "Process" on the document updates data on the tokens within the document
    // (specifically, the token.Frequency value)
    tfidf.Process(doc);
    foreach (var sentence in doc)
    {
        foreach (var token in sentence)
        {
            if (!tokenValueTFIDF.TryGetValue(token.Value, out var freqs))
            {
                freqs = new();
                tokenValueTFIDF.Add(token.Value, freqs);
            }
            freqs.Add(token.Frequency);
        }
    }
}
var averagedTokenValueTFIDF = tokenValueTFIDF.ToDictionary(
    entry => entry.Key,
    entry => entry.Value.Average(), StringComparer.OrdinalIgnoreCase
);

Now, with a couple of helper methods:

private static float GetProximityByTitleTFIDF(
    string similarPostTitle,
    HashSet<string> tokenValuesInInitialPostTitle,
    Dictionary<string, float> averagedTokenValueTFIDF,
    Pipeline pipeline)
{
    return GetAllTokensForText(similarPostTitle, pipeline)
        .Where(token => tokenValuesInInitialPostTitle.Contains(token.Value))
        .Sum(token =>
        {
            var tfidfValue = averagedTokenValueTFIDF.TryGetValue(token.Value, out var score)
                ? score
                : 0;
            if (tfidfValue <= 0)
            {
                // Ignore any tokens that report a negative impact (eg. punctuation or
                // really common words like "in")
                return 0;
            }
            return tfidfValue;
        });
}

private static IEnumerable<IToken> GetAllTokensForText(string text, Pipeline pipeline)
{
    var doc = new Document(text, pipeline.Language);
    pipeline.ProcessSingle(doc);
    return doc.SelectMany(sentence => sentence);
}

.. it's possible, for any given post, to sort the titles of the other posts according to how many "interesting" words (and how "interesting" they are) they have in common like this:

// Post 82 on my blog is "Simple TypeScript type definitions for AMD modules"
var post82 = postsWithDocuments.Select(p => p.Post).FirstOrDefault(p => p.ID == 82);
var title = post82.Title;

var tokenValuesInTitle =
    GetAllTokensForText(NormaliseSomeCommonTerms(title), pipeline)
        .Select(token => token.Value)
        .ToHashSet(StringComparer.OrdinalIgnoreCase);
		
var others = postsWithDocuments
    .Select(p => p.Post)
    .Where(p => p.ID != post82.ID)
    .Select(p => new
    {
        Post = p,
        ProximityByTitleTFIDF = GetProximityByTitleTFIDF(
            NormaliseSomeCommonTerms(p.Title),
            tokenValuesInTitle,
            averagedTokenValueTFIDF,
            pipeline
        )
    })
    .OrderByDescending(similarResult => similarResult.ProximityByTitleTFIDF);
	
foreach (var result in others)
    Console.WriteLine($"{result.ProximityByTitleTFIDF:0.000} {result.Post.Title}");

The top 11 scores (after which, everything has a TF-IDF proximity score of zero) are these:

7.183 Parsing TypeScript definitions (functional-ly.. ish)
4.544 TypeScript State Machines
4.544 Writing React components in TypeScript
4.544 TypeScript classes for (React) Flux actions
4.544 TypeScript / ES6 classes for React components - without the hacks!
4.544 Writing a Brackets extension in TypeScript, in Brackets
0.796 A static type system is a wonderful message to the present and future
0.796 A static type system is a wonderful message to the present and future - Supplementary
0.796 Type aliases in Bridge.NET (C#)
0.796 Hassle-free immutable type updates in C#
0.000 I love Immutable Data

So the idea is to then use the fastText similarity score when deciding which of these matches is best.

There are all sorts of ways that these two scoring mechanisms could be combined - eg. I could take the 20 titles with the greatest TF-IDF proximity scores and then order them by similarity (ie. which results the fastText model thinks are best) or I could reverse it and take the 20 titles that fastText thought were best and then take the three with the greatest TF-IDF proximity scores from within those. For now, I'm using the simplest approach and ordering by the TF-IDF scores first and then by the fastText similarity model. So, from the above list, the 7.183-scoring post will be taken first and then 2 out of the 5 posts that have a TF-IDF score of 4.544 will be taken, according to which ones the fastText model thought were more similar.

Again, there are lots of things that could be tweaked and fiddled with - and I imagine that I will experiment with them at some point. The main problem is that I have enough data across my posts that it's tedious looking through the output to try to decide if I've improved things each time I make change but there isn't enough data that the algorithms have a huge pile of information to work on. Coupled with the fact that training takes a few minutes to run and I have recipe for frustration if I obsess too much about it. Right now, I'm happy enough with the suggestions and any that I want to manually override, I can do so easily.

Trying the code yourself

If you want to try out the code, you can find a complete sample in the "SimilarityWithTitleTFIDF" project in the solution of this repo: BlogPostSimilarity.

Has it helped?

Let's return to those examples that I started with.

Good suggestions from last time:

Learning F# via some Machine Learning: The Single Layer Perceptron
How are barcodes read?? (Library-less image processing in C#)
Writing F# to implement 'The Single Layer Perceptron'
Face or no face (finding faces in photos using C# and AccordNET)

Less good suggestions:

Simple TypeScript type definitions for AMD modules
STA ApartmentState with ASP.Net MVC
WCF with JSON (and nullable types)
The joys of AutoMapper

Now, the not-very-good one has improved and has these offered:

Simple TypeScript type definitions for AMD modules
Parsing TypeScript definitions (functional-ly.. ish)
TypeScript State Machines
Writing a Brackets extension in TypeScript, in Brackets

.. but, as I said before, the good suggestions are now not as good as they were:

How are barcodes read?? (Library-less image processing in C#)
Face or no face (finding faces in photos using C# and Accord.NET)
Implementing F#-inspired "with" updates for immutable classes in C#
A follow-up to "Implementing F#-inspired 'with' updates in C#"

There are lots of suggestions that are still very good - eg.

Creating a C# ("Roslyn") Analyser - For beginners by a beginner
Using Roslyn to identify unused and undeclared variables in VBScript WSC components
Locating TODO comments with Roslyn
Using Roslyn code fixes to make the "Friction-less immutable objects in Bridge" even easier

Migrating my Full Text Indexer to .NET Core (supporting multi-target NuGet packages)
Revisiting .NET Core tooling (Visual Studio 2017)
The Full Text Indexer Post Round-up
The NeoCities Challenge! aka The Full Text Indexer goes client-side!

Dependency Injection with a WCF Service
Ramping up WCF Web Service Request Handling.. on IIS 6 with .Net 4.0
Consuming a WCF Web Service from PHP
WCF with JSON (and nullable types)

Translating VBScript into C#
VBScript is DIM
Using Roslyn to identify unused and undeclared variables in VBScript WSC components
If you can keep your head when all about you are losing theirs and blaming it on VBScript

.. but still some less-good suggestions, like:

Auto-releasing Event Listeners
Writing React apps using Bridge.NET - The Dan Way (Part Three)
Persistent Immutable Lists - Extended
Extendable LINQ-compilable Mappers

Problems in Immutability-land
Language detection and words-in-sentence classification in C#
Using Roslyn to identify unused and undeclared variables in VBScript WSC components
Writing a Brackets extension in TypeScript, in Brackets

However, having just looked through the matches to try to find any really awful suggestions, there aren't many that jump out at me. And, informal as that may be as a measure of success, I'm fairly happy with that!

Automating "suggested / related posts" links for my blog posts

Wed, 07 Apr 2021 22:21:00 GMT

TL;DR

Using the same open source .NET library as I did in my last post (Language detection and words-in-sentence classification in C#), I use some of its other machine learning capabilities to automatically generate "you may also be interested in" links to similar posts for any given post on this blog.

The current "You may also be interested in" functionality

This site has always had a way for me to link related posts together - for example, if you scroll to the bottom of "Learning F# via some Machine Learning: The Single Layer Perceptron" then it suggests a link to "Face or no face (finding faces in photos using C# and Accord.NET)" on the basis that you might be super-excited into my fiddlings with computers being trained how to make decisions on their own. But there aren't many of these links because they're something that I have to maintain manually. Firstly, that means that I have to remember / consider every previous post and decide whether it might be worth linking to the new post that I've just finished writing and, secondly, I often just forget.

There are models in the Catalyst library* that make this possible and so I thought that I would see whether I could train it with my blog post data and then incorporate the suggestions into the final content.

* (Again, see my last post for more details on this library and a little blurb about my previous employers who are doing exciting things in the Enterprise Search space)

Specifically, I'll be using the fastText model that was published by Facebook's AI Research lab in 2015 and then rewritten in C# as part of the Catalyst library.

Getting my blog post articles

When I first launched my blog (just over a decade ago), I initially hosted it somewhere as an ASP.NET MVC application. Largely because I wanted to try my hand at writing an MVC app from scratch and fiddling with various settings, I think.. and partly because it felt like the "natural" thing to do, seeing as I was employed as a .NET Developer at the time!

To keep things simple, I had a single text file for each blog post and the filenames were of a particular format containing a unique post ID, date and time of publishing, whether it should appear in the "Highlights" column and any tags that should be associated with it. Like this:

1,2011,3,14,20,14,2,0,Immutability.txt

That's the very first post (it has ID 1), it was published on 2011-03-14 at 20:14:02 and it is not shown in the Highlights column (hence the final zero). It has a single tag of "Immutability". Although it has a ".txt" extension, it's actually markdown content, so ".md" would have been more logical (the reason why I chose ".txt" over ".md" will likely remain forever lost in the mists of time!)

A couple of years later, I came across the project neocities.org and thought that it was a cool idea and did some (perhaps slightly hacky) work to make things work as a static site (including pushing the search logic entirely to the client) as described in The NeoCities Challenge!.

Some more years later, GitHub Pages started supporting custom domains over HTTPS (in May 2018 according to this) and so, having already moved web hosts once due to wildly inconsistent performance from the first provider, I decided to use this to-static-site logic and start publishing via GitHub Pages.

This is a long-winded way of saying that, although I publish my content these days as a static site, I write new content by running the original blog app locally and then turning it into static content later. Meaning that the original individual post files are available in the ASP.NET MVC Blog GitHub repo here:

github.com/ProductiveRage/Blog/tree/master/Blog/App_Data/Posts

Therefore, if you were sufficiently curious and wanted to play along at home, you can also access the original markdown files for my blog posts and see if you can reproduce my results.

Following shortly is some code to do just that. GitHub has an API that allows you to query folder contents and so we can get a list of blog post files without having to do anything arduous like clone the entire repo or trying to scrape the information from the site or even creating an authenticated API access application because GitHub allows us rate-limited non-authenticated access for free! Once we have the list of files, each will have a "download_url" that we can retrieve the raw content from.

To get the list of blog post files, you would call:

api.github.com/repos/ProductiveRage/Blog/contents/Blog/App_Data/Posts?ref=master

.. and get results that look like this:

[
  {
    "name": "1,2011,3,14,20,14,2,0,Immutability.txt",
    "path": "Blog/App_Data/Posts/1,2011,3,14,20,14,2,0,Immutability.txt",
    "sha": "b243ea15c891f73550485af27fa06dd1ccb8bf45",
    "size": 18965,
    "url": "https://api.github.com/repos/ProductiveRage/Blog/contents/Blog/App_Data/Posts/1,2011,3,14,20,14,2,0,Immutability.txt?ref=master",
    "html_url": "https://github.com/ProductiveRage/Blog/blob/master/Blog/App_Data/Posts/1,2011,3,14,20,14,2,0,Immutability.txt",
    "git_url": "https://api.github.com/repos/ProductiveRage/Blog/git/blobs/b243ea15c891f73550485af27fa06dd1ccb8bf45",
    "download_url": "https://raw.githubusercontent.com/ProductiveRage/Blog/master/Blog/App_Data/Posts/1%2C2011%2C3%2C14%2C20%2C14%2C2%2C0%2CImmutability.txt",
    "type": "file",
    "_links": {
      "self": "https://api.github.com/repos/ProductiveRage/Blog/contents/Blog/App_Data/Posts/1,2011,3,14,20,14,2,0,Immutability.txt?ref=master",
      "git": "https://api.github.com/repos/ProductiveRage/Blog/git/blobs/b243ea15c891f73550485af27fa06dd1ccb8bf45",
      "html": "https://github.com/ProductiveRage/Blog/blob/master/Blog/App_Data/Posts/1,2011,3,14,20,14,2,0,Immutability.txt"
    }
  },
  {
    "name": "10,2011,8,30,19,06,0,0,Mercurial.txt",
    "path": "Blog/App_Data/Posts/10,2011,8,30,19,06,0,0,Mercurial.txt",
    "sha": "ab6cf2fc360948212e29c64d9c886b3dbfe0d6fc",
    "size": 3600,
    "url": "https://api.github.com/repos/ProductiveRage/Blog/contents/Blog/App_Data/Posts/10,2011,8,30,19,06,0,0,Mercurial.txt?ref=master",
    "html_url": "https://github.com/ProductiveRage/Blog/blob/master/Blog/App_Data/Posts/10,2011,8,30,19,06,0,0,Mercurial.txt",
    "git_url": "https://api.github.com/repos/ProductiveRage/Blog/git/blobs/ab6cf2fc360948212e29c64d9c886b3dbfe0d6fc",
    "download_url": "https://raw.githubusercontent.com/ProductiveRage/Blog/master/Blog/App_Data/Posts/10%2C2011%2C8%2C30%2C19%2C06%2C0%2C0%2CMercurial.txt",
    "type": "file",
    "_links": {
      "self": "https://api.github.com/repos/ProductiveRage/Blog/contents/Blog/App_Data/Posts/10,2011,8,30,19,06,0,0,Mercurial.txt?ref=master",
      "git": "https://api.github.com/repos/ProductiveRage/Blog/git/blobs/ab6cf2fc360948212e29c64d9c886b3dbfe0d6fc",
      "html": "https://github.com/ProductiveRage/Blog/blob/master/Blog/App_Data/Posts/10,2011,8,30,19,06,0,0,Mercurial.txt"
    }
  },
  ..

While the API is rate-limited, retrieving content via the "download_url" locations is not - so we can make a single API call for the list and then download all of the individual files that we want.

Note that there are a couple of files in that folders that are NOT blog posts (such as the "RelatedPosts.txt" file, which is the way that I manually associate "You may also be interested in" post) and so each filename will have to be checked to ensure that it matches the format shown above.

The title of the blog post is not in the file name, it is always the first line of the content in the file (to obtain it, we'll need to process the file as markdown content, convert it to plain text and then look at that first line).

private static async Task<IEnumerable<BlogPost>> GetBlogPosts()
{
    // Note: The GitHub API is rate limited quite severely for non-authenticated apps, so we just
    // call it once for the list of files and then retrieve them all further down via the Download
    // URLs (which don't count as API calls). Still, if you run this code repeatedly and start
    // getting 403 "rate limited" responses then you might have to hold off for a while.
    string namesAndUrlsJson;
    using (var client = new WebClient())
    {
        // The API refuses requests without a User Agent, so set one before calling (see
        // https://docs.github.com/en/rest/overview/resources-in-the-rest-api#user-agent-required)
        client.Headers.Add(HttpRequestHeader.UserAgent, "ProductiveRage Blog Post Example");
        namesAndUrlsJson = await client.DownloadStringTaskAsync(new Uri(
            "https://api.github.com/repos/ProductiveRage/Blog/contents/Blog/App_Data/Posts?ref=master"
        ));
    }

    // Deserialise the response into an array of entries that have Name and Download_Url properties
    var namesAndUrls = JsonConvert.DeserializeAnonymousType(
        namesAndUrlsJson,
        new[] { new { Name = "", Download_Url = (Uri)null } }
    );

    return await Task.WhenAll(namesAndUrls
        .Select(entry =>
        {
            var fileNameSegments = Path.GetFileNameWithoutExtension(entry.Name).Split(",");
            if (fileNameSegments.Length < 8)
                return default;
            if (!int.TryParse(fileNameSegments[0], out var id))
                return default;
            var dateContent = string.Join(",", fileNameSegments.Skip(1).Take(6));
            if (!DateTime.TryParseExact(dateContent, "yyyy,M,d,H,m,s", default, default, out var date))
                return default;
            return (PostID: id, PublishedAt: date, entry.Download_Url);
        })
        .Where(entry => entry != default)
        .Select(async entry =>
        {
            // Read the file content as markdown and parse into plain text (the first line of which
            // will be the title of the post)
            string markdown;
            using (var client = new WebClient())
            {
                markdown = await client.DownloadStringTaskAsync(entry.Download_Url);
            }
            var plainText = Markdown.ToPlainText(markdown);
            var title = plainText.Replace("\r\n", "\n").Replace('\r', '\n').Split('\n').First();
            return new BlogPost(entry.PostID, title, plainText, entry.PublishedAt);
        })
    );
}

private sealed class BlogPost
{
    public BlogPost(int id, string title, string plainTextContent, DateTime publishedAt)
    {
        ID = id;
        Title = !string.IsNullOrWhiteSpace(title)
            ? title
            : throw new ArgumentException("may not be null, blank or whitespace-only");
        PlainTextContent = !string.IsNullOrWhiteSpace(plainTextContent)
            ? plainTextContent
            : throw new ArgumentException("may not be null, blank or whitespace-only");
        PublishedAt = publishedAt;
    }

    public int ID{ get; }
    public string Title { get; }
    public string PlainTextContent { get; }
    public DateTime PublishedAt { get; }
}

(Note: I use the Markdig library to process markdown)

Training a FastText model

This raw blog post content needs to transformed into Catalyst "documents", then tokenised (split into individual sentences and words), then fed into a FastText model trainer.

Before getting to the code, I want to discuss a couple of oddities coming up. Firstly, Catalyst documents are required to train the FastText model and each document instance must be uniquely identified by a UID128 value, which is fine because we can generate them from the Title text of each blog post using the "Hash128()" extension method in Catalyst. However, (as we'll see a bit further down), when you ask for vectors* from the FastText model for the processed documents, each vector comes with a "Token" string that is the ID of the source document - so that has to be parsed back into a UID128. I'm not quite sure why the "Token" value isn't also a UID128 but it's no massive deal.

* (Vectors are just 1D arrays of floating point values - the FastText algorithm does magic to produce vectors that represent the text of the documents such that the distance between them can be compared; the length of these arrays is determined by the "Dimensions" option shown below and shorter distances between vectors suggest more similar content)

Next, there are the FastText settings that I've used. The Catalyst README has some code near the bottom for training a FastText embedding model but I didn't have much luck with the default options. Firstly, when I used the "FastText.ModelType.CBow" option then I didn't get any vectors generated and so I tried changing it to "FastText.ModelType.PVDM" and things started looked promising. Then I fiddled with some of the other settings. Some of which I have a rough idea what they mean and some, erm.. not so much.

The settings that I ended up using are these:

var fastText = new FastText(language, version: 0, tag: "");
fastText.Data.Type = FastText.ModelType.PVDM;
fastText.Data.Loss = FastText.LossType.NegativeSampling;
fastText.Data.IgnoreCase = true;
fastText.Data.Epoch = 50;
fastText.Data.Dimensions = 512;
fastText.Data.MinimumCount = 1;
fastText.Data.ContextWindow = 10;
fastText.Data.NegativeSamplingCount = 20;

I already mentioned changing the Data.Type / ModelType and the LossType ("NegativeSampling") is the value shown in the README. Then I felt like an obvious one to change was IgnoreCase, since that defaults to false and I think that I want it to be true - I don't care about the casing in any words when it's parsing my posts' content.

Now the others.. well, this library is built to work with systems with 10s or 100s of 1,000s of documents and that is a LOT more data than I have (currently around 120 blog posts) and so I made a few tweaks based on that. The "Epoch" count is the number of iterations that the training process will go through when constructing its model - by default, this is only 5 but I have limited data (meaning there's less for it to learn from but also that it's faster to complete each iteration) and so I bumped that up to 50. Then "Dimensions" is the size of the vectors generated - again, I figured that with limited data I would want a higher value and so I picked 512 (a nice round number if you're geeky enough) over the default 200. The "MinimumCount", I believe, relates to how often a word may appear and it defaults to 5 so I pulled it down to 1. The "ContextWindow" is (again, I think) how far to either side of any word that the process will look at in order to determine context - the larger the value, the more expensive the calculation; I bumped this from the default 5 up to 10. Then there's the "NegativeSamplingCount" value.. I have to just put my hands up and say that I have no idea what that actually does, only that I seemed to be getting better results with a value of 20 than I was with the default of 10.

With machine learning, there is almost always going to be some value to tweaking options (the "hyperparameters", if we're being all fancy) like this when building a model. Depending upon the model and the library, the defaults can be good for the general case but my tiny data set is not really what this library was intended for. Of course, machine learning experts have more idea what they're tweaking and (sometimes, at least) hopefully what results they'll get.. but I'm happy enough with where I've ended up with these.

This talk about what those machine learning experts do brings me on to the final thing that I wanted to talk about before showing the code; a little pre-processing / data-massaging. The better the data is that goes in, generally the better the results that come out will be. So another less glamorous part of the life of a Data Scientist is cleaning up data for training models.

In my case, that only extended to noticing that a few terms didn't seem to be getting recognised as essentially being the same thing and so I wanted to give it a little hand - for example, a fair number of my posts are about my "Full Text Indexer" project and so it probably makes sense to replace any instances of that string with a single concatenated word "FullTextIndexer". And I have a range of posts about React but I didn't want it to get confused with the verb "react" and so I replaced any "React" occurrence with "ReactJS" (now, this probably means that some "React" verb occurrences were incorrectly changed but I made the replacements of this word in a case-sensitive manner and felt like I would have likely used it as the noun more often than a verb with a capital letter due to the nature of my posts).

So I have a method to tidy up the plain text content a little:

private static string NormaliseSomeCommonTerms(string text) => text
    .Replace(".NET", "NET", StringComparison.OrdinalIgnoreCase)
    .Replace("Full Text Indexer", "FullTextIndexer", StringComparison.OrdinalIgnoreCase)
    .Replace("Bridge.net", "BridgeNET", StringComparison.OrdinalIgnoreCase)
    .Replace("React", "ReactJS");

Now let's get training!

Console.WriteLine("Reading posts from GitHub repo..");
var posts = await GetBlogPosts();

Console.WriteLine("Parsing documents..");
Storage.Current = new OnlineRepositoryStorage(new DiskStorage("catalyst-models"));
var language = Language.English;
var pipeline = Pipeline.For(language);
var postsWithDocuments = posts
    .Select(post =>
    {
        var document = new Document(NormaliseSomeCommonTerms(post.PlainTextContent), language)
        {
            UID = post.Title.Hash128()
        };
        pipeline.ProcessSingle(document);
        return (Post: post, Document: document);
    })
    .ToArray(); // Call ToArray to force evaluation of the document processing now

Console.WriteLine("Training FastText model..");
var fastText = new FastText(language, version: 0, tag: "");
fastText.Data.Type = FastText.ModelType.PVDM;
fastText.Data.Loss = FastText.LossType.NegativeSampling;
fastText.Data.IgnoreCase = true;
fastText.Data.Epoch = 50;
fastText.Data.Dimensions = 512;
fastText.Data.MinimumCount = 1;
fastText.Data.ContextWindow = 10;
fastText.Data.NegativeSamplingCount = 20;
fastText.Train(
    postsWithDocuments.Select(postWithDocument => postWithDocument.Document),
    trainingStatus: update => Console.WriteLine($" Progress: {update.Progress}, Epoch: {update.Epoch}")
);

Identifying similar documents using the model

Now that a model has been built that can represent all of my blog posts as vectors, we need to go through those post / vector combinations and identify others that are similar to it.

This will be achieved by using the HNSW.NET NuGet package that enables K-Nearest Neighbour (k-NN) searches over "high-dimensional space"*.

* (This just means that the vectors are relatively large; 512 in this case - two dimensions would be a point on a flat plane, three dimensions would be a physical point in space, anything with more dimensions that that is in "higher-dimensional space".. though that's not to say that any more than three dimensions is definitely a bad fit for a regular k-NN search but 512 dimensions IS going to be a bad fit and the HNSW approach will be much more efficient)

There are useful examples on the README about "How to build a graph?" and "How to run k-NN search?" and tweaking those for the data that I have so far leads to this:

Console.WriteLine("Building recommendations..");

// Combine the blog post data with the FastText-generated vectors
var results = fastText
    .GetDocumentVectors()
    .Select(result =>
    {
        // Each document vector instance will include a "token" string that may be mapped back to the
        // UID of the document for each blog post. If there were a large number of posts to deal with
        // then a dictionary to match UIDs to blog posts would be sensible for performance but I only
        // have a 100+ and so a LINQ "First" scan over the list will suffice.
        var uid = UID128.Parse(result.Token);
        var postForResult = postsWithDocuments.First(
            postWithDocument => postWithDocument.Document.UID == uid
        );
        return (UID: uid, result.Vector, postForResult.Post);
    })
    .ToArray(); // ToArray since we enumerate multiple times below

// Construct a graph to search over, as described at
// https://github.com/curiosity-ai/hnsw-sharp#how-to-build-a-graph
var graph = new SmallWorld<(UID128 UID, float[] Vector, BlogPost Post), float>(
    distance: (to, from) => CosineDistance.NonOptimized(from.Vector, to.Vector),
    DefaultRandomGenerator.Instance,
    new() { M = 15, LevelLambda = 1 / Math.Log(15) }
);
graph.AddItems(results);

// For every post, use the "KNNSearch" method on the graph to find the three most similar posts
const int maximumNumberOfResultsToReturn = 3;
var postsWithSimilarResults = results
    .Select(result =>
    {
        // Request one result too many from the KNNSearch call because it's expected that the original
        // post will come back as the best match and we'll want to exclude that
        var similarResults = graph
            .KNNSearch(result, maximumNumberOfResultsToReturn + 1)
            .Where(similarResult => similarResult.Item.UID != result.UID)
            .Take(maximumNumberOfResultsToReturn); // Just in case the original post wasn't included

        return new
        {
            result.Post,
            Similar = similarResults
                .Select(similarResult => new
                {
                    similarResult.Id,
                    similarResult.Item.Post,
                    similarResult.Distance
                })
                .ToArray()
        };
    })
    .OrderBy(result => result.Post.Title, StringComparer.OrdinalIgnoreCase)
    .ToArray();

And with that, there is a list of every post from my blog and a list of the three blog posts most similar to it!

Well, "most similar" according to the model that we trained and the hyperparameters that we used to do so. As with many machine learning algorithms, it will have started from a random state and tweaked and tweaked until it's time for it to stop (based upon the "Epoch" value in this FastText case) and so the results each time may be a little different.

However, if we inspect the results like this:

foreach (var postWithSimilarResults in postsWithSimilarResults)
{
    Console.WriteLine();
    Console.WriteLine(postWithSimilarResults.Post.Title);
    foreach (var similarResult in postWithSimilarResults.Similar.OrderBy(other => other.Distance))
        Console.WriteLine($"{similarResult.Distance:0.000} {similarResult.Post.Title}");
}

.. then there are some good results to be found! Like these:

Learning F# via some Machine Learning: The Single Layer Perceptron
0.229 How are barcodes read?? (Library-less image processing in C#)
0.236 Writing F# to implement 'The Single Layer Perceptron'
0.299 Face or no face (finding faces in photos using C# and AccordNET)

Translating VBScript into C#
0.257 VBScript is DIM
0.371 If you can keep your head when all about you are losing theirs and blaming it on VBScript
0.384 Using Roslyn to identify unused and undeclared variables in VBScript WSC components

Writing React components in TypeScript
0.376 TypeScript classes for (React) Flux actions
0.378 React and Flux with DuoCode
0.410 React (and Flux) with Bridge.net

However, there are also some less good ones - like these:

A static type system is a wonderful message to the present and future
0.271 STA ApartmentState with ASP.Net MVC
0.291 CSS Minification Regular Expressions
0.303 Publishing RSS

Simple TypeScript type definitions for AMD modules
0.162 STA ApartmentState with ASP.Net MVC
0.189 WCF with JSON (and nullable types)
0.191 The joys of AutoMapper

Supporting IDispatch through the COMInteraction wrapper
0.394 A static type system is a wonderful message to the present and future
0.411 TypeScript State Machines
0.414 Simple TypeScript type definitions for AMD modules

Improving the results

I'd like to get this good enough that I can include auto-generated recommendations on my blog and I don't feel like the consistency in quality is there yet. If they were all like the good examples then I'd be ploughing ahead right now with enabling it! But there are mediocre examples as well as those poorer ones above.

It's quite possible that I could get closer by experimenting with the hyperparameters more but that does tend to get tedious when you have to analyse the output of each run manually - looking through all the 120-ish post titles and deciding whether the supposed best matches are good or not. It would be lovely if I could concoct some sort of metric of "goodness" and then have the computer try lots of variations of parameters but one of the downsides of having relatively little data is that that is difficult*.

* (On the flip side, if I had 1,000s of blog posts as source data then the difficult part would be manually labelling enough of them as "quite similar" in numbers sufficient for the computer to know if it's done better or done worse with each experiment)

Fortunately, I have another trick up my sleeve - but I'm going to leave that for next time! This post is already more than long enough, I think. The plan is to combine results from another model in the Catalyst with the FastText results and see if I can encourage things to look a bit neater.

Trying the code if you're lazy

If you want to try fiddling with this code but don't want to copy-paste the sections above into a new project, you can find the complete sample in the "Similarity" project in the solution of this repo: BlogPostSimilarity.

Language detection and words-in-sentence classification in C#

Tue, 09 Mar 2021 19:52:00 GMT

TL;(BG)DR

Using an open source .NET library, it's easy to determine what language a sentence / paragraph / document is written in and to then classify the words in each sentence into verbs, nouns, etc..

What library?

I recently parted ways on very good terms with my last employers (and friends!) at Curiosity AI but that doesn't mean that I'm not still excited by their technology, some really useful aspects of which they have released as open source*.

* (For the full service, ask yourself if your team or your company have ever struggled to find some information that you know exists somewhere but that might be in one of your network drives containing 10s of 1,000s of files or in your emails or in Sharepoint or GDrive somewhere - with Curiosity, you can set up a system that will index all that data so that it's searchable in one place, as well as learning synonyms and abbreviations in case you can't conjure up the precise terms to search for. It can even find similar documents for those case where have one document to hand and just know that there's another related to it but are struggling to find it - plus it has an ingrained permissions model so that your team could all index their emails and GDrive files and be secure in the knowledge that only they and people that they've shared the files with can see them; they don't get pulled in in such a way that your private, intimate, confidential emails are now visible to everyone!)

I have a little time off between jobs and so I wanted to write a little bit about some of the open-sourced projects that they released that I think are cool.

This first one is a really simple example but I think that it demonstrates how easily you can access capabilities that are pretty impressive.

This is my cat Cass:

She looks so cute that you'd think butter wouldn't melt. But, of my three cats, she is the prime suspect for the pigeon carcus that was recently dragged through the cat flap one night, up a flight of stairs and deposited outside my home office - and, perhaps not coincidentally, a mere six feet away from where she'd recently made herself a cosy bed in a duvet cover that I'd left out to remind myself to wash.

I think that it's a fair conclusion to draw that:

My cat Cass is a lovely fluffy little pigeon-killer!

Now you and I can easily see that that is a sentence written in English. But if you wanted a computer to work it out, how would you go about it?

Well, one way would be to install the Catalyst NuGet package and write the following code:

using System;
using System.IO;
using System.Threading.Tasks;
using Catalyst;
using Catalyst.Models;
using Mosaik.Core;
using Version = Mosaik.Core.Version;

namespace CatalystExamples
{
    internal static class Program
    {
        private static async Task Main()
        {
            const string text = "My cat Cass is a lovely fluffy little pigeon-killer!";

            Console.WriteLine("Downloading/reading language detection models..");
            const string modelFolderName = "catalyst-models";
            if (!new DirectoryInfo(modelFolderName).Exists)
                Console.WriteLine("- Downloading for the first time, so this may take a little while");
            
            Storage.Current = new OnlineRepositoryStorage(new DiskStorage(modelFolderName));
            var languageDetector = await FastTextLanguageDetector.FromStoreAsync(
                Language.Any,
                Version.Latest,
                ""
            );
            Console.WriteLine();

            var doc = new Document(text);
            languageDetector.Process(doc);

            Console.WriteLine(text);
            Console.WriteLine($"Detected language: {doc.Language}");
        }
    }
}

Running this code will print the following to the console:

Downloading/reading language detection models..
- Downloading for the first time, so this may take a little while

My cat Cass is a lovely fluffy little pigeon-killer!
Detected language: English

Just to prove that it doesn't only detect English, I ran the sentence through Google Translate to get a German version (unfortunately, the languages I'm fluent in are only English and a few computer languages and so Google Translate was very much needed!) - thus changing the "text" definition to:

const string text = "Meine Katze Cass ist eine schöne flauschige kleine Taubenmörderin!";

Running the altered program results in the following console output:

Downloading/reading language detection models..

Meine Katze Cass ist eine wunderschöne, flauschige kleine Taubenmörderin!
Detected language: German

Great success!

The next thing that we can do is analyse the grammatical constructs of the sentence. I'm going to return to the English version for this because it will be easier for me to be confident that the word classifications are correct.

Add the following code immediately after the Console.WriteLine calls in the Main method from earlier:

Console.WriteLine();
Console.WriteLine($"Downloading/reading part-of-speech model for {doc.Language}..");
var pipeline = await Pipeline.ForAsync(doc.Language);
pipeline.ProcessSingle(doc);
foreach (var sentence in doc)
{
    foreach (var token in sentence)
        Console.WriteLine($"{token.Value}{new string(' ', 20 - token.Value.Length)}{token.POS}");
}

The program will now write the following to the console:

Downloading/reading language detection models..

My cat Cass is a lovely fluffy little pigeon-killer!
Detected language: English

Downloading/reading part-of-speech model for English..
My                  PRON
cat                 NOUN
Cass                PROPN
is                  AUX
a                   DET
lovely              ADJ
fluffy              ADJ
little              ADJ
pigeon-killer       NOUN
!                   PUNCT

The "Part of Speech" (PoS) categories shown above are (as quoted from universaldependencies.org/u/pos/all.html) -

Word(s)	Code	Name	Description
My	PRON	Pronoun	words that substitute for nouns or noun phrases, whose meaning is recoverable from the linguistic or extralinguistic context
cat, pigeon-killer	NOUN	Noun	a part of speech typically denoting a person, place, thing, animal or idea
Cass	PNOUN	Proper Noun	a noun (or nominal content word) that is the name (or part of the name) of a specific individual, place, or object
is	AUX	Auxillary Verb	a function word that accompanies the lexical verb of a verb phrase and expresses grammatical distinctions not carried by the lexical verb, such as person, number, tense, mood, aspect, voice or evidentiality
a	DET	Determiner	words that modify nouns or noun phrases and express the reference of the noun phrase in context
lovely, fluffy, little	ADJ	Adjective	words that typically modify nouns and specify their properties or attributes
!	PUNCT	Punctuation	non-alphabetical characters and character groups used in many languages to delimit linguistic units in printed text

How easy was that?! There are a myriad of uses for this sort of analysis (one of the things that the full Curiosity system uses it for is identifying nouns throughout documents and creating tags that any documents sharing a given noun are linked via; so if you found one document about "Flux Capacitors" then you could easily identify all of the other documents / emails / memos that mentioned it - though that really is just the tip of the iceberg).

Very minor caveats

I have only a couple of warnings before signing off this post. I've seen the sentence detector get confused if it has very little data to work with (a tiny segment fragment, for example) or if there is a document that has different sections written in multiple languages - but I don't think that either case is unreasonable, the library is very clever but it can't perform magic!

Coming soon

I've got another post relating to their open-sourced libraries in the pipeline, hopefully I'll get that out this week! Let's just say that I'm hoping that my days of having to manually maintain the "you may also be interested" links between my posts will soon be behind me!

Monitoring my garden's limited sunlight time period with an Arduino (and some tupperware)

Sat, 22 Aug 2020 21:34:00 GMT

My house has a lovely little garden out front. The house and garden itself are elevated one storey above the street (and so my basement is really more of a bizarre ground floor because it has natural light windows but is full of dust and my life-accumulated rubbish is in one room of it while my covid-times "trying to stay fit, not fat" home gym is in the other) and there was no fence around it when I moved in. Meaning that that the interesting characters that amble past (suffice to say that I went for a nicer house in a slightly on-the-cusp between classy and rougher neighbourhoods as opposed to a less nice house in a posh place) could see in and converse between sips on their 9am double-strength lager. Once fenced off, kitted out with a cute little table and chairs that my friendly neighbours found at a tip and with some lovely raised flower beds installed, it is a delight in Summer.. only problem is that my house faces the wrong way and so only gets direct sunlight at certain hours of the day. And this time period varies greatly depending upon the time of year - in March, it might not get the light until almost 5pm whilst in July and August it's getting warm and light and beautiful (well, on the days that English weather allows) more in time for a late lunch.

The problem is that, even after four years here, I still don't really have any idea when it's going to be sunny there for a given time of year and I want to be able to plan opportunities around it - late evening drinks outside with friends, lunch time warm weather meals for myself, just any general chance top up my vitamin D!

I guess that one way to sort this out would be to just keep an eye out on sunny days and take the opportunity whenever it strikes. A more organised plan would be to start a little diary and mark down every fortnight or so through the year when the sun hits the garden and when it leaves.

But I work in technology, damnnit, and so I expect to be able to solve this using that electronics and magic! (Cue comments about everything looking like a nail when you're holding a hammer).

To be really honest, maybe I'm describing this situation back to front. My friend gave me an Arduino UNO r3 because he had a kit spare from the coding club that he runs for kids locally and I'd been looking for a use for it.. and this seemed like it!

What I needed

Being a total Arduino noob (and, since my Electronics GCSE was over 20 years ago now, I'm basically a total hardware noob.. you should have seen the trouble that I had trying to build a custom PC a few years ago; I swear it was easier when I was 14!) I wanted something nice and simple to begin with.

So I had the starter kit, which included the Arduino board and some jumper cables, a prototyping breadboard and some common components (including, essentially, a photoresistor) and so I figured that all I'd then need is a way to record the light levels periodically, a power source and some sort of container for when it rains (again; England).

I considered having some sort of fancy wifi server in it that would record the values somehow and let me either poll it from somewhere else or have it push those results to the cloud somewhere but eventually decided to go for what seemed like a simpler, more robust and (presumably) more power efficient mechanism of storing the light values throughout the day - using an SD card. Because I'd got the kit for free (on the agreement that I would try to do something useful with it), I was looking for something cheap to write to an SD card that I'd had lying around since.. well, I guess since whenever SD cards were useful. Could it have been a digital camera? The very concept seems absurd these days, with the quality of camera that even phones from three or four generations ago have.

I came across something called a "Deek Robot SD/RTC datalogging shield" that would not only write to an SD card but would also keep time due to a small battery mounted on it.

These are cheap (mine was less than £5 delivered, new from eBay) but documentation is somewhat.. spotty. There is a lot of documentation for the "Adafruit Assembled Data Logging shield" but they cost more like £13+ and I was looking for the cheap option. Considering how much time I spent trying to make it work and find good information, it probably would have made more sense to buy a better supported shield than a knock-off from somewhere.. but I did get it working eventually, so I'll share all the code throughout this post!

Note: I found a warning that when using this particular shield, "If you have a UNO with a USB type B connector this shield may NOT WORK because the male pins are NOT LONG ENOUGH" on a forum page - my UNO r3 does have the USB B connector but I've not had this problem.. though if you do encounter this problem then maybe some sort of pin extenders or raisers would fix it.

Step 1: Writing to the SD card

After reading around, I settled on a library called SdFat that should handle the disk access for me. I downloaded it from the Github repo and followed the "Importing a .zip Library" instructions on the Installing Additional Arduino Libraries page.

This allowed me to stack the data logging shield on top of the UNO, put an SD card into the shield, connect the UNO to my PC via a USB lead and upload the following code -

#include <SdFat.h> // https://github.com/greiman/SdFat

// chipSelect = 10 according to "Item description" section of
// https://www.play-zone.ch/de/dk-data-logging-shield-v1-0.html
#define SD_CHIP_SELECT 10

void setup() {
  Serial.begin(9600);

  // See "Note 1" further down about SPI_HALF_SPEED
  SdFat sd;
  if (!sd.begin(SD_CHIP_SELECT, SPI_HALF_SPEED)) {
    Serial.println("ERROR: sd.begin() failed");
  }
  else {
    SdFile file;
    if (!file.open("TestData.txt", O_WRITE | O_APPEND | O_CREAT)) {
      Serial.println("ERROR: file.open() failed - unable to write");
    }
    else {
      file.println("Hi!");
      file.close();
      Serial.println("Successfully wrote to file!");
    }
  }
}

void loop() { }

The Arduino IDE has an option to view the serial output (the messages written to "Serial.println") by going to Tools / Serial Monitor. Ensure that the baud rate shown near the bottom right of the window is set to 9600 to match the setting in the code above.

This happily showed

Successfully wrote to file!

in the Serial Monitor's output and when I yanked the card out and put it into my laptop to see if it had worked, it did indeed have a file on it called "TestData.txt" with a single line saying "Hi!" - an excellent start!

Note 1: In the "sd.begin" call, I specify SPI_HALF_SPEED primarily because that's what most of the examples that I've found use - there is an option SPI_FULL_SPEED but I read in an Arduino forum thread that: "You should be able to use SPI_FULL_SPEED instead, but if that produces communication errors you can use SD_SCK_HZ(4 * MHZ) instead of SPI_HALF_SPEED" and I'm not sure what might be the limiting factor with said communication errors; whether it's the card or the shield or something else and I'm only going to be writing small amounts of data at relatively infrequent intervals and so I thought that I would err on the safe side and stick with SPI_HALF_SPEED.

Note 2: In a lot of code samples, in the "setup" method you will see code after the "Serial.begin(..)" call that looks like this:

while (!Serial) {
  // wait for serial port to connect - needed for native USB
}

^ This is only needed for particular variants of the Arduino - the "Leonardo", I believe - and is not required for the UNO and so I haven't included it in my code.

Gotcha One: Initially, I had formatted my SD card (branded as "Elgetec", who I can't remember ever hearing of other than on this card) on my Windows laptop - doing a full format, to make absolutely sure that it was as ready for action as possible. However, not only did that full format take a long time, I found that when I left my Arduino shield writing files over a period of a few hours then it would often get reported as being corrupted when I tried to read it. I've found that if the SdFormatter.ino (from the examples folder of the SdFat GitHub repo) is used then these corruption problems have stopped occurring (and the formatting is much faster!).

Gotcha Two: While I was fiddling around with writing to the SD card, particularly when connected to a battery instead of the USB port (where I could use the Serial Monitor to see what was happening), I tried setting the LED_BUILTIN to be on while writing and then go off again when the file was closed. This didn't work. And it can't work, though it took me a lot of reading to find out why. It turns out that the SPI (the Serial Peripheral Library) connection from the Arduino to the Deek Robot shield will use IO pins 10, 11, 12 and 13 for its own communications. 13 happens to be the output used to set the LED_BUILTIN state and so you lose access to setting that built-in LED while this shield is connected. Specifically, "pin 13 is the SPI clock. Pin 13 is also the built-in led and hence you have a conflict".

Step 2: Keeping time

Since I want to record light levels throughout the day, it's important to know at what time the recording is being made. The shield that I'm using also includes an "RTC" (a real-time clock) and so I needed to work out how to set that once and then read from it each time I took a light level reading.

The UNO board itself can do some basic form of time keeping, such as telling you how long it's been since the board started / was last reset (via the millis() function) but there are a few limitations with this. You can bake into the compiled code the time at which it was compiled and you could then use that, in combination with "millis()", to work out the current time but you will hit problems if power is temporarily lost or if the board is reset (because "millis()" will start from zero again and timing will start again from that baked-in "compiled at" time).

Gotcha Three: I didn't realise when I was first fiddling with this that any time you connected the USB lead, it would reset the board and the program (the "sketch", in Arduino-speak) would start all over again. (This will only make a difference if you're using an external power source because otherwise the program would stop whenever you disconnected the USB lead and there would be nothing running to reset when plugging the USB lead back in! I'll be talking about external power supplies further down).

So the next step was using the clock on the shield that I had bought, instead of relying on the clock on the Arduino board itself. To do this, I'd inserted a CR1220 battery and then tested with the following code:

#include <Wire.h>
#include <RTClib.h> // https://github.com/adafruit/RTClib

RTC_DS1307 rtc;

void setup() {
  // The clock won't work with this (thanks https://arduino.stackexchange.com/a/44305!)
  Wire.begin();

  bool rtcWasAlreadyConfigured;
  if (rtc.isrunning()) {
    rtcWasAlreadyConfigured = true;
  }
  else {
    rtc.adjust(DateTime(__DATE__, __TIME__));
    rtcWasAlreadyConfigured = false;
  }

  Serial.begin(9600);

  if (rtcWasAlreadyConfigured) {
    Serial.println("setup: RTC is already running");
  }
  else {
    Serial.println("setup: RTC was not running, so it was set to the time of compilation");
  }
}

void loop() {
  DateTime now = rtc.now();
  Serial.print("Year: ");
  Serial.print(now.year());
  Serial.print(" Month: ");
  Serial.print(now.month());
  Serial.print(" Day: ");
  Serial.print(now.day());
  Serial.print(" Hour: ");
  Serial.print(now.hour());
  Serial.print(" Minutes: ");
  Serial.print(now.minute());
  Serial.print(" Seconds: ");
  Serial.print(now.second());
  Serial.println();

  delay(1000);
}

The first time you run this, you'll see the first line say:

setup: RTC was not running, so it was set to the time of compilation

.. and then you'll see the date and time shown every second.

If you remove the USB cable and then re-insert it then you'll see the message:

setup: RTC is already running

.. and then the date and time will continue to show every second and it will be the correct date and time (it won't have reset each time that the USB cable is connected and the "setup" function is run again).

Gotcha Four: When disconnecting and reconnecting the USB lead, sometimes (if not always) I need to close the Serial Monitor and then re-open it otherwise it won't update and it will say that the COM port is busy if I try to upload a sketch to the board.

Gotcha Five: I've seen a lot of examples use "RTC_Millis" instead of "RTC_DS1307" in timing code samples. This is not what we want! That is a timer that is simulated by the board and it just uses the "millis()" function to track time which, as I explained earlier, is no good for persisting time across resets. We want to use "RTC_DS1307" because that uses the RTC on the shield, which will maintain the time between power cycles due to the battery on the board.

Gotcha Six: If you don't include "Wire.h" and call "Wire.begin();" at the start of setup then the RTC won't work properly and you will always get the same weird date displayed when you read it:

Year: 2165 Month: 165 Day: 165 Hour: 165 Minutes: 165 Seconds: 85

Step 3: An external power source

So far, the board has only been powered up when connected to the USB lead but this is not the only option. There are a few approaches that you can take; a regulated 5V input, the barrel-shaped power jack and the option of applying power to the vin and gnd pins on the board.

The power jack makes most sense when you are connecting to some sort of wall wart but I want a "disconnected" power supply for outside. I did a bunch of reading on this and some people are just connecting a simple 9V battery to the vin/gnd pins but apparently that's not very efficient - the amount of power stored in a standard MN1604 9V battery (the common kind that you might use in a smoke alarm) is comparatively low and the vin/gnd pins will be happy with something in the 6V-12V range and there is said to be more loss in regulating 9V to the internal 5V than there would be from a 6V supply.

So I settled on a rechargable 6V sealed lead acid battery, which I believe is often used in big torches or in remote control cars. I got one for £8 delivered from ebay that is stated to have 4.5Ah (which is a measure, essentially, of how much energy it stores) - for reference, a 9V battery will commonly have about 0.5Ah and so will run out much more quickly. Whatever battery you select, there are ways to eke out more life from them, which I'll cover shortly.

It's completely safe to connect the battery to the vin/gnd ports at the same time as the USB lead is inserted, so you don't have to worry about only providing power by the battery or the USB lead and you can safely connect and disconnect the USB lead while the battery is connected as often as you like.

Step 4: Capturing light levels

The starter kit that I've got conveniently included an LDR (a "Light Dependent Resistor" aka a "photo-resistor") and so I just had to work out how to connect that. I knew that the Arduino has a range of digital input/output pins and that it has some analog input pins but I had to remind myself of some basic electronics to put it all together.

What you can't do is just put 5V into one pin of the LDR and connect the other end of the LDR straight into an analog pin. I'm going to try to make a stab at a simple explanation here and then refer you to someone who can explain it better!

The analog pin will read a voltage value from between 0 and 5V and allow this to be read in code as a numeric value between 0 and 1023 (inclusive). When we talk about the 5V output pin, this only makes sense in the context of the ground of the board - so the concept of a 5V output with no gnd pin connection makes no sense, there is nothing for that 5V to be measured relative to. So what we need to do is use the varying resistance of the LDR and somehow translate that into a varying voltage to provide to an analog pin (I chose A0 in my build).

The way to do this is with a "voltage divider", which is essentially a circuit that looks a bit like this:

gnd <--> resistor <--> connection-to-analog-input <--> LDR <--> 5V

If the resistance of the LDR happens to precisely match that resistance of the fixed resistor then precisely 2.5V will be delivered to the analog input. But if the LDR resistance is higher or lower than the fixed resistor's value then a higher or lower voltage will be delivered to analog pin.

There is a tutorial on learn.adafruit.com that does a much better job of explaining it! It also suggests what fixed resistor values that you might use for different environments (eg. are you more interested in granular light level readings at low light levels but don't mind saturation at high levels or are you more interested in more granular readings at high levels and less granular at lower?) - at the moment, I'm still experimenting with a few different fixed resistor values to see which ones work for my particular climate.

The shield that I'm using solder pads for mounting components onto but I wasn't brave enough for that, so I've been using the pass-through pins and connecting them to the bread board that came with my starter kit.

When it's not connected to a power supply, it looks a bit like this:

The code to read the light level value looks like this (while running this code, try slowly moving your hand closer and further from covering the sensor to see the value change when it's read each second) -

void setup() {
  Serial.begin(9600);
}

void loop() {
  Serial.print("Light level reading: ");
  Serial.print(analogRead(0));
  Serial.println();

  delay(1000);
}

In an effort to start putting all of this together into a more robust package, I picked up a pack of self-adhesive felt pads from the supermarket and stuck them to appropriate points under the breadboard -

.. and then I secured it all together with an elastic band:

Step 5: Sleeping when not busy

In my ideal dream world, I would be able to leave my light level monitoring box outside for a few months. As I explained earlier, due to the direction that my garden faces, the hours at which the sun hits it fully varies by several hours depending upon the time of year. However, NO battery is going to last forever and even with this 4.5Ah battery that is at a 6V output (which is only a small jump down to regulate to 5V), the time that it can keep things running is limited.

Note: Recharging via a solar panel sounds interesting but it's definitely a future iteration possibility at this point!

There are, however, some things that can be done to eke out the duration of the battery by reducing the power usage of the board. There are ways to put the board into a "power down" state where it will do less - its timers will stop and its CPU can have a rest. There are tutorials out there about how to put it into this mode and have it only wake up on an "interrupt", which can be an external circuit setting an input pin (maybe somehow using the RTC on the shield I'm using) or using something called the "Watchdog Timer" that stays running on the Arduino even when it's in power down mode.

I read a lot of posts and tutorials on this and I really struggled to get it to work. Until, finally, I came across this one: Arduino Sleep Modes and How to use them to Save the Power. It explains in a clear table the difference between the different power-reduced modes (idle, power-save, power-down, etc..) and it recommends a library called "Low-Power" that takes all of the hard work out of it.

Whereas other tutorials talked about calling "sleep_enable()" and "set_sleep_mode(..)" functions and then using "attachInterrupt(..)" and adding some magic method to then undo all of those things, this library allows you to write a one-liner as follows:

LowPower.powerDown(SLEEP_8S, ADC_OFF, BOD_OFF);

This will cause the board to go into its most power-saving mode for eight seconds (which is the longest that's possible when relying upon its internal Watchdog Timer to wake it up).

No muss, no fuss.

I haven't measured yet how long that my complete device can sit outside in its waterproof box on a single charge of a battery but I'm confident that it's definitely measured in days, not hours - and that was before introducing this "LowPower.powerDown(..)" call.

Since I only want a reading every 30-60s, I call "LowPower.powerDown(..)" in a loop so that there are several 8s power down delays. While I haven't confirmed this yet, I would be astonished if it didn't last at least a week out there on one charge. And if I have to bring it in some nights (when it's dark and I don't care about light measurements) to charge it, then that's fine by me (though I'd like to be as infrequently as possible).

Gotcha Seven: When entering power-down mode, if you are connected to the USB port in order to use the Serial Monitor to watch what's going on, ensure that you call "Serial.flush()" before entering power-down, otherwise the message might get buffered up and not fully sent through the serial connection before the board takes a nap.

Step 6: Preparing for the outdoors (in the British weather!)

I always associate the brand "Tupperware" as being a very British thing - it's what we get packed lunches put into and what we get takeaway curries in. At least, I think that it is - maybe it's like "hoover", where everyone uses the phrase "hoover" when they mean their generic vacuum cleaner. Regardless the origin, this seemed like the simplest way to make my device waterproof. The containers are not completely transparent but they shouldn't make a significant impact on the light levels being recorded by the photo-resistor because they're also far from opaque. And these containers are sealable, waterproof and come in all shapes and sizes!

I took my elastic-band-wrapped "stack" of Arduino-plus-shield-plus-breadboard and connected it to the battery -

.. and put in a plastic box. By turning the battery so that it was length-side-up, it was quite a snug fit and meant that the battery wouldn't slide around inside the box. There wasn't a lot of space for the stack to move around and so it seemed like quite a secure arrangement:

Step 7: The final code

So far, each code sample has demonstrated aspects of what I want to do but now it's time to bring it all fully together.

In trying to write the following code, I was reminded how much I've taken for granted in C# (and other higher level languages) with their string handling! I tried a little C and C++ maaaaany years ago and so writing Arduino code was a bit of a throwback for me - at first, I was trying to make a char array for a filename and I set the length of the array to be the number of characters that were required for the filename.. silly me, I had forgotten that C strings need to be null-terminated and so you need an extra zero character at the end in order for things to work properly! Failing to do so would not result in a compile or run time error, it would just mean that the files weren't written properly. Oh, how I've been spoilt! But, on the other hand, it also feels kinda good being this close to the "bare metal" :)

The following sketch will record the light level about twice a minute to a file on the SD card where the filename is based upon the current date (as maintained by the RTC module and its CR1220 battery) -

#include <Wire.h>
#include <SdFat.h>    // https://github.com/greiman/SdFat
#include <RTClib.h>   // https://github.com/adafruit/RTClib
#include <LowPower.h> // https://github.com/rocketscream/Low-Power

// chipSelect = 10 according to "Item description" section of
// https://www.play-zone.ch/de/dk-data-logging-shield-v1-0.html
#define SD_CHIP_SELECT 10

RTC_DS1307 rtc;

void setup() {
  // The clock won't work with this (thanks https://arduino.stackexchange.com/a/44305!)
  Wire.begin();

  bool rtcWasAlreadyConfigured;
  if (rtc.isrunning()) {
    rtcWasAlreadyConfigured = true;
  }
  else {
    rtc.adjust(DateTime(__DATE__, __TIME__));
    rtcWasAlreadyConfigured = false;
  }

  Serial.begin(9600);

  if (rtcWasAlreadyConfigured) {
    Serial.println("setup: RTC is already running");
  }
  else {
    Serial.println("setup: RTC was not running, so it was set to the time of compilation");
  }
}

void loop() {
  // Character arrays need to be long enough to store the number of "real" characters plus the
  // null terminator
  char filename[13]; // yyyyMMdd.txt = 12 chars + 1 null
  char timestamp[9]; // 00:00:00     =  8 chars + 1 null
  DateTime now = rtc.now();
  snprintf(filename, sizeof(filename), "%04u%02u%02u.txt", now.year(), now.month(), now.day());
  snprintf(timestamp, sizeof(timestamp), "%02u:%02u:%02u", now.hour(), now.minute(), now.second());

  int sensorValue = analogRead(0);

  Serial.print(filename);
  Serial.print(" ");
  Serial.print(timestamp);
  Serial.print(" ");
  Serial.println(sensorValue);

  SdFat sd;
  if (!sd.begin(SD_CHIP_SELECT, SPI_HALF_SPEED)) {
    Serial.println("ERROR: sd.begin() failed");
  }
  else {
    SdFile file;
    if (!file.open(filename, O_WRITE | O_APPEND | O_CREAT)) {
      Serial.println("ERROR: file.open() failed - unable to write");
    }
    else {
      file.print(timestamp);
      file.print(" Sensor value: ");
      file.println(sensorValue);
      file.close();
    }
  }

  Serial.flush(); // Ensure we finish sending serial messages before going to sleep

  // 4x 8s is close enough to a reading every 30s, which gives me plenty of data
  // - Using this instead of "delay" should mean that the battery will power the device for longer
  for (int i = 0; i < 3; i++) {
    LowPower.powerDown(SLEEP_8S, ADC_OFF, BOD_OFF);
  }
}

At the moment, I'm bringing the box inside each night and then disconnecting the battery, pulling out the card and looking at the values recorded in the file to see if it's clear when the sun was fully hitting the table that I had placed the box on.

I've only started doing this in the last couple of days and each day has been rather grey and so there haven't been any sunny periods so that I can confirm that the readings clearly distinguish between "regular daylight" and "sun directly on the table". Once I get some sun again, I'll be able to get a better idea - and if I can't distinguish well enough then I'll adjust the pull-down resistor that splits the voltage with the LDR and keep experimenting!

When I'm happy with the configuration, then I'll start experimenting with leaving the box outside for longer to see how long this battery can last in conjunction with the "LowPower.powerDown(..)" calls. One obvious optimisation for my use case would be to continue keeping it in power-down mode between the hours of 10pm and 8am - partly because I know that it will definitely be dark after 10pm and partly because I am not a morning person and so would not want to be outside before 8am, even if it was streaming with light (which it wouldn't be due to when my yard actually gets direct sunlight).

Gotcha Eight: The RTC has no awareness of daylight savings time and so I'll need to take this into account when the clocks change in the UK. I'll worry about this another day!

Step 8: Draw some graphs (one day)

As you can tell from the above, I'm still very much in the early phases of gathering data. But, at some point, I'm going to have to use this data to predict when the yard will get sun for future dates - once I've got a few months of data for different times of year, hopefully I'll be able to do so!

I foresee a little bit of data-reading and Excel-graph-drawing in my future! There's just something about seeing results on a graph that make everything feel so much more real. As much as I'd like to be able to stare at 1000s of numbers and read them like the Matrix, seeing trends and curves plotted out just feels so much more satisfying and definitive. Maybe there will be a follow-up post with the results, though I feel that they would be much more personal and less useful to the general populace than even my standard level of esoteric and niche blog posts! Maybe there are some graphs in my Twitter stream's future!

On the other hand.. if I learn any more power-saving techniques or have any follow-up information about how long these rechargeable torch-or-remote-control batteries last then maybe that will be grounds for a follow-up!

In the meantime, I hope you've enjoyed this little journey - and if you've tried to do anything similar with these cheap Deek Robot boards, then maybe the code samples here have been of use to you. I hope so! (Because, goodness knows, feeling like a beginner again and getting onto those new forums has been quite an experience!)

How are barcodes read?? (Library-less image processing in C#)

Fri, 07 Aug 2020 23:24:00 GMT

I've been using MyFitnessPal and it has the facility to load nutrition information by scanning the barcode on the product. I can guess how the retrieval works once the barcode number is obtained (a big database somewhere) but it struck me that I had no idea how the reading of the barcode itself worked and.. well, I'm curious and enjoy the opportunity to learn something new to me by writing the code to do it. I do enjoy being able to look up (almost) anything on the internet to find out how it works!

(For anyone who wants to either play along but not copy-paste the code themselves or for anyone who wants to jump to the end result, I've put the code - along with the example image I've used in this post - up on a GitHub repo)

The plan of attack

There are two steps required here:

Read image and try to identify areas that look like barcodes
Try to extract numbers from the looks-like-a-barcode regions

As with anything, these steps may be broken down into smaller tasks. The first step can be done like this:

Barcodes are black and white regions that have content that has steep "gradients" in image intensity horizontally (where there is a change from a black bar to a white space) and little change in intensity vertically (as each bar is a vertical line), so first we greyscale the image and then generate horizontal and vertical intensity gradients values for each point in the image and combine the values by subtracting vertical gradient from horizontal gradient
These values are normalised so that they are all on the scale zero to one - this data could be portrayed as another greyscale image where the brightest parts are most likely to be within barcodes
These values are then "spread out" or "blurred" and then a threshold value is applied where every value about it is changed into a 1 and every value below it a 0
This "mask" (where every value is a 0 or 1) should have identified many of the pixels within the barcodes and we want to group these pixels into distinct objects
There is a chance, though, that there could be gaps between bars that mean that a single barcode is spread across multiple masked-out objects and we need to try to piece them back together into one area (since the bars are tall and narrow, this may be done by considering a square area over every object and then combining objects whose squared areas overlap into one)
This process will result in a list of areas that may be barcodes - any that are taller than they are wide are ignored (because barcode regions are always wider than they are tall)

The second step can be broken down into:

Take the maybe-barcode region of the image, greyscale it and then turn into a mask by setting any pixel with an intensity less than a particular threshold to zero and otherwise to one
Take a horizontal slice across the image region - all of the pixels on the first row of the image - and change the zero-or-one raw data into a list of line lengths where a new line starts at any transition from zero-to-one or one-to-zero (so "01001000110" becomes "1,1,2,1,3,2,1" because there is 1x zero and then 1x one and then 2x zero and then 1x one, etc..)
These line lengths should correspond to bar sizes (and space-between-bar sizes) if we've found a barcode - so run the values through the magic barcode bar-size-reading algorithm (see section 2.1 in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2859730/) and if we get a number (and the checksum is correct) then we're done, hurrah!
If we couldn't get a number from this horizontal slice then move one pixel down and go back around
If it was not possible to extract a number from any of the slices through the image region then it's either not a barcode or it's somehow so distorted in the image that we can't read it

This approach is fairly resilient to changes in lighting and orientation because the barcode regions are still likely to have the highest horizontal intensity gradient whether the image is dark or light (and even if part of the image is light and part of it is dark) and the barcode-reading algorithm works on ratios of bar/space-between-bar widths and these remain constant if the image is rotated.

(Some of the techniques are similar to things that I did in my Face or no face (finding faces in photos using C# and Accord.NET) and so I'll be using some of the same code shortly that I described then)

Identifying maybe-barcode images

Let's this image as an example to work with (peanut butter.. I do love peanut butter) -

Before looking at any code, let's visualise the process.

We're going to consider horizontal and vertical gradient intensity maps - at every point in the image we either look to the pixels to the left and to the right (for the horizontal gradient) or we look at the pixels above and below (for the vertical gradient) and the larger the change, the brighter the pixel in the gradient intensity map

And when they're combined by subtracting the vertical gradient at each point from the horizontal gradient, it looks lke this:

If this image is blurred then we get this:

.. and if we create a binary mask by saying "normalise the intensity values so that their range goes from zero (for the darkest pixel) to one (for the brightest) and then set any pixels that are in the bottom third in terms of intensity to 0 and set the rest to 1" then we get this:

If each distinct area (where an "area" means "a group of pixels that are connected") is identified and squares overlaid and centered around the areas then we see this:

.. and if the areas whose bounding squares overlap are combined and then cropped around the white pixels then we end up with this:

This has identified the area around the barcode and also two tiny other areas - when we come to trying to read barcode numbers out of these, the tiny regions will result in no value while the area around the genuine barcode content should result in a number successfully being read. But I'm getting ahead of myself.. let's look at the code required to perform the above transformations.

I'm going to start with a DataRectangle for performing transformations -

public static class DataRectangle
{
    public static DataRectangle<T> For<T>(T[,] values) => new DataRectangle<T>(values);
}

public sealed class DataRectangle<T>
{
    private readonly T[,] _protectedValues;
    public DataRectangle(T[,] values) : this(values, isolationCopyMayBeBypassed: false) { }
    private DataRectangle(T[,] values, bool isolationCopyMayBeBypassed)
    {
        if ((values.GetLowerBound(0) != 0) || (values.GetLowerBound(1) != 0))
            throw new ArgumentException("Both dimensions must have lower bound zero");
        var arrayWidth = values.GetUpperBound(0) + 1;
        var arrayHeight = values.GetUpperBound(1) + 1;
        if ((arrayWidth == 0) || (arrayHeight == 0))
            throw new ArgumentException("zero element arrays are not supported");

        Width = arrayWidth;
        Height = arrayHeight;

        if (isolationCopyMayBeBypassed)
            _protectedValues = values;
        else
        {
            _protectedValues = new T[Width, Height];
            Array.Copy(values, _protectedValues, Width * Height);
        }
    }

    /// <summary>
    /// This will always be greater than zero
    /// </summary>
    public int Width { get; }

    /// <summary>
    /// This will always be greater than zero
    /// </summary>
    public int Height { get; }

    public T this[int x, int y]
    {
        get
        {
            if ((x < 0) || (x >= Width))
                throw new ArgumentOutOfRangeException(nameof(x));
            if ((y < 0) || (y >= Height))
                throw new ArgumentOutOfRangeException(nameof(y));
            return _protectedValues[x, y];
        }
    }

    public IEnumerable<Tuple<Point, T>> Enumerate(Func<Point, T, bool>? optionalFilter = null)
    {
        for (var x = 0; x < Width; x++)
        {
            for (var y = 0; y < Height; y++)
            {
                var value = _protectedValues[x, y];
                var point = new Point(x, y);
                if (optionalFilter?.Invoke(point, value) ?? true)
                    yield return Tuple.Create(point, value);
            }
        }
    }

    public DataRectangle<TResult> Transform<TResult>(Func<T, TResult> transformer)
    {
        return Transform((value, coordinates) => transformer(value));
    }

    public DataRectangle<TResult> Transform<TResult>(Func<T, Point, TResult> transformer)
    {
        var transformed = new TResult[Width, Height];
        for (var x = 0; x < Width; x++)
        {
            for (var y = 0; y < Height; y++)
                transformed[x, y] = transformer(_protectedValues[x, y], new Point(x, y));
        }
        return new DataRectangle<TResult>(transformed, isolationCopyMayBeBypassed: true);
    }
}

And then I'm going to add a way to load image data into this structure -

public static class BitmapExtensions
{
    /// <summary>
    /// This will return values in the range 0-255 (inclusive)
    /// </summary>
    // Based on http://stackoverflow.com/a/4748383/3813189
    public static DataRectangle<double> GetGreyscale(this Bitmap image)
    {
        var values = new double[image.Width, image.Height];
        var data = image.LockBits(
            new Rectangle(0, 0, image.Width, image.Height),
            ImageLockMode.ReadOnly,
            PixelFormat.Format24bppRgb
        );
        try
        {
            var pixelData = new Byte[data.Stride];
            for (var lineIndex = 0; lineIndex < data.Height; lineIndex++)
            {
                Marshal.Copy(
                    source: data.Scan0 + (lineIndex * data.Stride),
                    destination: pixelData,
                    startIndex: 0,
                    length: data.Stride
                );
                for (var pixelOffset = 0; pixelOffset < data.Width; pixelOffset++)
                {
                    // Note: PixelFormat.Format24bppRgb means the data is stored in memory as BGR
                    const int PixelWidth = 3;
                    var r = pixelData[pixelOffset * PixelWidth + 2];
                    var g = pixelData[pixelOffset * PixelWidth + 1];
                    var b = pixelData[pixelOffset * PixelWidth];
                    values[pixelOffset, lineIndex] = (0.2989 * r) + (0.5870 * g) + (0.1140 * b);
                }
            }
        }
        finally
        {
            image.UnlockBits(data);
        }
        return DataRectangle.For(values);
    }
}

With these classes, we can load an image and calculate the combined horizontal-gradient-minus-vertical-gradient value like this:

private static IEnumerable<Rectangle> GetPossibleBarcodeAreasForBitmap(Bitmap image)
{
    var greyScaleImageData = image.GetGreyscale();
    var combinedGradients = greyScaleImageData.Transform((intensity, pos) =>
    {
        // Consider gradients to be zero at the edges of the image because there aren't pixels
        // both left/right or above/below and so it's not possible to calculate a real value
        var horizontalChange = (pos.X == 0) || (pos.X == greyScaleImageData.Width - 1)
            ? 0
            : greyScaleImageData[pos.X + 1, pos.Y] - greyScaleImageData[pos.X - 1, pos.Y];
        var verticalChange = (pos.Y == 0) || (pos.Y == greyScaleImageData.Height - 1)
            ? 0
            : greyScaleImageData[pos.X, pos.Y + 1] - greyScaleImageData[pos.X, pos.Y - 1];
        return Math.Max(0, Math.Abs(horizontalChange) - Math.Abs(verticalChange));
    });

    // .. more will go here soon
}

Before jumping straight into the image analysis, though, it's worth resizing the source image if it's large. Since this stage of the processing is looking for areas that look approximately like barcodes, we don't require a lot of granularity - I'm envisaging (as with the MyFitnessPal use case) source images where the barcode takes up a significant space in the image and is roughly aligned with the view port* and so resizing the image such that the largest side is 300px should work well. If you wanted to scan an image where there were many barcodes to process (or even where there was only one but it was very small) then you might want to allow larger inputs than this - the more data that there is, though, the more work that must be done and the slower that the processing will be!

* (The barcode has to be roughly aligned with the viewport because the approaching of looking for areas with large horizontal variance in intensity with minor vertical variance would not work - as we'll see later, though, there is considerable margin for error in this approach and perfect alignment is not required)

A naive approach to this would be force the image so that its largest side is 300px, regardless of what it was originally. However, this is unnecessary if the largest side is already less than 300px (scaling it up will actually give us more work to do) and if the largest side is not much more than 300px then it's probably not worth doing either - scaling it down may make any barcodes areas fuzzy and risk reducing the effectiveness of the processing while not actually reducing the required work. So I'm going to say that if the largest side of the image is 450px or larger than resize it so that its largest side is 300px and do nothing otherwise. To achieve that, we need a method like this:

private static DataRectangle<double> GetGreyscaleData(
    Bitmap image,
    int resizeIfLargestSideGreaterThan,
    int resizeTo)
{
    var largestSide = Math.Max(image.Width, image.Height);
    if (largestSide <= resizeIfLargestSideGreaterThan)
        return image.GetGreyscale();

    int width, height;
    if (image.Width > image.Height)
    {
        width = resizeTo;
        height = (int)(((double)image.Height / image.Width) * width);
    }
    else
    {
        height = resizeTo;
        width = (int)(((double)image.Width / image.Height) * height);
    }
    using var resizedImage = new Bitmap(image, width, height);
    return resizedImage.GetGreyscale();
}

The next steps are to "normalise" the combined intensity variance values so that they fit the range zero-to-one, to "blur" this data and to then create a binary mask where the brighter pixels get set to one and the darker pixels get set to zero. In other words, to extend the code earlier (that calculated the intensity variance values) like this:

private static IEnumerable<Rectangle> GetPossibleBarcodeAreasForBitmap(Bitmap image)
{
    var greyScaleImageData = GetGreyscaleData(
        image,
        resizeIfLargestSideGreaterThan: 450,
        resizeTo: 300
    );
    var combinedGradients = greyScaleImageData.Transform((intensity, pos) =>
    {
        // Consider gradients to be zero at the edges of the image because there aren't pixels
        // both left/right or above/below and so it's not possible to calculate a real value
        var horizontalChange = (pos.X == 0) || (pos.X == greyScaleImageData.Width - 1)
            ? 0
            : greyScaleImageData[pos.X + 1, pos.Y] - greyScaleImageData[pos.X - 1, pos.Y];
        var verticalChange = (pos.Y == 0) || (pos.Y == greyScaleImageData.Height - 1)
            ? 0
            : greyScaleImageData[pos.X, pos.Y + 1] - greyScaleImageData[pos.X, pos.Y - 1];
        return Math.Max(0, Math.Abs(horizontalChange) - Math.Abs(verticalChange));
    });

    const int maxRadiusForGradientBlurring = 2;
    const double thresholdForMaskingGradients = 1d / 3;

    var mask = Blur(Normalise(combinedGradients), maxRadiusForGradientBlurring)
        .Transform(value => (value >= thresholdForMaskingGradients));

    // .. more will go here soon
}

To do that we, need a "Normalise" method - which is simple:

private static DataRectangle<double> Normalise(DataRectangle<double> values)
{
    var max = values.Enumerate().Max(pointAndValue => pointAndValue.Item2);
    return (max == 0)
        ? values
        : values.Transform(value => (value / max));
}

.. and a "Blur" method - which is a little less simple but hopefully still easy enough to follow (for every point, look at the points around it and take an average of all of them; it just looks for a square area, which is fine for small "maxRadius" values but which might be better implemented as a circular area if large "maxRadius" values might be needed, which they aren't in this code):

private static DataRectangle<double> Blur(DataRectangle<double> values, int maxRadius)
{
    return values.Transform((value, point) =>
    {
        var valuesInArea = new List<double>();
        for (var x = -maxRadius; x <= maxRadius; x++)
        {
            for (var y = -maxRadius; y <= maxRadius; y++)
            {
                var newPoint = new Point(point.X + x, point.Y + y);
                if ((newPoint.X < 0) || (newPoint.Y < 0)
                || (newPoint.X >= values.Width) || (newPoint.Y >= values.Height))
                    continue;
                valuesInArea.Add(values[newPoint.X, newPoint.Y]);
            }
        }
        return valuesInArea.Average();
    });
}

This gets us to this point:

.. which feels like good progress!

Now we need to try to identify distinct "islands" of pixels where each "island" or "object" is a set of points that are within a single connected area. A straightforward way to do that is to look at every point in the mask that is set to 1 and either:

Perform a pixel-style "flood fill" starting at this point in order to find other points in an object
If this pixel has already been included in such a fill operation, do nothing (because it's already been accounted for)

This was made easier for me by reading the article Flood Fill algorithm (using C#.Net)..

private static IEnumerable<IEnumerable<Point>> GetDistinctObjects(DataRectangle<bool> mask)
{
    // Flood fill areas in the looks-like-bar-code mask to create distinct areas
    var allPoints = new HashSet<Point>(
        mask.Enumerate(optionalFilter: (point, isMasked) => isMasked).Select(point => point.Item1)
    );
    while (allPoints.Any())
    {
        var currentPoint = allPoints.First();
        var pointsInObject = GetPointsInObject(currentPoint).ToArray();
        foreach (var point in pointsInObject)
            allPoints.Remove(point);
        yield return pointsInObject;
    }

    // Inspired by code at
    // https://simpledevcode.wordpress.com/2015/12/29/flood-fill-algorithm-using-c-net/
    IEnumerable<Point> GetPointsInObject(Point startAt)
    {
        var pixels = new Stack<Point>();
        pixels.Push(startAt);

        var valueAtOriginPoint = mask[startAt.X, startAt.Y];
        var filledPixels = new HashSet<Point>();
        while (pixels.Count > 0)
        {
            var currentPoint = pixels.Pop();
            if ((currentPoint.X < 0) || (currentPoint.X >= mask.Width)
            || (currentPoint.Y < 0) || (currentPoint.Y >= mask.Height))
                continue;

            if ((mask[currentPoint.X, currentPoint.Y] == valueAtOriginPoint)
            && !filledPixels.Contains(currentPoint))
            {
                filledPixels.Add(new Point(currentPoint.X, currentPoint.Y));
                pixels.Push(new Point(currentPoint.X - 1, currentPoint.Y));
                pixels.Push(new Point(currentPoint.X + 1, currentPoint.Y));
                pixels.Push(new Point(currentPoint.X, currentPoint.Y - 1));
                pixels.Push(new Point(currentPoint.X, currentPoint.Y + 1));
            }
        }
        return filledPixels;
    }
}

The problem is that, even with the blurring we performed, there will likely be some groups of distinct objects that are actually part of a single barcode. These areas need to be joined together. It's quite possible for there to be relatively large gaps in the middle of barcodes (there is in the example that we've been looking at) and so we might not easily be able to just take the distinct objects that we've got and join together areas that seem "close enough".

On the basis that individual bars in a barcode are tall compared to the largest possible width that any of them can be (which I'll go into more detail about later on), it seems like a reasonable idea to take any areas that are taller than they are wide and expand their width until they become square. That would give us this:

We'd then work out which of these "squared off" rectangles overlap (if any) and replace overlapping rectangles with rectangles that cover their combined areas, which would look like this:

The only problem with this is that the combined rectangles extend too far to the left and right of the areas, so we need to trim them down. The will be fairly straightforward because we have the information about what distinct objects there are and each object is just a list of points - so we work out which objects have points within each of the combined bounding areas and then we work out which out of all of the objects for each combined area has the smallest "x" value and smallest "y" value and which have the largest values. That way, we can change the combined bounding areas to only cover actual barcode pixels. Which would leave us with this:

That might sound like a lot of complicated work but if we take a bit of a brute force* approach to it then it can be expressed like this:

private static IEnumerable<Rectangle> GetOverlappingObjectBounds(
    IEnumerable<IEnumerable<Point>> objects)
{
    // Translate each "object" (a list of connected points) into a bounding box (squared off if
    // it was taller than it was wide)
    var squaredOffBoundedObjects = new HashSet<Rectangle>(
        objects.Select((points, index) =>
        {
            var bounds = Rectangle.FromLTRB(
                points.Min(p => p.X),
                points.Min(p => p.Y),
                points.Max(p => p.X) + 1,
                points.Max(p => p.Y) + 1
            );
            if (bounds.Height > bounds.Width)
                bounds.Inflate((bounds.Height - bounds.Width) / 2, 0);
            return bounds;
        })
    );

    // Loop over the boundedObjects and reduce the collection by merging any two rectangles
    // that overlap and then starting again until there are no more bounds merges to perform
    while (true)
    {
        var combinedOverlappingAreas = false;
        foreach (var bounds in squaredOffBoundedObjects)
        {
            foreach (var otherBounds in squaredOffBoundedObjects)
            {
                if (otherBounds == bounds)
                    continue;

                if (bounds.IntersectsWith(otherBounds))
                {
                    squaredOffBoundedObjects.Remove(bounds);
                    squaredOffBoundedObjects.Remove(otherBounds);
                    squaredOffBoundedObjects.Add(Rectangle.FromLTRB(
                        Math.Min(bounds.Left, otherBounds.Left),
                        Math.Min(bounds.Top, otherBounds.Top),
                        Math.Max(bounds.Right, otherBounds.Right),
                        Math.Max(bounds.Bottom, otherBounds.Bottom)
                    ));
                    combinedOverlappingAreas = true;
                    break;
                }
            }
            if (combinedOverlappingAreas)
                break;
        }
        if (!combinedOverlappingAreas)
            break;
    }

    return squaredOffBoundedObjects.Select(bounds =>
    {
        var allPointsWithinBounds = objects
            .Where(points => points.Any(point => bounds.Contains(point)))
            .SelectMany(points => points)
            .ToArray(); // Don't re-evaluate in the four accesses below
        return Rectangle.FromLTRB(
            left: allPointsWithinBounds.Min(p => p.X),
            right: allPointsWithinBounds.Max(p => p.X) + 1,
            top: allPointsWithinBounds.Min(p => p.Y),
            bottom: allPointsWithinBounds.Max(p => p.Y) + 1
        );
    });
}

* (There are definitely more efficient ways that this could be done but since we're only looking at 300px images then we're not likely to end up with huge amounts of data to deal with)

To complete the process, we need to do three more things:

Since barcodes are wider than they are tall, we can discard any regions that don't fit this shape (of which there are two in the example image)
The remaining regions are expanded a little across so that they more clearly surround the barcode region, rather than being butted right up to it (this will make the barcode reading process a little easier)
As the regions that have been identified may well be on a resized version of the source image, they may need to scaled up so that they correctly apply to the source

To do that, we'll start from this code that we saw earlier:

var mask = Blur(Normalise(combinedGradients), maxRadiusForGradientBlurring)
    .Transform(value => (value >= thresholdForMaskingGradients));

.. and expand it like so (removing the "// .. more will go here soon" comment), using the methods above:

// Determine how much the image was scaled down (if it had to be scaled down at all)
// by comparing the width of the potentially-scaled-down data to the source image
var reducedImageSideBy = (double)image.Width / greyScaleImageData.Width;

var mask = Blur(Normalise(combinedGradients), maxRadiusForGradientBlurring)
    .Transform(value => (value >= thresholdForMaskingGradients));

return GetOverlappingObjectBounds(GetDistinctObjects(mask))
    .Where(boundedObject => boundedObject.Width > boundedObject.Height)
    .Select(boundedObject =>
    {
        var expandedBounds = boundedObject;
        expandedBounds.Inflate(width: expandedBounds.Width / 10, height: 0);
        expandedBounds.Intersect(
            Rectangle.FromLTRB(0, 0, greyScaleImageData.Width, greyScaleImageData.Height)
        );
        return new Rectangle(
            x: (int)(expandedBounds.X * reducedImageSideBy),
            y: (int)(expandedBounds.Y * reducedImageSideBy),
            width: (int)(expandedBounds.Width * reducedImageSideBy),
            height: (int)(expandedBounds.Height * reducedImageSideBy)
        );
    });

The final result is that the barcode has been successfully located on the image - hurrah!

With this information, we should be able to extract regions or "sub images" from the source image and attempt to decipher the barcode value in it (presuming that there IS a bar code in it and we haven't got a false positive match).

As we'll see in a moment, the barcode doesn't have to be perfectly lined up - some rotation is acceptable (depending upon the image, up to around 20 or 30 degrees should be fine). The MyFitnessPal app has a couple of fallbacks that I've noticed, such as being able to read barcodes that are upside down or even back to front (which can happen if a barcode is scanned from the wrong side of a transparent wrapper). While I won't be writing code here for either of those approaches, I'm sure that you could envisage how it could be done - the source image data could be processed as described here and then, if no barcode is read, rotated 180 degrees and re-processed and reversed and re-processed, etc..

How to read a bar code

A barcode is comprised of both black and white bars - so it's not just the black parts that are significant, it is the spaces between them as well.

The format of a barcode is as follows:

Three single-width bars (a black one, a white one and another black one) that are used to gauge what is considered to be a "single width"
Information for six numbers then appears, where each number is encoded by a sequence of four bars (white, black, white, black) - particular combinations of bar widths relate to particular digits (see below)
Another guard section appears with five single width bars (white, black, white, black, white)
Six more numbers appear (using the same bar-width-combinations encoding as before but the groups of four bars are now black, white, black, white)
A final guard section of three single width bards (black, white, black)

The numbers are encoded using the following system:

 Digit      Bar widths

   0        3, 2, 1, 1
   1        2, 2, 2, 1
   2        2, 1, 2, 2
   3        1, 4, 1, 1
   4        1, 1, 3, 2
   5        1, 2, 3, 1
   6        1, 1, 1, 4
   7        1, 3, 1, 2
   8        1, 2, 1, 3
   9        3, 1, 1, 2

(Note that every combination of values totals 7 when they added up - this is very helpful later!)

To see what that looks like in the real world, here's a slice of that barcode from the jar of peanut butter with each section and each numeric value identified:

(I should point out that the article How does a barcode work? was extremely helpful in the research I did for this post and I'm very grateful to the author for having written it in such an approachable manner!)

Any combination of bar widths that is not found in the table is considered to be invalid. On the one hand, you might think that this a potential loss; the format could support more combinations of bar widths to encode more values and then more data could be packed into the same space. There is an advantage, however, to having relatively few valid combinations of bar widths - it makes easier to tell whether the information being read appears to be correct. If a combination is encountered that seems incorrect then the read attempt should be aborted and retried. The format has existed for decades and it would make sense, bearing that in mind, to prioritise making it easier for the hardware to read rather prioritising trying to cram as much data in there as possible. There is also a checksum included in the numerical data to try to catch any "misreads" but when working with low resolutions or hardware with little computing power, the easier that it is to bail out of a scan and to retry the better.

The way to tackle the reading is to:

Convert the sub image to greyscale
Create a binary mask so that the darker pixels become 0 and the lighter ones become 1
Take a single line across the area
Change the individual 1s and 0s into lengths of continuous "runs" of values
- eg. 0001100 would become 3, 2, 2 because there are three 0s then two 1s and then two 0s
These runs of values will represent the different sized (black and white) bars that were encountered
- For a larger image, each run length will be longer than for a small image but that won't matter because when we encounter runs of four bar length values that we think should be interpreted as a single digit, we'll do some dividing to try to guess the average size of a single width bar
Take these runs of values, skip through the expected guard regions and try to interpret each set of four bars that is thought to represent a digit of the bar code as that digit
If successful then perform a checksum calculation on the output and return the value ass a success if it meets expectations
If the line couldn't be interpreted as a barcode or the checksum calculation fails then take the next line down and go back to step 4
If there are no more lines to attempt then a barcode could not be identified in the image

This processing is fairly light computationally and so there is no need to resize the "may be a barcode" image region before attempting the work. In fact, it's beneficial to not shrink it as shrinking it will likely make the barcode section fuzzier and that makes the above steps less likely to work - the ideal case for creating a binary mask is where there is no significant "seepage" of pixel intensity between the black bar areas and the white bar areas. That's not to say that the images have to be crystal clear or perfectly aligned with the camera because the redundancy built into the format works in our favour here - if one line across the image can't be read because it's fuzzy then there's a good chance that one of the other lines will be legible.

60 length values is the precise number that we expect to find - there is expected to be some blank space before the barcode starts (1) and then a guard section of three single-width lines that we use to gauge bar width (3) and then six numbers that are encoded in four bars each (6x4=24) and then a guard section of five single-width lines (5) and then six numbers (6x4=24) and then a final guard region of three single-width bars, giving 1+3+24+5+24+3=60.

There will likely be another section of blank content after the barcode that we ignore

If we don't want to validate the final guard region then we can work with a barcode image where some of the end of cut off, so long as the data for the 12 digits is there; in this case, 57 lengths if the minimum number that we can accept

Reading the numeric value with code

I'm going to try to present the code in approximately the same order as the steps presented above. So, firstly we need to convert the sub image to greyscale and create a binary mark from it. Then we'll go line by line down the image data and try to read a value. So we'll take this:

public static string? TryToReadBarcodeValue(Bitmap subImage)
{
    const double threshold = 0.5;

     // Black lines are considered 1 and so we set to true if it's a dark pixel (and 0 if light)
    var mask = subImage.GetGreyscale().Transform(intensity => intensity < (256 * threshold));
    for (var y = 0; y < mask.Height; y++)
    {
        var value = TryToReadBarcodeValueFromSingleLine(mask, y);
        if (value is object)
            return value;
    }
    return null;
}

.. and the read-each-slice-of-the-image code looks like this:

private static string? TryToReadBarcodeValueFromSingleLine(
    DataRectangle<bool> barcodeDetails,
    int sliceY)
{
    if ((sliceY < 0) || (sliceY >= barcodeDetails.Height))
        throw new ArgumentOutOfRangeException(nameof(sliceY));

    var lengths = GetBarLengthsFromBarcodeSlice(barcodeDetails, sliceY).ToArray();
    if (lengths.Length < 57)
    {
        // As explained, we'd like 60 bars (which would include the final guard region) but we
        // can still make an attempt with 57 (but no fewer)
        // - There will often be another section of blank content after the barcode that we ignore
        // - If we don't want to validate the final guard region then we can work with a barcode
        //   image where some of the end is cut off, so long as the data for the 12 digits is
        //   there (this will be the case where there are only 57 lengths)
        return null;
    }

    var offset = 0;
    var extractedNumericValues = new List<int>();
    for (var i = 0; i < 14; i++)
    {
        if (i == 0)
        {
            // This should be the first guard region and it should be a pattern of three single-
            // width bars
            offset += 3;
        }
        else if (i == 7)
        {
            // This should be the guard region in the middle of the barcode and it should be a
            // pattern of five single-width bars
            offset += 5;
        }
        else
        {
            var value = TryToGetValueForLengths(
                lengths[offset],
                lengths[offset + 1],
                lengths[offset + 2],
                lengths[offset + 3]
            );
            if (value is null)
                return null;
            extractedNumericValues.Add(value.Value);
            offset += 4;
        }
    }

    // Calculate what the checksum should be based upon the first 11 numbers and ensure that
    // the 12th matches it
    if (extractedNumericValues.Last() != CalculateChecksum(extractedNumericValues.Take(11)))
        return null;

    return string.Join("", extractedNumericValues);
}

With the code below, we find the runs of continuous 0 or 1 lengths that will represent bars are return that list (again, for larger images each run will be longer and for smaller images each run will be shorter but this will be taken care of later) -

private static IEnumerable<int> GetBarLengthsFromBarcodeSlice(
    DataRectangle<bool> barcodeDetails,
    int sliceY)
{
    if ((sliceY < 0) || (sliceY >= barcodeDetails.Height))
        throw new ArgumentOutOfRangeException(nameof(sliceY));

    // Take the horizontal slice of the data
    var values = new List<bool>();
    for (var x = 0; x < barcodeDetails.Width; x++)
        values.Add(barcodeDetails[x, sliceY]);

    // Split the slice into bars - we only care about how long each segment is when they
    // alternate, not whether they're dark bars or light bars
    var segments = new List<Tuple<bool, int>>();
    foreach (var value in values)
    {
        if ((segments.Count == 0) || (segments[^1].Item1 != value))
            segments.Add(Tuple.Create(value, 1));
        else
            segments[^1] = Tuple.Create(value, segments[^1].Item2 + 1);
    }
    if ((segments.Count > 0) && !segments[0].Item1)
    {
        // Remove the white space before the first bar
        segments.RemoveAt(0);
    }
    return segments.Select(segment => segment.Item2);
}

Now we need to implement the "TryToGetValueForLengths" method that "TryToReadBarcodeValueFromSingleLine" calls. This takes four bar lengths that are thought to represent a single digit in the bar code value (they are not part of a guard region or anything like that). It take those four bar lengths and guesses how many pixels across a single bar would be - which is made my simpler by the fact that all of the possible combinations of bar lengths in the lookup chart that we saw earlier add up to 7.

There's a little flexibility introduced here to try to account for a low quality image or if the threshold was a bit strong in the creation of the binary mask; we'll take that calculated expected width of a single bar and tweak it up or down a little if apply that division to the bar lengths means that we made some of the bars too small that they disappeared or too large and it seemed like the total width would be more than seven single estimated-width bars. There's only a little flexibility here because if we fail then we can always try another line of the image! (Or maybe it will turn out that this sub image was a false positive match and there isn't a bar code in it at all).

private static int? TryToGetValueForLengths(int l0, int l1, int l2, int l3)
{
    if (l0 <= 0)
        throw new ArgumentOutOfRangeException(nameof(l0));
    if (l1 <= 0)
        throw new ArgumentOutOfRangeException(nameof(l1));
    if (l2 <= 0)
        throw new ArgumentOutOfRangeException(nameof(l2));
    if (l3 <= 0)
        throw new ArgumentOutOfRangeException(nameof(l3));

    // Take a guess at what the width of a single bar is based upon these four values
    // (the four bars that encode a number should add up to a width of seven)
    var raw = new[] { l0, l1, l2, l3 };
    var singleWidth = raw.Sum() / 7d;
    var adjustment = singleWidth / 10;
    var attemptedSingleWidths = new HashSet<double>();
    while (true)
    {
        var normalised = raw.Select(x => Math.Max(1, (int)Math.Round(x / singleWidth))).ToArray();
        var sum = normalised.Sum();
        if (sum == 7)
            return TryToGetNumericValue(normalised[0], normalised[1], normalised[2], normalised[3]);

        attemptedSingleWidths.Add(singleWidth);
        if (sum > 7)
            singleWidth += adjustment;
        else
            singleWidth -= adjustment;
        if (attemptedSingleWidths.Contains(singleWidth))
        {
            // If we've already tried this width-of-a-single-bar value then give up -
            // it doesn't seem like we can make the input values make sense
            return null;
        }
    }

    static int? TryToGetNumericValue(int i0, int i1, int i2, int i3)
    {
        var lookFor = string.Join("", new[] { i0, i1, i2, i3 });
        var lookup = new[]
        {
            // These values correspond to the lookup chart shown earlier
            "3211", "2221", "2122", "1411", "1132", "1231", "1114", "1312", "1213", "3112"
        };
        for (var i = 0; i < lookup.Length; i++)
        {
            if (lookFor == lookup[i])
                return i;
        }
        return null;
    }
}

Finally we need the CalculateChecksum method (as noted in the code, there's a great explanation of how to do this in wikipedia) -

private static int CalculateChecksum(IEnumerable<int> values)
{
    if (values == null)
        throw new ArgumentNullException(nameof(values));
    if (values.Count() != 11)
        throw new ArgumentException("Should be provided with precisely 11 values");

    // See https://en.wikipedia.org/wiki/Check_digit#UPC
    var checksumTotal = values
        .Select((value, index) => (index % 2 == 0) ? (value * 3) : value)
        .Sum();
    var checksumModulo = checksumTotal % 10;
    if (checksumModulo != 0)
        checksumModulo = 10 - checksumModulo;
    return checksumModulo;
}

With this code, we have executed all of the planned steps outlined before.

It should be noted that, even with the small amount of flexibility in the "TryToGetValueForLengths" method, in the peanut butter bar code example it requires 15 calls to "GetBarLengthsFromBarcodeSlice" until a bar code is successfully matched! Presumably, this is because there is a little more distortion further up the bar code due to the curve of the jar.

That's not to say, however, that this approach to bar reading is particularly fussy. The redundancy and simplicity, not to mention the size of the average bar code, means that there is plenty of opportunity to try reading a sub image in multiple slices until one of them does match. In fact, I mentioned earlier that the barcode doesn't have to be perfectly at 90 degrees in order to be interpretable and that some rotation is acceptable. This hopefully makes some intuitive sense based upon the logic above and how it doesn't matter how long each individual bar code line is because they are averaged out - if a bar code was rotated a little and then a read was attempted of it line by line then the ratios between each line should remain consistent and the same data should be readable.

To illustrate, here's a zoomed-in section of the middle of the peanut butter bar code in the orientation shown so far:

If we then rotate it like this:

.. then the code above will still read the value correctly because a strip across the rotated bar code looks like this:

Hopefully it's clear enough that, for each given line, the ratios are essentially the same as for the non-rotated strip:

To get a reading from an image that is rotated more than this requires a very clear source image and will still be limited by the first stage of processing - that tried to find sections where the horizontal image intensity changed with steep gradients but the vertical intensity did not. If the image is rotated too much then there will be more vertical image intensity differences encountered and it is less likely to identify it as a "maybe a bar code" region.

(Note: I experimented with rotated images that were produced by an online barcode generator and had more success - meaning that I could rotate them more than I could with real photographs - but that's because those images are generated with stark black and white and the horizontal / vertical intensity gradients are maintained for longer when the image is rotated if they start with such a high level of clarity.. I'm more interested in reading values from real photographs and so I would suggest that only fairly moderate rotation will work - though it would still be plenty for an MyFitnessPal-type app that expects the User to hold the bar code in roughly the right orientation!)

Tying it all together

We've looked at the separate steps involved in the whole reading process, all that is left is to combine them. The "GetPossibleBarcodeAreasForBitmap" and "TryToReadBarcodeValue" methods can be put together into a fully functioning program like this:

static void Main()
{
    using var image = new Bitmap("Source.jpg");

    var barcodeValues = new List<string>();
    foreach (var area in GetPossibleBarcodeAreasForBitmap(image))
    {
        using var areaBitmap = new Bitmap(area.Width, area.Height);
        using (var g = Graphics.FromImage(areaBitmap))
        {
            g.DrawImage(
                image,
                destRect: new Rectangle(0, 0, areaBitmap.Width, areaBitmap.Height),
                srcRect: area,
                srcUnit: GraphicsUnit.Pixel
            );
        }
        var valueFromBarcode = TryToReadBarcodeValue(areaBitmap);
        if (valueFromBarcode is object)
            barcodeValues.Add(valueFromBarcode);
    }

    if (!barcodeValues.Any())
        Console.WriteLine("Couldn't read any bar codes from the source image :(");
    else
    {
        Console.WriteLine("Read the following bar code(s) from the image:");
        foreach (var barcodeValue in barcodeValues)
            Console.WriteLine("- " + barcodeValue);
    }

    Console.WriteLine();
    Console.WriteLine("Press [Enter] to terminate..");
    Console.ReadLine();
}

Finito!

And with that, we're finally done! I must admit that I started writing this post about three years ago and it's been in my TODO list for a loooooong time now. But I've taken a week off work and been able to catch up with a few things and have finally been able to cross it off the list. And I'm quite relieved that I didn't give up on it entirely because it was a fun little project and coming back to it now allowed me to tidy it up a bit with the newer C# 8 syntax and even enable the nullable reference types option on the project (I sure do hate unintentional nulls being allowed to sneak in!)

A quick reminder if you want to see it in action or play about it yourself, the GitHub repo is here.

Thanks to anyone that read this far!

Productive Rage

Hosting a DigitalOcean App Platform app on a custom subdomain (with CORS)

(Approximately) correcting perspective with C# (fixing a blurry presentation video - part two)

You may also be interested in (see here for information about how these are generated):

Finding the brightest area in an image with C# (fixing a blurry presentation video - part one)

You may also be interested in (see here for information about how these are generated):

So.. what is machine learning? (#NoCodeIntro)

You may also be interested in (see here for information about how these are generated):

Parallelising (LINQ) work in C#

You may also be interested in:

Automating "suggested / related posts" links for my blog posts - Part 2

You may also be interested in (see here for information about how these are generated):

Automating "suggested / related posts" links for my blog posts

You may also be interested in (see here for information about how these are generated):

Language detection and words-in-sentence classification in C#

You may also be interested in (see here for information about how these are generated):

Monitoring my garden's limited sunlight time period with an Arduino (and some tupperware)

You may also be interested in (see here for information about how these are generated):

How are barcodes read?? (Library-less image processing in C#)

You may also be interested in (see here for information about how these are generated):