Productive Rage - FullTextIndexer

13 November 2016

Migrating my Full Text Indexer to .NET Core (supporting multi-target NuGet packages)

So it looks increasingly like .NET Core is going to be an important technology in the near future, in part because Microsoft is developing much of it in the open (in a significant break from their past approach to software), in part because some popular projects support it (Dapper, AutoMapper, Json.NET) and in part because of excitement from blog posts such as ASP.NET Core – 2300% More Requests Served Per Second.

All I really knew about it was that it was a cut-down version of the .NET framework which should be able to run on platforms other than Windows, which might be faster in some cases and which may still undergo some important changes in the near future (such as moving away from the new "project.json" project files and back to something more traditional in terms of Visual Studio projects - see The Future of project.json in ASP.NET Core).

To try to find out more, I've taken a codebase that I wrote years ago and have migrated it to .NET Core. It's not enormous but it spans multiple projects, has a (small-but-better-than-nothing) test suite and supports serialising search indexes to and from disk for caching purposes. My hope was that I would be able to probe some of the limitations of .NET Core with this non-trivial project but that it wouldn't be such a large task that it take more than a few sessions spaced over a few days to complete.

Spoilers

Would I be able to migrate one project at a time within the solution to .NET Core and still have the whole project building successfully (while some of the other projects were still targeting the "full fat" .NET framework)? Yes, but some hacks are required.

Would it be easy (or even possible) to create a NuGet package that would work on both .NET Core and .NET 4.5? Yes.

Would the functionality that is no longer present in .NET Core cause me problems? Largely, no. The restricted reflection capabilities, no - if I pull in an extra dependency. The restricted serialisation facilities, yes (but I'm fairly happy with the solution and compromises that I ended up with).

What, really, is .NET Core (and what is the Full Text Indexer)?

Essentially,

.NET Core is a new cross-platform .NET Product .. [and] is composed of the following parts: A .NET runtime .. A set of framework libraries .. A set of SDK tools and language compilers .. The 'dotnet' app host, which is used to launch .NET Core apps

(from Scott Hanselman's ".NET Core 1.0 is now released!.NET Core 1.0 is now released!" post)

What .NET Core means in the context of this migration is that there are new project types in Visual Studio to use that target new .NET frameworks. Instead of .NET 4.6.1, for example, there is "netstandard1.6" for class libraries and "netcoreapp1.0" for console applications.

The new Visual Studio project types become available after you install the Visual Studio Tooling - alternatively, the "dotnet" command line tool makes things very easy so you can could create projects using nothing more than notepad and "dotnet" if you want to! Since I was just getting started, I chose to stick in my Visual Studio comfort zone.

The Full Text Indexer code that I'm migrating was something that I wrote a few years ago while I was working with a Lucene integration ("this full text indexing lark.. how hard could it really be!"). It's a set of class libraries; "Common" (which has no dependencies other than the .NET framework), "Core" (which depends upon Common), "Helpers" (which depends upon both Common and Core), and "Querier" (which also depends upon Common and Core). Then there is a "UnitTests" project and a "Tester" console application, which loads some data from a Sqlite database file, constructs an index and then performs a search or two (just to demonstrate how it works end-to-end).

My plan was to try migrating one project at a time over to .NET Core, to move in baby steps so that I could be confident that everything would remain in a working state for most of the time.

Creating the first .NET Core project

The first thing I did was delete the "Common" project entirely (deleted it from Visual Studio and then manually deleted all of the files) and then created a brand new .NET Core class library called "Common". I then used my source control client to revert the deletions of the class files so that they appeared within the new project's folder structure. I expected to then have to "Show All Files" and explicitly include these files in the project but it turns out that .NET Core project files don't specify files to include, it's presumed that all files in the folder will be included. Makes sense!

It wouldn't compile, though, because some of the classes have the [Serializable] attribute on them and this doesn't exist in .NET Core. As I understand it, that's because the framework's serialisation mechanisms have been stripped right back with the intention of the framework being able to specialise at framework-y core competencies and for there to be an increased reliance on external libraries for other functionality.

This attribute is used through my library because there is an IndexDataSerialiser that allows an index to be persisted to disk for caching purposes. It uses the BinaryFormatter to do this, which requires that the types that you need to be serialised be decorated with the [Serializable] attribute or they implement the ISerializable interface. Neither the BinaryFormatter nor the ISerializable interface are available within .NET Core. I will need to decide what to do about this later - ideally, I'd like to continue to be able to support reading and writing to the same format as I have done before (if only to see if it's possible when migrating to Core). For now, though, I'll just remove the [Serializable] attributes and worry about it later.

So, with very little work, the Common project was compiling for the "netstandard1.6" target framework.

Unfortunately, the projects that rely on Common weren't compiling because their references to it were removed when I removed the project from the VS solution. And, if I try to add references to the new Common project I'm greeted with this:

A reference to 'Common' could not be added. An assembly must have a 'dll' or 'exe' extension in order to be referenced.

The problem is that Common is being built for "netstandard1.6" but only that framework. I also want it to support a "full fat" .NET framework, like 4.5.2 - in order to do this I need to edit the project.json file so that the build process creates multiple versions of the project, one .NET 4.5.2 as well as the one for netstandard. That means changing it from this:

{
  "version": "1.0.0-*",

  "dependencies": {
    "NETStandard.Library": "1.6.0"
  },

  "frameworks": {
    "netstandard1.6": {
      "imports": "dnxcore50"
    }
  }
}

to this:

{
  "version": "1.0.0-*",

  "dependencies": {},

  "frameworks": {
    "netstandard1.6": {
      "imports": "dnxcore50",
      "dependencies": {
        "NETStandard.Library": "1.6.0"
      }
    },
    "net452": {}
  }
}

Two things have happened - an additional entry has been added to the "frameworks" section ("net452" joins "netstandard1.6") and the "NETStandard.Library" dependency has moved from being something that is always required by the project to something that is only required by the project when it's being built for netstandard.

Now, Common may be added as a reference to the other projects.

However.. they won't compile. Visual Studio will be full of errors that required classes do not exist in the current context.

Adding a reference to a .NET Core project from a .NET 4.5.2 project in the same solution

Although the project.json configuration does mean that two version of the Common library are being produced (looking in the bin/Debug folder, there are two sub-folders "net452" and "netstandard1.6" and each have their own binaries in), it seems that the "Add Reference" functionality in Visual Studio doesn't (currently) support adding references. There is an issue on GitHub about this; Allow "Add Reference" to .NET Core class library that uses .NET Framework from a traditional class library but it seems like the conclusion is that this will be fixed in the future, when the changes have been completed that move away from .NET Core projects having a project.json file and towards a new kind of ".csproj" file.

There is a workaround, though. Instead of selecting the project from the Add Reference dialog, you click "Browse" and then select that file in the "Common/bin/Debug/net452" folder. Then the project will build. This isn't a perfect solution, though, since it will always reference the Debug build. When building in Release configuration, you also want the referenced binaries from other projects to be built in Release configuration. To do that, I had to open each ".csproj" file notepad and change

<Reference Include="Common">
  <HintPath>..\Common\bin\Debug\net452\Common.dll</HintPath>
</Reference>

<Reference Include="Common">
  <HintPath>..\Common\bin\$(Configuration)\net452\Common.dll</HintPath>
</Reference>

A little bit annoying but not the end of the world (credit for this fix, btw, goes to this Stack Overflow answer to Attach unit tests to ASP.NET Core project).

What makes it even more annoying is the link from the referencing project (say, the Core project) to the referenced project (the Common project) is not as tightly integrated as when a project reference is normally added through Visual Studio. For example, while you can set breakpoints on the Common project and they will be hit when the Core project calls into that code, using "Go To Definition" to navigate from code in the Core project into code in the referenced Common project doesn't work (it takes you to a "from metadata" view rather than taking you to the actual file). On top of this, the referencing project doesn't know that it needs to be rebuilt if the referenced project is rebuilt - so, if the Common library is changed and rebuilt then the Core library may continue to work against an old version of the Common binary unless you explicitly rebuild Core as well.

These are frustrations that I would not want to live with long term. However, the plan here is to migrate all of the projects over to .NET Core and so I think that I can put up with these limitations so long as they only affect me as I migrate the projects over one-by-one.

The second project (additional dependencies required)

I repeated the procedure for second project; "Core". This also contained files with types marked as [Serializable] (which I just removed for now) and there was the IndexDataSerialiser class that used the BinaryFormatter to allow data to be persisted to disk - this also had to go, since there was no support for it in .NET Core (I'll talk about what I did with serialisation later on). I needed to add a reference to the Common project - thankfully adding a reference to a .NET Core project from a .NET Core project works perfectly, so the workaround that I had to apply before (when the Core project was still .NET 4.5.2) wasn't necessary.

However, it still didn't compile.

In the "Core" project lives the EnglishPluralityStringNormaliser class, which
is used to adjust tokens (ie. words) so that the singular and plural versions of the same word are considered equivalent (eg. "cat" and "cats", "category" and "categories"). Internally, it generates a compiled LINQ expression to try to perform its work as efficiently as possible and it requires reflection to do that. Calling "GetMethod" and "GetProperty" on a Type is not supported in netstandard, though, and an additional dependency is required. So the Core project.json file needed to be changed to look like this:

{
  "version": "1.0.0-*",

  "dependencies": {
    "Common": "1.0.0-*"
  },

  "frameworks": {
    "netstandard1.6": {
      "imports": "dnxcore50",
      "dependencies": {
        "NETStandard.Library": "1.6.0"
        "System.Reflection.TypeExtensions": "4.1.0"
      }
    },
    "net452": {}
  }
}

The Common project is a dependency regardless of what the target framework is during the build process but the "System.Reflection.TypeExtensions" package is also required when building for netstandard (but not .NET 4.5.2), as this includes extensions methods for Type such as "GetMethod" and "GetProperty".

Note: Since these are extension methods in netstandard, a "using System.Reflection;" statement is required at the top of the class - this is not required when building for .NET 4.5.2 because "GetMethod" and "GetProperty" are instance methods on Type.

There was one other dependency that was required for Core to build - "System.Globalization.Extensions". This was because the DefaultStringNormaliser class includes the line

var normalisedValue = value.Normalize(NormalizationForm.FormKD);

which resulted in the error

'string' does not contain a definition for 'Normalize' and no extension method 'Normalize' accepting a first argument of type 'string' could be found (are you missing a using directive or an assembly reference?)

This is another case of functionality that is in .NET 4.5.2 but that is an optional package for .NET Core. Thankfully, it's easy to find out what additional package needs to be included - the "lightbulb" code fix options will try to look for a package to resolve the problem and it correctly identifies that "System.Globalization.Extensions" contains a relevant extension method (as illustrated below).

NormalizationForm exists in the System.Globalization.Extensions package for .NET Core

Note: Selecting the "Add package System.Globalization.Extensions 4.0.1" option will add the package as a dependency for netstandard in the project.json file and it will add the required "using System.Globalization;" statement to the class - which is very helpful of it!

All that remained now was to use the workaround from before to add the .NET Core version of the "Core" project as a reference to the projects that required it.

The third and fourth projects (both class libraries)

The process for the "Helpers" and "Querier" class libraries was simple. Neither required anything that wasn't included in netstandard1.6 and so it was just a case of going through the motions.

The "Tester" Console Application

At this point, all of the projects that constituted the actual "Full Text Indexer" were building for both the netstandard1.6 framework and .NET 4.5.2 - so I could have stopped here, really (aside from the serialisation issues I had been putting off). But I thought I might as well go all the way and see if there were any interesting differences in migrating Console Applications and xUnit test suite projects.

For the Tester project; no, not much was different. It has an end-to-end example integration where it loads data from a Sqlite database file of Blog posts using Dapper and then creates a search index. The posts contain markdown content and so three NuGet packages were required - Dapper, System.Data.Sqlite and MarkdownSharp.

Dapper supports .NET Core and so that was no problem but the other two did not. Thankfully, though, there were alternatives that did support netstandard - Microsoft.Data.Sqlite and Markdown. Using Microsoft.Data.Sqlite required some (very minor) code changes while Markdown exposed exactly the same interface as MarkdownSharp.

The xUnit Test Suite Project

The "UnitTests" project didn't require anything very different but there are a few gotchas to watch out for..

The first is that you need to create a "Console Application (.NET Core)" project since xUnit works with the "netcoreapp1.0" framework (which console applications target) and not "netstandard1.6" (which is what class libraries target).

The second is that, presuming you want the Visual Studio test runner integration (which, surely, you do!) you need to not only add the "xunit" NuGet package but also the "dotnet-test-xunit" package. Thirdly, you need to enable the "Include prerelease" option in the NuGet Package Manager to locate versions of these packages that work with .NET Core (this will, of course, change with time - but as of November 2016 these packages are only available as "prereleases").

Fourthly, you need to add a line

"testRunner": "xunit",

to the project.json file.

Having done all of this, the project should compile and the tests should appear in the Test Explorer window.

Note: I wanted to fully understand each step required to create an xUnit test project but you could also just follow the instructions at Getting started with xUnit.net (.NET Core / ASP.NET Core) which provides you a complete project.json to paste in - one of the nice things about .NET Core projects is that changing (and saving) the project.json is all it takes to change from being a class library (and targeting netstandard) to being a console application (and targeting netcoreapp). Similarly, references to other projects and to NuGet packages are all specified there and saving changes to that project file results in those reference immediately being resolved and any specified packages being downloaded.

In the class library projects, I made them all target both netstandard and net452. With the test suite project, if the project.json file is changed to target both .NET Core ("netcoreapp1.0", since it's a console app) and full fat .NET ("net452") then two different versions of the suite will be built. The clever thing about this is that if you use the command line to run the tests -

dotnet test

.. then it will run the tests in both versions. Since there are going to be some differences between the different frameworks and, quite feasibly, between different versions of dependencies then it's a very handy tool to be able to run the tests against all of the versions of .NET that your libraries target.

There is a "but" here, though. While the command line test process will target both frameworks, the Visual Studio Test Explorer doesn't. I think that it only targets the first framework that is specified in the project.json file but I'm not completely sure. I just know that it doesn't run them both. Which is a pity. On the bright side, I do like that .NET Core is putting the command line first - not only because I'm a command line junkie but also because it makes it very easy to integrate into build servers and continuous integration processes. I do hope that one day (soon) that the VS integration will be as thorough as the command line tester, though.

Building NuGet packages

So, now, there are no errors and everything is building for .NET Core and for "classic"* .NET.

* I'm still not sure what the accepted terminology is for non-.NET-Core projects, I don't really think that "full fat framework" is the official designation :)

There are no nasty workarounds required for the references (like when the not-yet-migrated .NET 4.5.2 projects were referencing the .NET Core projects). It's worth mentioning that that workaround was only required when the .NET 4.5.2 project wanted to reference a .NET Core project from within the same solution - if the project that targeted both "netstandard1.6" and "net452" was built into a NuGet package then that package could be added to a .NET Core project or to a .NET 4.5.2 project without any workarounds. Which makes me think that now is a good time to talk about building NuGet packages from .NET Core projects..

The project.json file has enough information that the "dotnet" command line can create a NuGet package from it. So, if you run the following command (you need to be in the root of the project that you're interested in to do this) -

dotnet pack

.. then you will get a NuGet package built, ready to distribute! This is very handy, it makes things very simple. And if the project.json targets both netstandard1.6 and net452 then you will get a NuGet package that may be added to either a .NET Core project or a .NET 4.5.2 (or later) project.

I hadn't created the Full Text Indexer as a NuGet package before now, so this seemed like a good time to think about how exactly I wanted to do it.

There were a few things that I wanted to change with what "dotnet pack" gave me at this point -

The name and ID of the package comes from the project name, so the Core project resulted into a package named "Core", which is too vague
I wanted to include additional metadata in the packages such as a description, project link and icon url
If each project would be built into a separate package then it might not be clear to someone what packages are required and how they work together, so it probably makes sense to have a combined package that pulls in everything

For points one and two, the "project.json reference" documentation has a lot of useful information. It describes the "name" attribute -

The name of the project, used for the assembly name as well as the name of the package. The top level folder name is used if this property is not specified.

So, it sounds like I could add a line to the Common project -

"name": "FullTextIndexer.Common",

.. which would result in the NuGet package for "Common" having the ID "FullTextIndexer.Common". And it does!

However, there is a problem with doing this.

The "Common" project is going to be built into a NuGet package called "FullTextIndexer.Common" so the projects that depend upon it will need updating - their project.json files need to change the dependency from "Common" to "FullTextIndexer.Common". If the Core project, for example, wasn't updated to state "FullTextIndexer.Common" as a dependency then the "Core" NuGet package would have a dependency on a package called "Common", which wouldn't exist (because I want to publish it as "FullTextIndexer.Common"). The issue is that if Core's project.json is updated to specify "FullTextIndexer.Common" as a dependency then the following errors are reported:

NuGet Package Restore failed for one or more packages. See details in the Output window.

The dependency FullTextIndexer.Common >= 1.0.0-* could not be resolved.

The given key was not present in the dictionary.

To cut a long story short, after some trial and error experimenting (and having been unable to find any documentation about this or reports of anyone having the same problem) it seems that the problem is that .NET Core dependencies within a solution depend upon the project folders having the same name as the package name - so my problem was that I had a project folder called "Common" that was building a NuGet package called "FullTextIndexer.Common". Renaming the "Common" folder to "FullTextIndexer.Common" fixed it. It probably makes sense to keep the project name, package name and folder name consistent in general, I just wish that the error messages had been more helpful because the process of discovering this was very frustrating!

Note: Since I renamed the project folder to "FullTextIndexer.Common", I didn't need the "name" option in the project.json file and so I removed it (the default behaviour of using the top level folder name is fine).

The project.json reference made the second task simple, though, by documenting the "packOptions" section. To cut to the chase, I changed the Common's project.json to the following:

{
  "version": "1.0.0-*",

  "packOptions": {
    "iconUrl": "https://secure.gravatar.com/avatar/6a1f781d4d5e2d50dcff04f8f049767a?s=200",
    "projectUrl": "https://bitbucket.org/DanRoberts/full-text-indexer",
    "tags": [ "C#", "full text index", "search" ]
  },
  "authors": [ "ProductiveRage" ],
  "copyright": "Copyright 2016 ProductiveRage",

  "dependencies": {},

  "frameworks": {
    "netstandard1.6": {
      "imports": "dnxcore50",
      "dependencies": {
        "NETStandard.Library": "1.6.0"
      }
    },
    "net452": {}
  }
}

I updated the other class library projects similarly and updated the dependency names on all of the projects in the solution so that everything was consistent and compiling.

Finally, I created an additional project named simply "FullTextIndexer" whose only role in life is to generate a NuGet package that includes all of the others (it doesn't have any code of its own). Its project.json file looks like this:

{
  "version": "1.0.0-*",

  "packOptions": {
    "summary": "A project to try implementing a full text index service from scratch in C# and .NET Core",
    "iconUrl": "https://secure.gravatar.com/avatar/6a1f781d4d5e2d50dcff04f8f049767a?s=200",
    "projectUrl": "https://bitbucket.org/DanRoberts/full-text-indexer",
    "tags": [ "C#", "full text index", "search" ]
  },
  "authors": [ "ProductiveRage" ],
  "copyright": "Copyright 2016 ProductiveRage",

  "dependencies": {
    "FullTextIndexer.Common": "1.0.0-*",
    "FullTextIndexer.Core": "1.0.0-*",
    "FullTextIndexer.Helpers": "1.0.0-*",
    "FullTextIndexer.Querier": "1.0.0-*"
  },

  "frameworks": {
    "netstandard1.6": {
      "imports": "dnxcore50",
      "dependencies": {
        "NETStandard.Library": "1.6.0"
      }
    },
    "net452": {}
  }
}

One final note about NuGet packages before I move on - the default behaviour of "dotnet pack" is to build the project in Debug configuration. If you want to build in release mode then you can use the following:

dotnet pack --configuration Release

"Fixing" the serialisation problem

Serialisation in .NET Core seems to a bone of contention - the Microsoft Team are sticking to their guns in terms of not supporting it and, instead, promoting other solutions:

Binary Serialization

Justification. After a decade of servicing we've learned that serialization is incredibly complicated and a huge compatibility burden for the types supporting it. Thus, we made the decision that serialization should be a protocol implemented on top of the available public APIs. However, binary serialization requires intimate knowledge of the types as it allows to serialize object graphs including private state.

Replacement. Choose the serialization technology that fits your goals for formatting and footprint. Popular choices include data contract serialization, XML serialization, JSON.NET, and protobuf-net.

(from "Porting to .NET Core")

Meanwhile, people have voiced their disagreement in GitHub issues such as "Question: Serialization support going forward from .Net Core 1.0".

The problem with recommendations such as Json.NET) and protobuf-net is that they require changes to code that previously worked with BinaryFormatter - there is no simple switchover. Another consideration is that I wanted to see if it was possible to migrate my code over to supporting .NET Core while still making it compatible with any existing installation, such that it could still deserialise any disk-cached data that had been persisted in the past (the chances of this being a realistic use case are exceedingly slim - I doubt that anyone but me has used the Full Text Indexer - I just wanted to see if it seemed feasible).

For the sake of this post, I'm going to cheat a little. While I have come up with a way to serialise index data that works with netstandard, it would probably best be covered another day (and it isn't compatible with historical data, unfortunately). A good-enough-for-now approach was to use "conditional directives", which are basically a way to say "if you're building in this configuration then include this code (and if you're not, then don't)". This allowed me the restore all of the [Serializable] attributes that I removed earlier - but only if building for .NET 4.5.2 (and not for .NET Core). For example:

#if NET452
    [Serializable]
#endif
    public class Whatever
    {

The [Serializable] attribute will be included in the binaries for .NET 4.5.2 and not for .NET Core.

Update (9th March 2021): This isn't actually true any more - the BinaryFormatter and Serializable attribute are available in more current versions of .NET Standard and I've recently updated the FullTextIndexer NuGet packages to target such a version. There remains an IndexDataJsonSerialiser class in the FullTextIndexer.Serialisation.Json package, though, which you may find preferable as Microsoft strongly warns against using the BinaryFormatter (it only applies to untrusted input, which is not an expected use case for deserialisation of index data but you still may wish to err on the safe side).

You need to be careful with precisely what conditions you specify, though. When I first tried this, I used the line "if #net452" (where the string "net452" is consistent with the framework target string in the project.json files) but the attribute wasn't being included in the .NET 4.5.2 builds. There was no error reported, it just wasn't getting included. I had to look up the supported values to see if I'd made a silly mistake and it was the casing - it needs to be "NET452" and not "net452".

I used the same approach to restore the ISerializable implementations that some classes had and I used it to conditionally compile the entirety of the IndexDataSerialiser (which I got back out of my source control history, having deleted it earlier).

This meant that if the "FullTextIndexer" package is added to a project building for the "classic" .NET framework then all of the serialisation options that were previously available still will be - any disk-cached data may be read back using the IndexDataSerialiser. It wouldn't be possible if the package is added to a .NET Core project but this compromise felt much better than nothing.

Final tweaks and parting thoughts

The migration is almost complete at this point. There's one minor thing I've noticed while experimenting with .NET Core projects; if a new solution is created whose first project is a .NET Core class library or console application, the project files aren't put into the root of the solution - instead, they are in a "src" folder. Also, there is a "global.json" file in the solution root that enables.. magic special things. If I'm being honest, I haven't quite wrapped my head around all of the potential benefits of global.json (though there is an explanation of one of the benefits here; The power of the global.json). What I'm getting around to saying is that I want my now-.NET-Core solution to look like a "native" .NET Core solution, so I tweaked the folder structure and the .sln file to be consistent with a solution that had been .NET Core from the start. I'm a fan of consistency and I think it makes sense to have my .NET Core solution follow the same arrangement as everyone else's .NET Core solutions.

Having gone through this whole process, I think that there's an important question to answer: Will I now switch to defaulting to supporting .NET Core for all new projects?

.. and the answer is, today, if I'm being honest.. no.

There are just a few too many rough edges and question marks. The biggest one is the change that's going to happen away from "project.json" and to a variation of the ".csproj" format. I'm sure that there will be some sort of simple migration tool but I'd rather know for sure what the implications are going to be around this change before I commit too much to .NET Core.

I'm also a bit annoyed that the Productivity Power Tools remove-and-sort-usings-on-save doesn't work with .NET Core projects (there's an issue on GitHub about this but it hasn't bee responded to since August 2016, so I'm not sure if it will get fixed).

Finally, I'm sure I read an issue around analysers being included in NuGet packages for .NET Core - that they weren't getting loaded correctly. I can't find the issue now so I've done some tests to try to confirm or deny the rumour.. I've got a very simple project that includes an analyser and whose package targets both .NET 4.5 and netstandard1.6 and the analyser does seem to install correctly and be included in the build process (see ProductiveRage.SealedClassVerification) but I still have a few concerns; in .csproj files, analyser are all explicitly referenced (and may be enabled or disabled in the Solution Explorer by going into References/Analyzers) but I can't see how they're referenced in .NET Core projects (and they don't appear in the Solution Explorer). Another (minor) thing is that, while the analyser does get executed and any warnings displayed in the Error List in Visual Studio, there are no squigglies underlining the offending code. I don't know why that is and it makes me worry that the integration is perhaps a bit flakey. I'm a big fan of analysers and so I want to be sure that they are fully supported*. Maybe this will get tidied up when the new project format comes about.. whenever that will be.

* (Update: Having since added a code fix to the "SealedClassVerification" analyser, I've realised that the no-squigglies-in-editor problem is worse than I first thought - it means that the lightbulb for the code fix does not appear in the editor and so the code fix can not be used. I also found the GitHub issue that I mentioned: "Analyzers fail on .NET Core projects", it says that improvements are on the way "in .NET Core 1.1" which should be released sometime this year.. maybe then will improve things)

I think that things are close (and I like that Microsoft is making this all available early on and accepting feedback on it) but I don't think that it's quite ready enough for me to switch to it full time yet.

Finally, should you be curious at all about the Full Text Indexer project that I've been talking about, the source code is available here: bitbucket.org/DanRoberts/full-text-indexer and there are a range of old posts that I wrote about how it works (see "The Full Text Indexer Post Round-up").

Posted at 13:38

Tags:

FullTextIndexer

Comments

30 July 2013

JavaScript Compression (Putting my JSON Search Indexes on a diet)

When I wrote a couple of weeks ago about writing a way to consume Full Text Indexer data entirely on the client (so that I could recreate my blog's search functionality at productiverage.neocities.org - see The Full Text Indexer goes client-side!), I said that

Yes, the search index files are bigger than I would have liked..

This may have been somewhat of an understatement.

As a little refresher; there is one JSON file responsible for matching posts to search terms ("oh, I see you searched for 'cats'! The posts that you want are, in descending order of match quality: 26, 28, 58, 36, 27, 53, 24"). Then there is a JSON file per-post for all of the source location mappings. These describe precisely where in the posts that tokens may be matched. There's quite a lot of data to consider and it's in a fairly naive JSON structure (I used single letter property names but that was the only real attempt at size-constraining*). On top of this, there's a plain-text representation of each post so that the source location mappings can be used to generate a best match "content excerpt" for each result (à la Google's results, which highlight matched terms). And finally a JSON file for all of the titles of the posts so that the titles can be displayed when the search logic knows which posts match the query but the content excerpt generation is still being worked on in the background.

This results in 5.8MB of JSON and plain text data.

Now, in fairness, this is all gzip'd when it comes down the wire and this sort of data compresses really well. And the detail files for the posts are only requested when the corresponding posts are identified as being results for the current search. So in terms of what the user has to download, it's no big deal.

However, the hosting at neocities.org only offers 10MB at the moment so 5.8MB solely for searching seems a little excessive. To put it into perspective, this is several times more than the html required to render all of the actual posts!

I hadn't actually realised just how much of a problem it was until I published my last post. When I came to re-generate the flattened content to upload to NeoCities, that last post (being something of a beast) pushed the storage requirements past the magic 10MB mark!

* (Very minor diversion: There was actually one potential optimisation that I did consider when first serialising the index data for use on the client. Instead of a simple associative array, I wondered if using a Ternary Search Tree would reduce the space requirements if a lot of the keys had similarities. I used a TST internally in the C# project for performance reasons and talked about it in The .Net Dictionary is FAST!. Alas, when I tried it as a structure for my data, it resulted in slightly larger files).

So. Compression, yeah? In JavaScript??

The biggest contributors to size are the per-post search index files that contain the source locations data. Each is JSON data describing an associative array matching a "normalised" search term (to take into account plurality, case-insensitivity, etc) to a post key, weight and array of source locations. Each source location has the fields FieldIndex, TokenIndex, ContentIndex and ContentLength (read all about it at The Full Text Indexer: Source Locations).

I know that this data can be compressed since, not only does textual or JSON data frequently lend itself to being effectively compressed, but I can see the compression that gzip achieves when the data is delivered to the browser.

My first thought when it comes to compression is that we're dealing with binary data. Certainly, when I see gzip'd responses in Fiddler (before using the "Response is encoded.. Click here to transform" link) it looks like gobbledygook, not like any text I know!

This reminded me of something I read some time ago about a guy creating a PNG file where the pixels are generated from bytes extracted from textual content. This PNG can be read by javascript and the pixels extracted again, and from the pixel values the textual source content can be recreated. The real benefit with PNG here is that it incorporates a lossless compression scheme. Lossless compression is really important here, it means that the decompressed content will be identical to the source content. JPEG is a lossy scheme and can achieve higher compression rates if the quality is reduced sufficiently. But it loses information when it does this. The idea is that the resulting image is either acceptably close visually to the source image or that the discrepancy is evident but thought to be a worthwhile trade-off considering the file size benefits. If we're trying to extract text data that represents javascript than "close-ish, maybe" is not going to cut it!

The original article that I'd remembered was this: Compression using Canvas and PNG-embedded data. The results sound impressive, he got an older version of jQuery (1.2.3) compressed down from 53kb to 17kb. Bear in mind that this is already the minified version of the code! It's quite a short article and interesting, so give it a visit (and while you're there notice that the Mario background is interactive and can be played using the cursor keys! :)

The summary of the article, though, is that it's not suitable for mainstream use. Browser support is restricted (all modern browsers now would work, I'm sure, but I don't know how many versions of IE it would work in). And it concludes with "this is meant only as thing of interest and is not something you should use in most any real life applications, where something like gzip will outperform this".

Ok.

Glad I looked it up again, though, since it was interesting stuff.

On the same blog, there's also an article Read EXIF data with Javascript, which talks about retrieving binary data from a file and extracting the data from it. In this case, the content would have to be compressed when written and then decompressed by the javascript loading it. Unlike, the PNG approach, compression doesn't come for free. From the article, he's written the file binaryajax.js. Unfortunately, browser support apparently is still incomplete. The original plan outlined works for Chrome and Firefox but then some dirty hacks (including rendering VBScript script tags and executing those functions) are required for IE and, at the time at least, Opera wouldn't work at all.

Again, interesting stuff! But not quite what I want.

Help, Google!

So I had to fall back to asking google about javascript compression and trying not to end up in article after article about how to minify scripts.

In fairness, it didn't take too long at all until a pattern was emerging where a lot of people were talking about LZ compression (info available at the Wikipedia page). And I finally ended up here lz-string: JavaScript compression, fast!

From that page -

lz-string was designed to fulfill the need of storing large amounts of data in localStorage, specifically on mobile devices. localStorage being usually limited to 5MB, all you can compress is that much more data you can store.

What about other libraries?

All I could find was:

- some LZW implementations which gives you back arrays of numbers (terribly inefficient to store as tokens take 64bits) and don't support any character above 255.

- some other LZW implementations which gives you back a string (less terribly inefficient to store but still, all tokens take 16 bits) and don't support any character above 255.

- an LZMA implementation that is asynchronous and very slow - but hey, it's LZMA, not the implementation that is slow.

- a GZip implementation not really meant for browsers but meant for node.js, which weighted 70kb (with deflate.js and crc32.js on which it depends).

This is sounding really good (and his view of the state of the other libraries available reflects my own experiences when Googling around). And he goes on to say

localStorage can only contain JavaScript strings. Strings in JavaScript are stored internally in UTF-16, meaning every character weight 16 bits. I modified the implementation to work with a 16bit-wide token space.

Now, I'm not going to be using it with localStorage but it's gratifying to see that this guy has really understood the environment in which it's going to be used and how best to use that to his advantage.

Preliminary tests went well; I was compressing this, decompressing that, testing this, testing that. It was all going swimmingly! The only problem now was that this was a clearly a custom (and clever) implementation of the algorithm so I wouldn't be able to use anything standard on the C# side to compress the data if I wanted the javascript to be able to decompress it again. And the whole point of all of this is to "flatten" my primary blog and serialise the search index in one process, such that it can be hosted on NeoCities.

The javascript code is fairly concise, so I translated it into C#. When I'd translated C# classes from my Full Text Indexer into javascript equivalents, it had gone surprisingly painlessly. I'd basically just copied the C# code into an empty file and then removed types and tweaked things to work as javascript. So I thought I'd take a punt and try the opposite - just copy the javascript into an empty C# class and then try to fix it up. Adding appropriate types and replacing javascript methods with C# equivalents. This too seemed to go well, I was compressing some strings to text files, pulling them with ajax requests in javascript, decompressing them and viewing them. Success!

Until..

I gave it a string that didn't work. The decompress method returned null. Fail.

Troubleshooting

I figured that there's only so much that could be going wrong. If I compressed the same string with the javascript code then the javascript code could decompress it just fine. The data compressed with the C# version refused to be decompressed by the javascript, though. Chance are I made a mistake in the translation.

I got the shortest reproduce string I could (it's from the titles-of-all-posts JSON that the search facility uses) -

{"1":"I lo

and got both the C# and javascript code to print out a list of character codes that were generated when that string was compressed.

These were identical. So maybe my translation wasn't at fault.

Well something must be getting lost somewhere!

This led me on to wondering if it was somehow the encoding. The compressed content is being served up as UTF8 (basically the standard on the web) but the compression algorithm is intended to compress to UTF16. Now, surely this won't make a difference? It means that the bits sent over the wire (in UTF8) will not be the exact same bits as the javascript string (UTF16) is represented by when it's received, but these encoded bits should be describing the same character codes.

So the next step was to intercept the ajax request that the javascript client was making for the data (compressed by C# code, delivered over the wire with UTF8 encoding) and to write out the character codes at that point.

And there was a discrepancy! The character codes received were not the same that I'd generated by the C# code and that I thought I was transmitting!

Thinking still that this must somehow be encoding-related, I started playing around with the encoding options when writing the data to disk. And noticed, innocently hidden away in an alternate constructor signature, the System.Text.UTF8Encoding class has an option to "throwOnInvalidBytes". What is this?

I knew how UTF8 worked, that it uses a variable number of bytes and uses the most-signficant-bits to describe how many bytes are required for the current character (the Wikipedia article explains it nicely) and thought that that was pretty much all there was to it. So how could a byte be invalid?? Well, with this constructor argument set to true, I was getting the error

Unable to translate Unicode character \uD900 at index 6 to specified code page.

so clearly particular bytes can be invalid somehow..

UTF Limitations

With this error, it didn't actually take much searching. There's a link on www.unicode.com; Are there any 16-bit values that are invalid? that states that

Unpaired surrogates are invalid in UTF8. These include any value in the range D80016 to DBFF16 not followed by a value in the range DC0016 to DFFF16, or any value in the range DC0016 to DFFF16 not preceded by a value in the range D80016 to DBFF16

I spent a little while wandering through various searches on the internet trying to decide what the best way would be to try to address this. I didn't want to have to try to find another compressor for all of the reasons that the author of the one I'm using outlined! Which made me think, maybe there's more information about this on his site.

Lo and behold, in the changelog (at the bottom of that page), there's mention that there's a v1.3.0 available that has additional methods compressToUTF16 and decompressToUTF16 ("version 1.3.0 now stable") that "allow lz-string to produce something that you can store in localStorage on IE and Firefox".

These new methods wrap the methods "compress" and "decompress". But the "compress" and "decompress" methods in this new version of the code look different to those in the code that I had been using (and had translated). But it's no big deal to translate the newer version (and the new methods).

And now it works! I wish that this version had been the file you see when you go to the main lz-string GitHub page rather than being hidden in the "libs" folder. But considering how happy I am that the solution to my problem has been provided to me with little-to-no-effort, I'm really not going to complain! :)

Incorporating it into the solution

Step 1 was to alter my Blog-to-NeoCities Transformer to write compressed versions of the per-post source location mappings data, along with compressed versions of the plain text post data and the JSON that has the titles for each post.

The C# translation of the LZString code can be seen at: LZStringCompress.cs.

Step 2 was to alter the javascript search code to handle the compressed content. Having included the 1.3.0 version of LZString.js, I needed to replace some of the $.ajax requests with calls to one of

function LoadCompressedData(strUrl, fncSuccess, fncFailure) {
  // Note: I've seen this fail when requesting files with extension ".js" but work when the exact
  // same file is renamed to ".txt", I'm not sure if this is in IIS that it's failing or if jQuery
  // is requesting the data in a slightly different manner (despite the explicit dataType option),
  // so best just ensuring that all LZ-compressed data is stored in a file with a ".txt" extension.
  $.ajax({
    url: strUrl,
    dataType: "text",
    success: function(strCompressedContent) {
      var strContent;
      try {
        strContent = LZString.DecompressFromUTF16(strCompressedContent);
      }
      catch(e) {
        if (fncFailure) {
          fncFailure(e);
        }
        return;
      }
      fncSuccess(strContent);
    }
  });
}

function LoadCompressedJsonData(strUrl, fncSuccess, fncFailure) {
  LoadCompressedData(
    strUrl,
    function(strContent) {
      var objData;
      try {
        eval("objData = " + strContent);
      }
      catch(e) {
        if (fncFailure) {
          fncFailure(e);
        }
        return;
      }
      fncSuccess(objData);
    },
    function() {
      if (fncFailure) {
        fncFailure(arguments);
      }
    }
  );
}

Step 3 is.. there is no step 3! Everything was now working but taking up less space on the hosting.

When compressed, the detailed source location mappings data is reduced from a combined 4.7MB to 1.5MB. The plain text content was reduced from 664kb to 490kb (not as much of a saving as I'd expected, to be honest). The titles JSON shrank marginally from 2.58kb to 2.36kb. The summary data JSON wasn't compressed so that the first stage of the search could be performed as quickly as possible and one non-compressed file on the server was no problem (it's still gzip'd when passed to the client, though, so there's no bandwidth cost). In total, 5.3MB of data was condensed into requiring less than 2MB on the server. Which I am happily marking down as a win :)

So here's to me hopefully fitting many more posts (along with all of the related javascript searching data) into the NeoCities hosting limitations! I'm sure that if I ever start pushing that 10MB point again, by then the 10MB limit will have been raised - 20MB is already on the way!

Posted at 23:22

Tags:

Comments

10 July 2013

The NeoCities Challenge! aka The Full Text Indexer goes client-side!

When I heard about NeoCities last week, I thought it was a cool idea - offering some back-to-basics hosting for an outlay of absolutely zero. Yeah, the first thing that came to mind was those geocities sites that seemed to want to combine pink text, lime green backgrounds and javascript star effects that chase the cursor.. but that's just nostalgia! :)

The more I thought about it, the more I sort of wondered whether I couldn't host this blog there. I'm happy with my hosting, it's cheap and fast, but I just liked the idea of creating something simple with the raw ingredients of html, css and javascript alone. I mean, it should be easy enough to flatten the site so that all of the pages are single html files, with a different file per month, per Post, per tag.. the biggest stumbling blog was the site search which is powered by my Full Text Indexer project. Obviously that won't work when everything is on the client, when there is no server backend to power it. Unless..

.. a client-side Full Text Indexer Search Service.. ??

Challenge accepted!! :D

NeoCities Hosting

Before I get carried away, I'll breeze through the NeoCities basics and how I got most of my site running atNeoCities. I initially thought about changing the code that runs my site server-side to emit a flattened version for upload. But that started to seem more and more complicated the more I thought about it, and seemed like maintaining it could be a headache if I changed the blog over time.

So instead, I figured why not treat the published blog (published at www.productiverage.com) as a generic site and crawl it, grab content, generate new urls for a flattened structure, replace urls in the existing content with the new urls and publish it like that. How hard could it be??

Since my site is almost entirely html and css (well, LESS with some rules and structure, but it compiles down to css so why be fussy) with only a smattering of javascript, this should be easy. I can use the Html Agility Pack to parse and alter html and I very conveniently have a CSS Parser that can be used to update urls (like backgrounds) in the stylesheets.

And, in a nutshell, that's what I've done. The first version of this NeoCities-hosted blog hid the site search functionality and ran from pure html and css. I had all of the files ready to go in a local IIS site as a proof of concept. Then I went to upload them.. With the Posts and the Archive and Tags pages, there were 120-something files. I'd only used the uploader on the NeoCities page to upload a single test file up to this point and so hadn't realised that that's all it would let you do; upload a single file at a time. Even if I was willing to upload all of the files individually now (which, I must admit, I did; I was feeling overexcitable about getting the first version public! :) this wouldn't scale, I couldn't (wouldn't) do it every time I changed something - a single new Post could invalidate many Pages, considering the links between Posts, the Monthly Archive pages, the Tags pages, etc..

I spent a few minutes googling to see if anyone else could offer a sensible solution and came up empty.

So Plan B was to have a look with Fiddler to see what the upload traffic looked like. It seemed fairly straight-forward, an authorisation cookie (to identify who I was logged in as) and a "csrf_token" form value, along with the uploaded file's content. I was familiar with the phrase "Cross Site Request Forgery" (from "csrf_token") but didn't really know what it meant in this context. I thought I'd take a punt and try manipulating the request to see if that token had to vary between uploads and it didn't seem to (Fiddler lets you take a request, edit it and broadcast it - so uploading a text file provided an easy to mess-with request, I could change one character of the file and leave everything else, content-length included, the same and refresh the browser to see if the new content had arrived).

This was enough to use the .net WebRequest class and upload files individually. I wrote something to loop over the files in my local version of the site and upload them all.. and it worked! There were some stumbling blocks with getting the cookie sent and specifying the form value and the file to upload but StackOverflow came to the rescue each time. There is one outstanding issue that each upload requests received a 500 Server Error response even though it was successful, but I chose to ignore that for now - yes, this approach is rough around the edges but it's functional for now!

In case this is useful to anyone else, I've made the code available at Bitbucket: The BlogToNeocitiesTransformer.

If you plug in the auth cookie and csrf token values into the (C#) console application (obtaining those values by looking at a manual upload request in Fiddler still seems like the only way right now) then you can use it yourself, should you have need to. That app actually does the whole thing; downloads a site's content, generates a flattened version (rewriting the html and css, ensuring the urls follow NeoCities' filename restrictions) and then uploads it all to a NeoCities account.

Temporary Measures

Thankfully, this file upload situation looks to only be temporary. Kyle Drake (NeoCities' proud father) has updated the blog today with NeoCities can now handle two million web sites. Here's what we're working on next. In there he says that the file upload process is going to be improved with "things like drag-and-drop file uploading, and then with an API to allow developers to write tools to upload files", which is excellent news! He also addresses the limits on file types that may be uploaded. At the moment "ico" is not in the white list of allowed extensions and so I can't have a favicon for my blog at NeoCities. But this is soon to be replaced with a "black list" (to block executables and the like) so my favicon should soon be possible* :)

* (I half-contemplated, before this announcement, a favicon in another format - such as gif or png - but it seems that this counts out IE and I wanted a solution for all browsers).

Hopefully this black listing approach will allow me to have an RSS feed here as well - essentially that's just an xml file with an xslt to transform the content into a nice-to-view format. And since xslt is just xml I thought that it might work to have an xslt reference in the xml file that has an xml extension itself. But alas, I couldn't get it working, I just kept getting a blank screen in Chrome although "view source" showed the content was present. I'll revisit this when the file restrictions have been changed.

Update (25th July 2013): The NeoCities interface has been updated so that multiple files can be uploaded by dragging-and-dropping! I still can't upload my favicon yet though..

Update (12th August 2014): I'm not sure when, exactly, but favicons are now support (any .ico file is, in fact). It's a little thing but I do think it adds something to the identity of a site so I'm glad they changed this policy.

Site Search!

This was all well and good, but at this point the site search was missing from the site. This functionality is enabled by server-side code that takes a search string, tries to find matching Posts and then shows the results with Post titles and content excerpts with the matching term(s) highlighted. The matching is done against an index of tokens (possible words) so that the results retrieval can be very fast. The index records where in the source content it matches the token, which enables the except-highlighting. It has support for plurality matching (so it knows that "cat" and "cats" can be considered to be the same word) and has some other flexibility with ignoring letter case, ignoring accents on characters, ignoring certain characters (so "book's" is considered the same as "books") and search term splitting (so "cat posts" matches Posts with "cat" and "posts" in, rather than requiring that a Post contain the exact phrase "cat posts").

But the index is essentially a string-key dictionary onto the match data (the C# code stores it as a ternary search tree for performance but it's still basically just a dictionary). Javascript loves itself some associative arrays (all objects are associated arrays, basically bags of string-named properties) and associative arrays are basically string-key dictionaries. It seemed like the start of an idea!

If the index was still generated server-side (as it has to be for my "primary" blog hosting) then I should be able to represent this data as something that javascript could interpret and then perform the searching itself on the client..

The plan:

Get the generated IIndexData<int> from my server-side blog (it's an int since the Posts all have a unique int "Id" property)
Use the GetAllTokens and GetMatches methods to loop through each token in the data and generate JSON to represent the index, storing at this point only the Post Keys and their match Weight so that searching is possible
Do something similar to get tokens, Keys, Weights and Source Locations (where in the source content the matches were identified) for each Post, resulting in detailed match data for each individual Post
Get plain text content for each Post from the server-side data (the Posts are stored as Markdown in my source and translated into html for rendering and plain text for searching)
Write javascript classes that can use this data to perform a search and then render the results with matched terms highlighted, just like the server-side code can do
Profit!

(This is actually a slightly amended plan from the original, I tried first to generate a single JSON file with the detailed content for all Posts but it was over 4 meg uncompressed and still bigger than I wanted when delivered from NeoCities gzip'd. So I went for the file-with-summary-data-for-all-Posts and then separate files for detailed data for individual Posts).

I used JSON.Net for the serialisation (it's just the go-to for this sort of thing!) and used intermediary C# classes where each property was only a single character long to try to keep the size of the serialised data down. (There's nothing complicated about this, if more detail is of interest then the code can be found in the JsonSearchIndexDataRecorder class available on Bitbucket).

C# -> Javascript

So now I had a single JSON file for performing a search across all Posts, multiple JSON files for term-highlighting individual Posts and text files containing the content that the source locations mapped onto (also for term-highlighting). I needed to write a javascript ITokenBreaker (to use the parlance of the Full Text Indexer project) to reduce a search term into individual words (eg. "Cat posts" into "Cat" and "posts"), then an IStringNormaliser that will deal with letter casing, pluralisation and all that guff (eg. "Cat" and "posts" into "cat~" and "post~"). Then a way to take each of these words and identify Posts which contain all of the words. And finally a way to use ajax to pull in the detailed match data and plain text content to display the Post excerpts with highlighted search term matches.

To make things as snappy-feeling as possible, I wanted to display the title of matched Posts first while waiting for the ajax requests to deliver the match highlighting information - and for the content excerpts to be added in later.

The file IndexSearchGenerator.js takes a JSON index file and a search term, breaks the search term into words, "normalises" those words, identifies Posts that contain all of the normalised words and returns an array of Key, Weight and (if it was present in the data) Source Locations. It's only 264 lines of non-minified javascript and a lot of that is the mapping of accented characters to non-accented representations. (The Source Locations will not be present in the all-Posts summary JSON but will be in the per-Post detail JSON).

SearchTermHighlighter.js takes the plain text content of a Post, a maximum length for a content-excerpt to show and a set of Source Locations for matched terms and returns a string of html that shows the excerpt that best matches the terms, with those terms highlighted. And that's only 232 lines of non-minified, commented code. What I found particularly interesting with this file was that I was largely able to copy-and-paste the C# code and fudge it into javascript. There were some issues with LINQ's absence. At the start of the GetPlainTextContentAsHighlightedHtml method, for example, I need to get the max "EndIndex" of any of the highlighted terms - I did this by sorting the Source Locations data by the EndIndex property and then taking the EndIndex value of the last element of the array. Easy! The algorithm for highlighting (determining the best excerpt to take and then highlighting any terms, taking care that any overlapped terms don't cause problems) wasn't particularly complicated but it was fairly pleasant to translate it "by rote" and have it work at the end.

Finally, SearchPage.js fills in the gaps by determining whether you're on the search page and extracting (and url-decoding) the terms from the Query String if so, performing the initial search against the summary data (displaying the matched titles) and then making the ajax requests for the detailed data and rendering the match excerpts as they become available. Looping through the results, making ajax requests for detail data and handling the results for each Post when it is delivered is a bit like using a complicated asynchronous model in .net but in javascript this sort of async callback madness is parr for the course :) If this script decides that you're not on the search page then it makes an ajax request for the summary regardless so that it can be browser-cached and improve the performance of the first search you make (the entirety of the summary data is only 77k gzip'd so it's no big deal if you don't end up actually performing a search).

This last file is only 165 lines of commented, white-spaced javascript so the entire client-side implementation of the search facility is still fairly approachable and maintainable. It's effective (so long as you have javascript enabled!) and - now, bear with me if I'm just being overly impressed with my own creation - it looks cool performing the complicated search mechanics so quickly and fading in the matched excerpts! :)

Signing off

I'm really proud of this and had a lot of fun within the "NeoCities boundaries"! Yes, the source-site-grabbing-and-rewriting could be tidied up a bit. Yes, the file upload is currently a bit of a dodgy workaround. Yes, I still have a handful of TODOs about handling ajax failures in SearchPage.js. Yes, the search index files are bigger than I would have liked (significantly larger than not only the plain text content but also their full page html representations), which I may address by trying out a more complicated format rather than naive JSON (which was very easy). But it all works! And it works well. And the bits that are a bit untidy are only a bit untidy, on the whole they're robust and I'm sufficiently unashamed of them all that the code is all public!

Speaking of which, my blog is primarily hosted at www.productiverage.com* and I write about various projects which are hosted on my Bitbucket account. Among them, the source code for the Blog itself, for the Full Text Indexer which powers the server-hosted blog and which generates the source index data which I've JSON-ified, the CSSParser which I use in the rewriting / site-flattening process, the BlogToNeocitiesTransformer which performs the site-flattening and a few other things that I've blogged about at various times. Ok, self-promotion over! :)

* (If you're reading this at my primary blog address, the NeoCities version with the client-side search can be found at productiverage.neocities.org).

Posted at 20:21

Tags:

Comments

9 April 2013

The Full Text Indexer: Search Term Highlighting with Source Locations

I made reference last time (in The Full Text Indexer: Source Locations) to using Source Location data to implement search term highlighting and since adding this functionality to the Full Text Indexer, I've rewritten the search term highlighting mechanism on my blog to illustrate it.

The idea is that when a search is performed, the results page tries to show a snippet of each matching post with instances of the search term(s) highlighted. The source locations will describe where all of the matches are but it might not be possible to show all of the matches if the length of the summary snippet is restricted to, say, 300 characters. So a mechanism is required that can choose the best snippet to show; this might be the segment of content that contains the greatest number of matches, or the greatest cumulative weight of matches. It also needs to bear in mind that it's possible that none of the source locations will be in the post content itself; it's feasible that a search term could be present in a post's title but not its content at all!

Getting straight to it

The previous system that I had on my blog to perform this task was a bit complicated - it tried to determine all of the places in the content where matches might have occured (this was before source locations information was included in the Full Text Indexer data) and then generated all permutations of combinations of some or all of these matches in order to try to decide what was the best segment of the content to extract and display as the summary. The logic wasn't precisely the same as the Full Text Indexer's searching as I didn't want to go crazy on processing the content - so it wouldn't handle matches of plurals, for example. And the LINQ statement I carefully crafted over a few iterations to generate the permutations of possible matches seemed cryptic when I came back to look at it a few months later.

The new approach is much simpler:

Sort the source location data by index (where the match appears in the content) and then by length of matched token
Loop through the source data and build up chains of adjacent / overlapping matches where all of the matches can fit inside a segment that is no longer than maxLengthForHighlightedContent
Construct each chain by starting with the current source location
Then look ahead to the next source location (if any) and determine whether both it and the current source location fit within the maxLengthForHighlightedContent constraint
Continue for the subsequent source locations (as soon as including one would exceed the maxLengthForHighlightedContent, stop looking - this works since they're sorted by index and length)
Decide which of these source location chains will result in the best summary (if no chains could be constructed then return an empty set instead)
Return a set of segments (index / length pairs) to highlight - no overlapping segments will be returned, any overlapping segments will be combined (this can make the actual highlighting of search terms much simpler)

The "best summary" is determined by an IComparer that considers different sets of source locations. The implementation I use runs through three metrics

The combined MatchWeightContribution of the source location sets (the greater the better)
If there are chains that the first metric can't differentiate between then consider the number of source locations in the chain (the lower the better, this must mean that the average weight of each match is greater)
If a decision still hasn't been reached for a given comparison then consider the minimum SourceIndex (the lower the better, meaning the matching starts earlier in the content)

I will only be showing a search summary extracted from the post Content field although the search functionality also considers the Title as well as any Tags for the post. The first Content Retriever extracts content from a plain text version of the post's content so all source locations that relate to the Content field will have a SourceFieldIndex value of zero (I touched briefly on this last time but I'll explain in more detail further down in this post too).

So let's see the code! This is one of those pleasant cases where the code flows nicely from the outlined approach. I didn't go into detail above about how the overlap-prevention was handled but the code (hopefully!) illustrates more than adequately -

using System;
using System.Collections.Generic;
using System.Linq;
using FullTextIndexer.Common.Lists;
using FullTextIndexer.Core.Indexes;

namespace BlogBackEnd.FullTextIndexing
{
  public static class SearchTermHighlighter
  {
    public static NonNullImmutableList<StringSegment> IdentifySearchTermsToHighlight(
      string content,
      int maxLengthForHighlightedContent,
      NonNullImmutableList<SourceFieldLocation> sourceLocations,
      IComparer<NonNullImmutableList<SourceFieldLocation>> bestMatchDeterminer)
    {
      if (content == null)
        throw new ArgumentNullException("content");
      if (maxLengthForHighlightedContent <= 0)
      {
        throw new ArgumentOutOfRangeException(
          "maxLengthForHighlightedContent",
          "must be greater than zero"
        );
      }
      if (sourceLocations == null)
        throw new ArgumentNullException("sourceLocations");
      if (sourceLocations.Select(s => s.SourceFieldIndex).Distinct().Count() > 1)
        throw new ArgumentException("All sourceLocations must have the same SourceFieldIndex");
      if (bestMatchDeterminer == null)
        throw new ArgumentNullException("bestMatchDeterminer");

      // If there are no source locations there there is nothing to highlight
      if (!sourceLocations.Any())
        return new NonNullImmutableList<StringSegment>();

      // Sort sourceLocations by index and then length
      sourceLocations = sourceLocations.Sort((x, y) =>
      {
        if (x.SourceIndex < y.SourceIndex)
          return -1;
        else if (y.SourceIndex < x.SourceIndex)
          return 1;

        if (x.SourceTokenLength < y.SourceTokenLength)
          return -1;
        else if (y.SourceTokenLength < x.SourceTokenLength)
          return 1;

        return 0;
      });

      // Identify all combinations of source locations that can be shown at once without exceeding the
      // maxLengthForHighlightedContent restraint
      var sourceLocationChains = new NonNullImmutableList<NonNullImmutableList<SourceFieldLocation>>();
      for (var indexOfFirstSourceLocationInChain = 0;
               indexOfFirstSourceLocationInChain < sourceLocations.Count;
               indexOfFirstSourceLocationInChain++)
      {
        var sourceLocationChain = new NonNullImmutableList<SourceFieldLocation>();
        for (var indexOfLastSourceLocationInChain = indexOfFirstSourceLocationInChain;
                 indexOfLastSourceLocationInChain < sourceLocations.Count;
                 indexOfLastSourceLocationInChain++)
        {
          var startPoint = sourceLocations[indexOfFirstSourceLocationInChain].SourceIndex;
          var endPoint =
            sourceLocations[indexOfLastSourceLocationInChain].SourceIndex +
            sourceLocations[indexOfLastSourceLocationInChain].SourceTokenLength;
          if ((endPoint - startPoint) > maxLengthForHighlightedContent)
            break;

          sourceLocationChain = sourceLocationChain.Add(sourceLocations[indexOfLastSourceLocationInChain]);
          sourceLocationChains = sourceLocationChains.Add(sourceLocationChain);
        }
      }

      // Get the best source location chain, if any (if not, return an empty set) and translate into a
      // StringSegment set
      if (!sourceLocationChains.Any())
        return new NonNullImmutableList<StringSegment>();

      return ToStringSegments(
        sourceLocationChains.Sort(bestMatchDeterminer).First()
      );
    }

    private static NonNullImmutableList<StringSegment> ToStringSegments(
      NonNullImmutableList<SourceFieldLocation> sourceLocations)
    {
      if (sourceLocations == null)
        throw new ArgumentNullException("sourceLocations");
      if (!sourceLocations.Any())
        throw new ArgumentException("must not be empty", "sourceLocations");

      var stringSegments = new NonNullImmutableList<StringSegment>();
      var sourceLocationsToCombine = new NonNullImmutableList<SourceFieldLocation>();
      foreach (var sourceLocation in sourceLocations.Sort((x, y) => x.SourceIndex.CompareTo(y.SourceIndex)))
      {
        // If the current sourceLocation overlaps with the previous one (or adjoins it) then they should
        // be combined together (if there isn't a previous sourceLocation then start a new queue)
        if (!sourceLocationsToCombine.Any()
        || (sourceLocation.SourceIndex
          <= sourceLocationsToCombine.Max(s => (s.SourceIndex + s.SourceTokenLength)))
        )
        {
          sourceLocationsToCombine = sourceLocationsToCombine.Add(sourceLocation);
          continue;
        }

        // If the current sourceLocation marks the start of a new to-highlight segment then add any
        // queued-up sourceLocationsToCombine content to the stringSegments set..
        if (sourceLocationsToCombine.Any())
          stringSegments = stringSegments.Add(new StringSegment(sourceLocationsToCombine));

        // .. and start a new sourceLocationsToCombine list
        sourceLocationsToCombine = new NonNullImmutableList<SourceFieldLocation>(new[] { sourceLocation });
      }
      if (sourceLocationsToCombine.Any())
        stringSegments = stringSegments.Add(new StringSegment(sourceLocationsToCombine));
      return stringSegments;
    }

    public class StringSegment
    {
      public StringSegment(NonNullImmutableList<SourceFieldLocation> sourceLocations)
      {
        if (sourceLocations == null)
          throw new ArgumentNullException("sourceLocations");
        if (!sourceLocations.Any())
          throw new ArgumentException("must not be empty", "sourceLocations");
        if (sourceLocations.Select(s => s.SourceFieldIndex).Distinct().Count() > 1)
          throw new ArgumentException("All sourceLocations must have the same SourceFieldIndex");

        Index = sourceLocations.Min(s => s.SourceIndex);
        Length = sourceLocations.Max(s => (s.SourceIndex + s.SourceTokenLength) - Index);
        SourceLocations = sourceLocations;
      }

      public int Index { get; private set; }
      public int Length { get; private set; }
      public NonNullImmutableList<SourceFieldLocation> SourceLocations { get; private set; }
    }
  }
}

The overlap-prevention is important for my application since I want to be able to take arbitrary segments of the content and wrap them in <strong> tags so that they can appear highlighted - if there are segments that overlap then this isn't going to result in valid html!

The other part of the puzzle is the "best match determiner". This also follows very closely the approach outlined:

public class BlogSearchTermBestMatchComparer : IComparer<NonNullImmutableList<SourceFieldLocation>>
{
  public int Compare(
    NonNullImmutableList<SourceFieldLocation> x,
    NonNullImmutableList<SourceFieldLocation> y)
  {
    if (x == null)
      throw new ArgumentNullException("x");
    if (y == null)
      throw new ArgumentNullException("y");

    var combinedWeightComparisonResult =
      y.Sum(s => s.MatchWeightContribution)
      .CompareTo(x.Sum(s => s.MatchWeightContribution));
    if (combinedWeightComparisonResult != 0)
      return combinedWeightComparisonResult;

    var numberOfTokensComparisonResult = x.Count.CompareTo(y.Count);
    if (numberOfTokensComparisonResult != 0)
      return numberOfTokensComparisonResult;

    return x.Min(s => s.SourceIndex).CompareTo(y.Min(s => s.SourceIndex));
  }
}

Ok, there's actually one more thing. Since I currently use the GetPartialMatches method to deal with multi-word searches on my blog, I have a NonNullImmutableList<SourceFieldLocationWithTerm> rather than a NonNullImmutableList<SourceFieldLocation> so I have this alternate method signature:

public static NonNullImmutableList<StringSegment> IdentifySearchTermsToHighlight(
  string content,
  int maxLengthForHighlightedContent,
  NonNullImmutableList<IndexData_Extensions_PartialMatches.SourceFieldLocationWithTerm> sourceLocations,
  IComparer<NonNullImmutableList<SourceFieldLocation>> bestMatchDeterminer)
{
  if (sourceLocations == null)
    throw new ArgumentNullException("sourceLocations");

  return IdentifySearchTermsToHighlight(
    content,
    maxLengthForHighlightedContent,
    sourceLocations
      .Select(s => new SourceFieldLocation(
        s.SourceFieldIndex,
        s.TokenIndex,
        s.SourceIndex,
        s.SourceTokenLength,
        s.MatchWeightContribution
      ))
      .ToNonNullImmutableList(),
    bestMatchDeterminer
  );
}

It would be nice if covariance was supported for classes in C#, rather than interfaces only, as then this method signature and the extra wrapping would not be required. I've contemplated changing my code such that the NonNullImmutableList implements INonNullImmutableList and supporting covariance on that interface, but I'm a bit uncomfortable that then implementations of INonNullImmutableList could be provided that actually aren't immutable. Having the interface specify NonNullImmutableList (which inherits from ImmutableList) means that the list provided absolutely is immutable. Unfortunately this leaves us without covariance support. (This reminds me of this post I read: Immutability and ReadOnlyCollection<T>).

SourceFieldIndex Values

When the index is generated from the source data, Content Retrievers are specified which are responsible for returning strings of content. This was covered in the first post I wrote for this project: The Full Text Indexer. Originally, each Content Retriever would return zero or one strings but since then the functionality has been expanded to return a NonNullOrEmptyStringList and so zero, one or multiple strings of content may be extracted by a single retriever.

To return to my blog as an example, each post has a Title, Content and Tags (there may be zero, one or multiple Tags). The first Content Retriever extracts a single string from the Content property (it has to do some manipulation since internally it is Markdown which is converted into html, then I extract plain text content from that). The second Content Retriever takes the Title property for a post and the third Content Retriever takes a string for each Tag. This means that any given post will generate at least two content strings, depending upon how many Tags there are.

Each source location associated with a matched search term has a SourceFieldIndex value. For results from searching my blog posts, I know that any source location with SourceFieldIndex zero comes from the post's Content, any source location with SourceFieldIndex one comes from the Title and any with a SourceFieldIndex greater than one must relate to a Tag. So when I want to consider source locations to highlight matches within a segment of the post Content, I consider only those with a SourceFieldIndex of zero.

If you wish to use the AutomatedIndexGeneratorFactoryBuilder (see The Full Text Indexer - Automating Index Generation) to configure an index generator (since that makes it really easy!) there is a method SetPropertyForFirstContentRetriever which enables a particular property to be specified as that which the first Content Retriever will extract content from. This allows this sort of functionality to be layered on top

For more information on this project, see the Full Text Indexer Round-up.

Posted at 20:40

Tags:

FullTextIndexer

Comments

28 March 2013

The Full Text Indexer: Source Locations

After adding the Structured Queries functionality to my Full Text Indexer project I've been looking back at the mechanism for matching runs of tokens - eg. matching

"penguins are the best"

not just to results that contain the words "penguins", "are", "the", "best" but results that contain them in a string, as a run of consecutive tokens.

I'd previously addressed this functionality with the ConsecutiveTokenCombiningTokenBreaker - this can wrap another token breaker so that during Index generation the Index will be populated with tokens that are not just individual words but also runs of words strung back together. (There's more details in the Token Breaker and String Normaliser variations post).

There are some issues that I've encountered with this when I've used it with real data, however. Firstly, the Index generation time expands greatly since so much more work is done in terms of generating the tokens and also building the Index with all of this additional token data. Secondly, all of this additional data takes up a lot more space (whether persisting the Index to disk or just maintaining it in memory). An Index generated with the use of a ConsecutiveTokenCombiningTokenBreaker will likely be several times larger, feasibly ten times as large. And finally, the token breaker takes a constructor argument "maxNumberOfTokens" which caps how many tokens will be strung together in any given run. This puts a limit on the length of input search strings, based on the number of tokens it would be broken down into ("penguins are the best" would be a run of four words. If a maxNumberOfTokens value of three was specified, then the string couldn't be matched in any content).

Source Locations

Something I've been thinking about adding is "Source Location" information to the match data. I believe that Lucene can be configured to record where in the source content that a particular token was extracted from, which can be used for search term highlighting. I've implemented search term highlighting on my blog but that tries to match search terms to content after the Index has identified which posts match the search. And it doesn't use the same string normaliser as the Index so it doesn't realise that "cat" and "cats" will be considered the same by the Index.

So in the back of my mind I've thought about adding this source location data to token matches so that I could use it to implement more consistent search term highlighting (consistent in that the same token matches identified by the Index will be considered by the search term highlighter).

But it struck me that I should be able to use the same data to search for consecutive runs of token matches after the Index has been generated, rather than requiring additional processing to generate the Index in the first place.

If all of the string data for a source data entry was extracted out into one long string then each "Source Location" instance would need a start index and a length for the segment of that string that was extracted for a particular token. However, this isn't how the string data is extracted for data types that have multiple properties to extract from, each is considered a separate field. So the source location would require a field index as well as the content start index and length. (If the source data type represents articles, for example, then different fields may be Title, Description, Author, etc..).

If, in addition to this, we record the "token index" for each source location then we would have the data required to identify consecutive runs. If a source data instance had a single text property with the content

"penguins are the best, penguins!"

this could be extracted into source locations with

{ 0, 0, 0,  8 }, // FieldIndex, TokenIndex, ContentIndex, ContentLength
{ 0, 1, 9,  3 }, // FieldIndex, TokenIndex, ContentIndex, ContentLength
{ 0, 2, 13, 3 }, // FieldIndex, TokenIndex, ContentIndex, ContentLength
{ 0, 3, 17, 4 }, // FieldIndex, TokenIndex, ContentIndex, ContentLength
{ 0, 4, 23, 8 }  // FieldIndex, TokenIndex, ContentIndex, ContentLength

(They would all have FieldIndex zero since there is only a single field to extract from).

The search for "penguins are the best" could be performed by searching for each of the four words and then analysing the match data and its source locations to only consider token matches that are arranged in the content as part of a consecutive run. The second instance of "penguins" could be ignored as there is no match for the word "are" that has the same FieldIndex but a TokenIndex one greater.

This logic is incorporated into the new "GetConsecutiveMatches" extension method. Its signature is similar to "GetPartialMatches" - it takes a search term which is expected to be multiple tokens according to the token breaker which must also be provided. It then requires two weight combiners where GetPartialMatches only requires one.

// There are alternate signatures that take less arguments in favour of sensible defaults
public static NonNullImmutableList<WeightedEntry<TKey>> GetConsecutiveMatches<TKey>(
    this IIndexData<TKey> index,
    string source,
    ITokenBreaker tokenBreaker,
    IndexGenerator.WeightedEntryCombiner weightCombinerForConsecutiveRuns,
    IndexGenerator.WeightedEntryCombiner weightCombinerForFinalMatches
)

GetPartialMatches will combine matches for each of the individual words in the search term, regardless of where they appear in the source content. There is only one combination of match data for any given result. GetConsecutiveMatches has to break down the match data back into individual occurences in the source data because some occurences of a word may be valid for the returned data (if they are part of a consecutive run of search terms) while other occurences may not be valid (if they aren't part of a consecutive run). In the above example, the word "penguin" appears as a match with two source locations but only the first source location is valid as that is the only one that is part of a consecutive run of tokens that match "penguins are the best".

GetConsecutiveMatches will identify distinct runs of tokens represented by WeightedEntry instances with a single SourceLocation each. The first weight combiner will be called with these sets of tokens (where each set represents a single run that matches the entire search term) and must return a weight that represents the entire run. This run of tokens will be reduced to a single WeightedEntry instance with a single SourceLocation that spans from the start of the first token in the run to the end of the last one. A reasonable implementation of a weight combiner for this purpose would be one that sums together the weights of each token in the run and then applies a multiplier based on the length of the run (how many tokens are in it), this way longer token runs are awarded a greater match weight.

The second weight combiner is responsible for determing the final match weight for a result where the run of tokens is identified multiple times. If the source data in the earlier example had other data where the phrase "penguins are the best" appeared then a single WeightedEntry for that result for the string "penguins are the best" is required, its weight will be an aggregate of the weights of the individual matches. This process is exactly the same as that which takes place as part of the Index generation; when a token is found multiple times for the same result a combined weight for that token must be determined. The exact same delegate (the IndexGenerator.WeightedEntryCombiner) is used by the IndexGenerator's constructor and for the weight combiners for GetConsecutiveMatches.

Hurrah for defaults

That's the detail about the source locations data that enabled the GetConsecutiveMatches extension method to be written, and the detail about how to call it where you need to specify all of its behaviour. But following the convenience of the AutomatedIndexGeneratorFactory (see Automating Index Generation) I've included some method signatures which provide defaults for the weight combiners and the token breaker. So you can get results with the much simpler

var results = index.GetConsecutiveMatches("penguins are the best");

The default token breaker is a WhiteSpaceExtendingTokenBreaker that treats common punctuation characters as whitespace (such as square, round, curly or triangular brackets, commas, full stops, colons and some others). This is the same token breaker that the AutomatedIndexGeneratorFactory will use unless a token break override is specified.

The default weight-combiner-for-consecutive-runs will sum the weights of tokens in the consecutive run and then multiply by two to the power number-of-tokens-minus-one (so x2 if there are two tokens that make up the run, x4 if there are three, x8 if there are four, etc..). The default weight-combiner-for-all-of-a-results-consecutive-runs will sum the weights of the tokens (which is the default weight combiner used by the AutomatedIndexGeneratorFactoryBuilder).

While I was doing this, I added similar alternate method signatures to GetPartialMatches as well, so now the bare minimum it needs is

var results = index.GetPartialMatches("penguins are the best");

The default token break is the same as described above and the default weight combiner is one that sums the weights so long as all of the search terms are present for the result somewhere in its content. Any result that contains the words "penguins", "are" and "the" but not "best" would not be included in the results.

More data but reduced disk space requirements

For my blog, I persist the search index data to disk so that it doesn't need to be rebuilt if the application is reset (it stores a last-modified date alongside the index data which can be compared to the last-modified date of any post, so it's rebuilt when the source data changes rather than when a memory cache entry arbitrarily expires).

I was concerned that this additional source location data would make a significant difference to the size of this stored data, which could be inconvenient because I tend to build it before uploading changes to the web server (so smaller is better). And, to be honest, I had already been somewhat surprised that the data I persist to disk was several megabytes. (Even though that also contains all of the raw Post contents, along with the AutoComplete content extracted from analysing the Posts, it was still larger than my gut instinct suspected it would be). So I didn't want to make it any worse!

I've used the bog standard BinaryFormatter to serialise the data and GZipStream to compress it. To see how much overhead was added by this approach compared to writing a custom serialisation method for the IndexData, I wrote the IndexDataSerialiser. This only works with IndexData (the specific implementation of IIndexData rather than any IIndexData implementation) which means that there are assumptions that can be made (eg. that all of the source locations will be instances of the SourceFieldLocation class and not another class derived from it). And it's reduced the size of the data for the Index that my blog content generates to about 10% of what it was before. Win!

The IndexDataSerialiser is a static class with two methods:

void IndexDataSerialiser.Serialise(IndexData<TKey> source, Stream stream);

IndexData<TKey> IndexDataSerialiser.Deserialise(Stream stream);

It doesn't compress the data at all, so there will be advantages to using a GZipStream. It uses the BinaryWriter to write out the bare minimum content required to describe the data when serialising and then the BinaryReader to read the data back out and instantiate a new IndexData from it. It has to rebuild the TernarySearchTreeDictionary that the IndexData takes as a constructor argument but my feeling is that the processing required to do this is less than deserialising an already-populated IndexData using the BinaryFormatter. (I've not compared them thorough but in preliminary testing it seemed to take longer to deserialise with the BinaryFormatter when the data was loaded into a MemoryStream than the IndexDataSerialiser deserialisation took when loading from disk).

I might write another day about how I implemented the search term highlighting on this blog but I think this post has already gone on long enough! Update (9th April): See Search Term Highlighting with Source Locations.

For more information on this project, see the Full Text Indexer Round-up.

Posted at 23:29

Tags:

FullTextIndexer

Comments

13 March 2013

The Full Text Indexer - Structured Queries

I've considered in the past extending the way in which searches can be defined for use with the Full Text Indexer. The current approaches are:

A simple one-word query
A multi-word query using GetPartialMatches
A multi-word query where all of the words must appear

The first doesn't need much explaining, most of the examples have been based around this. The use of GetPartialMatches was outlined in the first Full Text Indexer post; the method takes a search term, a Token Breaker (to break the multi-word term into single-word terms) and a "MatchCombiner" delegate which describes how to combine the weighted matches for each broken-down word (this delegate will include the logic that determines whether all of the words in the original term must be matched or if it's just that a greater combined weight should be given to results that do match them all). This is the method that the search facility on this blog uses.

The third approach makes use of the ConsecutiveTokenCombiningTokenBreaker and is a bit different; when the index is being generated, the content is not only broken down into individual words but also runs of multiple words. This is explained in more detail in the Token Breaker and String Normaliser variations post, but that's the gist. In this scenario, the search term is not broken down and treated as a single token to search for. If you want to perform searches for multi-word terms where those words must appear in the order specified (rather than just appearing in any order, anywhere throughout the source content - possibly spanning multiple fields) then this is what you'd use.

Structured Querying

I wanted to introduce a consolidated query method but I'd been putting off writing a parser to take a search string and work out what to do with the various components. However, having recently written a CSS / LESS parser (CSSParser on Bitbucket) I was inspired to use the same recursive parsing technique and piece together something for the Full Text Indexer.

I went into it wanting something vaguely like GetPartialMatches but with.. more. The first assumption I wanted to make was that where multiple terms are specified then they should be considered an OR combination; so a match will be found if any one of the terms is found. If a particular term absolutely must be present then it can be prefixed with a "+". If a term must not be present then it can be prefixed with a "-". These ideas are directly influenced by Google query format! :)

This would allow us straight away to specify

apples pears bananas +fruit +nuts -lunatics

so that we could match articles (or whatever the source content may be) that have "fruit" and "nuts" in the content (but not "lunatics", we don't want those kinds of nuts!) and apply a greater match weigh to results that contain the words "apples", "pears" and / or "bananas". If an article doesn't contain the word "apples" then it may still be returned so long as it contains the word "fruit" (and not "lunatics").

The same logic about word matching would be applied as normal, so if an index is built with an EnglishPluralityStringNormaliser then the word "fruits" would be matched as it was "fruit".

There are a few more refinements that I wanted to add, the first also straight from Google search's interface! I wanted to allow words or phrases to be quoted such that they should appear precisely as specified. So, if our example became

"apples" pears bananas +fruit +nuts -lunatics

then the word "apple" should not be considered a match for "apples". This is also applicable to phrases so

"apples and pears"

should only match articles that contain the string "apples and pears", not ones that contain the words "apples" / "and" / "pears" present but in a different order.

These should be combinable such that we could specify

-"apples and pears" apples pears bananas +fruit

which would return articles that definitely contained "fruit" (or a word that is considered equivalent by the string normaliser), with additional weight given to articles that contained "apples" / "pears" / "bananas", so long as they don't contain the phrase "apples and pears". I think I've contorted this example a bit far now :)

The final aspect to throw in the mix is the ability to bracket terms. Let's stretch the example on step further:

+(apples pears bananas) +fruit +nut -lunatic

This will return articles that contain at least one of "apples" / "pears" / "bananas" and "fruit" and "nut" and not "lunatic".

The bracketing and compulsory / excluding (the "+" and "-") operators should be combinable and nestable in any manner. They can't be nested within quoted sections as they would be considered to be part of the content, but quoted sections can be nested with brackets or combined with the other operators, as already seen. (If a quote is required within a quoted section that it may be escaped with a backslash).

Show me the code!

In case you're not that interested in stepping through the internals, there's a complete working example at the end of this post that demonstrates how to use this! Just change the string passed to the querier.GetMatches method to play around with it.

Content Analysers

The first step is to break down a search term into the various IQuerySegment types in the Querier project (in the Full Text Indexer Bitbucket repository): the StandardMatchQuerySegment, PreciseMatchQuerySegment, CompulsoryQuerySegment, ExcludingQuerySegment, CombiningQuerySegment and NoMatchContentQuerySegment (used, for example, when brackets surround empty content).

To illustrate, the example

+(apples pears bananas) +fruit +nut -lunatic

would be translated into

CombiningQuerySegment
{
  CompulsoryQuerySegment
  {
    CombiningQuerySegment
    {
      StandardMatchQuerySegment: apples
      StandardMatchQuerySegment: pears
      StandardMatchQuerySegment: bananas
    }
  },
  CompulsoryQuerySegment
  {
    StandardMatchQuerySegment: fruit
  },
  CompulsoryQuerySegment
  {
    StandardMatchQuerySegment: nut
  },
  ExcludingQuerySegment
  {
    StandardMatchQuerySegment: lunatic
  }
}

The outermost CombiningQuerySegment is required since a Content Analyser should only return a single query segment, and since there were multiple in the search term they have to be wrapped up in the CombiningQuerySegment.

To translate an arbitrary search term into an IQuerySegment, we use

var querySegment = (new BreakPointCharacterAnalyser()).Process(new StringNavigator(searchTerm));

That's quite a mouthful, but if you read on you'll see that the Querier class means that you should never need to call that directly.

It breaks tokens on whitespace unless inside a quoted section, so the only way to specify particular multi-word phrases is to quote them (as with "apples and pears" above).

Two Indexes

One thing I haven't addressed so far is how quoted sections can be processed differently to none-quoted sections. Unfortunately, there's no clever facility to introduce and the bad news is that to deal with this, two indexes will have to be generated for the source content. The first index, the "default", uses the most common construction parameters and will be more forgiving on matches. It would be appropriate to use the EnglishPluralityStringNormaliser for this index, for example (assuming that it is English language content!). It will only need to deal with single word matches (as only quoted sections in the content are parsed into query segments with multiple words).

The second index, the "precise match" index, should be less forgiving (using a DefaultStringNormaliser, perhaps, which will normalise casing and ignore punctuation but not consider singular and plural versions of words to be equivalent). It will also need to make use of the ConsecutiveTokenCombiningTokenBreaker if quoted phrases are to be matchable (as opposed to only supporting quoting individual words).

Query Translator

The two indexes (and a MatchCombiner, see below) are used to instantiate a QueryTranslator whose method GetMatches will take an IQuerySegment and return an immutable set of WeighedEntry results, just like the *IIndexData class.

The MatchCombiner is used whenever multiple matches need be combined together into one - this will happen if there are multiple words in the initial query and will happen any times multiple terms are bracketed together. For the search term

apples +(pears bananas +(pomegranate tomato))

there will be three match weight combinations:

pomegranate / tomato
pears / bananas / combined-pomegranate-tomato
apples / combined-bananas-combined-pomegranate-tomato

This could be a simple summing or averaging of the match weights. One variation is to sum the weights but then always divide by a particular value, this reduces the weight of nested terms - so if terms are several bracketing levels deep then they will impart a lower weight on the final weight of the result. Whether this seems appropriate or not is up to you!

The Querier

The Querier class tidies up access to the Content Analysers and the Query Translator to try to make life easier. The Querier is instantiated with the two indexes and the MatchCombiner that the QueryTranslator requires and exposes a method GetMatches which takes a search term, translates it into an IQuerySegment, passes it through the QueryTranslator and returns the weighted results.

Example code

Below is a complete example that has a simple "Post" source type. I've used the AutomatedIndexGeneratorFactoryBuilder (see The Full Text Indexer - Automating Index Generation) to kick things off. I've taken the first content from a couple of Posts on my blog as example content. The largest piece of setup code is the instantiation of the generator for the "precise match" index, and that's most due to the explanatory comments!

using System;
using System.Linq;
using FullTextIndexer.Common.Lists;
using FullTextIndexer.Core.Indexes.TernarySearchTree;
using FullTextIndexer.Core.TokenBreaking;
using FullTextIndexer.Helpers;
using FullTextIndexer.Querier;

namespace Tester
{
  class Program
  {
    static void Main(string[] args)
    {
      var posts = new NonNullImmutableList<Post>(new[]
      {
        new Post(30, "The Full Text Indexer", "I started out on a journey a few months ago being " +
          "frustrated by the Lucene.net integration we had with one of our products at work (I'm not " +
          "badmouthing the Lucene project, I'm wholeheartedly blaming the integration I inherited!)"),
        new Post(31, "The Full Text Indexer - Adding and Subtracting", "The Full Text Indexer that I " +
          "talked about last time took a definition for an Index Generator for a specific TSource type " +
          "and produced an IndexData instance, using that generator, for a TSource set."),
        new Post(32, "The Full Text Indexer - Going International!", "Pushing on with the Full Text " +
          "Indexer series I'm been posting about (see Full Text Indexer and Full Text Indexer - Adding " +
          "and Subtracting) I want to demonstrate how it can work with multi-lingual content")
      });

      var defaultIndexGenerator = (new AutomatedIndexGeneratorFactoryBuilder<Post, int>()).Get().Get();
      var preciseMatchIndexGenerator = (new AutomatedIndexGeneratorFactoryBuilder<Post, int>())
        .SetTokenBreaker(
          new ConsecutiveTokenCombiningTokenBreaker(
            // The ConsecutiveTokenCombiningTokenBreaker wraps another token breaker and then creates new
            // tokens by stringing runs of broken tokens together
            new WhiteSpaceExtendingTokenBreaker(
              new ImmutableList<char>(new[] { '<', '>', '[', ']', '(', ')', '{', '}', '.', ',' }),
              new WhiteSpaceTokenBreaker()
            ),

            // This is the maximum number of words that are strung together, if quoted sections have more
            // words than this then they won't be matched. A way to work around this may be hashed out
            // one day (but not today :)
            12,

            // Tokens may be given an additional weight multiplier (between 0 and 1) when content is
            // is broken down, when multiple tokens are combined a multiplier for the combined token
            // must be provider. Commonly it is stop words that have a fractional multiplier, but
            // when words are combined into a phrase, it makes sense to remove any fractional
            // multiplier and give the combined token the full value of 1.
            weightMultipliersOfCombinedTokens => 1
          )
        )
        .SetStringNormaliser(new DefaultStringNormaliser())
        .Get()
        .Get();

      var querier = new Querier<Post, int>(
        defaultIndexGenerator.Generate(posts),
        preciseMatchIndexGenerator.Generate(posts),
        (matchWeights, sourceQuerySegments) => matchWeights.Sum()
      );

      var matches = querier.GetMatches("Generator");
    }
  }

  public class Post
  {
    public Post(int id, string title, string content)
    {
      if (string.IsNullOrWhiteSpace(title))
        throw new ArgumentException("Null/blank title specified");
      if (string.IsNullOrWhiteSpace(content))
        throw new ArgumentException("Null/blank content specified");

      Id = id;
      Title = title;
      Content = content;
    }

    public int Id { get; set; }
    public string Title { get; set; }
    public string Content { get; set; }
  }
}

To try different search terms, just replace the string "Generator" with something else.

Generator

will indicate one result, as only Post 31 is matched (it contains the word "generators").

Indexer Generators

will indicate that all three Posts match. With the configuration here, Posts 31 and 32 are found to have an identical match weight of 4 - as Post 31 matches "Indexer" twice and "Generators" twice while Post 32 matches "Indexer" four times. (Post 30 matches "Indexer" once and "Generator" zero times).

Indexer +"multi-lingual"

will only match Post 31, since that is the only one that contains "multi-lingual".

"Full Text Indexer" -adding

will only match Post 30 since, while they all have contain the phrase "Full Text Indexer", both Posts 31 and 32 also contain the word "adding".

"Full Text Indexers"

matches zero Posts. Since none of them contain that precise phrase. They will contain "Full Text Indexer", singular "Indexer", but not the plural "Full Text Indexers".

I don't think any more examples are required, really, hopefully it's clear enough how to construct the queries and understand how they're applied :)

I wouldn't necessarily expect this structured querying to be exposed through a simple site search (I have no immediate intentions of enabling it on this blog at the moment*) but it could certainly have a place elsewhere in application logic for performing a variety of full text searches against data.

* (The site search configuration here makes it compulsory that every word in the search term is matched in order for a Post to be returned, for cases where multiple words are specified. Changing over to use the Querier would mean that Posts would come back that don't match all of the words unless the "+" compulsory operator precedes each of them which, for now, I don't want to do).

Posted at 22:36

Tags:

FullTextIndexer

Comments

5 March 2013

The Full Text Indexer - Automating Index Generation

In the introductory Full Text Indexer post I showed how to build an Index Generator by defining "Content Retrievers" for each property of the source data type. I didn't think that, in itself, this was a huge amount of code to get started but it did have a generous spattering of potentially-cryptic class instantiations that implied a large assumed knowledge before you could use it.

With that in mind, I've added a project to the Full Text Indexer (Bitbucket) solution that can automate this step by applying a combination of reflection (to examine the source type) and default values for the various dependencies (eg. the string normaliser, token breaker, etc..).

This means that indexing data can now be as simple as:

var indexGenerator = (new AutomatedIndexGeneratorFactoryBuilder<Post, int>()).Get().Get();
var index = indexGenerator.Generate(posts.ToNonNullImmutableList());

where data is a set of Post instances (the ToNonNullImmutableList call is not required if the set is already a NonNullImmutableList<Post>).

public class Post
{
  public int Id { get; set; }
  public string Title { get; set; }
  public string Content { get; set; }
  public IEnumerable<Comment> Comments { get; set; }
}

public class Comment
{
  public string Author { get; set; }
  public string Content { get; set; }
}

The two "Get" calls are because the example uses an AutomatedIndexGeneratorFactoryBuilder which is able to instantiate an AutomatedIndexGeneratorFactory using a handful of defaults (explained below). The AutomatedIndexGeneratorFactory is the class that processes the object model to determine how to extract text data. Essentially it runs through the object graph and looks for text properties, working down through nested types or sets of nested types (like the IEnumerable<Comment> in the Post class above).

So an AutomatedIndexGeneratorFactory is returned from the first "Get" call and this returns an IIndexGenerator<Post, int> from the second "Get".

// This means we can straight away query data like this!
var results = index.GetMatches("potato");

(Note: Ignore the fact that I'm using mutable types for the source data here when I'm always banging on about immutability - it's just for brevity of example source code :)

Tweaking the defaults

This may be enough to get going - because once you have an IIndexGenerator you can start call GetMatches and retrieving search results straight away, and if your data changes then you can update the index reference with another call to

indexGenerator.Generate(posts.ToNonNullImmutableList());

But there are a few simple methods built in to adjust some of the common parameters - eg. to give greater weight to text matched in Post Titles I can specify:

var indexGenerator = (new AutomatedIndexGeneratorFactoryBuilder<Post, int>())
  .SetWeightMultiplier("DemoApp.Post", "Title", 5)
  .Get()
  .Get();

If, for some reason, I decide that the Author field of the Comment type shouldn't be included in the index I can specify:

var indexGenerator = (new AutomatedIndexGeneratorFactoryBuilder<Post, int>())
  .SetWeightMultiplier("DemoApp.Post.Title", 5)
  .Ignore("DemoApp.Comment.Author")
  .Get()
  .Get();

If I didn't want any comments content then I could ignore the Comments property of the Post object entirely:

var indexGenerator = (new AutomatedIndexGeneratorFactoryBuilder<Post, int>())
  .SetWeightMultiplier("DemoApp.Post.Title", 5)
  .Ignore("DemoApp.Post.Comments")
  .Get()
  .Get();

(There are overloads for SetWeightMultiplier and Ignore that take a PropertyInfo argument instead of the strings if that's more appropriate for the case in hand).

Explaining the defaults

The types that the AutomatedIndexGeneratorFactory requires are a Key Retriever, a Key Comparer, a String Normaliser, a Token Breaker, a Weighted Entry Combiner and a Token Weight Determiner.

The first is the most simple - it needs a way to extract a Key for each source data instance. In this example, that's the int "Id" field. We have to specify the type of the source data (Post) and type of Key (int) in the generic type parameters when instantiating the AutomatedIndexGeneratorFactoryBuilder. The default behaviour is to look for properties named "Key" or "Id" on the data type, whose property type is assignable to the type of the key. So in this example, it just grabs the "Id" field from each Post. If alternate behaviour was required then the SetKeyRetriever method may be called on the factory builder to explicitly define a Func<TSource, TKey> to do the job.

The default Key Comparer uses the DefaultEqualityComparer<TKey> class, which just checks for equality using the Equals class of TKey. If this needs overriding for any reason, then the SetKeyComparer method will take an IEqualityComparer<TKey> to do the job.

The String Normaliser used is the EnglishPluralityStringNormaliser, wrapping a DefaultStringNormaliser. I've written about these in detail before (see The Full Text Indexer - Token Breaker and String Normaliser variations). The gist is that punctuation, accented characters, character casing and pluralisation are all flattened so that common expected matches can be made. If this isn't desirable, there's a SetStringNormaliser method that takes an IStringNormaliser. There's a pattern developing here! :)

The Token Breaker dissects text content into individual tokens (normally individual words). The default will break on any whitespace, brackets (round, triangular, square or curly) and other punctuation that tends to define word breaks such as commas, colons, full stops, exclamation marks, etc.. (but not apostrophes, for example, which mightn't mark word breaks). There's a SetTokenBreaker which takes an ITokenBreak reference if you want it.

The Weighted Entry Combiner describes the calculation for combining match weight when multiple tokens for the same Key are found. If, for example, I have the word "article" once in the Title of a Post (with weight multiplier 5 for Title, as in the examples above) and the same word twice in the Content, then how should these be combined into the final match weight for that Post when "article" is searched for? Should it be the greatest value (5)? Should it be the sum of all of the weights (5 + 1 + 1 = 7)? The Weighted Entry Combiner takes a set of match weights and must return the final combined value. The default is to sum them together, but there's always the SetWeightedEntryCombiner method if you disagree!

Nearly there.. the Token Weight Determiner specifies what weight each token that is extracted from the text content should be given. By default, tokens are given a weight of 1 for each match unless they are from a property to ignore (in which they are skipped) or they are from a property that was specified by the SetWeightCombiner method, in which case they will take the value provided there. Any English stop words (common and generally irrelevant words such as "a", "an" and "the") have their weights divided by 100 (so they're not removed entirely, but matches against them count much less than matches for anything else). This entire process can be replaced by calling SetTokenWeightDeterminer with an alternate implementation (the property that the data has been extracted from will be provided so different behaviour per-source-property can be supported, if required).

Phew!

Well done if you got drawn in with the introductory this-will-make-it-really-easy promise and then actually got through the detail as well! :)

I probably went deeper off into a tangent on the details than I really needed to for this post. But if you're somehow desperate for more then I compiled my previous posts on this topic into a Full Text Indexer Round-up where there's plenty more to be found!

Posted at 00:01

Tags:

Comments

17 December 2012

The Full Text Indexer Post Round-up

This is a compilation of links to articles outlining some of the details of the Full Text Indexer project I put together, just so I could point a link to everything all in one place (like from the BitBucket ReadMe!)

I wrote about the basic building blocks of the Index Generator, went off on a few tangents about how using different key types could allow for searches over data with multi-lingual content (or support Product data that has different descriptions for different web sites, for example) and then came back round to illustrate how I've used the code for this blog's search functionality.

Along the journey, I got to learn a few new things, take advantage of other's research and have fun trying to improve the performance of some of the bottlenecks in the index generation process.

I also had a chance to revisit the basic immutable list structure that I used from the get-go in this project and improve its performance characteristics as well (again, taking a lot of inspiration from cleverer people who've tackled the same problems before me! :)

The code can be found in the Full Text Indexer BitBucket Repository. I've still got a few ideas I'm contemplating toying with - but I've also got other projects I want to investigate! So we'll just have to see what happens with this next..

Update (5th March 2013): I just can't seem to let this lie! :) I've added another post The Full Text Indexer - Automating Index Generation which demonstrates some new code that will examine your source data type and generate an index for you, all on its own! Easy! (Added to the list above).

Update (14th March 2013): And another! This time about support for structured querying, a way to combine terms with AND, OR, NOT operators. See The Full Text Indexer - Structured Queries. (Added to the list above).

Update (28th March 2013): Documenting an extension to the index data that allow for more performant consecutive term matching: The Full Text Indexer: Source Locations. Followed by a way to utilise this information for Search Term Highlighting with Source Locations. (Added to the list above).

Update (25th July 2013): Inspired by the "The 10 Megabyte Manifesto" and NeoCities, I've developed a way to consume search index data with JavaScript to enable a copy of this blog to be hosted where the searching is done entirely client-side. Read about it at The Full Text Indexer goes client-side! and see it in action live at productiverage.neocities.org! (Added to the list above).

Update (30th July 2013): A follow-up to the "The Full Text Indexer goes client-side" describing how the search index data can be compressed to take up less space on the host: JavaScript Compression (Putting my JSON Search Indexes on a diet). (Added to the list above).

Posted at 18:06

Tags:

Comments

14 November 2012

The Full Text Indexer - Token Breaker and String Normaliser variations

I've written a few Posts now about the investigative Full Text Indexer project I've been playing with (see also Updating the Index, MultiLingual Content and various tangents into plurality handling) but I think I've got at least one more to round it off.

Among the basic components of the Index Generator that determine the manner in which it processes content are (as outlined in that First Indexer Post):

Token Breakers

This example uses a "WhiteSpaceTokenBreaker" which will take the string content from the Content Retrievers and break it into individual words by splitting on whitespace characters. Straight forward!

String Normaliser

The String Normaliser is essentially an IEqualityComparer and will be used to generate a lookup of tokens and compare them against values passed into the GetMatches method. The DefaultStringNormaliser will remove all punctuation, exchange all non-latin characters for latin equivalents and lower-case them all. For the most basic lookups I had in mind, this does the hard work.

String Normalisers

There are not many String Normaliser implementations at present, the most basic would be to compare two strings to see if they match. Very simple - but for the use cases I have in mind*, which are largely European languages passages, not very useful. The next step up is a case-insensitive match. Marginally better but there are all sorts of punctuation marks that I might want to ignore - eg. I probably want to consider "James" (the name) to be the same as "James'" (the possessive determiner; er, if my searching for the correct linguistic phrase has led me to the correct one :). I may want to try to ignore differences between accented characters (eg. consider "Jose" to match "José"). If we end up comparing strings that contain whitespace (since there's nothing from forcing all Token Breakers to break on whitespace) then we probably want to normalise the whitespace such that a tab is equivalent to a space is equivalent to a run of multiple spaces.

The DefaultStringNormaliser will iron over all of these mentioned cracks by normalising whitespace, removing punctuation characters, replacing accented characters with non-accented equivalents, lower-casing the content and trimming it before yielding the final value for comparison.

The EnglishPluralityStringNormaliser will (optionally) wrap another String Normaliser (very feasibly a DefaultStringNormaliser) and then add a further layer of processing to the output so that plurality of terms is ignored; so "cat" and "cats" are considered to match, as are "cactus" and "cactii", for example. The approach it takes isn't perfect but it gets the job done for the most common cases.

* (The fact that a String Normaliser has to be specified in order to instantiate an IndexGenerator should mean that it would be straight-forward to configure one for uses cases that I didn't specifically have in mind when I wrote it).

Token Breakers

The WhiteSpaceTokenBreaker is probably the most obvious implementation, it breaks all content on whitespace and considers each resulting segment to be a token. However, there are a lot of other characters which may constitute a break between words - normally these have whitespace around as well but that relies on the input content following particular formatting rules (like ensuring that commas are always followed by a space). So we also have the WhiteSpaceExtendingTokenBreaker. This will replace particular characters with a space before handing off processing to another Token Breaker. It may, for example, specify that all brackets (round, square, curly or triangular) be replaced with whitespace, along with full stops and commas. This is useful for a lot of common content. Note that single quotes would not be replaced since they generally do not indicate the end of a word - eg. "don't" is one word, it should not be split into "don" and "t". This would rely upon the use of a String Normaliser that ignores punctuation such as single quotes so that "O'Connor" and "OConnor" are considered equivalent.

More interesting variations on the theme are the PartialMatchingTokenBreaker and the ConsecutiveTokenCombiningTokenBreaker. The first will wrap another Token Breaker and then process each of the resulting tokens by generating all substrings from them that are at least as long as the "minLengthOfPartialMatches" and no longer than the "maxLengthOfPartialMatches" constructor arguments on the class. This provides a simple way to implement "partial word matching" and also illustrates the benefit of returning a "WeightAdjustingToken" set from the Token Breaker; these partial words can be given much less weight when stored in the Index, such that full word matches for content appear much higher in a result set (ordered by match weight aka match quality). A "partialMatchWeightDeterminer" delegate is passed to the constructor and used to calculate the weight of these partial matches.

The ConsecutiveTokenCombiningTokenBreaker is essentially the opposite, it will apply a specified Token Breaker against the input content first and then generate additional tokens by combining runs of consecutive tokens. So if a passage contains the words "Gotos considered harmful" then instead of this being broken down into just the set "Gotos", "considered", "harmful" it would may also result in (depending upon the maxNumberOfTokens constructor argument) "Gotos considered", "considered harmful" and "Gotos considered harmful". Again, greater weights may be assigned to these runs of tokens via the weightMultiplierDeterminer constructor argument (a delegate that returns a weight multiplier based upon the number of tokens combined to form the extended token). This would enable the article with the phase "Gotos considered harmful" to be assigned a greater weight than one that has the separate words "Gotos", "considered" and "harmful" (but not arranged into that particular phrase). This would rely upon a search being performed using the GetPartialMatches method of the index, which breaks up the search term according using a particular Token Breaker, rather than requiring the entire phrase be matched precisely (this is covered briefly towards the end of the first Full Text Indexer post).

The use of these token breakers, whether individually or in combination, will result in more data being stored in the Index (as well as more processing of the input content required in order to generate the index) but offer the benefits that searches can also match content more loosely in some cases while prioritising the best matches in others.

Update (28th March 2013): The ConsecutiveTokenCombiningTokenBreaker is no longer the best way to deal with searches for consecutive terms, there is now a GetConsecutiveMatches extension method for IIndexData that doesn't require the additional (expensive) processing when the index is generated, see The Full Text Indexer: Source Locations.

Bonus Material: The Indexer in action (small scale, though it may be!)

All of the above is a bit dry and I wanted to include it largely to round out the introductory series to this code. So to make this post marginally more interesting, I thought I'd include the configuration in which I've used it on this blog to implement the site search and the autocomplete facility.

I have the following method which is used to generate Index content for both the site search and the auto-complete functionality. Posts have an integer Id and string Title and Content properties. They also have LastModified and Archive properties which enables the Indexes to be cached in memory and on disk, only rebuilding when content has changed (ie. a new Post has been published, an existing Post has been updated or a Post has been archived).

The bulk of the Index generation is illustrated below with comments around most of the decisions:

private IIndexData<int> GenerateIndexData(
    NonNullImmutableList<Post> posts,
    IStringNormaliser sourceStringComparer)
{
    if (posts == null)
        throw new ArgumentNullException("posts");
    if (sourceStringComparer == null)
        throw new ArgumentNullException("sourceStringComparer");

    // Define the manner in which the raw content is retrieved from Post title and body
    // - English stop words will only receive 1% the weight when match qualities are
    //   determined than other words will receive
    // - Words in the title will be given 5x the weight of words found in body content
    var englishStopWords = FullTextIndexer.Constants.GetStopWords("en");
    var contentRetrievers = new List<ContentRetriever<Post, int>>();
    contentRetrievers.Add(new ContentRetriever<Post, int>(
        p => new PreBrokenContent<int>(p.Id, p.Title),
        token => (englishStopWords.Contains(token, sourceStringComparer) ? 0.01f : 1f) * 5f
    ));
    contentRetrievers.Add(new ContentRetriever<Post, int>(
        p => new PreBrokenContent<int>(p.Id, p.Content),
        token => englishStopWords.Contains(token, sourceStringComparer) ? 0.01f : 1f
    ));

    // Specify the token breaker
    // - English content will generally break on "." and "," (unlike "'" or "-" which
    //   are commonly part of words). Also break on round brackets for written content
    //   but also the other bracket types and other common characters that might
    //   represent word breaks in code content found on the site
    var tokenBreaker = new WhiteSpaceExtendingTokenBreaker(
        new ImmutableList<char>(new[] {
            '<', '>', '[', ']', '(', ')', '{', '}',
            '.', ',', ':', ';', '"', '?', '!',
            '/', '\\',
            '@', '+', '|', '='
        }),
        new WhiteSpaceTokenBreaker()
    );

    // Generate an index using the specified StringNormaliser,
    // - The Post class has an integer Id so a simple IntEqualityComparer (see below)
    //   will do the job fine for the dataKeyComparer
    // - If the search term is matched multiple times in a Post then combine the match
    //   weight in a simple additive manner (hence the weightedValues.Sum() call)
    var indexGenerator = new IndexGenerator<Post, int>(
        contentRetrievers.ToNonNullImmutableList(),
        new IntEqualityComparer(),
        sourceStringComparer,
        tokenBreaker,
        weightedValues => weightedValues.Sum(),
        new NullLogger()
    );
    return indexGenerator.Generate(posts.ToNonNullImmutableList());
}

[Serializable]
private class IntEqualityComparer : IEqualityComparer<int>
{
    public bool Equals(int x, int y)
    {
        return (x == y);
    }

    public int GetHashCode(int obj)
    {
        return obj;
    }
}

The generation of the search index is fairly straight-forward, the content on my blog is English with code samples (mostly C#) so I use an EnglishPluralityStringNormaliser that wraps a DefaultStringNormaliser (the PreNormaliserWorkOptions flags specified in the constructor are optimisations described in Optimising the Plurality-Handling Normaliser).

var indexDataForSearching = GenerateIndexData(
    posts,
    new EnglishPluralityStringNormaliser(
        new DefaultStringNormaliser(),
        EnglishPluralityStringNormaliser.PreNormaliserWorkOptions.PreNormaliserLowerCases
        | EnglishPluralityStringNormaliser.PreNormaliserWorkOptions.PreNormaliserTrims
    )
);

The IIndexData<T> class has a GetAllTokens() method which is useful for the autocomplete functionality but it's not as useful with the above string normaliser as that applies various manipulations to the keys (not only does it normalise word endings for plurality handling but it replaces accented characters and removes punctuation). In order to generate an index that we could extract token data from for an autocomplete list we want to avoid these manipulations. This doesn't exist in the FullTextIndex project, since it's not very useful for the intended search functionality, but as a convenient (and very simple!) example of how to vary the functionality we can use a NonAlteringStrongNormaliser:

[Serializable]
public class NonAlteringStringNormaliser : StringNormaliser
{
    public override string GetNormalisedString(string value)
    {
        if (string.IsNullOrWhiteSpace(value))
            throw new ArgumentException("Null/blank value specified");

        return value;
    }
}

This inherits from the abstract StringNormaliser class which is in the FullTextIndexer project and implements the Equals and GetHashCode methods of the IEqualityComparer<string> interface, requiring the derived class to provide the GetNormalisedString(string value) method only.

And from this we can generate an autocomplete word list with a little more work:

var indexDataForAutoCompleteExtended = GenerateIndexData(
    posts,
    new NonAlteringStringNormaliser()
);
var autoCompleteContent = new NonNullOrEmptyStringList(
    indexDataForAutoCompleteExtended.GetAllTokens()
    .Where(token =>
        (token.Length >= 3) &&
        !char.IsPunctuation(token[0])
        && !token.Any(c => char.IsNumber(c))
    )
    .Distinct(StringComparer.InvariantCultureIgnoreCase)
    .Where(token => indexDataForSearching.GetMatches(token).Any())
    .OrderBy(token => token.ToLower())
);

The filters applied here were determined by running this against the blog content at the time and making up some rules that seemed like they made the resulting data look better (yes, as scientific as that! :) It ignores words less than three characters as these are usually stop words (I considered ignoring all stop words but some of the words in the stop word list seemed like things people might search on). If there are multiple tokens that are variations of each other with differing case then only one of them will be in the final list. Only tokens that actually result in matches in the "indexDataForSearching" content that was generated are included - this should always be the case for the string normaliser I'm currently using but if I tweak that in the future then I want to ensure that I don't end up with tokens being generated for the autocomplete list that don't actually match any Posts!

It's worth noting that the autocomplete list generated is really more like a "suggested" words list since it can't cover every single match. If, for example, the input data contained the word "cats" but not "cat" then the plurality-handling string normaliser used in the search data generation will match the search term "cat" to the word "cats" in the content, but the autocomplete word list would not contain the word "cats" since that wasn't in the source content (though the word "cat" would be, as it was in the content). In practice, I don't see this being a problem as the search box on this site allows you to enter anything - you aren't restricted to only words in the autocomplete list.

Finally, everything has been declared serialisable so that the index data could be cached on disk. In practice, this means that I build the index data locally when I update a Post and view it on my local computer - and then I upload the new content along with the on-disk index content so that searches performed on the site should be fast as soon as all this new data is uploaded.

Update (17th December 2012): This has been included as part of a later Full Text Indexer Round-up Post that brings together several Posts into one series, incorporating code and techniques from each of them.

Posted at 19:12

Tags:

FullTextIndexer

Comments

26 July 2012

The Full Text Indexer - Going International!

Pushing on with the Full Text Indexer series I'm been posting about (see Full Text Indexer and Full Text Indexer - Adding and Subtracting) I want to demonstrate how it can work with multi-lingual content (or other content variations - for example, with the data at my day job Products have different delivery "channels" for which different descriptions may be recorded, as well as being in multiple languages).

A heads-up: This post is going to be largely code with a few explanatory paragraphs. There's nothing particularly complex going on and I think the code will - for the most part - speak for itself!

Setting the scene

In the previous examples, the TKey type of the IIndexData was an int representing an item's unique id. One way to extend this would be to specify as TKey the following:

public interface ILanguageScopedId : IEquatable<ILanguageScopedId>
{
    int Id { get; }
    bool IsApplicableForLanguage(int language);
}

Where two simple implementations might be:

public sealed class LanguageScopedId : ILanguageScopedId
{
    public LanguageScopedId(int id, int language)
    {
        Id = id;
        Language = language;
    }

    public int Id { get; private set; }

    public int Language { get; private set; }

    public bool IsApplicableForLanguage(int language)
    {
        return (language == Language);
    }

    public bool Equals(ILanguageScopedId obj)
    {
        var objLanguageScopedId = obj as LanguageScopedId;
        if (objLanguageScopedId == null)
            return false;

        return ((objLanguageScopedId.Id == Id) && (objLanguageScopedId.Language == Language));
    }

    public override bool Equals(object obj)
    {
        return Equals(obj as ILanguageScopedId);
    }

    public override int GetHashCode()
    {
        // Since the overridden ToString method will consistently encapsulate all of the
        // information for this instance we use it to override the GetHashCode method,
        // consistent with the overridden Equals implementation
        return ToString().GetHashCode();
    }

    public override string ToString()
    {
        return String.Format("{0}:{1}-{2}", base.ToString(), Id, Language);
    }
}

public sealed class : ILanguageScopedId
{
    public NonLanguageScopedId(int id)
    {
        Id = id;
    }

    public int Id { get; private set; }

    public bool IsApplicableForLanguage(int language)
    {
        return true;
    }

    public bool Equals(ILanguageScopedId obj)
    {
        var objLanguageScopedId = obj as NonLanguageScopedId;
        if (objLanguageScopedId == null)
            return false;

        return (objLanguageScopedId.Id == Id);
    }

    public override bool Equals(object obj)
    {
        return Equals(obj as ILanguageScopedId);
    }

    public override int GetHashCode()
    {
        // Since the overridden ToString method will consistently encapsulate all of the
        // information for this instance we use it to override the GetHashCode method,
        // consistent with the overridden Equals implementation
        return ToString().GetHashCode();
    }

    public override string ToString()
    {
        return String.Format("{0}:{1}", base.ToString(), Id);
    }
}

There are two implementations to account for it's feasible that not all content will be multi-lingual (see the Article class further down). I only really included IEquatable<ILanguageScopedId> in the ILanguageScopedId so that it would be easy to write the KeyComparer that the IndexGenerator requires (this was the same motivation for having implementations being sealed, since they can't be inherited from the type comparisons are easier in the Equals methods) -

/// <summary>
/// As the ILanguageScopedId interface implements IEquatable ILanguageScopedId, this class
/// has very little work to do
/// </summary>
public class LanguageScopedIdComparer : IEqualityComparer<ILanguageScopedId>
{
    public bool Equals(ILanguageScopedId x, ILanguageScopedId y)
    {
        if ((x == null) && (y == null))
            return true;
        else if ((x == null) || (y == null))
            return false;
        return x.Equals(y);
    }

    public int GetHashCode(ILanguageScopedId obj)
    {
        if (obj == null)
            throw new ArgumentNullException("obj");

        return obj.GetHashCode();
    }
}

The previous posts used an Article class as an illustration. Here I'll expand that class such that the Title and Content have content that may vary across different languages (represented by the MultiLingualContent class, also below) while Author will not (and so is just a string) -

public class Article
{
    public Article(
        int id,
        DateTime lastModified,
        MultiLingualContent title,
        string author,
        MultiLingualContent content)
    {
        if (title == null)
            throw new ArgumentNullException("title");
        if (string.IsNullOrWhiteSpace(author))
            throw new ArgumentException("Null/blank author specified");
        if (content == null)
            throw new ArgumentNullException("content");

        Id = id;
        LastModified = lastModified;
        Title = title;
        Author = author.Trim();
        Content = content;
    }

    public int Id { get; private set; }

    public bool IsActive { get; private set; }

    public DateTime LastModified { get; private set; }

    /// <summary>
    /// This will never be null
    /// </summary>
    public MultiLingualContent Title { get; private set; }

    /// <summary>
    /// This will never be null or blank
    /// </summary>
    public string Author { get; private set; }

    /// <summary>
    /// This will never be null
    /// </summary>
    public MultiLingualContent Content { get; private set; }
}

public class MultiLingualContent
{
    private string _defaultContent;
    private ImmutableDictionary<int, string> _languageOverrides;
    public MultiLingualContent(
        string defaultContent,
        ImmutableDictionary<int, string> languageOverrides)
    {
        if (string.IsNullOrWhiteSpace(defaultContent))
            throw new ArgumentException("Null/blank defaultContent specified");
        if (languageOverrides == null)
            throw new ArgumentNullException("languageOverrides");
        if (languageOverrides.Keys.Select(key => languageOverrides[key]).Any(
            value => string.IsNullOrWhiteSpace(value))
        )
            throw new ArgumentException("Null/blank encountered in languageOverrides data");

        _defaultContent = defaultContent.Trim();
        _languageOverrides = languageOverrides;
    }

    /// <summary>
    /// This will never return null or blank. If there is no language-specific content for
    /// the specified language then the default will be returned.
    /// </summary>
    public string GetContent(int language)
    {
        if (_languageOverrides.ContainsKey(language))
            return _languageOverrides[language].Trim();
        return _defaultContent;
    }
}

Note: The ImmutableDictionary (along with the NonNullImmutableList and the ToNonNullImmutableList extension method which are seen elsewhere in the code) can be found in the Full Text Indexer repo on Bitbucket.

Generating and querying the new Index format

For the purposes of this example, I'm going to assume that all of the possible languages are known upfront (if not then each time an Index is built, it's feasible that the source Article data could be analysed each time to determine which languages are present but for now I'm going to go with the easier case of knowledge of all options beforehand).

As we've seen before, we need to prepare an IndexGenerator (this time IndexGenerator<ArticleI, LanguageScopedId> instead of IndexGenerator<ArticleI, int> since the key type of the IIndexData that will be produced is no longer an int) with Content Retrievers, a Key Comparer, Token Breaker, Weighted Entry Combiner and Logger. Here there are more Content Retrievers as the multi-lingual content must be requested for each supported language (though the non-multi-lingual content - the Author field on Article instances - only needs a single retriever).

var languages = new[] { 1, 2, 3 };

var contentRetrievers =
    new[]
    {
        new ContentRetriever<Article, ILanguageScopedId>(
            article => new PreBrokenContent<ILanguageScopedId>(
                new NonLanguageScopedId(article.Id),
                article.Author
            ),
            token => 1f
        )
    }
    .Concat(
        languages.SelectMany(language => new[]
        {
            new ContentRetriever<Article, ILanguageScopedId>(
                article => new PreBrokenContent<ILanguageScopedId>(
                    new LanguageScopedId(article.Id, language),
                    article.Title.GetContent(language)
                ),
                token => 5f
            ),
            new ContentRetriever<Article, ILanguageScopedId>(
                article => new PreBrokenContent<ILanguageScopedId>(
                    new LanguageScopedId(article.Id, language),
                    article.Content.GetContent(language)
                ),
                token => 1f
            )
        }
    ));

var indexGenerator = new IndexGenerator<Article, ILanguageScopedId>(
    contentRetrievers.ToNonNullImmutableList(),
    new LanguageScopedIdComparer(),
    new DefaultStringNormaliser(),
    new WhiteSpaceTokenBreaker(),
    weightedValues => weightedValues.Sum(),
    new NullLogger()
);

var index = indexGenerator.Generate(new NonNullImmutableList<Article>(new[]
{
    new Article(
        1,
        new DateTime(2012, 7, 24),
        new ContentBuilder("One").AddLanguage(2, "Un").Get(),
        "Terrence",
        new ContentBuilder("First Entry").AddLanguage(2, "Première entrée").Get()
    ),
    new Article(
        2,
        new DateTime(2012, 8, 24),
        new ContentBuilder("Two").AddLanguage(2, "Deux").Get(),
        "Jeroshimo",
        new ContentBuilder("Second Entry").AddLanguage(2, "Deuxième entrée").Get()
    )
}));

Finally, there's a slight change to the querying mechanism. We have to perform a lookup for all keys that match a given token and then filter out any entries that we're not interested in. And since there are multiple key types which can relate to content in the same language (because a request for content in language 1 should combine keys of type LanguageScopedId which are marked as being for language 1 alongside keys of type NonLanguageScopedId), we may have to group and combine some of the results.

var resultsEntryInLanguage1 = index.GetMatches("Entry")
    .Where(weightedMatch => weightedMatch.Key.IsApplicableForLanguage(language))
    .GroupBy(weightedMatch => weightedMatch.Key.Id)
    .Select(weightedMatchGroup => new WeightedEntry<int>(
        weightedMatchGroup.Key,
        weightedMatchGroup.Sum(weightedMatch => weightedMatch.Weight)
    ));

The earlier code uses a "ContentBuilder" to prepare the MultiLingualContent instances, just because it removes some of the clutter from the code. For the sake of completeness, that can be seen below:

private class ContentBuilder
{
    private string _defaultContent;
    private ImmutableDictionary<int, string> _languageOverrides;

    public ContentBuilder(string defaultContent)
        : this(defaultContent, new ImmutableDictionary<int, string>()) { }

    private ContentBuilder(
        string defaultContent,
        ImmutableDictionary<int, string> languageOverrides)
    {
        if (string.IsNullOrWhiteSpace(defaultContent))
            throw new ArgumentException("Null/blank defaultContent specified");
        if (languageOverrides == null)
            throw new ArgumentNullException("languageOverrides");

        _defaultContent = defaultContent;
        _languageOverrides = languageOverrides;
    }

    public ContentBuilder AddLanguage(int language, string content)
    {
        if (string.IsNullOrWhiteSpace(content))
            throw new ArgumentException("Null/blank content specified");
        if (_languageOverrides.ContainsKey(language))
            throw new ArgumentException("Duplicate language: " + language);

        return new ContentBuilder(
            _defaultContent,
            _languageOverrides.AddOrUpdate(language, content)
        );
    }

    public MultiLingualContent Get()
    {
        return new MultiLingualContent(_defaultContent, _languageOverrides);
    }
}

Extended Key Types

This approach to supporting multi-lingual data is just one way of using the generic TKey type of the IndexGenerator / IndexData classes. I mentioned at the top that the data I deal with at my real job also varies descriptive data over multiple delivery channels, this could be implemented in a similar manner to the above by extending the ILanguageScopedId interface to:

public interface ILanguageScopedId : IEquatable<ILanguageScopedId>
{
    int Id { get; }
    bool IsApplicableFor(int language, int channel);
}

And, in the same way as the above code has both the LanguageScopedId and NonLanguageScopedId, there could be various implementations for content that does/doesn't vary by language and/or does/doesn't vary by delivery channel.

In fact, since there must be a Key Comparer passed in as a constructor argument to the IndexGenerator, any kind of key can be used with the index so long as an appropriate comparer is available!

Performance

The downside to this sort of approach is, predictably, increased resource requirements in the index generation. I say predictably because it should be clear that specifying more Content Retrievers (which we are; they have to increase as the number languages of increases) means that more work will be done when an index is generated from input data.

Also in the above example, more storage space will be required to store the results as more content is being extracted and stored in the index - it's feasible that source data could be present which doesn't have any multi-lingual data and so returns the default values from the MultiLingualContent.GetContent(language) call for every language. For each token that is recorded for the data, keys for each of the languages will be recorded in the index - each with duplicate weight data, repeated for each language. It's possible that a more intelligent key structure could reduce that amount of space taken up in these cases but that's outside the scope of this post I think (plus no solution springs immediately to mind at this time of night! :)

The good news is that the retrieval time shouldn't be significantly increased; the additional work is to filter the matched keys and group them together on the underlying id, the lookup should still be very quick. The additional load that the filtering and grouping will incur will depend upon the structure of the key class.

Posted at 23:56

Tags:

FullTextIndexer

Comments

23 July 2012

The Full Text Indexer - Adding and Subtracting

The Full Text Indexer that I talked about last time took a definition for an Index Generator for a specific TSource type and produced an IndexData instance, using that generator, for a TSource set.

In the example shown there, it created an IndexGenerator for IArticle and then generated an Index for an IArticle list. The IIndexData<int> (TKey is an int in this case as the key on IArticle is its Id field, which is an int). This IIndexData<int> is an immutable data structure and so it may not be immediately obvious how to update it when the source data has changed.

Last time I mentioned that IIndexData<TKey> has this method:

public interface IIndexData<TKey>
{
    /// <summary>
    /// This will throw an exception for null or blank input. It will never return null.
    /// If there are no matches then an empty list will be returned.
    /// </summary>
    NonNullImmutableList<WeightedEntry<TKey>> GetMatches(string source);
}

but the full interface is:

public interface IIndexData<TKey>
{
    /// <summary>
    /// This will throw an exception for null or blank input. It will never return null.
    /// If there are no matches then an empty list will be returned.
    /// </summary>
    NonNullImmutableList<WeightedEntry<TKey>> GetMatches(string source);

    /// <summary>
    /// This will return a new instance that combines the source instance's data with the
    /// data other IndexData instances using the specified weight combiner. In a case where
    /// there are different TokenComparer implementations on this instance and on any of the
    /// indexesToAdd, the comparer from the current instance will be used. It is recommended
    /// that a consistent TokenComparer be used at all times. An exception will be thrown
    /// for null dataToAdd or weightCombiner references.
    /// </summary>
    IIndexData<TKey> Combine(
        NonNullImmutableList<IIndexData<TKey>> indexesToAdd,
        IndexGenerators.IndexGenerator.WeightedEntryCombiner weightCombiner
    );

    /// <summary>
    /// This will return a new instance without any WeightedEntry values whose Keys match
    /// the removeIf predicate. If tokens are left without any WeightedEntry values then
    /// the token will be excluded from the new data. This will never return null. It
    /// will throw an exception for a null removeIf.
    /// </summary>
    IIndexData<TKey> Remove(Predicate<TKey> removeIf);

    /// <summary>
    /// This will never return null, the returned dictionary will have this instance's
    /// KeyNormaliser as its comparer
    /// </summary>
    IDictionary<string, NonNullImmutableList<WeightedEntry<TKey>>> ToDictionary();

    /// <summary>
    /// This will never return null
    /// </summary>
    NonNullOrEmptyStringList GetAllTokens();

    /// <summary>
    /// This will never return null
    /// </summary>
    IEqualityComparer<string> TokenComparer { get; }

    /// <summary>
    /// This will never return null
    /// </summary>
    IEqualityComparer<TKey> KeyComparer { get; }
}

The TokenComparer and KeyComparer are the instances passed into the IndexGenerator's constructor (a DefaultStringNormaliser and an IntEqualityComparer in last time's example). The GetAllTokens method returns a set of tokens that have matches in the IndexData (where multiple tokens are present in the data that are considered equivalent, only one will be in the set returned by GetAllTokens - the example used the DefaultStringNormaliser which ignores case, so if data for the tokens "Token" and "TOKEN" is present, and encountered in that order, then only "Token" would be in the GetAllTokens set, "TOKEN" wouldn't have been added as a distinct value as it is equivalent to "Token").

The interesting methods in this context are Combine and Remove.

Remove

Remove is the simpler of the two so I'll address that first: A predicate is passed to it which filters which key values should be allowed through, data which passes this filtering will be used to form a new IIndexData instance which will be returned from the method. The original IndexData instance remains unaltered while a filtered version is provided which meets the particular criteria.

Combine

The Combine method will take one or more additional IIndexData instances (for the same TKey type) and bring all of the content from these and the original index into a new instance describing aggregated data. Where data for the same keys appear in the indexes, the match weights will be combined using a specified "WeightedEntryCombiner" (which just takes a set of floats and returns a single value representing them all; the most common case is to sum the values but they could be averaged or the greatest value taken - whatever's most appropriate!).

Pulling an example together

To show these methods in action I've extended the IArticle IndexGenerator concept that I showed in the previous post by wrapping it in another class that maintains an index based upon changing data by keeping a "source data summary" of what keys were used to generate the current data and what the last modified dates of the source data was. I'm aiming to come up with an "IndexBuilder" that will expose the following:

/// <summary>
/// This will never return null
/// </summary>
public IIndexData<TKey> Index { get; }

/// <summary>
/// This will never return null, it will throw an exception for null input
/// </summary>
public IIndexData<TKey> UpdateIndex(NonNullImmutableList<TSource> values);

All the same types will be required in the IndexBuilder constructor that the IndexGenerator required last time (the Content Retrievers, Key Comparer, Token Comparer, Weighted Entry Combiner and Logger) along with one additional dependency; a "Source Item Status Retriever". This is just a delegate that takes an instance of the generic type parameter TSource and returns a SourceDataSummary instance that reports its Key and a LastModifiedDate (so hardly rocket science!). This will enable the IndexBuilder to maintain a summary of the input data that was used to build the current index and so determine what work (if any) is required when UpdateIndex is called.

If the example code last time didn't look too scary, then neither should this:

// Instantiate an IndexBuilder that will index IArticles (which have ints as their Keys).
// - Content Retrievers describe how to extract data from each IArticle, there is a delegate
//   to retrieve Key and LastModifiedDate from IArticle (the "Source Item Status Retriever"),
//   there's a Token Breaker which breaks up the content, there's a String Normaliser which
//   compares the resulting Tokens to group them together, there's a "Weighted Entry
//   Combiner" which creates an aggregate weight for Tokens that are grouped,
//   there's an IntEqualityComparer that acts as a Key Comparer and there's
//   a Logger. See; nothing to it! :D

var indexBuilder = new IndexBuilder<IArticle, int>(
    new NonNullImmutableList<ContentRetriever<IArticle, int>>(new []
    {
        // Additional weight is given to words matched in the Title
        new ContentRetriever<IArticle, int>(
            article => new PreBrokenContent<int>(article.Id, article.Title),
            token => 5f
        ),
        new ContentRetriever<IArticle, int>(
            article => new PreBrokenContent<int>(article.Id, article.Content),
            token => 1f
        )
    }),
    article => new IndexBuilder<IArticle, int>.SourceDataSummary(
        article.Id,
        article.LastModified
    ),
    new IntEqualityComparer(),
    new WhiteSpaceTokenBreaker(),
    new DefaultStringNormaliser(),
    weightedValues => weightedValues.Sum(),
    new NullLogger()
);

Instead of instantiating an IndexGenerator directly we're going to use the IndexBuilder that I'm describing, and we'll pass data to it thusly:

var articles = new[]
{
    new Article(1, new DateTime(2012, 7, 21, 0, 0, 1), "One", "One Content"),
    new Article(2, new DateTime(2012, 8, 21, 0, 0, 1), "Two", "Two Content"),
    new Article(3, new DateTime(2012, 9, 21, 0, 0, 1), "Three", "Three Content")
};
var index = indexBuilder.UpdateIndex(new NonNullImmutableList<IArticle>(articles));

The source data types are not very interesting and are here only for completeness of the example, to be honest!

public class Article : IArticle
{
    public Article(int id, DateTime lastModified, string title, string content)
    {
        if (string.IsNullOrWhiteSpace(title))
            throw new ArgumentException("Null/blank title specified");
        if (string.IsNullOrWhiteSpace(content))
            throw new ArgumentException("Null/blank content specified");

        Id = id;
        LastModified = lastModified;
        Title = title.Trim();
        Content = content.Trim();
    }

    public int Id { get; private set; }

    public DateTime LastModified { get; private set; }

    /// <summary>
    /// This will never be null or blank
    /// </summary>
    public string Title { get; private set; }

    /// <summary>
    /// This will never be null or blank
    /// </summary>
    public string Content { get; private set; }
}

public interface IArticle
{
    int Id { get; }

    DateTime LastModified { get; }

    /// <summary>
    /// This will never be null or blank
    /// </summary>
    string Title { get; }

    /// <summary>
    /// This will never be null or blank
    /// </summary>
    string Content { get; }
}

And finally (finally!) the IndexBuilder itself. The constructor takes up a chunk of space, validating all of the input. Then there's a few lines taken up by the definition of the SourceItemStatusRetriever and SourceDataSummary class. At the end there's the UpdateIndex method which determines what work needs to be done to get its IndexData instance to match the new source data - and it uses the Remove and Combine methods to synchronise the index with the data:

using System;
using System.Collections.Generic;
using System.Linq;
using Common.Lists;
using Common.Logging;
using FullTextIndexer.Indexes;
using FullTextIndexer.Indexes.TernarySearchTree;
using FullTextIndexer.IndexGenerators;
using FullTextIndexer.TokenBreaking;

namespace FullTextIndexerDemo
{
    public class IndexBuilder<TSource, TKey> where TSource : class
    {
        private List<ContentRetriever<TSource, TKey>> _contentRetrievers;
        private SourceItemStatusRetriever _sourceItemStatusRetriever;
        private IEqualityComparer<TKey> _keyComparer;
        private ITokenBreaker _tokenBreaker;
        private IStringNormaliser _stringNormaliser;
        private IndexGenerator.WeightedEntryCombiner _weightedEntryCombiner;
        private IIndexGenerator<TSource, TKey> _indexGenerator;
        private IIndexData<TKey> _index;
        private Dictionary<TKey, DateTime> _sourceDataSummary;
        private object _writeLock;
        public IndexBuilder(
            NonNullImmutableList<ContentRetriever<TSource, TKey>> contentRetrievers,
            SourceItemStatusRetriever sourceItemStatusRetriever,
            IEqualityComparer<TKey> keyComparer,
            ITokenBreaker tokenBreaker,
            IStringNormaliser stringNormaliser,
            IndexGenerator.WeightedEntryCombiner weightedEntryCombiner,
            ILogger logger)
        {
            if (contentRetrievers == null)
                throw new ArgumentNullException("contentRetrievers");
            if (!contentRetrievers.Any())
                throw new ArgumentException("No contentRetrievers specified");
            if (sourceItemStatusRetriever == null)
                throw new ArgumentNullException("sourceItemStatusRetriever");
            if (keyComparer == null)
                throw new ArgumentNullException("keyComparer");
            if (tokenBreaker == null)
                throw new ArgumentNullException("tokenBreaker");
            if (stringNormaliser == null)
                throw new ArgumentNullException("stringNormaliser");
            if (weightedEntryCombiner == null)
                throw new ArgumentNullException("weightedEntryCombiner");
            if (logger == null)
                throw new ArgumentNullException("logger");

            var contentRetrieversTidied = new List<ContentRetriever<TSource, TKey>>();
            foreach (var contentRetriever in contentRetrievers)
            {
                if (contentRetriever == null)
                    throw new ArgumentException("Null encountered in contentRetrievers set");
                contentRetrieversTidied.Add(contentRetriever);
            }
            if (!contentRetrieversTidied.Any())
                throw new ArgumentException("No contentRetrievers specified");

            _contentRetrievers = contentRetrieversTidied;
            _sourceItemStatusRetriever = sourceItemStatusRetriever;
            _keyComparer = keyComparer;
            _tokenBreaker = tokenBreaker;
            _stringNormaliser = stringNormaliser;
            _weightedEntryCombiner = weightedEntryCombiner;
            _sourceDataSummary = new Dictionary<TKey, DateTime>(keyComparer);
            _writeLock = new object();

            _indexGenerator = new IndexGenerator<TSource, TKey>(
                contentRetrieversTidied.ToNonNullImmutableList(),
                keyComparer,
                stringNormaliser,
                tokenBreaker,
                weightedEntryCombiner,
                logger
            );
            _index = _indexGenerator.Generate(new NonNullImmutableList<TSource>());
        }

        /// <summary>
        /// This will never be called with a null source reference, it must never return null
        /// </summary>
        public delegate SourceDataSummary SourceItemStatusRetriever(TSource source);
        public class SourceDataSummary
        {
            public SourceDataSummary(TKey key, DateTime lastModified)
            {
                if (key == null)
                    throw new ArgumentNullException("key");

                Key = key;
                LastModified = lastModified;
            }
            public TKey Key { get; private set; }
            public DateTime LastModified { get; private set; }
        }

        /// <summary>
        /// This will never return null
        /// </summary>
        public IIndexData<TKey> Index
        {
            get
            {
                // Since the index is an immutable data type we don't need to worry about
                // locking it for read access
                return _index;
            }
        }

        /// <summary>
        /// This will never return null, it will throw an exception for null input
        /// </summary>
        public IIndexData<TKey> UpdateIndex(NonNullImmutableList<TSource> values)
        {
            if (values == null)
                throw new ArgumentNullException("values");

            var newIndexSummary = values
                .Select(value => _sourceItemStatusRetriever(value))
                .GroupBy(
                    summary => summary.Key,
                    _keyComparer
                )
                .ToDictionary(
                    group => group.Key,
                    group => group.Max(summary => summary.LastModified),
                    _keyComparer
                );

            lock (_writeLock)
            {
                // No source data, so just generate an empty index
                if (!newIndexSummary.Any())
                {
                    _sourceDataSummary = newIndexSummary;
                    _index = _indexGenerator.Generate(new NonNullImmutableList<TSource>());
                    return _index;
                }

                // There will (probably) be some keys to remove entirely, some that have to
                // be removed so that they can be replaced with updated content and some that
                // are not present in the existing data. First determine which keys fall into
                // which category (if any).
                var keysToRemove = new HashSet<TKey>(
                    _sourceDataSummary
                        .Select(summary => summary.Key)
                        .Except(newIndexSummary.Select(summary => summary.Key)),
                    _keyComparer
                );
                var keysToUpdate = new HashSet<TKey>(
                    _sourceDataSummary
                        .Where(summary =>
                        {
                            DateTime newSummaryLastModified;
                            if (!newIndexSummary.TryGetValue(
                                summary.Key, out newSummaryLastModified))
                            {
                                return false;
                            }
                            return newSummaryLastModified > summary.Value;
                        })
                        .Select(summary => summary.Key),
                    _keyComparer
                );
                var keysToAdd = new HashSet<TKey>(
                    newIndexSummary.Keys.Except(_sourceDataSummary.Keys),
                    _keyComparer
                );
                if (!keysToAdd.Any() && !keysToRemove.Any() && !keysToUpdate.Any())
                {
                    // If there are no actions to take then don't do any work!
                    return _index;
                }

                // Prepare the new data to insert
                var indexDataToUpdate = _indexGenerator.Generate(
                    values
                        .Where(value =>
                        {
                            var key = _sourceItemStatusRetriever(value).Key;
                            return keysToUpdate.Contains(key) || keysToAdd.Contains(key);
                        })
                        .ToNonNullImmutableList()
                );

                // Update the index content by removing keys and combining in the newly
                // generated content
                _index = _index
                    .Remove(key => keysToRemove.Contains(key) || keysToUpdate.Contains(key))
                    .Combine(
                        new[] { indexDataToUpdate }.ToNonNullImmutableList(),
                        _weightedEntryCombiner
                    );

                // All that's left is to update the source data summary and return the
                // new data!
                _sourceDataSummary = newIndexSummary;
                return _index;
            }
        }
    }
}

In case this class needs to be used in a multi-threaded environment there is a write-lock used for any calls to UpdateIndex but requests for the Index property for reading only will require no locks since the IndexData is an immutable structure! (This includes the case where index data may be cached in memory and shared between web requests which is an implicit multi-threading scenario rather than an explicit situation where you may dealing with Threads / ThreadPools / whatever yourself).

(If we were nitpicking then we could be concerned that there's no way to ensure that the TKey type is immutable and so the weighted entries described by the IndexData could feasibly change - but in this case they're ints, so there's nothing to worry about, and in other cases I would strongly suggest an immutable type be used for TKey at all times. Next time I'm going to cover more complex TKey possibilities in The Full Text Indexer - Going International!).

Posted at 21:37

Tags:

FullTextIndexer

Comments

17 July 2012

The Full Text Indexer

I started out on a journey a few months ago being frustrated by the Lucene.net integration we had with one of our products at work (I'm not badmouthing the Lucene project, I'm wholeheartedly blaming the integration I inherited!) and wondering just how difficult it would be to write a Full Text Indexer which could analyse generic content and generate some form of structure which could look up strings and assign weights to the source material, offering the best matches.

And now I've got it to the point that I've tried the resulting solution out in a variety of configurations and am using it to handle searches on this blog and incorporating an autocomplete functionality (that may or may not benefit from some more tweaking yet) to go with it. I'm quite proud of it!

Before I say any more, this was written to deal with the search tasks I needed and as such is not a direct replacement for Lucene, it's just an alternative I wanted to explore (for example I know that Lucene makes big claims about the number of documents it can maintain, I'm in no position right now to make any sorts of boasts on that scale!).

Here's a really basic example that would analyse data from:

public interface IArticle
{
    int Id { get; }

    /// <summary>
    /// This will never be null or blank
    /// </summary>
    string Title { get; }

    /// <summary>
    /// This will never be null or blank
    /// </summary>
    string Content { get; }
}

and generate an IIndexData<int> instance which has this method (among others, but this is all we need for this first example):

public interface IIndexData<TKey>
{
    /// <summary>
    /// This will throw an exception for null or blank input. It will never return null.
    /// If there are no matches then an empty list will be returned.
    /// </summary>
    NonNullImmutableList<WeightedEntry<TKey>> GetMatches(string source);
}

by defining "Content Retrievers" (which extract sections of keyed content; meaning content that is associated with a Key that represents each source data item), a "Key Comparer" (which defines which keyed content instances belong to the same data item in order to group results together), a "Token Breaker" (which reduces content strings into individual words), a "String Normaliser" (which compares individual words in order to group them together but will also be used to compare values passed to the GetMatches method shown above) and "Weighted Entry Combiner" (which describes how tokens that appear multiple times for the same data item should have their weights combined):

using System;
using System.Collections.Generic;
using System.Linq;
using Common.Lists;
using Common.Logging;
using FullTextIndexer.Indexes;
using FullTextIndexer.Indexes.TernarySearchTree;
using FullTextIndexer.IndexGenerators;
using FullTextIndexer.TokenBreaking;

namespace FullTextIndexerDemo
{
    public class Example
    {
        public IIndexData<int> GetIndex(NonNullImmutableList<IArticle> articles)
        {
            if (articles == null)
                throw new ArgumentNullException("articles");

            return GetIndexGenerator().Generate(articles);
        }

        private IIndexGenerator<IArticle, int> GetIndexGenerator()
        {
            var contentRetrievers = new List<ContentRetriever<IArticle, int>>
            {
                new ContentRetriever<IArticle, int>(
                    article => new PreBrokenContent<int>(article.Id, article.Title),
                    token => 5f
                ),
                new ContentRetriever<IArticle, int>(
                    article => new PreBrokenContent<int>(article.Id, article.Content),
                    token => 1f
                )
            };

            return new IndexGenerator<IArticle, int>(
                contentRetrievers.ToNonNullImmutableList(),
                new IntEqualityComparer(),
                new DefaultStringNormaliser(),
                new WhiteSpaceTokenBreaker(),
                weightedValues => weightedValues.Sum(),
                new NullLogger()
            );
        }

        private class IntEqualityComparer : IEqualityComparer<int>
        {
            public bool Equals(int x, int y) { return (x == y); }
            public int GetHashCode(int obj) { return obj; }
        }
    }
}

That was a lot of jargon that took more work to write than the code example! :)

Content Retrievers

These describe describe two simple things; a method to extract a particular content string from an item (along with a Key for that item) and a method to assign a weight to each token that is extracted from that content after it's been passed through a Token Breaker. In this example, more weight is given to "tokens" (for the time being we can take this to refer to a single word) matched in the Article Title than in the Article Content. Each Content Retriever can return null for a given Article if there is no content to retrieve for that instance - eg. if IArticle had an optional property for CategoryName for an instance then a Content Retriever might return null if the instance has no Category assigned to it.

Key Comparer

Here, the Key uniquely representing each Article is the Id value so the Key Comparer for this example is just an IEqualityComparer<int> implementation that compares integers - easy.

Token Breakers

This example uses a "WhiteSpaceTokenBreaker" which will take the string content from the Content Retrievers and break it into individual words by splitting on whitespace characters. Straight forward!

String Normaliser

The String Normaliser is essentially an IEqualityComparer<string> and will be used to generate a lookup of tokens and compare them against values passed into the GetMatches method. The DefaultStringNormaliser will remove all punctuation, exchange all non-latin characters for latin equivalents and lower-case them all. For the most basic lookups I had in mind, this does the hard work.

Weighted Entry

The Weighted Entry is a combination of a Key that identifies a data item and a numeric weight indicating the quality of the match; always positive and the higher the better.

Weighted Entry Combiner

This takes a set of match weights for a given token and must return a single value representing their combined total. In the example here I've just taken a sum of them, so if an Article has the word "giraffe" once in the Title and three times in the Content and "giraffe" was searched for, then match weights 5, 1, 1, 1 would be combined into 8 but it may be equally valid to take the maximum weight instead of considering Articles to be a better match if they have the same word more times (in which case "weightedValues => weightedValues.Max()" would be specified).

The Logger

While the index is being generated, status messages may be logged such as "Work completed: x%". This example ignores all log messages by passing a NullLogger to the index generator.

Customisation / Variations

This is a very basic example but it can be extended easily to handle other requirements or data structures. In general the Content Retrievers and Key Comparer are altered to deal with different input data while the Token Breakers, String Normaliser and Weighted Entry Combiner are varied to process that extracted data in a different manner.

The "English Language Plurality String Normaliser (link to Bitbucket file)" (which I've gone on about at considerable length in previous posts) could replace the DefaultStringNormaliser (or wrap it, since it will take an "optionalPreNormaliser" as a constructor argument) so that the token matching is more flexible; searching for "giraffes" would now match an Article that included the word "giraffe" in the Title and/or Content even if it didn't also include "giraffes".

Likewise, the WhiteSpaceTokenBreaker could be replaced with an alternative implementation that also breaks on commas (for content that doesn't also follow commas with spaces) or on the various bracket characters (especially useful for breaking content that includes code samples; so that "List<string>" is broken down into "List" and "string"). This can be done with the "WhiteSpaceExtendingTokenBreaker" class which replaces a fixed (but customisable) set of characters with spaces and then passes off processing to another Token Breaker (usually a WhiteSpaceTokenBreaker) to handle the altered content.

Multi-word Matching

With the above configuration, only single words would yield results when GetMatches is called on the index data since all of the content is broken into single words and so any multiple word "source" strings would fail to be matched without additional processing. For cases where the order of the words in a multiple word terms is not important there is an IIndexData<TKey> extension method:

/// <summary>
/// This will break a given source string and return results based upon the combination of
/// partial matches (so results that only match part of the source string may be included
/// in the returned data). The token breaker and the match combiner must be specified by the
/// caller - if the match combiner returns zero then the result will not be included in the
/// final data. To require that all tokens in the source content be present for any returned
/// results, the following matchCombiner could be specified:
///
///  (tokenMatches, allTokens) =>
///    (tokenMatches.Count < allTokens.Count)
///      ? 0 : tokenMatches.SelectMany(m => m.Weights).Sum()
///
/// </summary>
public static NonNullImmutableList<WeightedEntry<TKey>> GetPartialMatches<TKey>(
    this IIndexData<TKey> index,
    string source,
    ITokenBreaker tokenBreaker,
    MatchCombiner matchCombiner
)

If this is called with the same Token Breaker as used by the index generator then the multi-word search term can be split up in the same manner and each resulting token searched for in the index. Then a combined weight must be determined for each matched token, this calculation is handled by the specified MatchCombiner. I won't go into too much detail about it here, I may do another time or you can look at the code for the nitty gritty (there's a Bitbucket link at the bottom of this post). I think the most common case is that illustrated in the summary comment in the code above; where all tokens that result from breaking the search term must be matched in order for results to be considered valid, and where the combined weight of valid matches is taken by summing the weights of all of the component matches.

Still to come..

This has still only touched on a simple use case. I'm hoping to cover in future posts how an index could handle multi-lingual content, how it could handle multi-word matching that increases the weight of the matching if tokens that are adjacent in the search term appear adjacent in the source data, how the index can be updated or have items added and removed, how the AutoComplete on this blog is generated and how the term highlighting on the search page works! Who knows, I may even go right off in the deep end and contemplate writing a search term parser that can perform searches on the index with quoted terms, boolean operators and who knows what! But that's all definitely for another day :)

The Code!

The code for this project is all available at Bitbucket: "The Full Text Indexer".

Posted at 20:33

Tags:

FullTextIndexer

Comments

24 June 2012

Compiled LINQ Expressions don't serialise :(

In the last post (Optimising the Plurality-Handling Normaliser) I got all excited about improving the performance of what is essentially a string comparer for use in a Full Text Indexer I'm playing around with (which is now handling the search facilities on this blog so it must be half way to working at least! :) but the use of compiled LINQ expressions brought about their own problems when I tried to write away a fully-generated index to a disk cache. The generated lambda expression is not serialisable!

There was something in me that thought that since it had been formed through simple LINQ Expressions tied together that it would be easy to serialise and so not be a problem. But I suppose that once it becomes a generic lambda function then all bets are off since they can have references to all sort and so mightn't be easily serialisable anymore.

As usual there's a Stack Overflow post showing I'm hardly the first person to have encountered this issue, and this particular one even has the authority Eric Lippert getting involved! :) Interesting to see him make the point that this was was work that was required with all of the "LINQ-to-whatever" integrations..

Stack Overflow: How can I pass a lambda expression to a WCF Service?

A solution.. for now

I essentially just want to write to disk a custom string-keyed dictionary object with a particular key comparer to act as another level of caching when it drops out of memory so I didn't have any issues so complicated as passing expressions to a query service so I went for a relatively simple approach; I record all of the data as class members that are required to generate the LINQ Expressions so that I can implement ISerializable and write away just this data when an instance is serialised. Then when it's de-serialised I use this data to regenerate the lambdas.

// This de-serialising constructor takes the values that are stored in the GetObjectData
// method and passes them through to the standard public constructor
protected EnglishPluralityStringNormaliser(SerializationInfo info, StreamingContext context)
    : this(
        (IEnumerable<PluralEntry>)info.GetValue(
            "_plurals",
            typeof(IEnumerable<PluralEntry>)
        ),
        (IEnumerable<string>)info.GetValue(
            "_fallbackSuffixes",
            typeof(IEnumerable<string>)
        ),
        (IStringNormaliser)info.GetValue(
            "_optionalPreNormaliser",
            typeof(IStringNormaliser)
        ),
        (PreNormaliserWorkOptions)info.GetValue(
            "_preNormaliserWork",
            typeof(PreNormaliserWorkOptions)
        )
    ) { }

public void GetObjectData(SerializationInfo info, StreamingContext context)
{
    // Unfortunately we can't serialise the generated normaliser (we'll get a "Cannot
    // serialize delegates over unmanaged function pointers, dynamic methods or methods
    // outside the delegate creator's assembly" error) so if we have to serialise this
    // instance we'll store all of the dat and then re-generate the normaliser on
    // de-serialisation. Not ideal from a performance point of view but at least
    // it will work.
    info.AddValue("_plurals", _plurals);
    info.AddValue("_fallbackSuffixes", _fallbackSuffixes);
    info.AddValue("_optionalPreNormaliser", _optionalPreNormaliser);
    info.AddValue("_preNormaliserWork", _preNormaliserWork);
}

However..

Then I tried integrating the project as a search facility into my blog which is running ASP.Net MVC 3 (.Net 4.0) and ran into another snag; "Inheritance security rules violated while overriding member: MyBusinessException.GetObjectData(System.Runtime.Serialization.SerializationInfo, System.Runtime.Serialization.StreamingContext)'. Security accessibility of the overriding method must match the security accessibility of the method being overridden."

Hmmm...

Stack Overflow to the rescue again! Inheritance security rules violated with overriding member. Reading the post led me to click "Use Definition" on ISerializable and observe that the "SecurityCritical" attribute was marked on the GetObjectData method - and from what I understand from what I read, I should be able to fix this by marking that attribute on my GetObjectData method. Sortio!

Not sortio.. :( And now I must admit to being a bit lazy in my eagerness to get the search functionality integrated on the site. One of the Stack Overflow answers was to specify "Full Trust" for the web application but I got the impression that this was cheating a bit and bypassing some of the new .Net 4.0 security mechanisms. However, for now I've gone with it by adding this to the web.config (as per one of the posted answers):

<system.web>
    <trust level="Full" />
<system.web>

and now it is working! Still, something to look into further another day I think.

For the curious

The project I've been talking about is publicly accessible at BitBucket but I'm yet to sort out a decent Readme for it and I'm hoping to write some posts about its development, its use so far and a range of examples - watch this space! Full Text Indexer BitBucket repo

Posted at 16:01

Tags:

Comments

16 June 2012

A Plurality-Handling Normaliser Correction

A slight amendment to a previous post (An English-language Plurality-handling String Normaliser); the original intention of IStringNormaliser implementations was to expose IEqualityComparer<string> and return a value from GetNormalisedString that could be maintained in a dictionary (specifically a Ternary Search Tree). And at this the English Language Plurality-handling String Normaliser has operated splendidly! Does the string "cat" match the string "cats" when searching? Why, yes it does! (And things like this are important to know since the internet is made of cats).

However.. since the normalised values are maintained as strings in a dictionary they may be returned as normalised values. And then compared to a string that would be normalised to that value; so "cat" may be transformed through GetNormalisedString and then compared again to "cat" - and this is where things go awry. "cat" and "cats" are found to match as they are both normalised to "cat|s|es|ses" (as covered the earlier post I linked to) and the problem is that "cat|s|es|ses" does not get matched to "cat" (or "cats" for that matter) as it incorrectly gets transformed again to "cat|s|es|se|s|es|ses" as the final "s" gets matched as a potential suffix and the value gets altered.

Thankfully, the fix is none too difficult; before trying to perform transformations based upon value endings, we need to check for whether a suffix group has already been appended to the value. So before checking whether a value ends with "s", "es" or "ses" we need to check whether it ends with "|s|es|ses" and if so then return it as pre-transformed.

The method that requires changing is that below:

private static Func<string, string> GenerateNormaliser(
    IEnumerable<PluralEntry> plurals,
            IEnumerable<string> fallbackSuffixes)
{
    if (plurals == null)
        throw new ArgumentNullException("pluralEntries");
    if (fallbackSuffixes == null)
        throw new ArgumentNullException("fallbackSuffixes");

    // Build up if statements for each suffix - if a match is found, return the input value
    // with the matched suffix replaced with a combination of all the other suffixes in
    // PluralEntry
    var result = Expression.Parameter(typeof(string), "result");
    var endLabel = Expression.Label(typeof(string));
    var valueTrimmed = Expression.Parameter(typeof(string), "valueTrimmed");
    var expressions = new List<Expression>();
    var pluralsTidied = new List<PluralEntry>();
    foreach (var plural in plurals)
    {
        if (plural == null)
            throw new ArgumentException("Null reference encountered in plurals set");
        pluralsTidied.Add(plural);
    }
    foreach (var plural in pluralsTidied)
    {
        // Before checking for for suffix matches we need to check whether the input string
        // is a value that has already been through the normalisation process! eg. "category"
        // and "categories" will both be transformed into the value "categor|y|ies", but if
        // that value is passed in again it should leave as "categor|y|ies" and not have
        // any futher attempts at normalisation applying to it.
        expressions.Add(
            Expression.IfThen(
                GeneratePredicate(
                    CreateSuffixExtension(plural.Values),
                    valueTrimmed,
                    plural.MatchType
                ),
                Expression.Block(
                    Expression.Assign(
                        result,
                        valueTrimmed
                    ),
                    Expression.Return(endLabel, result)
                )
            )
        );
    }
    foreach (var plural in pluralsTidied)
    {
        foreach (var suffix in plural.Values)
        {
            var assignNormalisedValueToResult = Expression.Assign(
                result,
                GenerateStringConcatExpression(
                    GenerateRemoveLastCharactersExpression(valueTrimmed, suffix.Length),
                    Expression.Constant(
                        CreateSuffixExtension(plural.Values),
                        typeof(string)
                    )
                )
            );
            expressions.Add(
                Expression.IfThen(
                    GeneratePredicate(suffix, valueTrimmed, plural.MatchType),
                    Expression.Block(
                        assignNormalisedValueToResult,
                        Expression.Return(endLabel, result)
                    )
                )
            );
        }
    }

    // If any fallback suffixes are specified, add a statement to append them if none of the
    // PluralEntry matches are made
    fallbackSuffixes = TidyStringList(fallbackSuffixes, v => v.Trim().ToLower());
    if (fallbackSuffixes.Any())
    {
        expressions.Add(
            Expression.Assign(
                result,
                GenerateStringConcatExpression(
                    valueTrimmed,
                    Expression.Constant(
                        CreateSuffixExtension(fallbackSuffixes),
                        typeof(string)
                    )
                )
            )
        );
    }
    else
        expressions.Add(Expression.Assign(result, valueTrimmed));

    // Add the return-point label, configured to return the string value in "result"
    expressions.Add(Expression.Label(endLabel, result));

    return Expression.Lambda<Func<string, string>>(
        Expression.Block(
            new[] { result },
            expressions
        ),
        valueTrimmed
    ).Compile();
}

In the places I'm using this the plurality-handling normaliser wraps another normaliser that trims the string, lower-cases it, removes any punctuation and replaces any non-latin characters with latin equivalents. This is no problem. But if a normaliser was wrapped that removed any non-alpha characters completely then the method above wouldn't be able to match the transformed "|s|es|ses" ending as the pipe characters would have been removed. So long as this situation is avoided then everything will work lovely, but it's worth bearing in mind!

Posted at 17:53

Tags:

FullTextIndexer

Comments

9 June 2012

Optimising the Plurality-Handling Normaliser

In the last post I took a first stab at an English-language Plurality-handling String Normaliser to work with a Full Text Indexer research project I'm playing with. I've been using it in conjunction with a normaliser that strips out punctuation from words, replaces accented characters with latin versions (eg. é replaced with e) and lower-cases the content. This originally used a few regular expressions along with some character replacement but I wasn't delighted with the performance, so that was re-written to do one pass through the character and do it all at once and is now running much better. Whether that ties back into my I don't like regular expressions ramble well or not I'm not sure :)

The point at which it does the most work is here:

public bool TryToTransform(string value, out string valueTransformed)
{
    if (value == null)
        throw new ArgumentNullException("value");

    if (_matchType == MatchTypeOptions.SuffixOnly)
    {
        var matchedSuffix = _values.FirstOrDefault(
            suffix => (value.Length > suffix.Length) && value.EndsWith(suffix)
        );
        if (matchedSuffix != null)
        {
            valueTransformed =
                value.Substring(0, value.Length - matchedSuffix.Length) + _combinedValues;
            return true;
        }
    }
    else
    {
        if (_values.Contains(value))
        {
            valueTransformed = _combinedValues;
            return true;
        }
    }

    valueTransformed = null;
    return false;
}

The first approach I took was to take out some of the LINQ - it makes it easy to read but I don't actually know for sure what it's doing! I thought it was worth checking if being more explicit about what I want to do would reap any performance benefit.. So the lengths are explicitly checked first (if matching WholeWord then the value length must match the suffix length, if matching SuffixOnly then the value length must be greater than the suffix length) and then count back through the last characters of the input value and ensure that each one matches the suffix. Not quite as readable but no big deal.

public bool TryToTransform(string value, out string valueTransformed)
{
    if (value == null)
        throw new ArgumentNullException("value");

    foreach (var suffix in _values)
    {
        if (_matchType == MatchTypeOptions.WholeWord)
        {
            if (value.Length != suffix.Length)
                continue;
        }
        else if (!(value.Length > suffix.Length))
            continue;

        var matchedSuffixLength = 0;
        for (var index = 0; index < suffix.Length; index++)
        {
            if (value[value.Length - (index + 1)] != suffix[suffix.Length - (index + 1)])
            {
                matchedSuffixLength = 0;
                break;
            }
            matchedSuffixLength++;
        }
        if (matchedSuffixLength == 0)
            continue;

        valueTransformed =
            value.Substring(0, value.Length - matchedSuffixLength) + _combinedValues;
        return true;
    }

    valueTransformed = null;
    return false;
}

Running a few loops of the All English Words data I mentioned in the last post saw an performance improvement (when run in release mode) over 3x - success!

But with what I've been learning about LINQ Expressions over the last year or so (culminating in The artist previously known as the AutoMapper-By-Constructor and The CompilableTypeConverter BitBucket repository) I couldn't help wondering if writing code that would generate expressions that unrolled the comparison loop and pre-generated the combined suffix extensions might not be faster. The only way to find out is to try!

The idea is that it would effectively generate code along the lines of:

if ((value.length > 1)
&& (value[value.length - 1] == 'y'))
    return value.substring(0, value.length - 1) + "[y][ies]";

if ((value.length > 3)
&& (value[value.length - 3] == 'i')
&& (value[value.length - 2] == 'e')
&& (value[value.length - 1] == 's'))
    return value.substring(0, value.length - 3) + "[y][ies]";

for all of the various plurality suffixes but while still maintaining the ability to easily define new suffix sets. And so, without further ado, I ended up with this:

/// <summary>
/// This will match common strings where one is the plural and the other the singular version
/// of the same word. It not intended to be perfect and may match a few false positives, but
/// it should catch most of the most common cases.
/// </summary>
[Serializable]
public class EnglishPluralityStringNormaliser : IStringNormaliser
{
    private Func<string, string> _normaliser;
    private IStringNormaliser _optionalPreNormaliser;
    private PreNormaliserWorkOptions _preNormaliserWork;
    public EnglishPluralityStringNormaliser(
        IEnumerable<PluralEntry> plurals,
        IEnumerable<string> fallbackSuffixes,
        IStringNormaliser optionalPreNormaliser,
        PreNormaliserWorkOptions preNormaliserWork)
    {
        if (plurals == null)
            throw new ArgumentNullException("pluralEntries");
        if (fallbackSuffixes == null)
            throw new ArgumentNullException("fallbackSuffixes");
        var allPreNormaliserOptions = (PreNormaliserWorkOptions)0;
        foreach (PreNormaliserWorkOptions option in
            Enum.GetValues(typeof(PreNormaliserWorkOptions)))
        {
            allPreNormaliserOptions = allPreNormaliserOptions | option;
        }
        if ((preNormaliserWork & allPreNormaliserOptions) != preNormaliserWork)
            throw new ArgumentOutOfRangeException("preNormaliserWork");

        _normaliser = GenerateNormaliser(plurals, fallbackSuffixes);
        _optionalPreNormaliser = optionalPreNormaliser;
        _preNormaliserWork = preNormaliserWork;
    }

    public EnglishPluralityStringNormaliser(
        IStringNormaliser optionalPreNormaliser,
        PreNormaliserWorkOptions preNormaliserWork
    ) : this(DefaultPlurals, DefaultFallback, optionalPreNormaliser, preNormaliserWork) { }

    public EnglishPluralityStringNormaliser()
        : this(null, PreNormaliserWorkOptions.PreNormaliserDoesNothing) { }

    public string GetNormalisedString(string value)
    {
        if (value == null)
            throw new ArgumentNullException("value");

        // If an additional normaliser was specified in the constructor then process the
        // string with that first (eg. a normaliser that removes punctuation from values
        // may be beneficial depending upon the content that may be passed in)
        if (_optionalPreNormaliser != null)
            value = _optionalPreNormaliser.GetNormalisedString(value);

        if ((_preNormaliserWork & PreNormaliserWorkOptions.PreNormaliserTrims)
        != PreNormaliserWorkOptions.PreNormaliserTrims)
            value = value.Trim();
        if (value == "")
            return "";

        // We have to lower case the trimmed value since the suffixes are all stored as
        // lower case values
        if ((_preNormaliserWork & PreNormaliserWorkOptions.PreNormaliserLowerCases)
        != PreNormaliserWorkOptions.PreNormaliserLowerCases)
            value = value.ToLower();
        return _normaliser(value);
    }

    public bool Equals(string x, string y)
    {
        if (x == null)
            throw new ArgumentNullException("x");
        if (y == null)
            throw new ArgumentNullException("y");

        return GetNormalisedString(x) == GetNormalisedString(y);
    }

    public int GetHashCode(string obj)
    {
        if (obj == null)
            throw new ArgumentNullException("obj");

        return GetNormalisedString(obj).GetHashCode();
    }

    private static Func<string, string> GenerateNormaliser(
        IEnumerable<PluralEntry> plurals,
        IEnumerable<string> fallbackSuffixes)
    {
        if (plurals == null)
            throw new ArgumentNullException("pluralEntries");
        if (fallbackSuffixes == null)
            throw new ArgumentNullException("fallbackSuffixes");

        // Build up if statements for each suffix - if a match is found, return the input
        // value with the matched suffix replaced with a combination of all the other
        // suffixes in PluralEntry
        var result = Expression.Parameter(typeof(string), "result");
        var endLabel = Expression.Label(typeof(string));
        var valueTrimmed = Expression.Parameter(typeof(string), "valueTrimmed");
        var expressions = new List<Expression>();
        foreach (var plural in plurals)
        {
            if (plural == null)
                throw new ArgumentException("Null reference encountered in plurals set");

            foreach (var suffix in plural.Values)
            {
                var assignNormalisedValueToResult = Expression.Assign(
                    result,
                    GenerateStringConcatExpression(
                        GenerateRemoveLastCharactersExpression(valueTrimmed, suffix.Length),
                        Expression.Constant(
                            CreateSuffixExtension(plural.Values),
                            typeof(string)
                        )
                    )
                );
                expressions.Add(
                    Expression.IfThen(
                        GeneratePredicate(suffix, valueTrimmed, plural.MatchType),
                        Expression.Block(
                            assignNormalisedValueToResult,
                            Expression.Return(endLabel, result)
                        )
                    )
                );
            }
        }

        // If any fallback suffixes are specified, add a statement to append them if none
        // of the PluralEntry matches are made
        fallbackSuffixes = TidyStringList(fallbackSuffixes, v => v.Trim().ToLower());
        if (fallbackSuffixes.Any())
        {
            expressions.Add(
                Expression.Assign(
                    result,
                    GenerateStringConcatExpression(
                        valueTrimmed,
                        Expression.Constant(
                            CreateSuffixExtension(fallbackSuffixes),
                            typeof(string)
                        )
                    )
                )
            );
        }
        else
            expressions.Add(Expression.Assign(result, valueTrimmed));

        // Add the return-point label, configured to return the string value in "result"
        expressions.Add(Expression.Label(endLabel, result));

        return Expression.Lambda<Func<string, string>>(
            Expression.Block(
                new[] { result },
                expressions
            ),
            valueTrimmed
        ).Compile();
    }

    /// <summary>
    /// Generate an expression that determines whether a string parameter matches a specified
    /// suffix / matchType combination
    /// </summary>
    private static Expression GeneratePredicate(
        string suffix,
        ParameterExpression valueTrimmed,
        MatchTypeOptions matchType)
    {
        if (string.IsNullOrWhiteSpace(suffix))
            throw new ArgumentException("Null/blank suffix specified");
        if (valueTrimmed == null)
            throw new ArgumentNullException("valueTrimmed");
        if (!Enum.IsDefined(typeof(MatchTypeOptions), matchType))
            throw new ArgumentOutOfRangeException("matchType");

        suffix = suffix.Trim();

        var conditionElements = new List<Expression>();
        var lengthProperty = typeof(string).GetProperty("Length");
        var indexedProperty = typeof(string).GetProperties().First(
            p => (p.GetIndexParameters() ?? new ParameterInfo[0]).Any()
        );
        if (matchType == MatchTypeOptions.SuffixOnly)
        {
            conditionElements.Add(
                Expression.GreaterThan(
                    Expression.Property(valueTrimmed, lengthProperty),
                    Expression.Constant(suffix.Length, typeof(int))
                )
            );
        }
        else
        {
            conditionElements.Add(
                Expression.Equal(
                    Expression.Property(valueTrimmed, lengthProperty),
                    Expression.Constant(suffix.Length, typeof(int))
                )
            );
        }
        for (var index = 0; index < suffix.Length; index++)
        {
            conditionElements.Add(
                Expression.Equal(
                    Expression.Constant(suffix[index], typeof(char)),
                    Expression.Property(
                        valueTrimmed,
                        indexedProperty,
                        Expression.Subtract(
                            Expression.Property(valueTrimmed, lengthProperty),
                            Expression.Constant(suffix.Length - index, typeof(int))
                        )
                    )
                )
            );
        }
        return CombineExpressionsWithAndAlso(conditionElements);
    }

    private static Expression CombineExpressionsWithAndAlso(
        IEnumerable<Expression> expressions)
    {
        if (expressions == null)
            throw new ArgumentNullException("expressions");

        var expressionsTidied = new List<Expression>();
        foreach (var expression in expressions)
        {
            if (expression == null)
                throw new ArgumentException("Null reference encountered in expressions set");
            expressionsTidied.Add(expression);
        }
        if (!expressionsTidied.Any())
            throw new Exception("No entries in expressions set");
        else if (expressionsTidied.Count == 1)
            return expressionsTidied[0];

        var reducedExpressions = new List<Expression>();
        for (var index = 0; index < expressionsTidied.Count; index += 2)
        {
            var expression = expressionsTidied[index];
            if (index < (expressionsTidied.Count - 1))
            {
                var expressionNext = expressionsTidied[index + 1];
                reducedExpressions.Add(Expression.AndAlso(expression, expressionNext));
            }
            else
                reducedExpressions.Add(expression);
        }

        return (reducedExpressions.Count == 1)
            ? reducedExpressions[0]
            : CombineExpressionsWithAndAlso(reducedExpressions);
    }

    /// <summary>
    /// The value Expression must represent a non-null string that is as at least as long as
    /// the specified length or an exception will
    /// be thrown upon exection
    /// </summary>
    private static Expression GenerateRemoveLastCharactersExpression(
        Expression value,
        int length)
    {
        if (value == null)
            throw new ArgumentNullException("value");
        if (length < 0)
            throw new ArgumentOutOfRangeException("length");

        return Expression.Call(
            value,
            typeof(string).GetMethod("Substring", new[] { typeof(int), typeof(int) }),
            Expression.Constant(0),
            Expression.Subtract(
                Expression.Property(value, typeof(string).GetProperty("Length")),
                Expression.Constant(length, typeof(int))
            )
        );
    }

    /// <summary>
    /// The values Expressions must represent strings otherwise the expression will fail when
    /// executed
    /// </summary>
    private static Expression GenerateStringConcatExpression(params Expression[] values)
    {
        if (values == null)
            throw new ArgumentNullException("values");

        var valuesTidied = values.ToList();
        if (!valuesTidied.Any())
            throw new ArgumentException("No entries in values set");
        if (valuesTidied.Any(v => v == null))
            throw new ArgumentException("Null reference encountered in values set");

        return Expression.Call(
            typeof(string).GetMethod("Concat", new[] { typeof(string[]) }),
            Expression.NewArrayInit(
                typeof(string),
                valuesTidied
            )
        );
    }

    private static string CreateSuffixExtension(IEnumerable<string> suffixes)
    {
        if (suffixes == null)
            throw new ArgumentNullException("suffixes");

        var suffixesTidied = suffixes.ToList();
        if (!suffixesTidied.Any())
            throw new ArgumentException("No entries in suffixes set");
        if (suffixesTidied.Any(s => string.IsNullOrWhiteSpace(s)))
            throw new ArgumentException("Null/blank entry encountered in suffixes set");

        return "|" + string.Join("|", suffixesTidied.Select(s => s.Trim()));
    }

    /// <summary>
    /// Given a set of values, ensure that none are null and return them de-duplicated after
    /// having been pushed through a string manipulation. This will throw an exception for
    /// null arguments or if any null value is encountered in the values set.
    /// </summary>
    private static IEnumerable<string> TidyStringList(
        IEnumerable<string> values,
        Func<string, string> transformer)
    {
        if (values == null)
            throw new ArgumentNullException("values");
        if (transformer == null)
            throw new ArgumentNullException("transformer");

        var valuesTidied = new List<string>();
        foreach (var value in values)
        {
            if (value == null)
                throw new ArgumentException("Null entry encountered in values");

            var valueToStore = transformer(value);
            if (!valuesTidied.Contains(valueToStore))
                valuesTidied.Add(valueToStore);
        }
        return valuesTidied.Distinct();
    }

    public readonly static IEnumerable<string> DefaultFallback = new[] { "ses", "es", "s" };
    public readonly static PluralEntry[] DefaultPlurals = new[]
    {
        // eg. formula / formulae / formulas
        new PluralEntry(new[] { "ula", "ulae", "ulas" }, MatchTypeOptions.SuffixOnly),

        // eg. category / categories
        new PluralEntry(new[] { "y", "ies" }, MatchTypeOptions.SuffixOnly),

        // eg. cactus / cactii
        new PluralEntry(new[] { "us", "ii" }, MatchTypeOptions.SuffixOnly),

        // eg. child / children
        new PluralEntry(new[] { "ld", "ldren" }, MatchTypeOptions.SuffixOnly),

        // eg. medium / media
        new PluralEntry(new[] { "ium", "ia" }, MatchTypeOptions.SuffixOnly),

        // Common special cases that have to come before the "ses", es", "s" form
        new PluralEntry(new[] { "index", "indexes", "indices" }, MatchTypeOptions.WholeWord),
        new PluralEntry(new[] { "matrix", "matrices" }, MatchTypeOptions.WholeWord),
        new PluralEntry(new[] { "vertex", "vertices" }, MatchTypeOptions.WholeWord),

        // eg. Abacuses, matching "s" here means we must use "ses", "es" AND "s" as fallbacks
        new PluralEntry(new[] { "ses", "es", "s" }, MatchTypeOptions.SuffixOnly),

        // Other common special cases
        new PluralEntry(new[] { "datum", "data" }, MatchTypeOptions.WholeWord),
        new PluralEntry(new[] { "man", "men" }, MatchTypeOptions.WholeWord),
        new PluralEntry(new[] { "woman", "women" }, MatchTypeOptions.WholeWord)
    };

    [Serializable]
    public class PluralEntry
    {
        public PluralEntry(IEnumerable<string> values, MatchTypeOptions matchType)
        {
            if (values == null)
                throw new ArgumentNullException("values");
            if (!Enum.IsDefined(typeof(MatchTypeOptions), matchType))
                throw new ArgumentOutOfRangeException("matchType");

            var valuesTidied = TidyStringList(values, v => v.Trim().ToLower());
            if (!valuesTidied.Any())
                throw new ArgumentException("No entries in values set");

            Values = valuesTidied.Distinct().ToList().AsReadOnly();
            MatchType = matchType;
        }

        /// <summary>
        /// This will never be null or an empty set, nor will it contain any null, empty or
        /// duplicate values (all values are lower-cased and trimmed)
        /// </summary>
        public ReadOnlyCollection<string> Values { get; private set; }

        public MatchTypeOptions MatchType { get; private set; }
    }

    [Serializable]
    public enum MatchTypeOptions
    {
        SuffixOnly,
        WholeWord
    }
}

It still takes me a far while to craft the generation LINQ Expressions but I do think that once written the resulting code is actually fairly easy to follow. For each suffix in a PluralEntry (where the PluralEntry might describe the group y, ies as a SuffixOnly extension - as clearly seen in the last post) a predicate is generated with LINQ Expressions that compares the input string's length and each of the characters that could correlate with the suffix string. Very similar to inside the loop of the first optimisation at the top of this post. An IfThen Expression will consider this predicate and - if matched - generate result that removes the suffix from the input value and appends a combined string consisting of the suffix values in the group before jumping to the end of the Expression block (effectively "returning" out of the block). Again, just like the setting of the valueTransformed string in the earlier code. If none of the suffix groups are found to match then it will append a default set of suffixes, so that cat is transformed to cat|s|es|ses in order to match cats which would also be transformed to cat|s|es|ses, for example.

There are couple of oddities in the code - I struggled for a while to find a nice way to access characters by index in a string since the ArrayAccess Expressions can't be used since a string isn't technically an array of characters; you have to first use reflection to get hold of the indexed property of the string type, there's only one so that must be the property we want to access! When comparing the string length and the individual characters, the Expressions are combined with the AndAlso Expression as this ensures that short-circuiting of the conditions is utilised - as soon as one condition is not met, any further ones are ignored.

This brought on another performance improvement of over 3x - success again!

Additional tweaks

There are a couple more minor optimisations in the new code that were made with knowledge of how I intended to integrate it. I was intending to use this DefaultStringNormaliser mentioned earlier that would trim, lower-case, remove punctuation and replace non-latin characters. This can be passed in as the optionalPreNormaliser constructor parameter and will process the string before the plurality normalisation is handled. However, if this "pre-normaliser" is going to trim and lower-case the input then there's no need for the EnglishPluralityStringNormaliser to do it as well! So there's a PreNormaliserWorkOptions enum that allows the instantiator to pass in hints as to whether the optionalPreNormaliser (if any) is going to do any of this work.

Sending a few passes of the All English Words data through an EnglishPluralityStringNormaliser that wraps the DefaultStringNormaliser (code below) with PreNormaliserLowerCases and PreNormaliserTrims specified compared to running it with PreNormaliserDoesNothing (which force the Trim and ToLower calls to be made by the EnglishPluralityStringNormaliser even though the DefaultStringNormaliser has already done this work) resulted in a performance boost of over 1.4x. Not as dramatic, but definitely not to be sniffed at!

There's one final tweak to note; I've switched from appending the suffix groups as [s][es][ses] to |s|es|ses since I'm intended to store the resulting normalised strings in a Ternary Search Tree (as discussed in The .Net Dictionary is FAST!) and if the string is shorter then less comparisons have to be made when matching a string in that structure!

The "Default String Normaliser"

Since I've made reference a few times to the DefaultStringNormaliser which I've made use of, here's the current code:

/// <summary>
/// This will perform string comparisons where the values have any accented characters
/// replaced with non-accented versions, all whitespace converted to spaces and runs of
/// whitespace replaced with a single space, all punctuation removed and the content
/// then lower-cased.
/// </summary>
[Serializable]
public sealed class DefaultStringNormaliser : IStringNormaliser
{
    private readonly static HashSet<Char> PunctuationCharacters = new HashSet<char>(
        Enumerable.Range(char.MinValue, char.MaxValue)
            .Select(c => (char)c)
            .Where(c => char.IsPunctuation(c))
    );

    public string GetNormalisedString(string value)
    {
        if (value == null)
            throw new ArgumentNullException("value");

        var normalisedValue = value.Normalize(NormalizationForm.FormKD);
        var content = new char[normalisedValue.Length];
        var contentIndex = 0;
        var contentIndexOfLastNonWhitespace = 0;
        var lastCharWasWhitespace = false;
        var gotContent = false;
        for (var index = 0; index < normalisedValue.Length; index++)
        {
            var currentChar = normalisedValue[index];
            if (PunctuationCharacters.Contains(currentChar))
                continue;
            if ((currentChar == '\r') || (currentChar == '\n') || (currentChar == '\t'))
                currentChar = ' ';
            else
            {
                var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(currentChar);
                if ((unicodeCategory == UnicodeCategory.EnclosingMark)
                || (unicodeCategory == UnicodeCategory.NonSpacingMark)
                || (unicodeCategory == UnicodeCategory.SpacingCombiningMark))
                    currentChar = ' ';
            }
            if (currentChar == ' ')
            {
                if (!lastCharWasWhitespace && gotContent)
                {
                    content[contentIndex] = currentChar;
                    contentIndex++;
                    lastCharWasWhitespace = true;
                }
                continue;
            }
            if (!char.IsLower(currentChar))
                currentChar = char.ToLower(currentChar);
            content[contentIndex] = currentChar;
            contentIndex++;
            contentIndexOfLastNonWhitespace = contentIndex;
            lastCharWasWhitespace = false;
            gotContent = true;
        }
        return new string(content, 0, contentIndexOfLastNonWhitespace);
    }

    public bool Equals(string x, string y)
    {
        if (x == null)
            throw new ArgumentNullException("x");
        if (y == null)
            throw new ArgumentNullException("y");

        return GetNormalisedString(x) == GetNormalisedString(y);
    }

    public int GetHashCode(string obj)
    {
        if (obj == null)
            throw new ArgumentNullException("obj");

        return GetNormalisedString(obj).GetHashCode();
    }
}

Conclusion

I'm still a firm believer in writing the code to work and be easily understood and maintained first. But when a section of code is measurably a bottleneck, and that bottleneck is worth removing, then little adventures like this can be fun and beneficial! And, to be honest, I don't think the resulting code is that difficult to understand. There are probably a few more tweaks that could be made to really eke out some more performance but I'm perfectly happy with it for now :)

Posted at 00:04

Tags:

Comments

31 May 2012

An English-language Plurality-handling String Normaliser

As part of my mental masturbation investigation into putting together a Full Text Indexer (which inspired The .Net Dictionary is FAST!) I wanted to implement a string normaliser that would handle retrieval of strings that stem from the same origin but which are the plural version of the singular, or enable matching of the singular version when searching on the plural version. In other words, if I search for "cat" then I want results that include the word "cats" too!

I had a fairly good idea how I wanted to do it; just look for particular endings on words and exchange it with all combinations of suffixes for singular or plural versions of the word. So I want "cat" and "cats" to be normalised to "cat[s]". But then some words have "es" as the plural like "fox" / "foxes". Some words have multiple variations like "index" / "indices" / "indexes" (I'm not sure if the last version is for specialised uses - I've seen it used when talking about the multiple of a database index or a stock index??). Some words don't even have variations, like "sheep"!

A visit to Wikipedia's English Plural page seemed like a natural starting point..

It turns out that there are a lot of irregular plurals in the English language! Not a surprise as such, just something I need to be aware of. But I'm not planning on a normaliser that is absolutely perfect - I want any avoid any false negatives (so I want to makes sure that "index" and "indices" are found to match) but I'm not concerned if some false positives are present (I won't care that it doesn't realise that "sheeps" isn't a real word and I'm not going to lose any sleep if it thinks that "as" is the plural of "a").

And, without further ado, this brings us to the first proper go at it I've made!

/// <summary>
/// This will match common strings where one is the plural and the other the singular version
/// of the same word. It not intended to be perfect and may match a few false positives, but
/// it should catch most of the most common cases.
/// </summary>
[Serializable]
public class EnglishPluarityStringNormaliser : IStringNormaliser
{
    /// <summary>
    /// This will never return null. If it returns an empty string then the string should
    /// not be considered elligible as a key. It will throw an exception for a null value.
    /// </summary>
    public string GetNormalisedString(string value)
    {
        if (value == null)
            throw new ArgumentNullException("value");

        value = value.Trim();
        if (value == "")
            return "";

        // Need to lower case the value since the suffix comparisons are all to lower case
        // characters
        value = value.ToLower();
        foreach (var matcher in Matchers)
        {
            string valueTransformed;
            if (matcher.TryToTransform(value, out valueTransformed))
                return valueTransformed;
        }

        // If no irregulare suffixes match then append all of "ses", "es" and "s" to catch
        // other common cases (and ensure that we match anything that ends in "s" due to
        // the suffix set "ses", "es", "s" above - we need to ensure that "cat" is
        // transformed to "cat[ses][es][s]" in order to match "cats" which will get that
        // form applied above).
        return value + "[ses][es][s]";
    }

    public bool Equals(string x, string y)
    {
        if (x == null)
            throw new ArgumentNullException("x");
        if (y == null)
            throw new ArgumentNullException("y");

        return GetNormalisedString(x) == GetNormalisedString(y);
    }

    public int GetHashCode(string obj)
    {
        if (obj == null)
            throw new ArgumentNullException("obj");

        return GetNormalisedString(obj).GetHashCode();
    }

    private readonly static PluralEntry[] Matchers = new[]
    {
        // eg. index / indexes / indices
        new PluralEntry(new[] { "ex", "exes", "ices" }, MatchTypeOptions.SuffixOnly),

        // eg. formula / formulae / formulas
        new PluralEntry(new[] { "ula", "ulae", "ulas" }, MatchTypeOptions.SuffixOnly),

        // eg. category / categories
        new PluralEntry(new[] { "y", "ies" }, MatchTypeOptions.SuffixOnly),

        // eg. cactus / cactii
        new PluralEntry(new[] { "us", "ii" }, MatchTypeOptions.SuffixOnly),

        // eg. child / children
        new PluralEntry(new[] { "ld", "ldren" }, MatchTypeOptions.SuffixOnly),

        // eg. medium / media
        new PluralEntry(new[] { "ium", "ia" }, MatchTypeOptions.SuffixOnly),

        // eg. abacuses, matching "s" here means we must use "ses", "es" AND "s" as fallbacks
        // below
        new PluralEntry(new[] { "ses", "es", "s" }, MatchTypeOptions.SuffixOnly),

        // Common special cases
        new PluralEntry(new[] { "datum", "data" }, MatchTypeOptions.WholeWord),
        new PluralEntry(new[] { "man", "men" }, MatchTypeOptions.WholeWord),
        new PluralEntry(new[] { "woman", "women" }, MatchTypeOptions.WholeWord)
    };

    [Serializable]
    private class PluralEntry
    {
        private HashSet<string> _values;
        private string _combinedValues;
        private MatchTypeOptions _matchType;
        public PluralEntry(IEnumerable<string> values, MatchTypeOptions matchType)
        {
            if (values == null)
                throw new ArgumentNullException("values");
            if (!Enum.IsDefined(typeof(MatchTypeOptions), matchType))
                throw new ArgumentOutOfRangeException("matchType");

            var valuesTidied = new HashSet<string>(StringComparer.InvariantCultureIgnoreCase);
            foreach (var value in values)
            {
                var valueTrimmed = (value ?? "").Trim();
                if (valueTrimmed == "")
                    throw new ArgumentException("Null/blank entry encountered in values");

                if (!valuesTidied.Contains(valueTrimmed))
                    valuesTidied.Add(valueTrimmed);
            }

            _values = valuesTidied;
            _combinedValues = "[" + string.Join("][", valuesTidied) + "]";
            _matchType = matchType;
        }

        public bool TryToTransform(string value, out string valueTransformed)
        {
            if (value == null)
                throw new ArgumentNullException("value");

            if (_matchType == MatchTypeOptions.SuffixOnly)
            {
                var matchedSuffix = _values.FirstOrDefault(
                    suffix => (value.Length > suffix.Length) && value.EndsWith(suffix)
                );
                if (matchedSuffix != null)
                {
                    valueTransformed =
                        value.Substring(0, value.Length - matchedSuffix.Length)
                        + _combinedValues;
                    return true;
                }
            }
            else
            {
                if (_values.Contains(value))
                {
                    valueTransformed = _combinedValues;
                    return true;
                }
            }
            valueTransformed = null;
            return false;
        }
    }

    [Serializable]
    public enum MatchTypeOptions
    {
        SuffixOnly,
        WholeWord
    }
}

public interface IStringNormaliser : IEqualityComparer<string>
{
    /// <summary>
    /// This will never return null. If it returns an empty string then the string should
    /// not be considered elligible as a key. It will throw an exception for a null value.
    /// </summary>
    string GetNormalisedString(string value);
}

It's a simple implementation where the TryToTransform method is incredibly easy to follow and the "Matchers" set shows an obvious expansion point to add more edge cases for various other irregular plural suffixes or if the current values are too general and are resulting in false negatives that I want to avoid.

If no match is found, then the word is assumed to be a singular term and the common suffix group "s", "es" and "ses" are all appended. The "es" and "s" values are to match words such as "fishes" and "cats" but the "ses" requirement is less obvious; in order to match "abacuses" and "abacus" we need a single form that matches them both such that "abacus" isn't thought to be the plural of "abacu" (not a real word!). This means that "cats" will be transformed to "cat[ses][es][s]" by this form and so we require all three suffixes in the fallback so that "cat" is also transformed to "cat[ses][es][s]". Looking through that english words list, other words that this is required for include "abuses", "abscesses", and "addresses" (and that's only the first few common words that I came across starting with A!).

Follow-up

I've been playing around more with this and identified a few words that it doesn't handle correctly, starting with data from this All English Words page. Warning: it's slow to load and doesn't look like the most professional site in the world so I'm not sure that the link will continue to work forever! :S

One example that is broken with the above is that "accomplices" is matched with the "ex" / "exes" / "ices" suffix group which was intended for "index" / "indexes" / "indices" and so "accomplice" and "accomplices" are not found to match.

Also "matrix" / "matrices" and "vertex" / "vertices" don't match and trying to handle them in the same manner as the "index" group would introduce problems with words such as "prices" (if a "rix" / "rices" suffix group was used) and "latex" (if a "tex" / "trices" group was used). So all of these troublesome words have been implemented as WholeWord matches instead of trying to deal with them as general form.

So the entry

// eg. index / indexes / indices
new PluralEntry(new[] { "ex", "exes", "ices" }, MatchTypeOptions.SuffixOnly),

is removed and more specific replacements used, the list now becomes:

// eg. formula / formulae / formulas
new PluralEntry(new[] { "ula", "ulae", "ulas" }, MatchTypeOptions.SuffixOnly),

// eg. category / categories
new PluralEntry(new[] { "y", "ies" }, MatchTypeOptions.SuffixOnly),

// eg. cactus / cactii
new PluralEntry(new[] { "us", "ii" }, MatchTypeOptions.SuffixOnly),

// eg. child / children
new PluralEntry(new[] { "ld", "ldren" }, MatchTypeOptions.SuffixOnly),

// eg. medium / media
new PluralEntry(new[] { "ium", "ia" }, MatchTypeOptions.SuffixOnly),

// Common special cases that have to come before the "ses", es", "s" form
new PluralEntry(new[] { "index", "indexes", "indices" }, MatchTypeOptions.WholeWord),
new PluralEntry(new[] { "matrix", "matrices" }, MatchTypeOptions.WholeWord),
new PluralEntry(new[] { "vertex", "vertices" }, MatchTypeOptions.WholeWord),

// eg. Abacuses, matching "s" here means we must use "ses", "es" AND "s" as fallbacks below
new PluralEntry(new[] { "ses", "es", "s" }, MatchTypeOptions.SuffixOnly),

// Other common special cases
new PluralEntry(new[] { "datum", "data" }, MatchTypeOptions.WholeWord),
new PluralEntry(new[] { "man", "men" }, MatchTypeOptions.WholeWord),
new PluralEntry(new[] { "woman", "women" }, MatchTypeOptions.WholeWord)

The WholeWord matches are located at the bottom of the Matchers set since presumably they will match less cases than the general forms before them and so (theoretically) more comparisons will be able to exit the TryToTransform loop earlier that if the WholeWord matches were further up the list.

More plans

There are other tweaks I'd still like to add to this - possibly passing in an optional "pre-processing" string normaliser which would operate before the TryToTransform calls could be beneficial - eg. something to remove punctuation from words so that "cat's" would be matched to "cat" and "cats" without any other changes. Possibly a way to specify the PluralEntry values and the fallback suffix group would be useful so that in special cases additional cases could be added.

Posted at 21:11

Tags:

FullTextIndexer

Comments

16 May 2012

The .Net Dictionary is FAST!

I've been playing around with writing a Full Text Indexer - not for any practical reason, more because I was having an argument with an integration of Lucene (.Net) at work and that phrase crossed my mind; "how hard could it be?!" Yeah.. generally that should be a warning sign :)

I was working under no false illusions - even if the basic concept was easy our integration has all sorts of issues dealing with indexing across different languages, filtering shared content for use in different websites, all sorts.. but I thought it might be a way to learn something new so why not!

At its core, once all of the source content has been parsed and analysed, there's a mapping of string matches back to results with appropriate weightings for the mappings. I initially used a Dictionary<string, WeightedEntry<TKey>> where WeightedEntry<TKey> is a key / weight pair. It struck me that the built-in class is a very general purpose implementation and surely it should be possible to write something that would perform better for more specific circumstances. For instance, I'll consider dropping mutability and only accept strings for keys if these can somehow be traded for performance.

The naive approach

The first pass was very simple - if the built-in dictionary was really just a list of KeyValuePairs then key lookups would have to compare the requested key against every key already in the data. So if I had a lookup class that pre-hashed the keys then on a key lookup I could take the requested key's hash and then only compare it to other keys that I already know have the same hash value. Simples!

Too simple of course.. the .Net Dictionary outperformed this easily. I suspected that, being such a commonly-used class, the framework developers would most likely have put some optimisation work into it - but it's always worth a go!

A still-pretty-naive approach

I remembered seeing some code somewhere deep in the past that had to deal with large numbers of entries in a lookup and the one optimisation it used was to "bucket" the data. So we could take the pre-hashed values and take the modulus of the integer value for some arbitrary number of buckets. Then there would be less comparisons again for a fairly simple one-off up-front calculation. Awesome!

Turns out that after running some tests that the built-in Dictionary was still a way ahead of me. Sigh.

.Net Framework Source

At some point I remembered reading that Microsoft made available the source for the .Net Framework so why not have a bit of a peek - maybe I'll learn something! I've also poked around in source with dotPeek in the past and I believe that will hook into source servers these days.

After a quick shufty it seems like it's using a bucketed approach and basing the number of buckets on a prime number related to the capacity of the dictionary.... er, what?! I presume this is based on a well researched and performant algorithm thought up by clever people and implemented in .Net as a speedy generic implementation. Looks like improving on this will be more difficult that I first imagined (granted, I didn't think my original naive thoughts were particularly realistic).

One more try

The night I came to this point I went to bed and was bothered by what I'd found and ended up keeping myself half awake with ideas about data splitting and bucketing and lots of things that probably made no sense. It was like the opposite of the clarity of a "shower moment" when a solution pops unbidden into your mind, I was just thinking round and round in circles.

But I came out of it wondering - if bucketing the values to reduce the number of comparisons (even if these comparisons are only of hashed values) yields such improved performance, and since computers love simple binary operations the best, then wouldn't having nested buckets which we fall through by repeatedly dividing the hash value by two until we get to the value (if present) be faster again??

All I'd be doing would be bit-shifting the hash and AND'ing the last bit to determine whether to go one way or the other as I go through the layers. I could even represent the set of "buckets" with structs and push them all into one array - that should make accessing the data that much faster as the memory access is more predictable? Right??

And maybe I could experiment with optimising for cases where keys are not present in the data - perhaps it would be quicker to have a "key-not-found" bucket which loops back on itself instead of having to check at each division whether the next bucket exists, surely it would be quicker for the cases where the key is found since there would be less work performed each step!

Well.. I'll not ramble on too much more about this. The .Net Dictionary still beat me.

(Just in case you have a morbid curiosity to the myriad ways in which I failed - and in which I succeeded - there will be a link to a Bitbucket repository further down with example code).

Some algorithm research!

I had a little break from it all (to wallow in the disappointment) before realising I'd basically missed a trick. How often is it that Google doesn't have the answer one way or the other?! More specifically, I'm thinking there must have been many many people who have tried to solve similar problems before.. Maybe I should have thought of this when I looked at the Framework source. Maybe I should have thought of it in the first place! But then I would have missed out on the fun of poking about with the existing Dictionary and tinkering about with all of the above.

So it's proper research time!

I must admit that my knowledge of lower level algorithms and structures is a bit lacking. I have a Maths degree so never covered all those courses from Computer Science, I'm mostly self-taught. Well I've heard about these mysterious red/black trees, binary search trees, self-adjusting balancing search trees, tries - it's just that I don't know what they all are! Maybe one of them can help. I mean, if these have been tested and applied to similar specialised scenarios then I can use someone else's genius to make my own code faster - good times!

The Ternary Search Tree

Having spent a little while trying to get a basic ground in some of the more common structures and investigating which may be appropriate to the matter in hand, I settled on trying out a Ternary Search Tree - partly because the performance characteristics seems to indicate that it could offer an improvement and because it seemed like it would easy to implement (and so not frustrate me too much if it got me nowhere!).

Reading the following articles was enough to get to grips with the structure and write a C# class to implement it:

http://en.wikipedia.org/wiki/Ternary_search_tree

http://www.drdobbs.com/database/184410528

Although the partial string matching and "nearest-neighbour" possibilities seemed interesting I didn't investigate them too far since I don't think they offer too much for the problem domain I've got in mind. This is in context of a full text indexer and the partial string matching will only match a "template" string - the strings must be the same length and specific characters can be considered flexible. Nearest-neighbour also relies on making comparisons between strings of the same length. Both methods would allow only minor flexibility in incorrect spellings, for example. And the partial matching only taking single character wildcards means that a search-for-substring-within-a-key facility couldn't be created in that manner.

The constructor takes a list of KeyValuePairs where the key must be a string while the value is a generic typeparam. It also takes a "keyNormaliser" which will enable a transformation to be applied to all keys before insertion into the tree and the same transformation applied to all keys passed to the retrieval method. This would enable a case-insensitive search tree to be specified. Or it could enable a transformation which replaces all accented characters with latin alternatives. Or enable basic plurality handling by appending an "s" to the end of keys unless the key ends in certain letters (so "s" can safely be appended to "chair" to make "chair[s]" whilst "category" would become "categor[y][ies]" and "cactus" "cact[us][ii]" - the approach would have to map both ways, so would map both "chair" and "chairs" to "chair[s]", it would be language-specific and probably not perfect but it opens up an interesting avenue of investigation).

Update (31st May 2012): To illustrate, an example English-language plurality-handling normaliser can be seen in this post.

Test data

When testing some early forays into this full text indexer problem I pulled some data from the New York Times Article Search API which I only discovered while trying to find a good source of data. I just did a search for "test" and pulled as many articles as I could in the number of API call you're allowed for a day. Then processed the content, leaving me essentially with a dictionary of data with string keys (each key being a word in an article, title or other field in their data).

The API is really simple to communicate with, below is the code I used to perform the retrieval (the comments were stripped out for brevity):

using System;
using System.Collections.Generic;
using System.Collections.ObjectModel;
using System.IO;
using System.Linq;
using System.Net;
using System.Web;
using System.Web.Script.Serialization;

namespace NewYorkTimesDataRetriever
{
    public class NewYorkTimesArticleRetriever : IArticleRetriever
    {
        private Uri _serviceUrl;
        private string _apiKey;
        public NewYorkTimesArticleRetriever(Uri serviceUrl, string apiKey)
        {
            if (serviceUrl == null)
                throw new ArgumentNullException("serviceUrl");
            if (!serviceUrl.IsAbsoluteUri)
                throw new ArgumentException("serviceUrl must be an absolute Uri");
            if (!string.IsNullOrWhiteSpace(serviceUrl.Query))
                throw new ArgumentException("serviceUrl may not have any query content");
            if (string.IsNullOrWhiteSpace(apiKey))
                throw new ArgumentException("Null/empty apiKey specified");

            _serviceUrl = serviceUrl;
            _apiKey = apiKey.Trim();
        }

        public ArticleSet GetArticles(string searchTerm, int pageIndex)
        {
            if (string.IsNullOrWhiteSpace(searchTerm))
                throw new ArgumentException("Null/empty searchTerm specified");
            if (pageIndex < 0)
                throw new ArgumentOutOfRangeException("pageIndex", "must be zero or greater");

            var request = WebRequest.Create(string.Format(
                "{0}?query={1}&api-key={2}&fields=column_facet,body,title,byline&offset={3}",
                _serviceUrl.ToString(),
                HttpUtility.UrlEncode(searchTerm.Trim()),
                HttpUtility.UrlEncode(_apiKey.Trim()),
                pageIndex
            ));

            string jsonData;
            using (var stream = request.GetResponse().GetResponseStream())
            {
                using (var reader = new StreamReader(stream))
                {
                    jsonData = reader.ReadToEnd();
                }
            }
            var results = new JavaScriptSerializer().Deserialize<Resultset>(jsonData);
            var translatedArticles = new List<Article>();
            for (var index = 0; index < results.Results.Count; index++)
            {
                var article = results.Results[index];
                if (string.IsNullOrWhiteSpace(article.Title)
                || string.IsNullOrWhiteSpace(article.Body))
                    continue;
                translatedArticles.Add(new Article(
                    (pageIndex * PageSize) + index,
                    article.Title,
                    article.ByLine,
                    article.Column_Facet,
                    article.Body
                ));
            }
            return new ArticleSet(results.Total, pageIndex, translatedArticles);
        }

        public int PageSize { get { return 10; } }

        private class Resultset
        {
            public Resultset()
            {
                Results = new List<Result>();
            }
            public int Offset { get; set; }
            public int Total { get; set; }
            public string[] Tokens { get; set; }
            public List<Result> Results { get; set; }

            public class Result
            {
                public string Title { get; set; }
                public string ByLine { get; set; }
                public string Column_Facet { get; set; }
                public string Body { get; set; }
            }
        }
    }

    public interface IArticleRetriever
    {
        ArticleSet GetArticles(string searchTerm, int pageIndex);
        int PageSize { get; }
    }

    public class ArticleSet
    {
        private ReadOnlyCollection<Article> _articles;
        public ArticleSet(int totalResultCount, int pageIndex, IEnumerable<Article> articles)
        {
            if (totalResultCount < 0)
                throw new ArgumentOutOfRangeException(
                    "totalResultCount",
                    "must be zero or greater"
                );
            if (pageIndex < 0)
                throw new ArgumentOutOfRangeException(
                    "pageIndex",
                    "must be zero or greater"
                );
            if (articles == null)
                throw new ArgumentNullException("articles");

            var articlesTidied = new List<Article>();
            foreach (var article in articles)
            {
                if (article == null)
                    throw new ArgumentException("Null entry encountered in articles");
                articlesTidied.Add(article);
            }
            if ((totalResultCount == 0) && articlesTidied.Any())
                throw new ArgumentException("Article set must be empty if total count is 0");

            TotalResultCount = totalResultCount;
            PageIndex = pageIndex;
            _articles = articlesTidied.AsReadOnly();
        }

        public int TotalResultCount { get; private set; }
        public int PageIndex { get; private set; }
        public ReadOnlyCollection<Article> Articles { get { return _articles; } }
    }

    public class Article
    {
        public Article(int key, string title, string byLine, string keywords, string body)
        {
            if (string.IsNullOrWhiteSpace(title))
                throw new ArgumentNullException("Null/empty title specified");
            if (string.IsNullOrWhiteSpace(body))
                throw new ArgumentNullException("Null/empty body specified");

            Key = key;
            Title = title.Trim();
            ByLine = (byLine ?? "").Trim();
            Keywords = (keywords ?? "").Trim();
            Body = body.Trim();
        }

        public int Key { get; private set; }
        public string Title { get; private set; }
        public string ByLine { get; private set; }
        public string Keywords { get; private set; }
        public string Body { get; private set; }
    }
}

Celebrate good times!

With this data (over 86,000 keys) I set up some initial test runs - comparing the performance of the .Net Dictionary against my Ternary Search Tree implementation by taking 1000 keys at random from the source data and requesting them all from the .Net Dictionary and then the TST. The retrieval was performed 1000 times successively against each dictionary and then entire process was looped 5 times. Perhaps not the ultimate in the application of the scientific method, but good enough to get an idea.

Running the test app I saw a performance improvement over the .Net Dictionary by the TST with it doing the work over 2.7 times as fast. Amazing! (This was in a release build, debug builds still showed impressive improvement but not as great; around 2x the speed). This was a real result and clearly a marked improvement over my meagre efforts! :)

However.. due to the manner in which the TST performs its lookups, this is really the ideal case for it and so further investigation is required before it can be considered a complete success.

So to look further into it I took the same randomly-selected keys and reversed half of them so that only have of them should match (there will be a few single character keys and some palindromes - I ran a quick a quick check and there were 275 in the set 86,728 keys so they're fairly negligible). This time the TST was over 1.9 times as fast as the .Net Dictionary - still definitely an improvement but not as dramatic. If I reverse all of the keys then it turns out the TST is just slower than the .Net Dictionary (over 97% the speed, but not quite as fast nonetheless).

Before I go too far down this path, though, there's something else that's bugging me. The order in which the keys are inserted while the tree is being constructed is going to alter the internal structure of the completed tree. It seems very unlikely that every order will have identical retrieval performance. Which brings on to thinking hard about..

Insertion Order

The Dobbs article makes a mention about varying performance due to key insertion order:

If you insert the nodes in sorted order, the result is a long skinny tree that is very costly to build and search. Fortunately, if you insert the nodes in random order, a binary search tree is usually close to balanced.

and

You can build a completely balanced tree by inserting the median element of the input set, then recursively inserting all lesser elements and greater elements.

To test it out I tried a few orderings on inserted items; as they appeared in the source data, applying a random sort, alphabetical and the median-item method recommended above.

I re-ran the earlier tests, reversing none of the keys, half of the keys and then all of the keys to get a rough idea of how that might affect the results. Each time the median-item sorting approach was fastest. But.. in the best cases it wasn't much over 2% faster than the random-ordered insertions and less than 2% faster than the unordered data. It was about 8.5% faster than the alphabetical-sorted keys so (as expected from the above) that was considerably worse. When all keys were reversed the improvements were in the region of 0.5%, 0.2% and 3.7% (for random, unordered and alphabetical). So there's definitely an improvement to be had but it's much smaller between randomly sorted keys and "ideally" sorted keys (a well-balanced tree) than between the .Net Dictionary and the TST.

If the data was to infrequently updated and heavily read then ensuring the tree is well-balanced is most likely worth it, but in other scenarios it would be a toss-up as to whether re-sorting the data at each update is worth the initial hit.

Measure of balance

If there was a standard way to measure how well balanced the tree is then it might be feasible to apply a threshold that, once crossed, triggers a re-build of the tree. Oddly, though, I had a quick scour of the internet and couldn't find a standard metric! The gist is that the less "deep" a tree is, the better - which makes sense as the seek operations can be envisaged as travelling down the branches of the tree and if a tree is wider than it is deep then retrievals can generally be expected to require less operations.

So I considered trying to measure the width and depth of the tree, then the range of tree lengths (then a few other things)... and finally settled on calculating the average depth-to-key-length ratio. This will always be greater than one (unless there are no items, in which case it should be considered undefined) since no key can be contained in less elements than there are characters in the key (after any manipulations by the keyNormaliser). For the data I was using, I got "balance factors" (for want of a better name!) of 2.17, 2.13, 6.65 and 1.96 for the randomly-sorted keys, the unsorted keys, the alphabetically-sorted keys and the median-item-sorted keys, each of which tie in roughly with the differences in seek times I described above.

Final results

Taking a well-balanced tree with my sample data, I ran it through the same batch of tests (with varying proportions of key flipped) to see how good we could get it:

No keys reversed (so all keys requested present in data): Over 2.8x the retrieval performance of the .Net Dictionary.

75% of the keys reversed: Over 1.4x

50%: Over 1.2x

97%: Almost 1.1x

98%: Between 1.0x and 1.1x

99%: Slightly slower, about 99% the speed of the .Net Dictionary

All keys reversed: Slower again, about 97% the performance

All in all, pretty impressive! So long as at least 99% of requests are for keys that are present then the TST will offer an improvement, and a drastic one when most requests are for valid keys.

It's been a fun venture and left me think I should fill in my algorithm knowledge a little more - it's not every day that I'm going to obsess over a particular structure, but it's always good to increase my awareness of what's out there.

The code can be found in the repository https://bitbucket.org/DanRoberts/dictionaryspeedtests - along with a few tests I ran and some of the approaches I tried out that I talked about at the top of the post - but the TST class is basically the following (again, slightly trimmed for brevity). As my original premise was to consider an immutable store I haven't implemented any add / delete / update methods but now that I'm happy with the performance boost I may well do - but then I'll have to make more decisions into whether I should try to build in a re-balancing threshold mechanism and I think instead I might have a little rest! :)

using System;
using System.Collections;
using System.Collections.Generic;
using System.Collections.ObjectModel;
using System.Linq;

namespace TSTExample
{
    public class TernarySearchTreeDictionary<TValue>
    {
        private Node _root;
        private IStringNormaliser _keyNormaliser;
        private ReadOnlyCollection<string> _keys;

        public TernarySearchTreeDictionary(
            IEnumerable<KeyValuePair<string, TValue>> data,
            IStringNormaliser keyNormaliser)
        {
            if (data == null)
                throw new ArgumentNullException("data");
            if (keyNormaliser == null)
                throw new ArgumentNullException("keyNormaliser");

            Node root = null;
            var nodeCount = 0;
            var keys = new HashSet<string>(keyNormaliser);
            var depthToLengthRatio = new List<float>();
            foreach (var entry in data)
            {
                var key = entry.Key;
                if (key == null)
                    throw new ArgumentException("Null key encountered in data");
                var normalisedKey = keyNormaliser.GetNormalisedString(key);
                if (normalisedKey == "")
                    throw new ArgumentException("normalised key is blank: " + key);
                if (keys.Contains(normalisedKey))
                    throw new ArgumentException("duplicate normalised key: " + key);
                keys.Add(key);

                if (root == null)
                {
                    root = new Node() { Character = normalisedKey[0] };
                    nodeCount++;
                }

                var node = root;
                var index = 0;
                var depth = 1;
                while (true)
                {
                    if (node.Character == normalisedKey[index])
                    {
                        index++;
                        if (index == normalisedKey.Length)
                        {
                            node.IsKey = true;
                            node.Value = entry.Value;
                            depthToLengthRatio.Add(depth / normalisedKey.Length);
                            break;
                        }
                        if (node.MiddleChild == null)
                        {
                            node.MiddleChild = new Node() { Character = normalisedKey[index] };
                            nodeCount++;
                        }
                        node = node.MiddleChild;
                    }
                    else if (normalisedKey[index] < node.Character)
                    {
                        if (node.LeftChild == null)
                        {
                            node.LeftChild = new Node() { Character = normalisedKey[index] };
                            nodeCount++;
                        }
                        node = node.LeftChild;
                    }
                    else
                    {
                        if (node.RightChild == null)
                        {
                            node.RightChild = new Node() { Character = normalisedKey[index] };
                            nodeCount++;
                        }
                        node = node.RightChild;
                    }
                    depth++;
                }
            }

            _root = root;
            _keyNormaliser = keyNormaliser;
            _keys = keys.ToList().AsReadOnly();
            BalanceFactor = depthToLengthRatio.Any() ? depthToLengthRatio.Average() : 0;
        }

        private class Node
        {
            public char Character { get; set; }
            public Node LeftChild { get; set; }
            public Node MiddleChild { get; set; }
            public Node RightChild { get; set; }
            public bool IsKey { get; set; }
            public TValue Value { get; set; }
        }

        public int Count
        {
            get { return _keys.Count; }
        }

        public IEnumerable<string> Keys
        {
            get { return _keys; }
        }

        public float BalanceFactor { get; private set; }

        public bool TryGetValue(string key, out TValue value)
        {
            if (key == null)
                throw new ArgumentNullException("key");
            var normalisedKey = _keyNormaliser.GetNormalisedString(key);
            if (normalisedKey != "")
            {
                var node = _root;
                var index = 0;
                while (true)
                {
                    if (node.Character == normalisedKey[index])
                    {
                        index++;
                        if (index == normalisedKey.Length)
                        {
                            if (node.IsKey)
                            {
                                value = node.Value;
                                return true;
                            }
                            break;
                        }
                        node = node.MiddleChild;
                    }
                    else if (normalisedKey[index] < node.Character)
                        node = node.LeftChild;
                    else
                        node = node.RightChild;
                    if (node == null)
                        break;
                }
            }
            value = default(TValue);
            return false;
        }

        public TValue this[string key]
        {
            get
            {
                TValue value;
                if (!TryGetValue(key, out value))
                    throw new KeyNotFoundException();
                return value;
            }
        }
    }

    public interface IStringNormaliser : IEqualityComparer<string>
    {
        string GetNormalisedString(string value);
    }
}

Posted at 23:38

Tags:

Comments

You may also be interested in (see here for information about how these are generated):

You may also be interested in (see here for information about how these are generated):

You may also be interested in (see here for information about how these are generated):

You may also be interested in (see here for information about how these are generated):

You may also be interested in (see here for information about how these are generated):

You may also be interested in (see here for information about how these are generated):

You may also be interested in (see here for information about how these are generated):

You may also be interested in (see here for information about how these are generated):

You may also be interested in (see here for information about how these are generated):

You may also be interested in (see here for information about how these are generated):

You may also be interested in (see here for information about how these are generated):

You may also be interested in (see here for information about how these are generated):

You may also be interested in (see here for information about how these are generated):

You may also be interested in (see here for information about how these are generated):

You may also be interested in (see here for information about how these are generated):

You may also be interested in (see here for information about how these are generated):

You may also be interested in (see here for information about how these are generated):

About

Recent Posts

Highlights

Archives