Productive Rage

26 July 2012

The Full Text Indexer - Going International!

Pushing on with the Full Text Indexer series I'm been posting about (see Full Text Indexer and Full Text Indexer - Adding and Subtracting) I want to demonstrate how it can work with multi-lingual content (or other content variations - for example, with the data at my day job Products have different delivery "channels" for which different descriptions may be recorded, as well as being in multiple languages).

A heads-up: This post is going to be largely code with a few explanatory paragraphs. There's nothing particularly complex going on and I think the code will - for the most part - speak for itself!

Setting the scene

In the previous examples, the TKey type of the IIndexData was an int representing an item's unique id. One way to extend this would be to specify as TKey the following:

public interface ILanguageScopedId : IEquatable<ILanguageScopedId>
{
    int Id { get; }
    bool IsApplicableForLanguage(int language);
}

Where two simple implementations might be:

public sealed class LanguageScopedId : ILanguageScopedId
{
    public LanguageScopedId(int id, int language)
    {
        Id = id;
        Language = language;
    }

    public int Id { get; private set; }

    public int Language { get; private set; }

    public bool IsApplicableForLanguage(int language)
    {
        return (language == Language);
    }

    public bool Equals(ILanguageScopedId obj)
    {
        var objLanguageScopedId = obj as LanguageScopedId;
        if (objLanguageScopedId == null)
            return false;

        return ((objLanguageScopedId.Id == Id) && (objLanguageScopedId.Language == Language));
    }

    public override bool Equals(object obj)
    {
        return Equals(obj as ILanguageScopedId);
    }

    public override int GetHashCode()
    {
        // Since the overridden ToString method will consistently encapsulate all of the
        // information for this instance we use it to override the GetHashCode method,
        // consistent with the overridden Equals implementation
        return ToString().GetHashCode();
    }

    public override string ToString()
    {
        return String.Format("{0}:{1}-{2}", base.ToString(), Id, Language);
    }
}

public sealed class : ILanguageScopedId
{
    public NonLanguageScopedId(int id)
    {
        Id = id;
    }

    public int Id { get; private set; }

    public bool IsApplicableForLanguage(int language)
    {
        return true;
    }

    public bool Equals(ILanguageScopedId obj)
    {
        var objLanguageScopedId = obj as NonLanguageScopedId;
        if (objLanguageScopedId == null)
            return false;

        return (objLanguageScopedId.Id == Id);
    }

    public override bool Equals(object obj)
    {
        return Equals(obj as ILanguageScopedId);
    }

    public override int GetHashCode()
    {
        // Since the overridden ToString method will consistently encapsulate all of the
        // information for this instance we use it to override the GetHashCode method,
        // consistent with the overridden Equals implementation
        return ToString().GetHashCode();
    }

    public override string ToString()
    {
        return String.Format("{0}:{1}", base.ToString(), Id);
    }
}

There are two implementations to account for it's feasible that not all content will be multi-lingual (see the Article class further down). I only really included IEquatable<ILanguageScopedId> in the ILanguageScopedId so that it would be easy to write the KeyComparer that the IndexGenerator requires (this was the same motivation for having implementations being sealed, since they can't be inherited from the type comparisons are easier in the Equals methods) -

/// <summary>
/// As the ILanguageScopedId interface implements IEquatable ILanguageScopedId, this class
/// has very little work to do
/// </summary>
public class LanguageScopedIdComparer : IEqualityComparer<ILanguageScopedId>
{
    public bool Equals(ILanguageScopedId x, ILanguageScopedId y)
    {
        if ((x == null) && (y == null))
            return true;
        else if ((x == null) || (y == null))
            return false;
        return x.Equals(y);
    }

    public int GetHashCode(ILanguageScopedId obj)
    {
        if (obj == null)
            throw new ArgumentNullException("obj");

        return obj.GetHashCode();
    }
}

The previous posts used an Article class as an illustration. Here I'll expand that class such that the Title and Content have content that may vary across different languages (represented by the MultiLingualContent class, also below) while Author will not (and so is just a string) -

public class Article
{
    public Article(
        int id,
        DateTime lastModified,
        MultiLingualContent title,
        string author,
        MultiLingualContent content)
    {
        if (title == null)
            throw new ArgumentNullException("title");
        if (string.IsNullOrWhiteSpace(author))
            throw new ArgumentException("Null/blank author specified");
        if (content == null)
            throw new ArgumentNullException("content");

        Id = id;
        LastModified = lastModified;
        Title = title;
        Author = author.Trim();
        Content = content;
    }

    public int Id { get; private set; }

    public bool IsActive { get; private set; }

    public DateTime LastModified { get; private set; }

    /// <summary>
    /// This will never be null
    /// </summary>
    public MultiLingualContent Title { get; private set; }

    /// <summary>
    /// This will never be null or blank
    /// </summary>
    public string Author { get; private set; }

    /// <summary>
    /// This will never be null
    /// </summary>
    public MultiLingualContent Content { get; private set; }
}

public class MultiLingualContent
{
    private string _defaultContent;
    private ImmutableDictionary<int, string> _languageOverrides;
    public MultiLingualContent(
        string defaultContent,
        ImmutableDictionary<int, string> languageOverrides)
    {
        if (string.IsNullOrWhiteSpace(defaultContent))
            throw new ArgumentException("Null/blank defaultContent specified");
        if (languageOverrides == null)
            throw new ArgumentNullException("languageOverrides");
        if (languageOverrides.Keys.Select(key => languageOverrides[key]).Any(
            value => string.IsNullOrWhiteSpace(value))
        )
            throw new ArgumentException("Null/blank encountered in languageOverrides data");

        _defaultContent = defaultContent.Trim();
        _languageOverrides = languageOverrides;
    }

    /// <summary>
    /// This will never return null or blank. If there is no language-specific content for
    /// the specified language then the default will be returned.
    /// </summary>
    public string GetContent(int language)
    {
        if (_languageOverrides.ContainsKey(language))
            return _languageOverrides[language].Trim();
        return _defaultContent;
    }
}

Note: The ImmutableDictionary (along with the NonNullImmutableList and the ToNonNullImmutableList extension method which are seen elsewhere in the code) can be found in the Full Text Indexer repo on Bitbucket.

Generating and querying the new Index format

For the purposes of this example, I'm going to assume that all of the possible languages are known upfront (if not then each time an Index is built, it's feasible that the source Article data could be analysed each time to determine which languages are present but for now I'm going to go with the easier case of knowledge of all options beforehand).

As we've seen before, we need to prepare an IndexGenerator (this time IndexGenerator<ArticleI, LanguageScopedId> instead of IndexGenerator<ArticleI, int> since the key type of the IIndexData that will be produced is no longer an int) with Content Retrievers, a Key Comparer, Token Breaker, Weighted Entry Combiner and Logger. Here there are more Content Retrievers as the multi-lingual content must be requested for each supported language (though the non-multi-lingual content - the Author field on Article instances - only needs a single retriever).

var languages = new[] { 1, 2, 3 };

var contentRetrievers =
    new[]
    {
        new ContentRetriever<Article, ILanguageScopedId>(
            article => new PreBrokenContent<ILanguageScopedId>(
                new NonLanguageScopedId(article.Id),
                article.Author
            ),
            token => 1f
        )
    }
    .Concat(
        languages.SelectMany(language => new[]
        {
            new ContentRetriever<Article, ILanguageScopedId>(
                article => new PreBrokenContent<ILanguageScopedId>(
                    new LanguageScopedId(article.Id, language),
                    article.Title.GetContent(language)
                ),
                token => 5f
            ),
            new ContentRetriever<Article, ILanguageScopedId>(
                article => new PreBrokenContent<ILanguageScopedId>(
                    new LanguageScopedId(article.Id, language),
                    article.Content.GetContent(language)
                ),
                token => 1f
            )
        }
    ));

var indexGenerator = new IndexGenerator<Article, ILanguageScopedId>(
    contentRetrievers.ToNonNullImmutableList(),
    new LanguageScopedIdComparer(),
    new DefaultStringNormaliser(),
    new WhiteSpaceTokenBreaker(),
    weightedValues => weightedValues.Sum(),
    new NullLogger()
);

var index = indexGenerator.Generate(new NonNullImmutableList<Article>(new[]
{
    new Article(
        1,
        new DateTime(2012, 7, 24),
        new ContentBuilder("One").AddLanguage(2, "Un").Get(),
        "Terrence",
        new ContentBuilder("First Entry").AddLanguage(2, "Première entrée").Get()
    ),
    new Article(
        2,
        new DateTime(2012, 8, 24),
        new ContentBuilder("Two").AddLanguage(2, "Deux").Get(),
        "Jeroshimo",
        new ContentBuilder("Second Entry").AddLanguage(2, "Deuxième entrée").Get()
    )
}));

Finally, there's a slight change to the querying mechanism. We have to perform a lookup for all keys that match a given token and then filter out any entries that we're not interested in. And since there are multiple key types which can relate to content in the same language (because a request for content in language 1 should combine keys of type LanguageScopedId which are marked as being for language 1 alongside keys of type NonLanguageScopedId), we may have to group and combine some of the results.

var resultsEntryInLanguage1 = index.GetMatches("Entry")
    .Where(weightedMatch => weightedMatch.Key.IsApplicableForLanguage(language))
    .GroupBy(weightedMatch => weightedMatch.Key.Id)
    .Select(weightedMatchGroup => new WeightedEntry<int>(
        weightedMatchGroup.Key,
        weightedMatchGroup.Sum(weightedMatch => weightedMatch.Weight)
    ));

The earlier code uses a "ContentBuilder" to prepare the MultiLingualContent instances, just because it removes some of the clutter from the code. For the sake of completeness, that can be seen below:

private class ContentBuilder
{
    private string _defaultContent;
    private ImmutableDictionary<int, string> _languageOverrides;

    public ContentBuilder(string defaultContent)
        : this(defaultContent, new ImmutableDictionary<int, string>()) { }

    private ContentBuilder(
        string defaultContent,
        ImmutableDictionary<int, string> languageOverrides)
    {
        if (string.IsNullOrWhiteSpace(defaultContent))
            throw new ArgumentException("Null/blank defaultContent specified");
        if (languageOverrides == null)
            throw new ArgumentNullException("languageOverrides");

        _defaultContent = defaultContent;
        _languageOverrides = languageOverrides;
    }

    public ContentBuilder AddLanguage(int language, string content)
    {
        if (string.IsNullOrWhiteSpace(content))
            throw new ArgumentException("Null/blank content specified");
        if (_languageOverrides.ContainsKey(language))
            throw new ArgumentException("Duplicate language: " + language);

        return new ContentBuilder(
            _defaultContent,
            _languageOverrides.AddOrUpdate(language, content)
        );
    }

    public MultiLingualContent Get()
    {
        return new MultiLingualContent(_defaultContent, _languageOverrides);
    }
}

Extended Key Types

This approach to supporting multi-lingual data is just one way of using the generic TKey type of the IndexGenerator / IndexData classes. I mentioned at the top that the data I deal with at my real job also varies descriptive data over multiple delivery channels, this could be implemented in a similar manner to the above by extending the ILanguageScopedId interface to:

public interface ILanguageScopedId : IEquatable<ILanguageScopedId>
{
    int Id { get; }
    bool IsApplicableFor(int language, int channel);
}

And, in the same way as the above code has both the LanguageScopedId and NonLanguageScopedId, there could be various implementations for content that does/doesn't vary by language and/or does/doesn't vary by delivery channel.

In fact, since there must be a Key Comparer passed in as a constructor argument to the IndexGenerator, any kind of key can be used with the index so long as an appropriate comparer is available!

Performance

The downside to this sort of approach is, predictably, increased resource requirements in the index generation. I say predictably because it should be clear that specifying more Content Retrievers (which we are; they have to increase as the number languages of increases) means that more work will be done when an index is generated from input data.

Also in the above example, more storage space will be required to store the results as more content is being extracted and stored in the index - it's feasible that source data could be present which doesn't have any multi-lingual data and so returns the default values from the MultiLingualContent.GetContent(language) call for every language. For each token that is recorded for the data, keys for each of the languages will be recorded in the index - each with duplicate weight data, repeated for each language. It's possible that a more intelligent key structure could reduce that amount of space taken up in these cases but that's outside the scope of this post I think (plus no solution springs immediately to mind at this time of night! :)

The good news is that the retrieval time shouldn't be significantly increased; the additional work is to filter the matched keys and group them together on the underlying id, the lookup should still be very quick. The additional load that the filtering and grouping will incur will depend upon the structure of the key class.

Update (17th December 2012): This has been included as part of a later Full Text Indexer Round-up Post that brings together several Posts into one series, incorporating code and techniques from each of them.

Posted at 23:56

Tags:

FullTextIndexer

Comments

23 July 2012

The Full Text Indexer - Adding and Subtracting

The Full Text Indexer that I talked about last time took a definition for an Index Generator for a specific TSource type and produced an IndexData instance, using that generator, for a TSource set.

In the example shown there, it created an IndexGenerator for IArticle and then generated an Index for an IArticle list. The IIndexData<int> (TKey is an int in this case as the key on IArticle is its Id field, which is an int). This IIndexData<int> is an immutable data structure and so it may not be immediately obvious how to update it when the source data has changed.

Last time I mentioned that IIndexData<TKey> has this method:

public interface IIndexData<TKey>
{
    /// <summary>
    /// This will throw an exception for null or blank input. It will never return null.
    /// If there are no matches then an empty list will be returned.
    /// </summary>
    NonNullImmutableList<WeightedEntry<TKey>> GetMatches(string source);
}

but the full interface is:

public interface IIndexData<TKey>
{
    /// <summary>
    /// This will throw an exception for null or blank input. It will never return null.
    /// If there are no matches then an empty list will be returned.
    /// </summary>
    NonNullImmutableList<WeightedEntry<TKey>> GetMatches(string source);

    /// <summary>
    /// This will return a new instance that combines the source instance's data with the
    /// data other IndexData instances using the specified weight combiner. In a case where
    /// there are different TokenComparer implementations on this instance and on any of the
    /// indexesToAdd, the comparer from the current instance will be used. It is recommended
    /// that a consistent TokenComparer be used at all times. An exception will be thrown
    /// for null dataToAdd or weightCombiner references.
    /// </summary>
    IIndexData<TKey> Combine(
        NonNullImmutableList<IIndexData<TKey>> indexesToAdd,
        IndexGenerators.IndexGenerator.WeightedEntryCombiner weightCombiner
    );

    /// <summary>
    /// This will return a new instance without any WeightedEntry values whose Keys match
    /// the removeIf predicate. If tokens are left without any WeightedEntry values then
    /// the token will be excluded from the new data. This will never return null. It
    /// will throw an exception for a null removeIf.
    /// </summary>
    IIndexData<TKey> Remove(Predicate<TKey> removeIf);

    /// <summary>
    /// This will never return null, the returned dictionary will have this instance's
    /// KeyNormaliser as its comparer
    /// </summary>
    IDictionary<string, NonNullImmutableList<WeightedEntry<TKey>>> ToDictionary();

    /// <summary>
    /// This will never return null
    /// </summary>
    NonNullOrEmptyStringList GetAllTokens();

    /// <summary>
    /// This will never return null
    /// </summary>
    IEqualityComparer<string> TokenComparer { get; }

    /// <summary>
    /// This will never return null
    /// </summary>
    IEqualityComparer<TKey> KeyComparer { get; }
}

The TokenComparer and KeyComparer are the instances passed into the IndexGenerator's constructor (a DefaultStringNormaliser and an IntEqualityComparer in last time's example). The GetAllTokens method returns a set of tokens that have matches in the IndexData (where multiple tokens are present in the data that are considered equivalent, only one will be in the set returned by GetAllTokens - the example used the DefaultStringNormaliser which ignores case, so if data for the tokens "Token" and "TOKEN" is present, and encountered in that order, then only "Token" would be in the GetAllTokens set, "TOKEN" wouldn't have been added as a distinct value as it is equivalent to "Token").

The interesting methods in this context are Combine and Remove.

Remove

Remove is the simpler of the two so I'll address that first: A predicate is passed to it which filters which key values should be allowed through, data which passes this filtering will be used to form a new IIndexData instance which will be returned from the method. The original IndexData instance remains unaltered while a filtered version is provided which meets the particular criteria.

Combine

The Combine method will take one or more additional IIndexData instances (for the same TKey type) and bring all of the content from these and the original index into a new instance describing aggregated data. Where data for the same keys appear in the indexes, the match weights will be combined using a specified "WeightedEntryCombiner" (which just takes a set of floats and returns a single value representing them all; the most common case is to sum the values but they could be averaged or the greatest value taken - whatever's most appropriate!).

Pulling an example together

To show these methods in action I've extended the IArticle IndexGenerator concept that I showed in the previous post by wrapping it in another class that maintains an index based upon changing data by keeping a "source data summary" of what keys were used to generate the current data and what the last modified dates of the source data was. I'm aiming to come up with an "IndexBuilder" that will expose the following:

/// <summary>
/// This will never return null
/// </summary>
public IIndexData<TKey> Index { get; }

/// <summary>
/// This will never return null, it will throw an exception for null input
/// </summary>
public IIndexData<TKey> UpdateIndex(NonNullImmutableList<TSource> values);

All the same types will be required in the IndexBuilder constructor that the IndexGenerator required last time (the Content Retrievers, Key Comparer, Token Comparer, Weighted Entry Combiner and Logger) along with one additional dependency; a "Source Item Status Retriever". This is just a delegate that takes an instance of the generic type parameter TSource and returns a SourceDataSummary instance that reports its Key and a LastModifiedDate (so hardly rocket science!). This will enable the IndexBuilder to maintain a summary of the input data that was used to build the current index and so determine what work (if any) is required when UpdateIndex is called.

If the example code last time didn't look too scary, then neither should this:

// Instantiate an IndexBuilder that will index IArticles (which have ints as their Keys).
// - Content Retrievers describe how to extract data from each IArticle, there is a delegate
//   to retrieve Key and LastModifiedDate from IArticle (the "Source Item Status Retriever"),
//   there's a Token Breaker which breaks up the content, there's a String Normaliser which
//   compares the resulting Tokens to group them together, there's a "Weighted Entry
//   Combiner" which creates an aggregate weight for Tokens that are grouped,
//   there's an IntEqualityComparer that acts as a Key Comparer and there's
//   a Logger. See; nothing to it! :D

var indexBuilder = new IndexBuilder<IArticle, int>(
    new NonNullImmutableList<ContentRetriever<IArticle, int>>(new []
    {
        // Additional weight is given to words matched in the Title
        new ContentRetriever<IArticle, int>(
            article => new PreBrokenContent<int>(article.Id, article.Title),
            token => 5f
        ),
        new ContentRetriever<IArticle, int>(
            article => new PreBrokenContent<int>(article.Id, article.Content),
            token => 1f
        )
    }),
    article => new IndexBuilder<IArticle, int>.SourceDataSummary(
        article.Id,
        article.LastModified
    ),
    new IntEqualityComparer(),
    new WhiteSpaceTokenBreaker(),
    new DefaultStringNormaliser(),
    weightedValues => weightedValues.Sum(),
    new NullLogger()
);

Instead of instantiating an IndexGenerator directly we're going to use the IndexBuilder that I'm describing, and we'll pass data to it thusly:

var articles = new[]
{
    new Article(1, new DateTime(2012, 7, 21, 0, 0, 1), "One", "One Content"),
    new Article(2, new DateTime(2012, 8, 21, 0, 0, 1), "Two", "Two Content"),
    new Article(3, new DateTime(2012, 9, 21, 0, 0, 1), "Three", "Three Content")
};
var index = indexBuilder.UpdateIndex(new NonNullImmutableList<IArticle>(articles));

The source data types are not very interesting and are here only for completeness of the example, to be honest!

public class Article : IArticle
{
    public Article(int id, DateTime lastModified, string title, string content)
    {
        if (string.IsNullOrWhiteSpace(title))
            throw new ArgumentException("Null/blank title specified");
        if (string.IsNullOrWhiteSpace(content))
            throw new ArgumentException("Null/blank content specified");

        Id = id;
        LastModified = lastModified;
        Title = title.Trim();
        Content = content.Trim();
    }

    public int Id { get; private set; }

    public DateTime LastModified { get; private set; }

    /// <summary>
    /// This will never be null or blank
    /// </summary>
    public string Title { get; private set; }

    /// <summary>
    /// This will never be null or blank
    /// </summary>
    public string Content { get; private set; }
}

public interface IArticle
{
    int Id { get; }

    DateTime LastModified { get; }

    /// <summary>
    /// This will never be null or blank
    /// </summary>
    string Title { get; }

    /// <summary>
    /// This will never be null or blank
    /// </summary>
    string Content { get; }
}

And finally (finally!) the IndexBuilder itself. The constructor takes up a chunk of space, validating all of the input. Then there's a few lines taken up by the definition of the SourceItemStatusRetriever and SourceDataSummary class. At the end there's the UpdateIndex method which determines what work needs to be done to get its IndexData instance to match the new source data - and it uses the Remove and Combine methods to synchronise the index with the data:

using System;
using System.Collections.Generic;
using System.Linq;
using Common.Lists;
using Common.Logging;
using FullTextIndexer.Indexes;
using FullTextIndexer.Indexes.TernarySearchTree;
using FullTextIndexer.IndexGenerators;
using FullTextIndexer.TokenBreaking;

namespace FullTextIndexerDemo
{
    public class IndexBuilder<TSource, TKey> where TSource : class
    {
        private List<ContentRetriever<TSource, TKey>> _contentRetrievers;
        private SourceItemStatusRetriever _sourceItemStatusRetriever;
        private IEqualityComparer<TKey> _keyComparer;
        private ITokenBreaker _tokenBreaker;
        private IStringNormaliser _stringNormaliser;
        private IndexGenerator.WeightedEntryCombiner _weightedEntryCombiner;
        private IIndexGenerator<TSource, TKey> _indexGenerator;
        private IIndexData<TKey> _index;
        private Dictionary<TKey, DateTime> _sourceDataSummary;
        private object _writeLock;
        public IndexBuilder(
            NonNullImmutableList<ContentRetriever<TSource, TKey>> contentRetrievers,
            SourceItemStatusRetriever sourceItemStatusRetriever,
            IEqualityComparer<TKey> keyComparer,
            ITokenBreaker tokenBreaker,
            IStringNormaliser stringNormaliser,
            IndexGenerator.WeightedEntryCombiner weightedEntryCombiner,
            ILogger logger)
        {
            if (contentRetrievers == null)
                throw new ArgumentNullException("contentRetrievers");
            if (!contentRetrievers.Any())
                throw new ArgumentException("No contentRetrievers specified");
            if (sourceItemStatusRetriever == null)
                throw new ArgumentNullException("sourceItemStatusRetriever");
            if (keyComparer == null)
                throw new ArgumentNullException("keyComparer");
            if (tokenBreaker == null)
                throw new ArgumentNullException("tokenBreaker");
            if (stringNormaliser == null)
                throw new ArgumentNullException("stringNormaliser");
            if (weightedEntryCombiner == null)
                throw new ArgumentNullException("weightedEntryCombiner");
            if (logger == null)
                throw new ArgumentNullException("logger");

            var contentRetrieversTidied = new List<ContentRetriever<TSource, TKey>>();
            foreach (var contentRetriever in contentRetrievers)
            {
                if (contentRetriever == null)
                    throw new ArgumentException("Null encountered in contentRetrievers set");
                contentRetrieversTidied.Add(contentRetriever);
            }
            if (!contentRetrieversTidied.Any())
                throw new ArgumentException("No contentRetrievers specified");

            _contentRetrievers = contentRetrieversTidied;
            _sourceItemStatusRetriever = sourceItemStatusRetriever;
            _keyComparer = keyComparer;
            _tokenBreaker = tokenBreaker;
            _stringNormaliser = stringNormaliser;
            _weightedEntryCombiner = weightedEntryCombiner;
            _sourceDataSummary = new Dictionary<TKey, DateTime>(keyComparer);
            _writeLock = new object();

            _indexGenerator = new IndexGenerator<TSource, TKey>(
                contentRetrieversTidied.ToNonNullImmutableList(),
                keyComparer,
                stringNormaliser,
                tokenBreaker,
                weightedEntryCombiner,
                logger
            );
            _index = _indexGenerator.Generate(new NonNullImmutableList<TSource>());
        }

        /// <summary>
        /// This will never be called with a null source reference, it must never return null
        /// </summary>
        public delegate SourceDataSummary SourceItemStatusRetriever(TSource source);
        public class SourceDataSummary
        {
            public SourceDataSummary(TKey key, DateTime lastModified)
            {
                if (key == null)
                    throw new ArgumentNullException("key");

                Key = key;
                LastModified = lastModified;
            }
            public TKey Key { get; private set; }
            public DateTime LastModified { get; private set; }
        }

        /// <summary>
        /// This will never return null
        /// </summary>
        public IIndexData<TKey> Index
        {
            get
            {
                // Since the index is an immutable data type we don't need to worry about
                // locking it for read access
                return _index;
            }
        }

        /// <summary>
        /// This will never return null, it will throw an exception for null input
        /// </summary>
        public IIndexData<TKey> UpdateIndex(NonNullImmutableList<TSource> values)
        {
            if (values == null)
                throw new ArgumentNullException("values");

            var newIndexSummary = values
                .Select(value => _sourceItemStatusRetriever(value))
                .GroupBy(
                    summary => summary.Key,
                    _keyComparer
                )
                .ToDictionary(
                    group => group.Key,
                    group => group.Max(summary => summary.LastModified),
                    _keyComparer
                );

            lock (_writeLock)
            {
                // No source data, so just generate an empty index
                if (!newIndexSummary.Any())
                {
                    _sourceDataSummary = newIndexSummary;
                    _index = _indexGenerator.Generate(new NonNullImmutableList<TSource>());
                    return _index;
                }

                // There will (probably) be some keys to remove entirely, some that have to
                // be removed so that they can be replaced with updated content and some that
                // are not present in the existing data. First determine which keys fall into
                // which category (if any).
                var keysToRemove = new HashSet<TKey>(
                    _sourceDataSummary
                        .Select(summary => summary.Key)
                        .Except(newIndexSummary.Select(summary => summary.Key)),
                    _keyComparer
                );
                var keysToUpdate = new HashSet<TKey>(
                    _sourceDataSummary
                        .Where(summary =>
                        {
                            DateTime newSummaryLastModified;
                            if (!newIndexSummary.TryGetValue(
                                summary.Key, out newSummaryLastModified))
                            {
                                return false;
                            }
                            return newSummaryLastModified > summary.Value;
                        })
                        .Select(summary => summary.Key),
                    _keyComparer
                );
                var keysToAdd = new HashSet<TKey>(
                    newIndexSummary.Keys.Except(_sourceDataSummary.Keys),
                    _keyComparer
                );
                if (!keysToAdd.Any() && !keysToRemove.Any() && !keysToUpdate.Any())
                {
                    // If there are no actions to take then don't do any work!
                    return _index;
                }

                // Prepare the new data to insert
                var indexDataToUpdate = _indexGenerator.Generate(
                    values
                        .Where(value =>
                        {
                            var key = _sourceItemStatusRetriever(value).Key;
                            return keysToUpdate.Contains(key) || keysToAdd.Contains(key);
                        })
                        .ToNonNullImmutableList()
                );

                // Update the index content by removing keys and combining in the newly
                // generated content
                _index = _index
                    .Remove(key => keysToRemove.Contains(key) || keysToUpdate.Contains(key))
                    .Combine(
                        new[] { indexDataToUpdate }.ToNonNullImmutableList(),
                        _weightedEntryCombiner
                    );

                // All that's left is to update the source data summary and return the
                // new data!
                _sourceDataSummary = newIndexSummary;
                return _index;
            }
        }
    }
}

In case this class needs to be used in a multi-threaded environment there is a write-lock used for any calls to UpdateIndex but requests for the Index property for reading only will require no locks since the IndexData is an immutable structure! (This includes the case where index data may be cached in memory and shared between web requests which is an implicit multi-threading scenario rather than an explicit situation where you may dealing with Threads / ThreadPools / whatever yourself).

(If we were nitpicking then we could be concerned that there's no way to ensure that the TKey type is immutable and so the weighted entries described by the IndexData could feasibly change - but in this case they're ints, so there's nothing to worry about, and in other cases I would strongly suggest an immutable type be used for TKey at all times. Next time I'm going to cover more complex TKey possibilities in The Full Text Indexer - Going International!).

Posted at 21:37

Tags:

FullTextIndexer

Comments

17 July 2012

The Full Text Indexer

I started out on a journey a few months ago being frustrated by the Lucene.net integration we had with one of our products at work (I'm not badmouthing the Lucene project, I'm wholeheartedly blaming the integration I inherited!) and wondering just how difficult it would be to write a Full Text Indexer which could analyse generic content and generate some form of structure which could look up strings and assign weights to the source material, offering the best matches.

And now I've got it to the point that I've tried the resulting solution out in a variety of configurations and am using it to handle searches on this blog and incorporating an autocomplete functionality (that may or may not benefit from some more tweaking yet) to go with it. I'm quite proud of it!

Before I say any more, this was written to deal with the search tasks I needed and as such is not a direct replacement for Lucene, it's just an alternative I wanted to explore (for example I know that Lucene makes big claims about the number of documents it can maintain, I'm in no position right now to make any sorts of boasts on that scale!).

Here's a really basic example that would analyse data from:

public interface IArticle
{
    int Id { get; }

    /// <summary>
    /// This will never be null or blank
    /// </summary>
    string Title { get; }

    /// <summary>
    /// This will never be null or blank
    /// </summary>
    string Content { get; }
}

and generate an IIndexData<int> instance which has this method (among others, but this is all we need for this first example):

public interface IIndexData<TKey>
{
    /// <summary>
    /// This will throw an exception for null or blank input. It will never return null.
    /// If there are no matches then an empty list will be returned.
    /// </summary>
    NonNullImmutableList<WeightedEntry<TKey>> GetMatches(string source);
}

by defining "Content Retrievers" (which extract sections of keyed content; meaning content that is associated with a Key that represents each source data item), a "Key Comparer" (which defines which keyed content instances belong to the same data item in order to group results together), a "Token Breaker" (which reduces content strings into individual words), a "String Normaliser" (which compares individual words in order to group them together but will also be used to compare values passed to the GetMatches method shown above) and "Weighted Entry Combiner" (which describes how tokens that appear multiple times for the same data item should have their weights combined):

using System;
using System.Collections.Generic;
using System.Linq;
using Common.Lists;
using Common.Logging;
using FullTextIndexer.Indexes;
using FullTextIndexer.Indexes.TernarySearchTree;
using FullTextIndexer.IndexGenerators;
using FullTextIndexer.TokenBreaking;

namespace FullTextIndexerDemo
{
    public class Example
    {
        public IIndexData<int> GetIndex(NonNullImmutableList<IArticle> articles)
        {
            if (articles == null)
                throw new ArgumentNullException("articles");

            return GetIndexGenerator().Generate(articles);
        }

        private IIndexGenerator<IArticle, int> GetIndexGenerator()
        {
            var contentRetrievers = new List<ContentRetriever<IArticle, int>>
            {
                new ContentRetriever<IArticle, int>(
                    article => new PreBrokenContent<int>(article.Id, article.Title),
                    token => 5f
                ),
                new ContentRetriever<IArticle, int>(
                    article => new PreBrokenContent<int>(article.Id, article.Content),
                    token => 1f
                )
            };

            return new IndexGenerator<IArticle, int>(
                contentRetrievers.ToNonNullImmutableList(),
                new IntEqualityComparer(),
                new DefaultStringNormaliser(),
                new WhiteSpaceTokenBreaker(),
                weightedValues => weightedValues.Sum(),
                new NullLogger()
            );
        }

        private class IntEqualityComparer : IEqualityComparer<int>
        {
            public bool Equals(int x, int y) { return (x == y); }
            public int GetHashCode(int obj) { return obj; }
        }
    }
}

That was a lot of jargon that took more work to write than the code example! :)

Content Retrievers

These describe describe two simple things; a method to extract a particular content string from an item (along with a Key for that item) and a method to assign a weight to each token that is extracted from that content after it's been passed through a Token Breaker. In this example, more weight is given to "tokens" (for the time being we can take this to refer to a single word) matched in the Article Title than in the Article Content. Each Content Retriever can return null for a given Article if there is no content to retrieve for that instance - eg. if IArticle had an optional property for CategoryName for an instance then a Content Retriever might return null if the instance has no Category assigned to it.

Key Comparer

Here, the Key uniquely representing each Article is the Id value so the Key Comparer for this example is just an IEqualityComparer<int> implementation that compares integers - easy.

Token Breakers

This example uses a "WhiteSpaceTokenBreaker" which will take the string content from the Content Retrievers and break it into individual words by splitting on whitespace characters. Straight forward!

String Normaliser

The String Normaliser is essentially an IEqualityComparer<string> and will be used to generate a lookup of tokens and compare them against values passed into the GetMatches method. The DefaultStringNormaliser will remove all punctuation, exchange all non-latin characters for latin equivalents and lower-case them all. For the most basic lookups I had in mind, this does the hard work.

Weighted Entry

The Weighted Entry is a combination of a Key that identifies a data item and a numeric weight indicating the quality of the match; always positive and the higher the better.

Weighted Entry Combiner

This takes a set of match weights for a given token and must return a single value representing their combined total. In the example here I've just taken a sum of them, so if an Article has the word "giraffe" once in the Title and three times in the Content and "giraffe" was searched for, then match weights 5, 1, 1, 1 would be combined into 8 but it may be equally valid to take the maximum weight instead of considering Articles to be a better match if they have the same word more times (in which case "weightedValues => weightedValues.Max()" would be specified).

The Logger

While the index is being generated, status messages may be logged such as "Work completed: x%". This example ignores all log messages by passing a NullLogger to the index generator.

Customisation / Variations

This is a very basic example but it can be extended easily to handle other requirements or data structures. In general the Content Retrievers and Key Comparer are altered to deal with different input data while the Token Breakers, String Normaliser and Weighted Entry Combiner are varied to process that extracted data in a different manner.

The "English Language Plurality String Normaliser (link to Bitbucket file)" (which I've gone on about at considerable length in previous posts) could replace the DefaultStringNormaliser (or wrap it, since it will take an "optionalPreNormaliser" as a constructor argument) so that the token matching is more flexible; searching for "giraffes" would now match an Article that included the word "giraffe" in the Title and/or Content even if it didn't also include "giraffes".

Likewise, the WhiteSpaceTokenBreaker could be replaced with an alternative implementation that also breaks on commas (for content that doesn't also follow commas with spaces) or on the various bracket characters (especially useful for breaking content that includes code samples; so that "List<string>" is broken down into "List" and "string"). This can be done with the "WhiteSpaceExtendingTokenBreaker" class which replaces a fixed (but customisable) set of characters with spaces and then passes off processing to another Token Breaker (usually a WhiteSpaceTokenBreaker) to handle the altered content.

Multi-word Matching

With the above configuration, only single words would yield results when GetMatches is called on the index data since all of the content is broken into single words and so any multiple word "source" strings would fail to be matched without additional processing. For cases where the order of the words in a multiple word terms is not important there is an IIndexData<TKey> extension method:

/// <summary>
/// This will break a given source string and return results based upon the combination of
/// partial matches (so results that only match part of the source string may be included
/// in the returned data). The token breaker and the match combiner must be specified by the
/// caller - if the match combiner returns zero then the result will not be included in the
/// final data. To require that all tokens in the source content be present for any returned
/// results, the following matchCombiner could be specified:
///
///  (tokenMatches, allTokens) =>
///    (tokenMatches.Count < allTokens.Count)
///      ? 0 : tokenMatches.SelectMany(m => m.Weights).Sum()
///
/// </summary>
public static NonNullImmutableList<WeightedEntry<TKey>> GetPartialMatches<TKey>(
    this IIndexData<TKey> index,
    string source,
    ITokenBreaker tokenBreaker,
    MatchCombiner matchCombiner
)

If this is called with the same Token Breaker as used by the index generator then the multi-word search term can be split up in the same manner and each resulting token searched for in the index. Then a combined weight must be determined for each matched token, this calculation is handled by the specified MatchCombiner. I won't go into too much detail about it here, I may do another time or you can look at the code for the nitty gritty (there's a Bitbucket link at the bottom of this post). I think the most common case is that illustrated in the summary comment in the code above; where all tokens that result from breaking the search term must be matched in order for results to be considered valid, and where the combined weight of valid matches is taken by summing the weights of all of the component matches.

Still to come..

This has still only touched on a simple use case. I'm hoping to cover in future posts how an index could handle multi-lingual content, how it could handle multi-word matching that increases the weight of the matching if tokens that are adjacent in the search term appear adjacent in the source data, how the index can be updated or have items added and removed, how the AutoComplete on this blog is generated and how the term highlighting on the search page works! Who knows, I may even go right off in the deep end and contemplate writing a search term parser that can perform searches on the index with quoted terms, boolean operators and who knows what! But that's all definitely for another day :)

The Code!

The code for this project is all available at Bitbucket: "The Full Text Indexer".

Posted at 20:33

Tags:

FullTextIndexer

Comments

The Full Text Indexer - Going International!

Setting the scene

Generating and querying the new Index format

Extended Key Types

Performance

The Full Text Indexer - Adding and Subtracting

Remove

Combine

Pulling an example together

The Full Text Indexer

Content Retrievers

Key Comparer

Token Breakers

String Normaliser

Weighted Entry

Weighted Entry Combiner

The Logger

Customisation / Variations

Multi-word Matching

Still to come..

The Code!

About

Recent Posts

Highlights

Archives

You may also be interested in (see here for information about how these are generated):

You may also be interested in (see here for information about how these are generated):

You may also be interested in (see here for information about how these are generated):

About

Recent Posts

Highlights

Archives