Productive Rage

Dan's techie ramblings

Persistent Immutable Lists

I've written before about immutable data structures (I'm all for them; see I love Immutable Data and Problems in Immutability-land) but watching a talk recently by Clojure-creator Rich Hickey made me think about one particular area again recently. In that first post I put up some example cost for an Immutable List that wrapped the .Net List<T> class - this was very simple to implement and understand, and in many cases I was using the immutable class as a return type or a method argument which meant that the instance would be built once and futher manipulations would be limited. This meant that I wasn't too concerned with internally creating a new list instance each time a new immutable instance was required and copying the references over.

However, in this talk it was reiterated that all of the core data structures in Clojure were intended to be immutable and that considerable work was done to ensure that the performance of these structures was sufficient that it could compete with Java and C#. A persistent linked list structure was used so that operations could be performed without having to recreate the entire dataset.

This is something that I didn't know a huge amount about but sounded like it could be an interesting avenue to explore!

A basic introduction into the singly-linked list

The singly-linked list is a fairly basic structure built around nodes; each node has a value and a link to the next node, if there is one. We know we're at the end of the list when the current node has a null "next" node reference.

An empty list would have a null "head" node.

An int list with a single item would have a head node of the form

{ 1, null }

where the value of the item is 1 and there is no next node.

An int list with two items could be illustrated as

{ 1, { 2, null } }

And one with four values as

{ 1, { 2, { 3, { 4, null } } } }

Well, you get the idea!

The interesting thing comes when we look at how the structure changes as items are added. Starting off with an empty list and adding items one at a time to the front of the list, the structure grows in this manner:

{ 1, null }

{ 2, { 1, null } }

{ 3, { 2, { 1, null } } }

{ 4, { 3, { 2, { 1, null } } } }

Each time we take a list L0 and create a new instance L1 by adding a single item, the head node of L1 can be taken to be a new node that contains the new value and whose "next" reference points to the head node of L0. This is where the "persistent" part comes into play. (This is only possible if the nodes themselves are immutable as otherwise one instance of the list could affect the data in another instance if they shared node chain references where the nodes were not immutable).

This means that creating a new list with a new item is a very simple and fast action! This operation is considerably faster than the doing the same on the original immutable list approach I was using, especially as the size of the list grows.

Enumerating the list is also straight-forward; we start at the head node (if non-null) and then walk down the "next" references until we hit a null, indicating the end of the list.

A basic implementation of this could be:

public class SimplePersistentImmutableList<T> : IEnumerable<T>
{
    private readonly Node _head;
    public SimplePersistentImmutableList() : this(null) { }
    private SimplePersistentImmutableList(Node head)
    {
        _head = head;
    }

    public SimplePersistentImmutableList<T> InsertAtStart(T value)
    {
        return new SimplePersistentImmutableList<T>(
            new Node(value, _head)
        );
    }

    public IEnumerator<T> GetEnumerator()
    {
        return new SetEnumerator(_head);
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }

    private class Node
    {
        public Node(T value, Node next)
        {
            Value = value;
            Next = next;
        }

        public T Value { get; private set; }

        /// <summary>
        /// This will be null if there is no next node
        /// </summary>
        public Node Next { get; private set; }
    }

    private class SetEnumerator : IEnumerator<T>
    {
        private readonly Node _topNode;
        private Node _currentNode;
        public SetEnumerator(Node topNode)
        {
            // The way that the enumeration operates is that it will call MoveNext before
            // trying to retrieve the first value when processing a foreach loop to ensure
            // that data is present. In order to deal with this, we need to wrap the Top
            // Node in another node so that the first MoveNext call moves us to the start
            // of the data.
            _topNode = new Node(default(T), topNode);
            _currentNode = _topNode;
        }

        public void Dispose() { }

        public T Current
        {
            get
            {
                if (_currentNode == null)
                    throw new InvalidOperationException("No Current value");
                return _currentNode.Value;
            }
        }

        object IEnumerator.Current
        {
            get { return Current; }
        }

        public bool MoveNext()
        {
            if ((_currentNode == null) || (_currentNode.Next == null))
                return false;
            _currentNode = _currentNode.Next;
            return true;
        }

        public void Reset()
        {
            _currentNode = _topNode;
        }
    }
}

And most of that code is the implementation of IEnumerable!

Limitations

This example only exposes an InsertAtStart method as a manner in which to alter the list. An obvious counterpart would be to add a RemoveFromStart method, since all that need do is create a new list whose head node is the "next" node of the head node of the current list (if the head node of the intial list was null then there are no items, and so RemoveFromStart would be invalid).

public SimplePersistentImmutableList<T> RemoveFirstItem()
{
    if (_head == null)
        throw new InvalidOperationException("The list is empty");

    return new SimplePersistentImmutableList<T>(
        _head.Next
    );
}

At this point, we could very easily take this code and create an immutable stack by renaming "InsertAtStart" to "Push", "RemoveFromStart" to "Pop" and adding in a way to retrieve the current value, if any:

public T Peek()
{
    if (_head == null)
        throw new InvalidOperationException("The list is empty");

    return _head.Value;
}

public bool IsEmpty
{
    get
    {
        return (_head == null);
    }
}

However, to support the other actions that are expected from a list such as inserting-into and removing-from arbitrary locations we need to consider how to find the appropriate place in the node chain from which to snip out values or insert new ones. Unless these operations are to remove the first item(s) from a list or to add some to the start of the list, only some of the existing chain may be shared between the current and new instances.

For example, to add the value 99 into index 2 of the list that is described by the following node chain

{ 3, { 2, { 1, { 0, null } } } }

then we'd need to end up with the chain

{ 3, { 2, { 99, { 1, { 0, null } } } } }

managing to re-use only the last two nodes of the existing chain.

This brings me onto the issue that I have with the above implementation; it's my gut feeling that the majority of operations that I might perform on a list are generating an immutable list from a mutable set, adding items to the end of an existing list and enumerating through the values. Keeping a reference to the head node means that every time a new value is added to the end of the list, none of the chain may be persisted. So to optimise for this operation we can store a reference to the tail of the chain. Now the same logic from the InsertAtStart method becomes the Add method:

public SimplePersistentImmutableList<T> Add(T value)
{
    return new SimplePersistentImmutableList<T>(
        new Node(value, _tail)
    );
}

so long as the Node class is also altered to reflect this reversed nature:

private class Node
{
    public Node(T value, Node previous)
    {
        Value = value;
        Previous = previous;
    }

    public T Value { get; private set; }

    /// <summary>
    /// This will be null if there is no previous node
    /// </summary>
    public Node Previous { get; private set; }
}

This does raise one thorny issue, though; we have to re-think enumeration of the list since we can only step backwards through the list as the "master" node reference we store is the tail. A simple approach would be as follows:

public IEnumerator<T> GetEnumerator()
{
    var values = new List<T>();
    var node = _tail;
    while (_tail != null)
    {
        values.Insert(0, node.Value);
        node = node.Previous;
    }
    return values.GetEnumerator();
}

This makes enumeration potentially an expensive operation, especially if there are a large number of items in the set since a new List is built and populated for each enumeration. And if there are a lot of items to deal with then the list may have to resize its internal array multiple times (with a copy operation from one array to the next) since we don't know up front how large the list needs to be.

To address this, I'm going to make two changes. Firstly, the Node class will be given a Count property which is always the Count of the previous Node plus one, unless the previous Node is null in which case the Count is one.

private class Node
{
    public Node(T value, Node previous)
    {
        Value = value;
        Previous = previous;
        Count = (previous == null) ? 1 : (previous.Count + 1);
    }

    public T Value { get; private set; }

    /// <summary>
    /// This will be null if there is no previous node
    /// the head)
    /// </summary>
    public Node Previous { get; private set; }

    public int Count { get; private set; }
}

Secondly, I'm going to introduce a class member array "_allValues" which is only populated the first time that an enumeration is requested and that effectively caches the value set in an easily-enumerable format. This is only populated "on demand" to avoid any overhead where it is generated for a list that will never be enumerated over (if an instance L0 has a value added to it, resulting in L1, which then has a further value added to it, resulting in L2, we don't want to waste time generating the "_allValues" array for L1 if the reference to L1 is dropped when L2 is created).

/// <summary>
/// For enumerating the values we need to walk through all of the nodes and then reverse the
/// set (since we start with the tail and work backwards). This can be relatively expensive
/// so the list is cached in the "_allValues" member array so that subsequent requests are
/// fast (mightn't be a big deal for a single enumeration of the contents but it could
/// be for multiple calls to the indexed property, for example).
/// </summary>
private void EnsureAllValuesDataIsPopulated()
{
    if (_allValues != null)
        return;

    // Since we start at the tail and work backwards, we need to reverse the order of the
    // items in values array that is populated here
    var numberOfValues = Count;
    var values = new T[numberOfValues];
    var node = _tail;
    for (var index = 0; index < numberOfValues; index++)
    {
        values[(numberOfValues - 1) - index] = node.Value;
        node = node.Previous;
    }
    _allValues = values;
}

The Count property of the node allows an array to be initialised of the required size since now we know the required size. The "_allValues" array is set to null in the constructor and this EnsureAllValuesDataIsPopulated method must be called before anything references it (eg. the GetEnumerator method).

There's something potentially a bit hairy in this, though, as the internals of the class are no longer immutable and so could we be open to crazy things happening in multi-threaded scenarios? Joseph Albahari's Advanced Threading article shows a scary first example and Jon Skeet's Implementing the Singleton Pattern in C# has an example with code that looks very similar to what we're doing here, and that's clearly marked as not thread-safe. The first example illustrates how issues may arise as the "compiler, CLR or CPU may _reorder_ your program's instructions to improve efficiency" but "C# and the runtime are very careful to ensure that such optimizations don’t break ordinary single-threaded code" so in this case we needn't worry as there is only one "_allValues" reference being compared to null and then set and no significant rearrangement could be made that wouldn't affect single-threaded processing. In the Singleton example, the issue is that the work could potentially be performed multiple times if multiple threads checked for null before any thread had completed the work and set the "_allValues" reference. For the lock-free reading that be possible result when "_allValues" has been set, I'm happy with the trade-off in this case. (If multiple threads have to do the work of generating the array while they're all clamouring for the "_allValues" data at the same time that's fine since once they finish, subsequent requests will be able to access the pre-generated array with no locking or other complications). If I wasn't happy with it then I'd probably use the .Net 4.0 Lazy<T> construct I've talked about before (see Check, check it out) but this could potentially add some overhead for each new instance of the immutable list, which I wanted to avoid for instances that will never be enumerated over.

public class PersistentImmutableList<T> : IEnumerable<T>
{
    private readonly Node _tail;
    private T[] _allValues;

    public PersistentImmutableList() : this((Node)null) { }
    public PersistentImmutableList(IEnumerable<T> values)
    {
        if (values == null)
            throw new ArgumentNullException("values");

        Node node = null;
        foreach (var value in values)
        {
            if (node == null)
                node = new Node(value, null);
            else
                node = new Node(value, node);
        }
        _tail = node;
    }
    private PersistentImmutableList(Node tail)
    {
        _tail = tail;
    }

    public int Count
    {
        get { return (_tail == null) ? 0 : _tail.Count; }
    }

    public PersistentImmutableList<T> Add(T value)
    {
        return AddRange(new[] { value });
    }

    public PersistentImmutableList<T> AddRange(IEnumerable<T> values)
    {
        if (values == null)
            throw new ArgumentNullException("values");

        var node = _tail;
        foreach (var value in values)
            node = new Node(value, _tail);
        return new PersistentImmutableList<T>(node);
    }

    public IEnumerator<T> GetEnumerator()
    {
        // As documented at http://msdn.microsoft.com/en-us/library/system.array.aspx, from
        // .Net 2.0 onward, the Array class implements IEnumerable<T> but this is only
        // provided at runtime so we have to explicitly cast access its generic
        // GetEnumerator method
        EnsureAllValuesDataIsPopulated();
        return ((IEnumerable<T>)_allValues).GetEnumerator();
    }
    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }

    /// <summary>
    /// For enumerating the values we need to walk through all of the nodes and then reverse
    /// the set (since we start with the tail and work backwards). This can be relatively
    /// expensive so the list is cached in the "_allValues" member array so that subsequent
    /// requests are fast (mightn't be a big deal for a single enumeration of the contents
    /// but it could be for multiple calls to the indexed property).
    /// </summary>
    private void EnsureAllValuesDataIsPopulated()
    {
        if (_allValues != null)
            return;

        // Since we start at the tail and work backwards, we need to reverse the order of
        // the items in values array that is populated here
        var numberOfValues = Count;
        var values = new T[numberOfValues];
        var node = _tail;
        for (var index = 0; index < numberOfValues; index++)
        {
            values[(numberOfValues - 1) - index] = node.Value;
            node = node.Previous;
        }
        _allValues = values;
    }

    private class Node
    {
        public Node(T value, Node previous)
        {
            Value = value;
            Previous = previous;
            Count = (previous == null) ? 1 : (previous.Count + 1);
        }

        public T Value { get; private set; }

        /// <summary>
        /// This will be null if there is no previous node
        /// the head)
        /// </summary>
        public Node Previous { get; private set; }

        public int Count { get; private set; }
    }
}

Having a Count property on the Node enables the immutable list to expose a Count property without having to recursively loop through the nodes.

Rounding it out

Since we have a "_tail" Node reference and each Node has a Previous property, this Count on the Node represents the number of items in the list up to and including the current Node. So the tail Node's Count is the number of items in the entire list, the Count property on the node before the tail (if any) would have a Count value of one less - indicating the number of Nodes there are one place before the tail Node. I mention this is because I hope it makes the following methods easier to follow!

public PersistentImmutableList<T> InsertAt(T value, int insertAtIndex)
{
    return InsertAt(new[] { value }, insertAtIndex);
}

public PersistentImmutableList<T> InsertAt(IEnumerable<T> values, int insertAtIndex)
{
    if (values == null)
        throw new ArgumentNullException("values");
    if (!values.Any())
        return this;
    if ((insertAtIndex < 0) || (insertAtIndex > Count))
        throw new ArgumentOutOfRangeException("insertAtIndex");

    // If the insertion is at the end of the list then we can use AddRange and avoid any
    // messing about
    if (insertAtIndex == Count)
        return AddRange(values);

    // Starting with the tail, walk back to the insertion point, recording the values we
    // pass over
    var node = _tail;
    var valuesBeforeInsertionPoint = new T[Count - insertAtIndex];
    for (var index = 0; index < valuesBeforeInsertionPoint.Length; index++)
    {
        valuesBeforeInsertionPoint[index] = node.Value;
        node = node.Previous;
    }

    // Any existing node chain before the insertion point can be persisted and the new
    // value(s) appended
    foreach (var value in values)
        node = new Node(value, node);

    // Finally, add back the values we walked through before to complete the chain
    for (var index = valuesBeforeInsertionPoint.Length - 1; index >= 0; index--)
        node = new Node(valuesBeforeInsertionPoint[index], node);
    return new PersistentImmutableList<T>(node);
}

To insert into an arbitrary location in the list, we need to walk backwards from the tail to the insertion point and then insert the new value(s) by persisting the rest of the node chain (from the insertion point up to the head) and appending the new values and then the values which we have to walk through to get to the insertion point. The nodes from the tail to the insertion point can not be maintained as their "Previous" chain will not include the new values!

A very similar approach may be taken to removals:

public PersistentImmutableList<T> RemoveAt(int removeAtIndex)
{
    return RemoveRange(removeAtIndex, 1);
}

public PersistentImmutableList<T> RemoveRange(int removeAtIndex, int count)
{
    if (removeAtIndex < 0)
        throw new ArgumentOutOfRangeException(
            "removeAtIndex",
            "must be greater than or equal zero"
        );
    if (count <= 0)
        throw new ArgumentOutOfRangeException("count", "must be greater than zero");
    if ((removeAtIndex + count) > Count)
        throw new ArgumentException("removeAtIndex + count must not exceed Count");

    // Starting with the tail, walk back to the end of the removal range, recording the
    // values we passed over
    var node = _tail;
    var valuesBeforeRemovalRange = new T[Count - (removeAtIndex + count)];
    for (var index = 0; index < valuesBeforeRemovalRange.Length; index++)
    {
        valuesBeforeRemovalRange[index] = node.Value;
        node = node.Previous;
    }

    // Move past the values in the removal range
    for (var index = 0; index < count; index++)
        node = node.Previous;

    // Now add back the values we walked through above to the part of the chain that can be
    // persisted
    for (var index = valuesBeforeRemovalRange.Length - 1; index >= 0; index--)
        node = new Node(valuesBeforeRemovalRange[index], node);
    return new PersistentImmutableList<T>(node);
}

And really, that's most of the complication out of the way! We can still flesh out a few more properties like an index property:

public T this[int index]
{
    get
    {
        if ((index < 0) || (index >= Count))
            throw new ArgumentNullException("index");

        EnsureAllValuesDataIsPopulated();
        return _allValues[index];
    }
}

and a sort method:

public PersistentImmutableList<T> Sort(IComparer<T> comparison)
{
    if (comparison == null)
        throw new ArgumentNullException("comparison");

    EnsureAllValuesDataIsPopulated();
    return new PersistentImmutableList<T>(
        _allValues.OrderBy(x => x, comparison)
    );
}

but we're getting down to icing-on-the-cake now.

Conclusion

I've enjoyed this little foray and intend to replace that old simple (effective but slow) immutable list I was using before with a version of this code! In existing code that used the previous implementation, there was a measurable performance hit in some loops where lists were being built up in a method before being returned - I rewrote these to use a mutable list internally and return an immutable representation when the work was complete (because of this performance hit). But now I think I could probably get away with using this new immutable list throughout method internals as well! I need to do some profiling of previously-seen trouble areas to be sure, but I get the sneaky feeling that in some of the larger data sets where performance was seen to be taking a hit, this new immutable list variation may work even better than the built-in mutable list. And that, I'm very happy with! :)

Posted at 20:33

Comments

The WinDbg Blues

To investigate some cpu-off-the-charts hanging issue that refuses to reproduce itself in a local environment in a project I'm involved with I've had to use WinDbg to interrogate a process dump for clues. On the one hand, that all is information is available can seem amazing at times. Other times it's frustrating how primitive it feels compared to poking around with the Visual Studio debugger!

One of the problems I have is that I only do this kind of investigation infrequently so I never quite internalise all the tricks that I need from one time to the next. The first that for some minidumps there seems to be a version mismatch problem with I try to ".loadby sos mscorwks", resulting in the following error:

CLRDLL: c:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\mscordacwks.dll:2.0.50727.5448 f:0 doesn't match desired version 2.0.50727.3625 f:0

The version of the clr on my computer doesn't match that loaded by the process running on the server. One way around this is to copy sos.dll and all msc*.dll files from the server's C:\Windows\Microsoft.Net\Framework64\v2.0.50727 (or equivalent, depending upon windows location and framework version) into a local folder and then load sos with the following: ".load C:\DllsFromServer\SOS.dll". I must admit, I think I came upon the set of files to load through some trial and error rather than comprehending the full depth of the issue. But it's worked each time I've encountered the issue so far! :)

Retrieving C# struct values

Another problem I run into each time is the manner in which struct values needs to be probed. With other types you can just "!do {addr}" (DumpObject) it and nine times out of ten see the properties and values you want, with the occassional "!da {addr}" (DumpArray) thrown in for good measure. But if you try to do this for a struct value you get the cryptic response:

<Note: this object has an invalid CLASS field> Invalid object

From what I understand, the type information is not contained at the address that is indicated (my understanding of all this is a bit limited, so if that's less than accurate then please forgive me!). To access its content we need to point the debugger at the type information as well. For a DateTime, we can get this with the following:

!Name2EE * System.DateTime

This searches all of the loaded assemblies for the specified type and, if located, will report a MethodTable address. This can used with the DumpVC command to look into the DateTime data:

!DumpVC {MethodTableAddr} {addr}

For DateTimes, the value is represented by a UInt64 which can be translated by C#:

var d = new DateTime(value);

Alternatively, there is a WinDbg extension that can help with this: sosex (which can be downloaded from http://www.stevestechspot.com/SOSEXV40NowAvailable.aspx). Once loaded (by putting the dll into the "winext" folder under the WinDbg installation location and using ".load sosex") you can:

!mdt System.DateTime {addr}

Which is much easier!

There is a similar problem with .Net generic nullable types. If you want to look into the state of a nullable int aka "int?" aka "Nullable<int>" you need to look into the type "System.Nullable`1" and then following the above !Name2EE / !DumpVC approach to looking as the "hasValue" and "value" fields.

This only works following some guesswork at that precise "System.Nullable`1" name. It turns out that sosex can also help with this; it can look up type information given a partial name:

!mx System.Nullable*

This returns clickable links, amongst which are "get_Value" which exposes a MethodTable for retrieving the content with !DumpVC.

Posted at 21:03

Comments

The Full Text Indexer - Token Breaker and String Normaliser variations

I've written a few Posts now about the investigative Full Text Indexer project I've been playing with (see also Updating the Index, MultiLingual Content and various tangents into plurality handling) but I think I've got at least one more to round it off.

Among the basic components of the Index Generator that determine the manner in which it processes content are (as outlined in that First Indexer Post):

Token Breakers

This example uses a "WhiteSpaceTokenBreaker" which will take the string content from the Content Retrievers and break it into individual words by splitting on whitespace characters. Straight forward!

String Normaliser

The String Normaliser is essentially an IEqualityComparer and will be used to generate a lookup of tokens and compare them against values passed into the GetMatches method. The DefaultStringNormaliser will remove all punctuation, exchange all non-latin characters for latin equivalents and lower-case them all. For the most basic lookups I had in mind, this does the hard work.

String Normalisers

There are not many String Normaliser implementations at present, the most basic would be to compare two strings to see if they match. Very simple - but for the use cases I have in mind*, which are largely European languages passages, not very useful. The next step up is a case-insensitive match. Marginally better but there are all sorts of puncuation marks that I might want to ignore - eg. I probably want to consider "James" (the name) to be the same as "James'" (the possessive determiner; er, if my searching for the correct linguistic phrase has led me to the correct one :). I may want to try to ignore differences between accented characters (eg. consider "Jose" to match "José"). If we end up comparing strings that contain whitespace (since there's nothing from forcing all Token Breakers to break on whitespace) then we probably want to normalise the whitespace such that a tab is equivalent to a space is equivalent to a run of multiple spaces.

The DefaultStringNormaliser will iron over all of these mentioned cracks by normalising whitespace, removing punctuation characters, replacing accented characters with non-accented equivalents, lower-casing the content and trimming it before yielding the final value for comparison.

The EnglishPluralityStringNormaliser will (optionally) wrap another String Normaliser (very feasibly a DefaultStringNormaliser) and then add a further layer of processing to the output so that plurality of terms is ignored; so "cat" and "cats" are considered to match, as are "cactus" and "cactii", for example. The approach it takes isn't perfect but it gets the job done for the most common cases.

* (The fact that a String Normaliser has to be specified in order to instantiate an IndexGenerator should mean that it would be straight-forward to configure one for uses cases that I didn't specifically have in mind when I wrote it).

Token Breakers

The WhiteSpaceTokenBreaker is probably the most obvious implementation, it breaks all content on whitespace and considers each resulting segment to be a token. However, there are a lot of other characters which may constitute a break between words - normally these have whitespace around as well but that relies on the input content following particular formatting rules (like ensuring that commas are always followed by a space). So we also have the WhiteSpaceExtendingTokenBreaker. This will replace particular characters with a space before handing off processing to another Token Breaker. It may, for example, specify that all brackets (round, square, curly or triangular) be replaced with whitespace, along with full stops and commas. This is useful for a lot of common content. Note that single quotes would not be replaced since they generally do not indicate the end of a word - eg. "don't" is one word, it should not be split into "don" and "t". This would rely upon the use of a String Normaliser that ignores punctuation such as single quotes so that "O'Connor" and "OConnor" are considered equivalent.

More interesting variations on the theme are the PartialMatchingTokenBreaker and the ConsecutiveTokenCombiningTokenBreaker. The first will wrap another Token Breaker and then process each of the resulting tokens by generating all substrings from them that are at least as long as the "minlengthOfPartialmatches" and no longer than the "maxLengthOfPartialMatches" constructor arguments on the class. This provides a simple way to implement "partial word matching" and also illustrates the benefit of returning a "WeightAdjustingToken" set from the Token Breaker; these partial words can be given much less weight when stored in the Index, such that full word matches for content appear much higher in a result set (ordered by match weight aka match quality). A "partialMatchWeightDeterminer" delegate is passed to the constructor and used to calculate the weight of these partial matches.

The ConsecutiveTokenCombiningTokenBreaker is essentially the opposite, it will apply a specified Token Breaker against the input content first and then generate additional tokens by combining runs of consecutive tokens. So if a passage contains the words "Gotos considered harmful" then instead of this being broken down into just the set "Gotos", "considered", "harmful" it would may also result in (depending upon the maxNumberOfTokens constructor argument) "Gotos considered", "considered harmful" and "Gotos considered harmful". Again, greater weights may be assigned to these runs of tokens via the weightMultiplierDeterminer constructor argument (a delegate that returns a weight multipler based upon the number of tokens combined to form the extended token). This would enable the article with the phase "Gotos considered harmful" to be assigned a greater weight than one that has the separate words "Gotos", "considered" and "harmful" (but not arranged into that particular phrase). This would rely upon a search being performed using the GetPartialMatches method of the index, which breaks up the search term according using a particular Token Breaker, rather than requiring the entire phrase be matched precisely (this is covered briefly towards the end of the first Full Text Indexer post).

The use of these token breakers, whether individually or in combination, will result in more data being stored in the Index (as well as more processing of the input content required in order to generate the index) but offer the benefits that searches can also match content more loosely in some cases while prioritising the best matches in others.

Update (28th March 2013): The ConsecutiveTokenCombiningTokenBreaker is no longer the best way to deal with searches for consecutive terms, there is now a GetConsecutiveMatches extension method for IIndexData that doesn't require the additional (expensive) processing when the index is generated, see The Full Text Indexer: Source Locations.

Bonus Material: The Indexer in action (small scale, though it may be!)

All of the above is a bit dry and I wanted to include it largely to round out the introductory series to this code. So to make this post marginally more interesting, I thought I'd include the configuration in which I've used it on this blog to implement the site search and the autocomplete facility.

I have the following method which is used to generate Index content for both the site search and the auto-complete functionality. Posts have an integer Id and string Title and Content properties. They also have LastModified and Archive properties which enables the Indexes to be cached in memory and on disk, only rebuilding when content has changed (ie. a new Post has been published, an existing Post has been updated or a Post has been archived).

The bulk of the Index generation is illustrated below with comments around most of the decisions:

private IIndexData<int> GenerateIndexData(
    NonNullImmutableList<Post> posts,
    IStringNormaliser sourceStringComparer)
{
    if (posts == null)
        throw new ArgumentNullException("posts");
    if (sourceStringComparer == null)
        throw new ArgumentNullException("sourceStringComparer");

    // Define the manner in which the raw content is retrieved from Post title and body
    // - English stop words will only receive 1% the weight when match qualities are
    //   determined than other words will receive
    // - Words in the title will be given 5x the weight of words found in body content
    var englishStopWords = FullTextIndexer.Constants.GetStopWords("en");
    var contentRetrievers = new List<ContentRetriever<Post, int>>();
    contentRetrievers.Add(new ContentRetriever<Post, int>(
        p => new PreBrokenContent<int>(p.Id, p.Title),
        token => (englishStopWords.Contains(token, sourceStringComparer) ? 0.01f : 1f) * 5f
    ));
    contentRetrievers.Add(new ContentRetriever<Post, int>(
        p => new PreBrokenContent<int>(p.Id, p.Content),
        token => englishStopWords.Contains(token, sourceStringComparer) ? 0.01f : 1f
    ));

    // Specify the token breaker
    // - English content will generally break on "." and "," (unlike "'" or "-" which
    //   are commonly part of words). Also break on round brackets for written content
    //   but also the other bracket types and other common characters that might
    //   represent word breaks in code content found on the site
    var tokenBreaker = new WhiteSpaceExtendingTokenBreaker(
        new ImmutableList<char>(new[] {
            '<', '>', '[', ']', '(', ')', '{', '}',
            '.', ',', ':', ';', '"', '?', '!',
            '/', '\\',
            '@', '+', '|', '='
        }),
        new WhiteSpaceTokenBreaker()
    );

    // Generate an index using the specified StringNormaliser, 
    // - The Post class has an integer Id so a simple IntEqualityComparer (see below)
    //   will do the job fine for the dataKeyComparer
    // - If the search term is matched multiple times in a Post then combine the match
    //   weight in a simple additive manner (hence the weightedValues.Sum() call)
    var indexGenerator = new IndexGenerator<Post, int>(
        contentRetrievers.ToNonNullImmutableList(),
        new IntEqualityComparer(),
        sourceStringComparer,
        tokenBreaker,
        weightedValues => weightedValues.Sum(),
        new NullLogger()
    );
    return indexGenerator.Generate(posts.ToNonNullImmutableList());
}

[Serializable]
private class IntEqualityComparer : IEqualityComparer<int>
{
    public bool Equals(int x, int y)
    {
        return (x == y);
    }

    public int GetHashCode(int obj)
    {
        return obj;
    }
}

The generation of the search index is fairly straight-forward, the content on my blog is English with code samples (mostly C#) so I use an EnglishPluralityStringNormaliser that wraps a DefaultStringNormaliser (the PreNormaliserWorkOptions flags specified in the constructor are optimisations described in Optimising the Plurality-Handling Normaliser).

var indexDataForSearching = GenerateIndexData(
    posts,
    new EnglishPluralityStringNormaliser(
        new DefaultStringNormaliser(),
        EnglishPluralityStringNormaliser.PreNormaliserWorkOptions.PreNormaliserLowerCases
        | EnglishPluralityStringNormaliser.PreNormaliserWorkOptions.PreNormaliserTrims
    )
);

The IIndexData<T> class has a GetAllTokens() method which is useful for the autocomplete functionality but it's not as useful with the above string normaliser as that applies various manipulations to the keys (not only does it normalise word endings for plurality handling but it replaces accented characters and removes punctuation). In order to generate an index that we could extract token data from for an autocomplete list we want to avoid these manipulations. This doesn't exist in the FullTextIndex project, since it's not very useful for the intended search functionality, but as a convenient (and very simple!) example of how to vary the functionality we can use a NonAlteringStrongNormaliser:

[Serializable]
public class NonAlteringStringNormaliser : StringNormaliser
{
    public override string GetNormalisedString(string value)
    {
        if (string.IsNullOrWhiteSpace(value))
            throw new ArgumentException("Null/blank value specified");

        return value;
    }
}

This inherits from the abstract StringNormaliser class which is in the FullTextIndexer project and implements the Equals and GetHashCode methods of the IEqualityComparer<string> interface, requiring the derived class to provide the GetNormalisedString(string value) method only.

And from this we can generate an autocomplete word list with a little more work:

var indexDataForAutoCompleteExtended = GenerateIndexData(
    posts,
    new NonAlteringStringNormaliser()
);
var autoCompleteContent = new NonNullOrEmptyStringList(
    indexDataForAutoCompleteExtended.GetAllTokens()
    .Where(token =>
        (token.Length >= 3) &&
        !char.IsPunctuation(token[0])
        && !token.Any(c => char.IsNumber(c))
    )
    .Distinct(StringComparer.InvariantCultureIgnoreCase)
    .Where(token => indexDataForSearching.GetMatches(token).Any())
    .OrderBy(token => token.ToLower())
);

The filters applied here were determined by running this against the blog content at the time and making up some rules that seemed like they made the resulting data look better (yes, as scientific as that! :) It ignores words less than three characters as these are usually stop words (I considered ignoring all stop words but some of the words in the stop word list seemed like things people might search on). If there are multiple tokens that are variations of each other with differing case then only one of them will be in the final list. Only tokens that actually result in matches in the "indexDataForSearching" content that was generated are included - this should always be the case for the string normaliser I'm currently using but if I tweak that in the future then I want to ensure that I don't end up with tokens being generated for the autocomplete list that don't actually match any Posts!

It's worth noting that the autocomplete list generated is really more like a "suggested" words list since it can't cover every single match. If, for example, the input data contained the word "cats" but not "cat" then the plurality-handling string normaliser used in the search data generation will match the search term "cat" to the word "cats" in the content, but the autocomplete word list would not contain the word "cats" since that wasn't in the source content (though the word "cat" would be, as it was in the content). In practice, I don't see this being a problem as the search box on this site allows you to enter anything - you aren't restricted to only words in the autocomplete list.

Finally, everything has been declared serialisable so that the index data could be cached on disk. In practice, this means that I build the index data locally when I update a Post and view it on my local computer - and then I upload the new content along with the on-disk index content so that searches performed on the site should be fast as soon as all this new data is uploaded.

Update (17th December 2012): This has been included as part of a later Full Text Indexer Round-up Post that brings together several Posts into one series, incorporating code and techniques from each of them.

Posted at 19:12

Comments

Consuming a WCF Web Service from PHP

For the Web Service that I've been developing at work, I was asked to support a Developer who was trying to consume it for a project. The problem being that he was developing in PHP and had never connected to a .Net Web Service before - at least not knowingly; the APIs he'd integrated with before would all communicate in JSON. The other half of the problem is that I've never developed anything in PHP! In fact I'd not written a line in my life before last week.

From what I'm led to understand of PHP it's somewhat of a mess of 1000s of functions with inconsistent names, signatures and approaches. It's dynamically typed and may or may not have namespaces in any usable manner. But it's got an enormous set of well-established libraries available for everything under the sun. So this should be a walk in the park, right??

Below is a simplified version of what we're working with; the idea that you could search for Hotels that meet various criteria, where the criteria can be built up with the application of multiple Filters - at least one must be specified but multiple may be, in which case Hotels must meet the criteria in all Filters in order to be returned.

[ServiceContract]
public class HotelService
{
    [OperationContract]
    public Hotel[] GetHotels(HotelSearchRequest request)
    {
        ..
    }
}

[DataContact]
public class HotelSearchRequest
{
    [DataMember]
    public string APIKey { get; set; }

    [DataMember]
    public Filter[] Filters { get; set; }
}

[KnownType(CategoryFilter)]
[KnownType(ProximityFilter)]
[DataContract]
public abstract class Filter { }

[DataContract]
public class CategoryFilter : Filter
{
    [DataMember]
    public int[] CategoryKeys { get; set; }
}

[DataContract]
public class ProximityFilter : Filter
{
    [DataMember]
    public CoordinateDetails Centre { get; set; }

    [DataMember(IsRequired = true)]
    public int MaxDistanceInMetres { get; set; }
}

We have to mess about a bit with KnownType declarations in the right place in the service contract but once that's all sorted the service is easy to consume from a .Net WCF Client, all of the inheritance is easily understood (as they're documented in the xsd that the service exposes) and queries are easy to construct.

Getting things working in PHP is another matter, however. For someone who doesn't know what he's doing, at least! All of the basic examples suggest something along the lines of:

$client = new SoapClient(
    "http://testhotelservice.com/HotelService.svc?wsdl",
    array("trace" => 1)
);
$data = $client->GetHotels(
    array(
        "request" => array(    
            "ApiKey" => "{KeyGoesHere}",
            "Filters" => array(
                // TODO: What to write here?!
            )
        )
    )
);

It seems like the SoapClient can do some sort of magic with what are presumably associative arrays to build up the web request. All looks good initially for declaring data for the "request" argument of the GetHotels method; we set the "ApiKey" property of the request to an appropriate string.. the SoapClient must be doing something clever with the wsdl to determine the xml to generate, which must include the type name of the request to specify. But if the type names are intended to be hidden, how am I going to specify them when I build the "Filters" array?? I can't just carry on with this type-name-less associated-array approach because there will be no way for the request to know that I want the CategoryFilter or the ProximityFilter (or any other Filter that might be available).

Hmmm...

More googling brings me to discover the SoapVar class for use with the SoapClient. If we do the following:

$client = new SoapClient(
    "http://testhotelservice.com/HotelService.svc?wsdl",
    array("trace" => 1)
);
$data = $client->GetHotels(
    array(
        "request" => array(    
            "ApiKey" => "{KeyGoesHere}",
            "Filters" => array(
                new SoapVar(
                    array(
                        "CategoryKeys" => array(1, 2, 3)
                    ), 
                    SOAP_ENC_OBJECT,
                    "CategoryFilter",
                    "http://schemas.datacontract.org/2004/07/DemoService.Messages.Requests"
                )
            )
        )
    )
);

Then we are able to include information about the type. Progress! The namespace string specified references the C# namespace of the CategoryFilter class.

As with so many things, it looks all so easy. But I didn't know what exactly I should be searching for, getting this far took me quite a while - and the resource out there explaining this are thin on the ground! With the maturity of both PHP and WCF I would have thought that information about calling into WCF Services from PHP in this manner would be much readily available!

But wait; there's more

While I was at it, I thought I'd dig a little further. When I first communicated with this other Developer, I asked him to send me a trace of the requests that were being generated by his PHP code, using Fiddler or something. This information was not forthcoming and when I was running test PHP scripts on my local machine the requests weren't being captured by Fiddler anyway. But I found these handy methods:

$client->__getLastRequest()
$client->__getLastRequestHeaders()

which will retrieve the sent content, ideal for looking into exactly what messages were being generated! These only work if the "trace" => "1" argument is specified when the SoapClient is instantiated. Which explained to me what that was for, which was nice :)

Enabling gzip support

The next issue I had was another one that I thought would be beyond easy to solve, with easy-to-follow and accurate information all over the place. I was wrong again! At least, my first searches didn't bring me immediately to the answer :(

A lot of resources suggest the following:

$client = new SoapClient(
    "http://testhotelservice.com/HotelService.svc?wsdl",
    array(
        "compression" => SOAP_COMPRESSION_ACCEPT | SOAP_COMPRESSION_GZIP | 9,
        "trace" => 1
    )
);

and some suggest this variation (quoting the compression value):

$client = new SoapClient(
    "http://testhotelservice.com/HotelService.svc?wsdl",
    array(
        "compression" => "SOAP_COMPRESSION_ACCEPT | SOAP_COMPRESSION_GZIP | 9",
        "trace" => 1
    )
);

These do not work. The first results in a "can't uncompress compressed response" after a delay which makes me think that it's doing work. The latter does cause any error but also doesn't include the "Accept-encoding: gzip" HTTP header that I'm looking for.

They both feel wrong, anyway; presumably the 9 relates to gzip compression level 9. The compression level should surely be set on the server only, not referenced by the client?? And what are these SOAP_COMPRESS_ACCEPT and SOAP_COMPRESSION_GZIP values? These values are just numeric constants, it turns out, which are OR'd together in the first variation. But what's the 9 for; is it supposed to be there at all; is it some other mystery constant?? And surely the quoted version is incorrect unless PHP has some mad string rules that I don't know about (totally possible with my knowledge of PHP but not the case here! :)

The correct version is simply:

$client = new SoapClient(
    "http://testhotelservice.com/HotelService.svc?wsdl",
    array(
        "compression" => SOAP_COMPRESSION_ACCEPT | SOAP_COMPRESSION_GZIP,
        "trace" => 1
    )
);

Again, oh so simple yet so much harder to come to in the end than it should have been.

One more thing

A final note, that actually was commonly documented and pointed out, was that if you are developing against a service that is still in flux that you should disable wsdl caching in PHP. This will affect performance as the wsdl will be retrieved on each request (I presume), but it may prevent some headaches if the contract changes. I changed a value in the php.ini file but apparently it can also be done in the PHP script with:

ini_set("soap.wsdl_cache_enabled", 0);

(Courtesy of a StackOverflow answer).

Conclusion

This may well be simple stuff to the real PHP developers out there but since I struggled, maybe this post will help others in the future with this particular problem!

Posted at 23:03

Comments