Productive Rage

19 August 2015

Strongly-typed React (with Bridge.net)

A few weeks ago, I wrote about using React with Bridge.net. I described that I'd only written the bare minimum of bindings required to get my samples working - so, while I had a function for React.DOM.div -

[Name("div")]
public extern static ReactElement Div(
  HTMLAttributes properties,
  params ReactElementOrText[] children
);

The HTMLAttributes class I had written really was the bare minimum:

[Ignore]
[ObjectLiteral]
public class HTMLAttributes
{
  public string className;
}

It's time to revisit this and build up my bindings library!

A starting point

An obvious resource to work from initially is the "DefinitelyTyped" bindings that allow you to use React from TypeScript. But I'd identified a pattern that I didn't like with them in my earlier post - the type system isn't being used to as full effect as it could be. For example, in the declaration of "input" elements. Let me explain (and please bear with me, I need to go through a few steps to get to the point)..

The TypeScript bindings describe a function for creating input elements:

React.DOM.input(props: HTMLAttributes, ...children: ReactNode[]): DOMElement

For any non-TypeScripters, this is a function that takes an argument named "props" that is of type HTMLAttributes, and then 0, 1, .. n arguments of type ReactNode that are wrapped up into an array (the same principle as "params" arguments in C#). It returns a DOMElement instance.

HTMLAttributes has 115 of its own properties (such as "className", "disabled" and "itemScope" - to take three at random) and extends DOMAttributes, which has 34 more properties (such as "onChange" and "onDragStart").

The "onChange" property is a FormEventHandler, which is derived from EventHandler<FormEvent>, where EventHandler<E> is a delegate which has a single "event" argument of type "E" which returns no value. It's a callback, in other words.

This looks promising and is, on the whole, a good use of TypeScript's generics system.

However, I don't think it uses this system enough. The FormEvent (that the "onChange" property passes in the callback) is a specialisation of a SyntheticEvent type:

interface FormEvent extends SyntheticEvent { }

interface SyntheticEvent {
  bubbles: boolean;
  cancelable: boolean;
  currentTarget: EventTarget;
  defaultPrevented: boolean;
  eventPhase: number;
  isTrusted: boolean;
  nativeEvent: Event;
  preventDefault(): void;
  stopPropagation(): void;
  target: EventTarget;
  timeStamp: Date;
  type: string;
}

(The EventTarget, which is what the "target" property is an instance of, is a DOM concept and is not a type defined by the React bindings, it just means that it is one of the DOM elements that are able to raise events).

The problem I have is that if we write code such as

React.DOM.input({
  value: "hi"
  onChange: e => { alert(e.target.value); }
})

Then we'll get a TypeScript compile error because "e.target" is only known to be of type EventTarget, it is not known to be an input element and so it is not known to have a "value" property. But we're specifying this "onChange" property while declaring an input element.. the type system should know that the "e.target" reference will be an input!

In fact, in TypeScript, we actually have to skirt around the type system to make it work:

// "<any>" means cast the "e.target" reference to the magic type "any", which
// is like "dynamic" in C# - you can specify any property or method and the
// compiler will assume you know what you're doing and allow it (resulting
// in a runtime exception if you get it wrong)
React.DOM.input({
  value: "hi"
  onChange: e => { alert((<any>e.target).value); }
})

In my React bindings for Bridge I improved this by defining an InputAttributes type:

[Ignore]
[ObjectLiteral]
public class InputAttributes : HTMLAttributes
{
  public Action<FormEvent<InputEventTarget>> onChange;
  public string value;
}

And having a generic FormEvent<T> which inherits from FormEvent -

[Ignore]
public class FormEvent<T> : FormEvent where T : EventTarget
{
  public new T target;
}

This means that the "target" property can be typed more specifically. And so, when you're writing this sort of code in C# with Bridge.net, you can write things like:

// No nasty casts required! The type system knows that "e.target" is an
// "InputEventTarget" and therefore knows that it has a "value" property
// that is a string.
DOM.Input(new InputAttributes
{
  value = "hi",
  onChange = e => Global.Alert(e.target.value)
})

This is great stuff! And I'm not changing how React works in any way, I'm just changing how we interpret the data that React is communicating; the event reference in the input's "onChange" callback has always had a "target" which had a "value" property, it's just that the TypeScript bindings don't tell us this through the type system.

So that's all good.. but it did require me to write more code for the bindings. The InputEventTarget class, for example, is one I had to define:

[Ignore]
public class InputEventTarget : EventTarget
{
  public string value;
}

And I've already mentioned having to define the FormEvent<T> and InputAttributes classes..

What I'm saying is that these improvements do not come for free, they required some analysis and some further effort putting into the bindings (which is not to take anything away from DefinitelyTyped, by the way - I'm a big fan of the work in that repository and I'm very glad that it's available, both for TypeScript / React work I've done in the past and to use as a starting point for Bridge bindings).

Seeing how these more focussed / specific classes can improve things, I come to my second problem with the TypeScript bindings..

Why must the HTMLAttributes have almost 150 properties??

The place that I wanted to start in extending my (very minimal) bindings was in fleshing out the HTMLAttributes class. Considering that it had only a single property ("className") so far, and that it would be used by so many element types, that seemed like a reasonable plan. But looking at the TypeScript binding, I felt like I was drowning in properties.. I realised that I wasn't familiar with everything that appeared in html5, but I was astonished by how many options there were - and convinced that they couldn't all be applicable to all elements types. So I picked one at random, of those that stood out as being completely unfamiliar to me: "download".

w3schools has this to say about the HTML <a> download Attribute:

The download attribute is new for the <a> tag in HTML5.

and

The download attribute specifies that the target will be downloaded when a user clicks on the hyperlink.
This attribute is only used if the href attribute is set.
The value of the attribute will be the name of the downloaded file. There are no restrictions on allowed values, and the browser will automatically detect the correct file extension and add it to the file (.img, .pdf, .txt, .html, etc.).

So it appears that this attribute is only applicable to anchor tags. Therefore, it would make more sense to not have a "React.DOM.a" function such as:

[Name("a")]
public extern static ReactElement A(
  HTMLAttributes properties,
  params ReactElementOrText[] children
);

and, like the "input" function, to be more specific and create a new "attributes" type. So the function would be better as:

[Name("a")]
public extern static ReactElement A(
  AnchorAttributes properties,
  params ReactElementOrText[] children
);

and the new type would be something like:

[Ignore]
[ObjectLiteral]
public class AnchorAttributes : HTMLAttributes
{
  public string download;
}

This would allow the "download" property to be pulled out of HTMLAttributes (so that it couldn't be a applied to a "div", for example, where it has no meaning).

So one down! Many, many more to go..

Some properties are applicable to multiple element types, but these elements may not have anything else in common. As such, I think it would be more sensible to duplicate some properties in multiple attributes classes, rather than trying to come up with a complicated inheritance tree that tries to avoid any repeating of properties, at the cost of the complexities that inheritance can bring. For example, "href" is a valid attribute for both "a" and "link" tags, but these elements do not otherwise have much in common - so it might be better to have completely distinct classes

[Ignore]
[ObjectLiteral]
public class AnchorAttributes : HTMLAttributes
{
  public string href;
  public string download;
  // .. and other attributes specified to anchor tags
}

[Ignore]
[ObjectLiteral]
public class LinkAttributes : HTMLAttributes
{
  public string href;
  // .. and other attributes specified to link tags
}

than to try to create a base class

[Ignore]
[ObjectLiteral]
public abstract class HasHrefAttribute : HTMLAttributes
{
  public string href;
}

which AnchorAttributes and LinkAttributes could be derived from. While it might appear initially to make sense, I imagine that it will all come unstuck quite quickly and you'll end up finding yourself wanting to inherit from multiple base classes and all sorts of things that C# doesn't like. I think this is a KISS over DRY scenario (I'd rather repeat "public string href;" in a few distinct places than try to tie the classes together in some convoluted manner).

More type shenanigans

So, with more thought and planning, I think a reduced HTMLAttributes class could be written and a range of attribute classes produced that make the type system work for us. I should probably admit that I haven't actually done any of that further thought or planning yet! I feel like I've spent this month coming up with grandiose schemes and then writing about doing them rather than actually getting them done! :D

Anyway, enough about my shortcomings, there's another issue I found while looking into this "download" attribute. Thankfully, it's a minor problem that can easily be solved with the way that bindings may be written for Bridge..

There was an issue on React's GitHub repo: "Improve handling of download attribute" which says the following:

Currently, the "download" attribute is handled as a normal attribute. It would be nice if it could be treated as a boolean value attribute when its value is a boolean. ... For example,

a({href: 'thing', download: true}, 'clickme'); // => <a href="thing" download>clickme</a>

a({href: 'thing', download: 'File.pdf'}, 'clickme'); // => <a href="thing" download="File.pdf">

This indicates that

[Ignore]
[ObjectLiteral]
public class AnchorAttributes : HTMLAttributes
{
  public string href;
  public string download;
  // .. and other attributes specified to anchor tags
}

is not good enough and that "download" needs to be allowed to be a string or a boolean.

This can be worked around by introducing a new class

[Ignore]
public sealed class StringOrBoolean
{
  private StringOrBoolean() { }

  public static implicit operator StringOrBoolean(bool value)
    => new StringOrBoolean();

  public static implicit operator StringOrBoolean(string value)
    => new StringOrBoolean();
}

This looks a bit strange at first glance. But it is only be used to describe a way to pass information in a binding, that's why it's got the "Ignore" attribute on it - that means that this class will not be translated into any JavaScript by Bridge, it exists solely to tell the type system how one thing talks to another (my React with Bridge.net post talked a little bit about this attribute, and others similar to it, that are used in creating Bridge bindings - so if you want to know more, that's a good place to start).

This explains why the "value" argument used in either of the implicit operators is thrown away - it's because it's never used by the binding code! It is only so that we can use this type in the attribute class:

[Ignore]
[ObjectLiteral]
public class AnchorAttributes : HTMLAttributes
{
  public string href;
  public StringOrBoolean download;
  // .. and other attributes specified to anchor tags
}

And this allows to then write code like

DOM.a(new AnchorAttributes
{
  href: "/instructions.pdf",
  download: "My Site's Instructions.pdf"
})

DOM.a(new AnchorAttributes
{
  href: "/instructions.pdf",
  download: true
})

We only require this class to exist so that we can tell the type system that React is cool with us giving a string value for "download" or a boolean value.

The "ObjectLiteral" attribute on these classes means that the code

DOM.a(new AnchorAttributes
{
  href: "/instructions.pdf",
  download: true
})

is not even translated into an instantiation of a class called "AnchorAttributes", it is instead translated into a simple object literal -

// It is NOT translated into this
React.DOM.a(
  Bridge.merge(
    new Bridge.React.AnchorAttributes(),
    { name: "/instructions.pdf", download: true }
  )
)

// It IS just translated into this
React.DOM.a({ name: "/instructions.pdf", download: true })

Again, this illustrates why the "value" argument was thrown away in the StringOrBoolean implicit operator calls - because those calls do not exist in the translated JavaScript.

A nice bonus

Another thing that I like about the "ObjectLiteral" attribute that I've used on these {Whatever}Attributes classes is that the translated code only includes the properties that have been explicitly set.

This means that, unlike in the TypeScript definitions, we don't have to declare all value types as nullable. If, for example, we have an attributes class for table cells - like:

[Ignore]
[ObjectLiteral]
public class TableCellAttributes : HTMLAttributes
{
  public int colSpan;
  public int rowSpan;
}

and we have C# code like this:

DOM.td(new TableCellAttributes { colSpan = 2 }, "Hello!")

Then the resulting JavaScript is simply:

React.DOM.td({ colSpan = 2 }, "Hello!")

Note that the unspecified "rowSpan" property does not appear in the JavaScript.

If we want it to appear, then we can specify a value in the C# code -

DOM.td(new TableCellAttributes { colSpan = 2, rowSpan = 1 }, "Hello!")

That will be translated as you would expect:

React.DOM.td({ colSpan = 2, rowSpan = 1 }, "Hello!")

This has two benefits, actually, because not only do we not have to mark all of the properties as nullable (while that wouldn't be the end of the world, it's nicer - I think - to have the attribute classes have properties that match the html values as closely as possible and using simple value types does so) but it also keeps the generated JavaScript succint. Imagine the alternative, where every property was included in the JavaScript.. every time a div element was declared it would have 150 properties listed along with it. The JavaScript code would get huge, very quickly!*

* (Ok, ok, it shouldn't be 150 properties for every div since half the point of this post is that it will be much better to create attribute classes that are as specific as possible - but there would still be a lot of properties that appear in element initialisations in the JavaScript which were not present in the C# code, it's much better only having the explicitly-specified values wind up in the translated output).

A change in Bridge 1.8

I was part way through writing about how pleased I was that unspecified properties in an [ObjectLiteral]-decorated class do not appear in the generated JavaScript when I decided to upgrade to Bridge 1.8 (which was just released two days ago).. and things stopped doing what I wanted.

With version 1.8, it seems like if you have an [ObjectLiteral] class then all of the properties will be included in the JavaScript - with default values if you did not specify them explicitly. So the example above:

DOM.td(new TableCellAttributes { colSpan = 2 }, "Hello!")

would result in something like:

React.DOM.td({
    colSpan = 2,
    rowSpan = 0,
    id = null,
    className = null,
    // .. every other HTMLAttribute value here with a default value
  },
  "Hello!"
)

Which is a real pity.

The good news is that it appears to be as easy as also including an [Ignore] attribute on the type - doing so re-enables the behaviour that only includes explicitly-specified properties in the JavaScript. However, I have been unable to find authoritative information on how [ObjectLiteral] should behave and how it should behave with or without [Ignore]. I had a quick flick through the 1.8 release notes and couldn't see any mention of this being an explicit change from 1.7 to 1.8 (but, I will admit, I wasn't super thorough in that investigation).

I only came across the idea of combining [Ignore] with [ObjectLiteral] when I was looking through their source code on GitHub (open source software, ftw!) and found a few places where there are checks for one of those attributes or both of them in some places.

(I've updated the code samples in this post to illustrate what I mean - now anywhere that has [ObjectLiteral] also has [Ignore]).

I'm a little bit concerned that this may change again in the future or that I'm not using these options correctly, but I've raised a bug in their forums and they've been very good at responding to these in the past - ObjectLiteral classes generate values for all properties in 1.8 (changed from 1.7).

What's next

So.. how am I intending to progress this? Or am I going to just leave it as an interesting initial investigation, something that I've looked briefly into and then blogged about??

Well, no. Because I am actually planning to do some useful work on this! :) I'm a big fan of both React and Bridge and hope to be doing work with both of them, so moving this along is going to be a necessity as much as a nice idea to play around with. It's just a case of how to proceed - as the I-have-never-heard-of-this-new-download-attribute story goes to show, I'm not intimately familiar with every single tag and every single attribute, particular in regards to some of the less well-known html5 combinations.

Having done some research while writing this post, I think the best resource that I've found has been MDN (the Mozilla Developer Network). It seems like you can look up any tag - eg.

https://developer.mozilla.org/en-US/docs/Web/HTML/Element/a

And then find details of every attribute that it has, along with compatibility information. For example, the td table cell documentation..

https://developer.mozilla.org/en-US/docs/Web/HTML/Element/td

.. mentions "colSpan" and "rowSpan", with no particular mentions of compatibility (these have existed from day one, surely, and I don't think they're going to disappear any time soon) but also mentions attributes such as "align" and "valign" and highlights them as deprecated in html 4.01 and obsolete in html 5.

I'm strongly considering scraping these MDN pages and trying to parse out the attribute names and compatibility information (probably only supporting html5, since what's the point in supporting anything older when Bridge and React are new and and so I will be using them for writing new code and taking advantage of current standards). It doesn't provide type information (like "colSpan" is numeric or "download" may be a string or a boolean), but the DefinitelyTyped definitions will go some way in helping out with that. And MDN says that its wiki documents are available under the creative commons license, so I believe that this would acceptable use of the data, so long as they are given the appropriate credit in the bindings code that I will eventually generate (which only seems fair!).

So I think that that is what will come next - trying to glean all of the information I need about the attributes specific to particular tags and then using this to produce bindings that take as much advantage of the C# type system as possible!

Unless I'm missing something and someone else can think of a better way? Anyone??

Update (8th October 2015): I've had some suggestions from a member of the Bridge.net Team on how to reuse some of their work on html5 element definitions to make this a lot easier - so hopefully I'll have an update before too long based upon this. Before I can do so, the Bridge Team are looking into some improvements, such as allowing the "CurrentTarget" property of elements to be more strongly-typed (see http://forums.bridge.net/forum/general/feature-requests/630-open-461-generic-html5-element-and-event-classes), but hopefully we'll all have an update before too long!

Posted at 23:07

Tags:

Comments

14 August 2015

Hassle-free immutable type updates in C#

Earlier this week, I was talking about parsing TypeScript definitions in an inspired-by-function-programming manner. Like this:

public static Optional<MatchResult<PropertyDetails>> Property(IReadStringContent reader)
{
  IdentifierDetails name = null;
  ITypeDetails type = null;
  var readerAfterProperty = Optional.For(reader)
    .Then(Identifier, value => name = value)
    .ThenOptionally(Whitespace)
    .Then(Match(':'))
    .ThenOptionally(Whitespace)
    .Then(TypeScriptType, value => type = value)
    .ThenOptionally(Whitespace)
    .Then(Match(';'));

  if (!readerAfterProperty.IsDefined)
    return null;

  return MatchResult.New(
    new PropertyDetails(name, type),
    readerAfterProperty.Value
  );
}

"Identifier", "Whitespace" and "TypeScriptType" are functions that match the following delegate:

public delegate Optional<MatchResult<T>> Parser<T>(
  IReadStringContent reader
);

.. while "Match" is a function that returns a Parser<char>.

The MatchResult class looks like this:

public sealed class MatchResult<T>
{
  public MatchResult(T result, IReadStringContent reader)
  {
    Result = result;
    Reader = reader;
  }
  public T Result { get; }
  public IReadStringContent Reader { get; }
}

/// <summary>
/// This non-generic static class is just to expose a helper method that takes advantage
/// of C#'s type inference to allow you to say "MatchResult.New(value, reader)" rather than
/// having to write out the type of the value in "new MatchResult<string>(value, reader)"
/// </summary>
public static class MatchResult
{
  /// <summary>
  /// Convenience method to utilise C# type inherence
  /// </summary>
  public static MatchResult<T> New<T>(T value, IReadStringContent reader)
  {
    if (value == null)
      throw new ArgumentNullException(nameof(value));
    if (reader == null)
      throw new ArgumentNullException(nameof(reader));

    return new MatchResult<T>(value, reader);
  }
}

.. and Optional is basically a way to identify a type as being maybe-null (the convention then being that any non-Optional type should never be null).

Feel free to fresh your memory at Parsing TypeScript definitions!

One thing that I thought was very un-functional-like (a very precise term! :) was the way that the "name" and "type" values were updated via callbacks from the "Then" methods. This mechanism felt wrong for two reasons; the repeat assignment to the references (setting them to null and then setting them again to something else) and the fact that the assignments were effectively done as side effects of the work of the "Then" function.

So I thought I'd have a look into some alternatives and see if I could whip up something better.

The verbose approach

The current approach chains together functions that take and return Optional<IReadStringContent> instances. If content is encountered that does not match the specified Parser then a "Missing" value will be returned from the "Then" call. If a "Then" call receives a "Missing" value then it passes that straight out. So, any time that a match is missed, all subsequent calls pass the "Missing" value straight through.

This is why the side effect callbacks are required to pass out the values, because each "Then" call only returns the next position for the reader (or "Missing" if content did not meet requirements).

To change this, the "Then" function will need to return additional information. Conveniently, there is already a structure to do this - the MatchResult<T>. As long as we had one result type that we wanted to thread through the "Then" calls then we could write an alternate version of "Then" -

public static Optional<MatchResult<TResult>> Then<TResult, TValue>(
  this Optional<MatchResult<TResult>> resultSoFar,
  Parser<TValue> parser,
  Func<TResult, TValue, TResult> updater)
{
  if (!resultSoFar.IsDefined)
    return null;

  var result = parser(resultSoFar.Value.Reader);
  if (!result.IsDefined)
    return null;

  return MatchResult.New(
    updater(resultSoFar.Value.Result, result.Value.Result),
    result.Value.Reader
  );
}

This takes an Optional<MatchResult<T>> and tries to match content in the reader inside that MatchResult using a Parser (just like before) - if it successfully matches the content then it uses an "updater" which takes the previous values from the MatchResult and the matched value from the reader, and returns a new result that combines the two. It then returns a MatchResult that combines this new value with the reader in a position after the matched content. (Assuming the content met the Parser requirements.. otherwise it would return null).

This all sounds very abstract, so let's make it concrete. Continuing with the parsing-a-TypeScript-property (such as "name: string;") example, let's declare an interim type -

public sealed class PartialPropertyDetails
{
  public PartialPropertyDetails(
    Optional<IdentifierDetails> name,
    Optional<ITypeDetails> type)
  {
    Name = name;
    Type = type;
  }
  public Optional<IdentifierDetails> Name { get; }
  public Optional<ITypeDetails> Type { get; }
}

This has Optional values because we are going to start with them being null (since we don't have any real values until we've done the parsing). This is the reason that I've introduced this interim type, rather than using the final PropertyDetails type - that type is very similar but it has non-Optional properties because it doesn't make sense for a correctly-parsed TypeScript property to be absent either a name or a type.

Now, the parsing method could be rewritten into this:

public static Optional<MatchResult<PropertyDetails>> Property(IReadStringContent reader)
{
  var result = Optional.For(MatchResult.New(
      new PartialPropertyDetails(null, null),
      reader
    ))
    .Then(Identifier, (value, name) => new PartialPropertyDetails(name, value.Type))
    .ThenOptionally(Whitespace)
    .Then(Match(':'))
    .ThenOptionally(Whitespace)
    .Then(TypeScriptType, (value, type) => new PartialPropertyDetails(value.Name, type))
    .ThenOptionally(Whitespace)
    .Then(Match(';'));

  if (!result.IsDefined)
    return null;

  return MatchResult.New(
    new PropertyDetails(result.Value.Result.Name, result.Value.Result.Type),
    result.Value.Reader
  );
}

Ta-da! No reassignments or reliance upon side effects!

And we could make this a bit cleaner by tweaking PartialPropertyDetails -

public sealed class PartialPropertyDetails
{
  public static PartialPropertyDetails Empty { get; }
    = new PartialPropertyDetails(null, null);

  private PartialPropertyDetails(
    Optional<IdentifierDetails> name,
    Optional<ITypeDetails> type)
  {
    Name = name;
    Type = type;
  }

  public Optional<IdentifierDetails> Name { get; }
  public Optional<ITypeDetails> Type { get; }

  public PartialPropertyDetails WithName(IdentifierDetails value)
    => new PartialPropertyDetails(value, Type);
  public PartialPropertyDetails WithType(ITypeDetails value)
    => new PartialPropertyDetails(Name, value);
}

and then changing the parsing code into this:

public static Optional<MatchResult<PropertyDetails>> Property(IReadStringContent reader)
{
  var result = Optional.For(MatchResult.New(
      PartialPropertyDetails.Empty,
      reader
    ))
    .Then(Identifier, (value, name) => value.WithName(name))
    .ThenOptionally(Whitespace)
    .Then(Match(':'))
    .ThenOptionally(Whitespace)
    .Then(TypeScriptType, (value, type) => value.WithType(name))
    .ThenOptionally(Whitespace)
    .Then(Match(';'));

  if (!result.IsDefined)
    return null;

  return MatchResult.New(
    new PropertyDetails(result.Value.Result.Name, result.Value.Result.Type),
    result.Value.Reader
  );
}

This makes the parsing code look nicer, at the cost of having to write more boilerplate code for the interim type.

What if we could use anonymous types and some sort of magic for performing the copy-and-update actions..

One way to write less code

The problem with the PartialPropertyDetails is not only that it's quite a lot of code to write out (and that was only for two properties, it will quickly get bigger for more complicated structures) but also the fact that it's only useful in the context of the "Property" function. So this non-negligible chunk of code is not reusable and it clutters up the scope of whatever class or namespace has to contain it.

Anonymous types sound ideal, because they would just let us start writing objects to populate - eg.

var result = Optional.For(MatchResult.New(
    new
    {
      Name = (IdentifierDetails)null,
      Type = (ITypeDetails)null,
    },
    reader
  ))
  .Then(Identifier, (value, name) => new { Name = name, Type = value.Type })
  .ThenOptionally(Whitespace)
  .Then(Match(':'))
  .ThenOptionally(Whitespace)
  .Then(TypeScriptType, (value, type) => new { Name = value.Name, Type = Type })
  .ThenOptionally(Whitespace)
  .Then(Match(';'));

They're immutable types (so nothing is edited in-place, which is just as bad as editing via side effects) and, despite looking like they're being defined dynamically, the C# compiler defines real classes for them behind the scenes, so the "Name" property will always be of type IdentifierDetails and "Type" will always be an ITypeDetails.

The compiler creates new classes for every distinct combination of properties (considering both property name and property type). This means that if you declare two anonymous objects that have the same properties then they will be represented by the same class. This is what allows the above code to declare "updater" implementations such as

(value, name) => new { Name = name, Type = value.Type }

The "value" in the lambda will be an instance of an anonymous type and the returned value will be an instance of that same anonymous class because it specifies the exact same property names and types. This is key, because the "updater" is a delegate with the signature

Func<TResult, TValue, TResult> updater

(and so the returned value must be of the same type as the first value that it was passed).

This is not actually a bad solution, I don't think. There was no need to create a PartialPropertyDetails class and we have full type safety through those anonymous types. The only (admittedly minor) thing is that if the data becomes more complex then there will be more and more properties and so every instantiation of the anonymous types will get longer and longer. It's a pity that there's no way to create "With{Whatever}" functions for the anonymous types.

A minor side-track

Before I go any further, there's another extension method I want to introduce. I just think that the way that these parser chains are initiated feels a bit clumsy -

var result = Optional.For(MatchResult.New(
    new
    {
      Name = (IdentifierDetails)null,
      Type = (ITypeDetails)null,
    },
    reader
  ))
  .Then(Identifier, (value, name) => new { Name = name, Type = value.Type })
  // .. the rest of the parsing code continues here..

This could be neatened right up with an extension method such as this:

public static Optional<MatchResult<T>> StartMatching<T>(
  this IReadStringContent reader,
  T value)
{
  return MatchResult.New(value, reader);
}

This uses C#'s type inference to ensure that you don't have to declare the type of T (which is handy if we're using an anonymous type because we have no idea what its type name might be!) and it relies on the fact that the Optional struct has an implicit operator that allows a value T to be returned as an Optional<T>; it will wrap the value up automatically. (For more details on the Optional type, read what I wrote last time).

Now, the parsing code that we have look like this:

var resultWithAnonymousType = reader
  .StartMatching(new
  {
    Name = (IdentifierDetails)null,
    Type = (ITypeDetails)null
  })
  .Then(Identifier, (value, name) => new { Name = name, Type = value.Type })
  .ThenOptionally(Whitespace)
  .Then(Match(':'))
  .ThenOptionally(Whitespace)
  .Then(TypeScriptType, (value, type) => new { Name = value.Name, Type = Type })
  .ThenOptionally(Whitespace)
  .Then(Match(';'));

Only a minor improvement but another step towards making the code match the intent (which was one of the themes in my last post).

A cleverer (but less safe) alternative

Let's try turning the volume up to "silly" for a bit. (Fair warning: "clever" here refers more to "clever for the sake of it" than "intelligent).

A convenient property of the anonymous type classes is that they each have a constructor whose arguments directly match the properties on it - this is an artifact of the way that they're translated into regular classes by the compiler. You don't see this in code anywhere since the names of these mysterious classes is kept secret and you can't directly call a constructor without knowing the name of the class to call. But they are there, nonetheless. And there is one way to call them.. REFLECTION!

We could use reflection to create something like the "With{Whatever}" methods - that way, we could go back to only having to specify a single property-to-update in each "Then" call. The most obvious way that this could be achieved would be by specifying the name of the property-to-update as a string. But this is particularly dirty and prone to breaking if any refactoring is done (such as a change to a property name in the anonymous type). There is one way to mitigate this, though.. MORE REFLECTION!

Let me code-first, explain-later:

public static Optional<MatchResult<TResult>> Then<TResult, TValue>(
  this Optional<MatchResult<TResult>> resultSoFar,
  Parser<TValue> parser,
  Expression<Func<TResult, TValue>> propertyRetriever)
{
  if (!resultSoFar.IsDefined)
    return null;

  var result = parser(resultSoFar.Value.Reader);
  if (!result.IsDefined)
    return null;

  var memberAccessExpression = propertyRetriever.Body as MemberExpression;
  if (memberAccessExpression == null)
  {
    throw new ArgumentException(
      "must be a MemberAccess",
      nameof(propertyRetriever)
    );
  }

  var property = memberAccessExpression.Member as PropertyInfo;
  if ((property == null)
  || !property.CanRead
  || property.GetIndexParameters().Any())
  {
    throw new ArgumentException(
      "must be a MemberAccess that targets a readable, non-indexed property",
      nameof(propertyRetriever)
    );
  }

  foreach (var constructor in typeof(TResult).GetConstructors())
  {
    var valuesForConstructor = new List<object>();
    var encounteredProblemWithConstructor = false;
    foreach (var argument in constructor.GetParameters())
    {
      if (argument.Name == property.Name)
      {
        if (!argument.ParameterType.IsAssignableFrom(property.PropertyType))
        {
          encounteredProblemWithConstructor = false;
          break;
        }
        valuesForConstructor.Add(result.Value.Result);
        continue;
      }
      var propertyForConstructorArgument = typeof(TResult)
        .GetProperties()
        .FirstOrDefault(p =>
          (p.Name == argument.Name) &&
          p.CanRead && !property.GetIndexParameters().Any()
        );
      if (propertyForConstructorArgument == null)
      {
        encounteredProblemWithConstructor = false;
        break;
      }
      var propertyGetter = propertyForConstructorArgument.GetGetMethod();
      valuesForConstructor.Add(
        propertyGetter.Invoke(
          propertyGetter.IsStatic ? default(TResult) : resultSoFar.Value.Result,
          new object[0]
        )
      );
    }
    if (encounteredProblemWithConstructor)
      continue;

    return MatchResult.New(
      (TResult)constructor.Invoke(valuesForConstructor.ToArray()),
      result.Value.Reader
    );
  }
  throw new ArgumentException(
    $"Unable to locate a constructor that can be used to update {property.Name}"
  );
}

This allows the parsing code to be rewritten (again!) into:

var result = reader
  .StartMatching(new
  {
    Name = (IdentifierDetails)null,
    Type = (ITypeDetails)null
  })
  .Then(Identifier, x => x.Name)
  .ThenOptionally(Whitespace)
  .Then(Match(':'))
  .ThenOptionally(Whitespace)
  .Then(TypeScriptType, x => x.Type)
  .ThenOptionally(Whitespace)
  .Then(Match(';'));

Well now. Isn't that easy on the eye! Ok.. maybe beauty is in the eye of the beholder, so let me hedge my bets and say: Well now. Isn't that succint!

Those lambdas ("x => x.Name" and "x => x.Type") satisfy the form:

Expression<Func<TResult, TValue>>

This means that they are expressions which must take a TResult and return a TValue. So in the call

.Then(Identifier, x => x.Name)

.. the Expression describes how to take the anonymous type that we're threading through the "Then" calls and extract an IdentifierDetails instance from it (the type of this is dictated by the TValue type parameter on the "Then" method, which is inferred from the "Identifier" call - which returns an Optional<IdentifierDetails>).

This is the difference between an Expression and a Func - the Func is executable and tells us how to do something (such as "take the 'Name' property from the 'x' reference") while the Expression tells us how the Func is constructed.

This information allows the new version of "Then" to not only retrieve the specified property but also to be aware of the name of that property. And this is what allows the code to say "I've got a new value for one property now, I'm going to try to find a constructor that I can call which has an argument matching this property name (so I can satisfy that argument with this new value) and which has other arguments that can all be satisfied by other properties on the type".

Anonymous types boil down to simple classes, a little bit like this:

private sealed CRAZY_AUTO_GEN_NAME<T1, T2>
{
  public CRAZY_AUTO_GEN_NAME(T1 Name, T2 Type)
  {
    this.Name = Name;
    this.Type = Type;
  }
  public T1 Name { get; }
  public T2 Type { get; }
}

Note: I said earlier that the compiler generates distinct classes for anonymous types that have unique combinations of property names and types. That's a bit of a lie, it only has to vary them based upon the property names since it can use generic type parameters for the types of those properties. I confirmed this by using ildasm on my binaries, which also showed that the name of the auto-generated class was <>f_AnonymousType1.. it's not really called "CRAZY_AUTO_GEN_NAME" :)

So we can take the Expression "x => x.Name" and extract the fact the "Name" property is being targeted. This allows us to match the constructor that takes a "Name" argument and a "Type" argument and to call that constructor - passing the new value into the "Name" argument and passing the existing "Type" property value into the "Type" argument.

This has the benefit that everything would still work if one of the properties was renamed in a refactor (since if the "Name" property was renamed to "SomethingElse" then Visual Studio would update the lambda "x => x.Name" to "x => x.SomethingElse", just as it would for any other reference to that "Name" property).

The major downside is that the "Then" function requires that the Expression relate to a simple property retrieval, failing at runtime if this is not the case.* Since an Expression could be almost anything then this could be a problem. For example, the following is valid code and would compile -

.Then(Identifier, x => null)

.. but it would fail at runtime. This is what I mean by it not being safe.

But I've got to admit, this approach has a certain charm about it! Maybe it's not an appropriate mechanism for critical life-or-death production code, but for building a little parser for a personal project.. maybe I could convince myself it's worth it!

(Credit where it's due, I think I first saw this specify-a-property-with-an-Expression technique some years ago in AutoMapper, which is an example of code that is often used in production despite not offering compile-time guarantees about mappings - but has such convenience that the need to write tests around the mappings is outweighed by the benefits).

* (Other people might also point out that reflection is expensive and that that is a major downside - however, the code that is used here is fairly easy to wrap up in LINQ Expressions that are dynamically compiled so that repeated executions of hot paths are as fast as hand-written code.. and if the paths aren't hot and executed many times, what's the problem with the reflection being slower??)

Posted at 00:10

Comments

10 August 2015

Parsing TypeScript definitions (functional-ly.. ish)

A couple of things have happened recently, I've started using Visual Studio 2015 and C# 6 (read-only auto-properties, where have you been all my life???) and I've been playing around with using Bridge.net to write React applications in C#. So far, I've only written the bare minimum of bindings that I required to use React in Bridge so I was wondering if there was a shortcut to extending these - like taking the DefinitelyTyped TypeScript/React bindings and converting them somehow.

Looking at those TypeScript bindings, there really isn't that much content and it's not particularly complicated. Translating the majority of it into C# Bridge.net bindings by hand would probably not be a huge deal. But I don't really want to have to spend much time trying to update bindings when new versions of React come out (in fairness, React is fairly stable now and I suspect that any changes in the future will be very minor for most of the surface area of these bindings).

.. if I'm being really honest, though, I didn't just start out trying to manually translate the TypeScript bindings because I wanted to play around with writing some code! In the professional world, if there's a chance that a one-off(ish) operation like this will be much quicker (and not much less accurate) to perform by hand than by code then it's not an easy sell to say that you should code up an awesome general solution (that probably.. or.. hopefully will work) instead.-But this is all for my own enjoyment, so I'm going to do it however I damn well please! :D

Back in 2013, I wrote a post about parsing CSS. It pulled the style sheet input through an interface:

public interface IReadStringContent
{
  char? Current { get; }
  uint Index { get; }
  IReadStringContent GetNext();
}

This allowed the source to be abstracted away (is it a fixed string, is it a huge file being streamed in chunks; who cares!) and it allowed the logic around where-is-the-current-position-in-the-content to be hidden somewhat (which is where a lot of the complications and the off-by-one errors creep into things like this).

So I figured this seemed like a resonable place to start. But I also wanted to throw a few more ideas into the mix - I'm all for describing structures with immutable classes (if you've read any other post on this site, you probably already know that!) but I wanted to play around with using an F#-esque Optional class to allow the type system to indicate where nulls are and aren't acceptable. In the past, I've relied on comments - eg.

public interface IReadStringContent
{
  /// <summary>
  /// This will be null if all of the content has been consumed
  /// </summary>
  char? Current { get; }

  /// <summary>
  /// This will be an index within the string or the position immediately
  /// after it (if all content has been read)
  /// </summary>
  uint Index { get; }

  /// <summary>
  /// This will never return null. If all of the content has been read
  /// then the returned reference will have a null Current character
  /// </summary>
  IReadStringContent GetNext();
}

I think that, really, I want to give Code Contracts another try (so that I could describe in source code where nulls are not acceptable and then have the compiler verify that every call to such functions matches those constraints). But every time I try contracts again I find it daunting. And for this little project, I wanted to have fun with coding and not get bogged down again with something that I've struggled with in the past*.

* (Even typing that, it sounds like a poor excuse.. which it is; one day I'll give them a proper go!)

Anyway.. my point is, I'm getting bored of writing "This will never return null" or "This is optional and may return null" in. Over.. and over.. and over again. What I'd like is to know that a string property will never be null in my own code because if it was allowed to be null then the return type would be Optional<string> and not just string.

Well, there just happens to be such a construct that someone else has written that I'm going to borrow (steal). There's a GitHub project ImmutableObjectGraph that is about using T4 templates to take a minimal representation of a class and transforming it into a fully-populated immutable type, saving you from having to write the repetitive boiler plate code yourself. (There's actually less of this type of boiler pate required if you use C# 6 and the project is in the process of changing to use Roslyn instead of dirty T4 templates). I digress.. the point is that there is an optional type out there that does the job pretty well.

// Borrowed from https://github.com/AArnott/ImmutableObjectGraph but tweaked slightly
[DebuggerDisplay("{IsDefined ? Value.ToString() : \"<missing>\",nq}")]
public struct Optional<T>
{
  private readonly T value;
  private readonly bool isDefined;

  [DebuggerStepThrough]
  public Optional(T value)
  {
    // Tweak - if this is initialised with a null value then treat that the same
    // as this being a "Missing" value. I think you should either have a value or
    // not have a value (which is what null should be). I don't think you should
    // be able to choose between having a "real" value, having a value that is
    // null and having a Missing value. Not having a value (ie. null) and
    // Missing should be the same thing.
    this.isDefined = (value != null);
    this.value = value;
  }

  public static Optional<T> Missing
  {
    [DebuggerStepThrough]
    get { return new Optional<T>(); }
  }

  public bool IsDefined
  {
    [DebuggerStepThrough]
    get { return this.isDefined; }
  }

  public T Value
  {
    [DebuggerStepThrough]
    get { return this.value; }
  }

  [DebuggerStepThrough]
  public static implicit operator Optional<T>(T value)
    => new Optional<T>(value);

  public T GetValueOrDefault(T defaultValue)
    => this.IsDefined ? this.value : defaultValue;
}

public static class Optional
{
  public static Optional<T> For<T>(T value) => value;
}

I'm also thinking of trying to make all classes either abstract or sealed - inheritance always makes me a bit nervous when I'm trying to write "value-like" classes (I might want to write a nice custom Equals implementation, but then I start worrying that if someone derives from the type then there's no way to require that they also write their own Equals function that ensures that any new data they add to the type is considered in the comparison). Interestingly, someone posted a question Stack Overflow about this a few years ago, so rather than talk any more about it I'll just link to that: why are concrete types rarely useful as bases for further derivation.

On a less significant note, I've been meaning to create functions as static by default for a while now. I'm going to make a concerted effort this time. If a function in a class does not need to access any instance state, then marking it as static is a good way to highlight this fact to the reader - it's a sort-of indicator that the function is pure (nitpickers might highlight the fact that a static function could still interact with any static state that the containing class has, but - even in these cases - marking it as static still reduces the possible references that it could affect). Little touches like this can make code more readable, I find. The only problem comes when you could pass just that one piece of state into it, keeping the function static and pure.. but then you find you need to make a change and pass it another piece of state - at which point should you say that the number of arguments required offsets the benefits of purity? (I don't have an easy answer to that, in case you were wondering!).

By applying these principles, I've found that the style of code that I wrote changed a bit. And I think that the patterns that came out are quite interesting - maybe some of this is due to the changing nature of C#, or maybe it's because I've read so many blog posts over the last couple of years about functional programming that some of it was bound to stick!

Getting on with it

Imagine that, during parsing a TypeScript definition, we find ourselves within an interface and need to parse a property that it has -

name: string;

(I've picked this as a simple starting point - I realise that it seems an arbitrary place to begin at, but I had to start somewhere!)

I want to pick out the property name ("name") and its type ("string"). I could go through each character in a loop and keep track of when I hit token delimiters (such as the ":" between the name and the type), but it seems like this is going to be code that will need writing over and over again for different structures. As such, it seems like somewhere that a helper function would be very useful in cutting down on code duplication and "noise" in the code - eg.

public static Optional<MatchResult<string>> MatchAnythingUntil(
  IReadStringContent reader,
  ImmutableList<char> acceptableTerminators)
{
  var content = new StringBuilder();
  while (reader.Current != null)
  {
    if (acceptableTerminators.Contains(reader.Current.Value))
    {
      if (content.Length > 0)
        return MatchResult.New(content.ToString(), reader);
      break;
    }
    content.Append(reader.Current);
    reader = reader.GetNext();
  }
  return null;
}

(Note: I'm using the Microsoft Immutable Collections NuGet package, which is where the "ImmutableList" type comes from).

This relies upon the Optional type that I mentioned before, but also a MatchResult class -

public sealed class MatchResult<T>
{
  public MatchResult(T result, IReadStringContent reader)
  {
    Result = result;
    Reader = reader;
  }
  public T Result { get; }
  public IReadStringContent Reader { get; }
}

/// <summary>
/// This non-generic static class is just to expose a helper method that takes advantage
/// of C#'s type inference to allow you to say "MatchResult.New(value, reader)" rather than
/// having to write out the type of the value in "new MatchResult<string>(value, reader)"
/// </summary>
public static class MatchResult
{
  public static MatchResult<T> New<T>(T value, IReadStringContent reader)
    => return new MatchResult<T>(value, reader);
}

This result type is just used to say "here is the value that you asked for and here is a reader that picks up after that content". Neither the "Result" nor "Reader" properties have comments saying "This will never be null" because their properties are not wrapped in Optional types and so it is assumed that they never will be null.

The "MatchAnythingUntil" function returns a value of type Optional<MatchResult<string>>. This means that it is expected that this function may return a null value. Previously, I might have named this function "TryToMatchAnythingUntil" - to try to suggest through a naming convention that it may return null if it is unable to perform a match. I think that having this expressed through the type system is a big improvement.

Right.. now, let's try and make some more direct steps towards actually parsing something! I want to read that "name" value into a class called IdentifierDetails:

public sealed class IdentifierDetails
{
  public static ImmutableList<char> DisallowedCharacters { get; }
    = Enumerable.Range(0, char.MaxValue)
      .Select(c => (char)c)
      .Where(c => !char.IsLetterOrDigit(c) && (c != '$') && (c != '_'))
      .ToImmutableList();

  public IdentifierDetails(string value, SourceRangeDetails sourceRange)
  {
    if (value == "")
      throw new ArgumentException("may not be blank", nameof(value));

    var firstInvalidCharacter = value
      .Select((c, i) => new { Index = i, Character = c })
      .FirstOrDefault(c => DisallowedCharacters.Contains(c.Character));
    if (firstInvalidCharacter != null)
    {
      throw new ArgumentException(
        string.Format(
          "Contains invalid character at index {0}: '{1}'",
          firstInvalidCharacter.Index,
          firstInvalidCharacter.Character
        ),
        nameof(value)
      );
    }

    Value = value;
    SourceRange = sourceRange;
  }

  public string Value { get; }
  public SourceRangeDetails SourceRange { get; }

  public override string ToString() => $"{Value}";
}

public sealed class SourceRangeDetails
{
  public SourceRangeDetails(uint startIndex, uint length)
  {
    if (length == 0)
      throw new ArgumentOutOfRangeException(nameof(length), "must not be zero");

    StartIndex = startIndex;
    Length = length;
  }

  public uint StartIndex { get; }

  /// <summary>This will always be greater than zero</summary>
  public uint Length { get; }

  public override string ToString() => $"{StartIndex}, {Length}";
}

These classes illustrates more of the new C# 6 features - property initialisers, expression-bodied members and string interpolation. I think these are all fairly small tweaks that add up to large improvements overall. If you're not already familiar with these and want to know more then I highly recommend this Visual Studio extension: C# Essentials. It makes nice suggestions about how you could use the new features and has the facility to automatically apply the changes to your code; awesome! :)

Ok.. so now we can start tying this all together. Say we have an IReadStringContent implementation which is pointing at the "name: string;" content. We could write another function that tries to parse the property's details, starting with something like -

public static Optional<MatchResult<PropertyDetails>> GetProperty(IReadStringContent reader)
{
  var identifierMatch = MatchAnythingUntil(reader, IdentifierDetails.DisallowedCharacters);
  if (!identifierMatch.IsDefined)
    return null;

  var readerAfterIdentifier = identifierMatch.Value.Reader;
  var identifier = new IdentifierDetails(
    identifierMatch.Value.Result,
    new SourceRangeDetails(reader.Index, readerAfterIdentifier.Index - reader.Index)
  );

  // .. more code (to check for the ":" separator and to get the type of the property)..

This function will either return the details of the property or it will return null, meaning that it failed to do so (it is immediately obvious from the method signature that a null return value is possible since its return type is an Optional).

Before I go into details on the ".. more code.." section above, we could do with some more helper functions to parse the content after the "name" value - eg.

public static Optional<MatchResult<char>> MatchCharacter(
  IReadStringContent reader,
  char character)
{
  return (reader.Current == character)
    ? MatchResult.New(character, reader.GetNext())
    : null;
}

public static Optional<MatchResult<string>> MatchWhitespace(IReadStringContent reader)
{
  var content = new StringBuilder();
  while ((reader.Current != null) && char.IsWhiteSpace(reader.Current.Value))
  {
    content.Append(reader.Current.Value);
    reader = reader.GetNext();
  }
  return (content.Length > 0) ? MatchResult.New(content.ToString(), reader) : null;
}

While we're doing this, let's put the try-to-match-Identifier logic into its own function. It's going to be used in other places (the IdentifierDetails will be used for type names, variable names, interface names, class names, property names, etc..) so having it in a reusable method makes sense:

public static Optional<MatchResult<IdentifierDetails>> GetIdentifier(
  IReadStringContent reader)
{
  var identifierMatch = MatchAnythingUntil(reader, IdentifierDetails.DisallowedCharacters);
  if (!identifierMatch.IsDefined)
    return null;

  var readerAfterIdentifier = identifierMatch.Value.Reader;
  return MatchResult.New(
    new IdentifierDetails(
      identifierMatch.Value.Result,
      new SourceRangeDetails(reader.Index, readerAfterIdentifier.Index - reader.Index)
    ),
    readerAfterIdentifier
  );
}

Now, the work-in-progress GetProperty function looks like this:

public static Optional<MatchResult<PropertyDetails>> GetProperty(IReadStringContent reader)
{
  IdentifierDetails identifier;
  var identifierMatch = GetIdentifier(reader);
  if (identifierMatch.IsDefined)
  {
    identifier = identifierMatch.Value.Result;
    reader = identifierMatch.Value.Reader;
  }
  else
    return null;

  var colonMatch = MatchCharacter(reader, ':');
  if (colonMatch.IsDefined)
    reader = colonMatch.Value.Reader;
  else
    return null;

  var whitespaceMatch = MatchWhitespace(reader);
  if (colonMatch.IsDefined)
    reader = whitespaceMatch.Value.Reader;

  // .. more code..

Hmm.. that's actually quite a lot of code. And there are already three patterns emerging which are only apparent if you look closely and pay attention to what's going on -

The "identifier" reference is required and the retrieved value is stored (then the "reader" reference is updated to point at the content immediately after the identifier)
The colon separator is required, but the "Value" of the MatchResult is not important - we just want to know that the character is present (and to move the "reader" reference along to the content after it)
The white space matching is optional - there doesn't need to be whitespace after the colon and before the type name (and if there is such whitespace then we don't store its value, it's of no interest to us, we just want to be able to get a "reader" reference that skips over any whitespace)

This feels like it's going to get repetitive very quickly and that silly mistakes could easily slip in. I don't like the continued re-assignment of the "reader" reference, I can see myself making a stupid mistake in code like that quite easily.

On top of this, the code above misses a valid case - where there is whitespace before the colon as well as after it (eg. "name : string;").

We can do better.

Let's get functional

What I'd like to do is chain the content-matching methods together. My first thought was how jQuery chains functions one after another, but LINQ might have been just as appropriate an inspiration.

The way that both of these libraries work is that they have standard input / output types. In LINQ, you can get a lot of processing done based just on IEnumerable<T> references. With jQuery, you're commonly working with sets of DOM elements (it has methods like "map", though, so nothing forces you to operate only on sets of elements).

So what I need to do is to encourage my parsing functions into a standard form. Well.. that's not quite true. I like my parsing functions as they are! They're clear and succinct and their type signatures are informative ("GetIdentifier" takes a non-null reader reference and returns either a result-plus-reader-after-the-result or a no-match response; similarly MatchCharacter has a very simple but descriptive signature).

What I really want to do is to create something generic that will wrap these parser functions and let me chain them together.

Before I do this, I'm going to try to create a definition for the parser signatures that I want to wrap. (I know, I know - I've just said that I'm happy with my little matching functions just the way they are! But this will only be a way to guide them all together into a happy family). So this is what I've come up with:

public delegate Optional<MatchResult<T>> Parser<T>(
  IReadStringContent reader
);

The "GetIdentifier" method already matches this delegate perfectly. As does "MatchWhitespace". The "MatchCharacter" function doesn't, however, due to its argument that specifies what character it is to look for.

What I'm going to do, then, is change "MatchCharacter" into a function that creates a Parser delegate. We go from:

public static Optional<MatchResult<char>> MatchCharacter(
  IReadStringContent reader,
  char character)
{
  return (reader.Current == character)
    ? MatchResult.New(character, reader.GetNext())
    : null;
}

public static Parser<char> MatchCharacter(char character)
{
  return reader =>
  {
    return (reader.Current == character)
      ? MatchResult.New(character, reader.GetNext())
      : null;
  };
}

That was pretty painless, yes?

But now that our parser functions take a consistent form, we can wrap them up in some sort of structure that allows them to be chained together. I'm going to create an extension method -

public static Optional<IReadStringContent> Then<T>(
  this Optional<IReadStringContent> reader,
  Parser<T> parser,
  Action<T> report)
{
  if (!reader.IsDefined)
    return null;

  var result = parser(reader.Value);
  if (!result.IsDefined)
    return null;

  report(result.Value.Result);
  return Optional.For(result.Value.Reader);
}

This is an extension for Optional<IReadStringContent> that also returns an Optional<IReadStringContent>. So multiple calls to this extension method could be chained together. It takes a Parser<T> which dictates what sort of parsing it's going to attempt. If it succeeds then it passes the matched value out through an Action<T> and then returns a reader for the position in the content immediately after that match. If the parser fails to match the content then it returns a "Missing" optional reader.

Any time that "Then" is called, if it is given a "Missing" reader then it returns a "Missing" response immediately. That means that we could string these together and when one of them fails to match, any subsequent "Then" calls will effectively be skipped over and the "Missing" value will pop out the bottom.

Before I show some demonstration code, I want to introduce two variations. For values such as the property name (going back to the "name: string;" example), the name of that property is vitally important - it's a key part of the content. However, with the colon separator, we only want to know that it's present, we don't really care about it's value. So the first variation of "Then" doesn't bother with the I've-matched-this-content callback:

public static Optional<IReadStringContent> Then<T>(
  this Optional<IReadStringContent> reader,
  Parser<T> parser)
{
  return Then(reader, parser, value => { });
}

The second variation relates to whitespace-matching. The thing with whitespace around symbols is that, not only do we not care about the "value" of that whitespace, we don't care whether it's there or not! The space in "name: string;" is not significant. Enter "ThenOptionally":

public static Optional<IReadStringContent> ThenOptionally<T>(
  this Optional<IReadStringContent> reader,
  Parser<T> parser)
{
  if (!reader.IsDefined)
    return null;

  return Optional.For(
    Then(reader, parser).GetValueOrDefault(reader.Value)
  );
}

This will try to apply a parser to the current content (if there is any). If the parser matches content, that's great! We'll return a reader that points to the content immediately after the match. However, if the parser doesn't match any content, then we'll just return the same reader that we were given in the first place.

This becomes less abstract if we refactor the "GetProperty" function to take advantage of these new functions -

public static Optional<MatchResult<PropertyDetails>> GetProperty(IReadStringContent reader)
{
  IdentifierDetails name = null;
  ITypeDetails type = null;
  var readerAfterProperty = Optional.For(reader)
    .Then(GetIdentifier, value => name = value)
    .ThenOptionally(MatchWhitespace)
    .Then(MatchCharacter(':'))
    .ThenOptionally(MatchWhitespace)
    .Then(GetTypeScriptType, value => type = value)
    .ThenOptionally(MatchWhitespace)
    .Then(MatchCharacter(';'));

  if (!readerAfterProperty.IsDefined)
    return null;

  return MatchResult.New(
    new PropertyDetails(name, type),
    readerAfterProperty.Value
  );
}

Well, isn't that much nicer!

Before, it looked like this method was going to run away with itself. And I said that I didn't like the fact that there seemed be a lot of "noise" required to describe the patterns "if this is matched then record the value and move on" / "if this is matched, I don't care about its value, just move the reader on" / "I don't care if this is matched or not, but if it is then move the reader past it".

This is a lovely example of C#'s type inference - each call to "Then" or "ThenOptionally" has a generic type, but those types did not need to be explicitly specified in the code above. When "Then" is provided "GetIdentifier" as an argument, it knows that the "T" in "Then<T>" is an IdentifierDetails because the return type of "GetIdentifier" is an Optional<MatchResult<IdentifierDetails>>.

It's also an example of.. if I've got this right.. the Monad Pattern. That concept that seems impossibly theoretical in 99% of articles you read (or maybe that's just me). Even this highly-upvoted answer on Stack Overflow by the ever-readable Eric Lippert starts off sounding very dense and full of abstract terms: Monad in plain English.

I don't want to get too hung up on all that, though, I just thought it was interesting to highlight the concept in use in some practical code. Plus, if I try to make myself out as some functional programming guru then someone might pick on me for writing "Then" in such a way that it requires a side effect to report the value it matches - side effects are so commonly (and fairly, imho) seen as an evil that must be minimised that it seems strange to write a function that will be used in many places that relies upon them. However, with C#, I think it would be complicated to come up with a structure that is passed through the "Then" calls, along with the Optional<IReadStringContent>, and since the side effects are so localised here, it's not something I'm going to lose sleep over. Maybe it will be easier if "record types" get implemented in C# 7, or maybe it's just an exercise for the reader to come up with an elegant solution and let me know! :)

Speaking of exercises for the reader, you might have noticed that I haven't included the source code for the ITypeDetails interface or the "GetTypeScriptType" function. That's because I haven't finished writing them yet! To be honest, I've been having a bit of an attention span problem so I thought I'd switch from writing code to writing a blog post for a little while. Hopefully, at some point, I'll release the source in case it's of any use to anyone else - but for now it's such work-in-progress that it's not even much use to me yet!

After wrapping the "GetIdentifier", "MatchCharacter", etc.. functions in "Then" calls, I felt like renaming them. In general, I prefer to have variables and properties named as nouns and functions named as verbs. However, when functions are being passed around as arguments for other functions, it blurs the borders! So, since "GetIdentifier" and "MatchWhitespace" are only ever passed into "Then" as arguments, I've switched them from being verbs into being nouns (simply "Identifer" and "Whitespace").

I've also shortened the "MatchCharacter" function to just "Match". It's still a verb, but that's because it will be passed a character to match - and the Parser that the "Match" function returns is passed to "Then". (To make things worse, "Match" can be either a verb or a noun, but in this case I'm taking it to be a verb!).

Now "GetPropertyDetails" looks like this:

public static Optional<MatchResult<PropertyDetails>> GetProperty(IReadStringContent reader)
{
  IdentifierDetails name = null;
  ITypeDetails type = null;
  var readerAfterProperty = Optional.For(reader)
    .Then(Identifier, value => name = value)
    .ThenOptionally(Whitespace)
    .Then(Match(':'))
    .ThenOptionally(Whitespace)
    .Then(TypeScriptType, value => type = value)
    .ThenOptionally(Whitespace)
    .Then(Match(';'));

  if (!readerAfterProperty.IsDefined)
    return null;

  return MatchResult.New(
    new PropertyDetails(name, type),
    readerAfterProperty.Value
  );
}

They're only little changes, but now the code is closer to directly expressing the intent ("match an Identifier then optionally match whitespace then match the colon separator then optionally whitespace then the type, etc..").

To take another example from the work I've started - this is how I parse a TypeScript interface header:

IdentifierDetails interfaceName = null;
ImmutableList<GenericTypeParameterDetails> genericTypeParameters = null;
ImmutableList<NamedTypeDetails> baseTypes = null;
var readerAfterInterfaceHeader = Optional.For(reader)
  .Then(Match("interface"))
  .Then(Whitespace)
  .Then(Identifier, value => interfaceName = value)
  .ThenOptionally(Whitespace)
  .If(Match('<')).Then(GenericTypeParameters, value => genericTypeParameters = value)
  .ThenOptionally(Whitespace)
  .ThenOptionally(InheritanceChain, value => baseTypes = value)
  .ThenOptionally(Whitespace)
  .Then(Match('{'));

There are a couple of functions I haven't included the source for here ("GenericTypeParameters" and "InheritanceChain") because I really wanted this example to be used to show how easy it is to extend functionality like "Then" and "Match" for different cases. I introduced a "Match" method signature that takes a string instead of a single character -

public static Parser<string> Match(string value)
{
  if (value == "")
    throw new ArgumentException("may not be blank", nameof(value));

  return reader =>
  {
    // Enumerate the characters in the "value" string and ensure that the reader
    // has characters that match, moving the reader along one character each
    // time until either every character in the string is tested or until
    // a non-matching character is encountered from the reader
    var numberOfMatchedCharacters = value
      .TakeWhile(c =>
      {
        if (reader.Current != c)
          return false;
        reader = reader.GetNext();
        return true;
      })
      .Count();
    if (numberOfMatchedCharacters < value.Length)
      return null;

    return MatchResult.New(value, reader);
  };
}

And I introduced an "If" function, which basically says "if this Parser is found to be a match, then move past it and use this other Parser on the subsequent content - but if that first Parser did not match, then ignore the second one". This makes it easy to parse either of the following interface headers:

// This interface has no generic type parameters..
interface ISomething {

// .. but this one does
interface ISomethingElse<TKey, TValue> {

The "If" function allows us to say "if this interface has a pointy bracket that indicates some generic type params then parse them, otherwise don't bother trying". And it could be implemented thusly:

public static Optional<IReadStringContent> If<TCondition, TResult>(
  this Optional<IReadStringContent> reader,
  Parser<TCondition> condition,
  Parser<TResult> parser,
  Action<TResult> report)
{
  if (!reader.IsDefined)
    return null;

  var readerAfterCondition = condition(reader.Value);
  if (!readerAfterCondition.IsDefined)
    return reader;

  return Then(Optional.For(readerAfterCondition.Value.Reader), parser, report);
}

However, that function takes three arguments in one continuous list: 1. the condition, 2. the parser-if-condition-is-met, 3. callback-if-condition-is-met-and-then-subsequent-parser-matches-content. So it would be called like:

.If(Match('<'), GenericTypeParameters, value => genericTypeParameters = value)

which isn't terrible.. but I think it's even better to have "If" return an interim type (a "ConditionalParser<T>") that has its own "Then" function that is only called if the condition parser matches content - ie.

// Using C# 6 "expression-bodied member" syntax..
public static ConditionalParser<T> If<T>(
  this Optional<IReadStringContent> reader,
  Parser<T> parser) => new ConditionalParser<T>(reader, parser);

public sealed class ConditionalParser<TCondition>
{
  private readonly Optional<IReadStringContent> _reader;
  private readonly Parser<TCondition> _condition;
  public ConditionalParser(
    Optional<IReadStringContent> reader,
    Parser<TCondition> condition)
  {
    _reader = reader;
    _condition = condition;
  }

  public Optional<IReadStringContent> Then<T>(Parser<T> parser, Action<T> report)
  {
    if (!_reader.IsDefined || !_reader.Then(_condition).IsDefined)
      return _reader;

    return _reader.Then(parser, report);
  }
}

That way, the resulting code is just a little closer to the original intent. Instead of

.If(Match('<'), GenericTypeParameters, value => genericTypeParameters = value)

we can use

.If(Match('<')).Then(GenericTypeParameters, value => genericTypeParameters = value)

And because the ConditionalParser's "Then" function returns an Optional<IReadStringContent> we can go back to chaining more "Then" calls right on to it (as I do in the "InterfaceHeader" example above).

In conclusion..

I really like how the techniques I've talked about work together and how combining these small functions enables complex forms to be described while still keeping things manageable. Having the complexity grow at a nice linear rate, rather than multiplying out every time something new needs to be considered, is very gratifying. Not to mention that having these small and simple functions also makes automated testing a lot easier.

All of the functions shown are pure (although some technically have side effects, these effects are only
actioned through lambdas that are passed to the functions as arguments) and nothing shares state, which makes the code very easy to reason about.

And it's nice to be able to see my own coding style evolving over time. It's an often-heard cliché that you'll hate your own code from six months ago, but lately I feel like I'm looking back to code from a couple of years ago and being largely content with it.. there's always things I find that I'd do slightly differently now, but rarely anything too egregious (I had to fix up some code I wrote more than five years ago recently, now that was an unpleasant surprise!) so it's gratifying to observe myself changing (in positive ways, I hope!).

Maybe now's a good time to give F# another proper go..

Posted at 22:06

Comments

Strongly-typed React (with Bridge.net)

A starting point

Why must the HTMLAttributes have almost 150 properties??

More type shenanigans

A nice bonus

A change in Bridge 1.8

What's next

Hassle-free immutable type updates in C#

The verbose approach

One way to write less code

A minor side-track

A cleverer (but less safe) alternative

Parsing TypeScript definitions (functional-ly.. ish)

Getting on with it

Let's get functional

In conclusion..

About

Recent Posts

Highlights

Archives

You may also be interested in:

You may also be interested in:

You may also be interested in (see here for information about how these are generated):

About

Recent Posts

Highlights

Archives