Productive Rage

11 February 2016

Easy "PureComponent" React performance boosts for Bridge.Net

React's great strength is that it makes creating UIs simple(r) because you can treat the view as a pure function - often, you essentially give a props reference into a top level component and it works out what to draw. Then, when something changes, you do the same again; trigger a full re-draw and rely upon React's Virtual DOM to work out what changed in an efficient manner and apply those changes to the browser DOM. The browser DOM is slow, which is why interactions with it should be minimised. The Virtual DOM is fast.

The common pre-React way to deal with UIs was to have some code to render the UI in an initial state and then further code that would change the UI based upon user interactions. React reduces these two types of state-handling (initial-display and update-for-specific-interaction) into one (full re-render).

And a lot of the time, the fast Virtual DOM performs quickly enough that you don't have to worry about what it's doing. But sometimes, you may have a UI that is so complicated that it's a lot of work for the Virtual DOM to calculate the diffs to apply to the browser DOM. Or you might have particularly demanding performance requirements, such as achieving 60 fps animations on mobile.

Handily, React has a way for you to give it hints - namely the ShouldComponentUpdate method that components may implement. This method can look at the component's current props and state values and the next props and state values and let React know if any changes are required. The method returns a boolean - false meaning "no, I don't need to redraw, this data looks the same" and true meaning "yes, I need to redraw for this new data". The method is optional, if a component doesn't implement it then it's equivalent to it always returning true. Remember, if a component returns true for "do I need to be redrawn?", the Virtual DOM is still what is responsible for dealing with the update - and it usually deals with it in a very fast and efficient manner. Returning true is not something to necessarily be worried about. However, if you can identify cases where ShouldComponentUpdate can return false then you can save the Virtual DOM from working out whether that component or any of its child components need to be redrawn. If this can be done high up in a deeply-nested component tree then it could save the Virtual DOM a lot of work.

The problem is, though, that coming up with a mechanism to reliably and efficiently compare object references (ie. props and / or state) to determine whether they describe the same data is difficult to do in the general case.

Let me paint a picture by describing a very simple example React application..

The Message Editor Example

Imagine an app that can read a list of messages from an API and allow the user of the app to edit these messages. Each message has "Content" and "Author" properties that are strings. Either of these values may be edited in the app. These messages are part of a message group that has a title - this also may be edited in the app.

(I didn't say that it was a useful or realistic app, it's just one to illustrate a point :)

The way that I like to create React apps is to categorise components as one of two things; a "Container Component" or a "Presentation Component". Presentation Components should be state-less, they should just be handed a props reference and then go off and draw themselves. Any interactions that the user makes with this component or any of its child components are effectively passed up (via change handlers on the props reference) until it reaches a Container Component. The Container Component will translate these interaction into actions to send to the Dispatcher. Actions will be handled by a store (that will be listening out for Dispatcher actions that it's interested in). When a store handles an action, it emits a change event. The Container Component will be listening out for change events on stores that it is interested in - when this happens, the Container Component will trigger a re-render of itself by updating its state based upon data now available in the store(s) it cares about. This is a fairly standard Flux architecture and, I believe, the terms "Container Component" / "Presentation Component are in reasonably common use (I didn't make them up, I just like the principle - one of the articles that I've read that uses these descriptions is Component Brick and Mortar: The React documentation I wish I had a year ago).

So, for my example app, I might have a component hierarchy that looks this:

AppContainer
  Title
    TextInput
      Input
  MessageList
    MessageRow
      TextInput
        Input
      TextInput
        Input
    MessageRow
      TextInput
        Input
      TextInput
        Input

There will be as many "MessageRow" components as there are messages to edit. Input is a standard React-rendered element and all of the others (AppContainer, Title, MessageList, MessageRow and TextInput) are custom components.

(Note: This is not a sufficiently deeply-nested hierarchy that React would have any problems with rendering performance, it's intended to be just complicated enough to demonstrate the point that I'm working up to).

The AppContainer is the only "Container Component" and so is the only component that has a React state reference as well as props. A state reference is, essentially, what prevents a component from being what you might consider a "pure function" - where the props that are passed in are all that affects what is rendered out. React "state" is required to trigger a re-draw of the UI, but it should be present in as few places as possible - ie. there should only be one, or a small number of, top level component(s) that have state. Components that render only according to their props data are much easier to reason about (and hence easier to write, extend and maintain).

My Bridge.NET React bindings NuGet package makes it simple to differentiate between stateful (ie. Container) components and stateless (ie. Presentation) components as it has both a Component<TProps, TState> base class and a StatelessComponent<TProps> base class - you derive from the appropriate one when you create custom components (for more details, see React (and Flux) with Bridge.net - Redux).

To start with the simplest example, below is the TextInput component. This just renders a text Input with a specified value and communicates up any requests to change that string value via an "OnChange" callback -

public class TextInput : StatelessComponent<TextInput.Props>
{
  public TextInput(Props props) : base(props) { }

  public override ReactElement Render()
  {
    return DOM.Input(new InputAttributes
    {
      Type = InputType.Text,
      Value = props.Content,
      OnChange = OnTextChange
    });
  }

  private void OnTextChange(FormEvent<InputElement> e)
  {
    props.OnChange(e.CurrentTarget.Value);
  }

  public class Props
  {
    public string Content { get; set; }
    public Action<string> OnChange { get; set; }
  }
}

It is fairly easy to envisage how you might try to implement "ShouldComponentUpdate" here - given a "this is the new props value" reference (which gets passed into ShouldComponentUpdate as an argument called "nextProps") and the current props reference, you need only look at the "Content" and "OnChange" references on the current and next props and, if both Content/Content and OnChange/OnChange references are the same, then we can return false (meaning "no, we do not need to re-draw this TextInput").

(Two things to note here: Firstly, it is not usually possible to directly compare the current props reference with the "nextProps" reference because it is common for the parent component to create a new props instance for each proposed re-render of a child component, rather than re-use a previous props instance - so the individual property values within the props references may all be consistent between the current props and nextProps, but the actual props references will usually be distinct. Secondly, the Bridge.NET React bindings only support React component life cycle method implementations on custom components derived from Component<TProps, TState> classes and not those derived from StatelessComponent<TProps>, so you couldn't actually write your own "ShouldComponentUpdate" for a StatelessComponent - but that's not important here, we're just working through a thought experiment).

Now let's move on to the MessageList and MessageRow components, since things get more complicated there -

public class MessageList : StatelessComponent<MessageList.Props>
{
  public MessageList(Props props) : base(props) { }

  public override ReactElement Render()
  {
    var messageRows = props.IdsAndMessages
      .Select(idAndMessage => new MessageRow(new MessageRow.Props
      {
        Key = idAndMessage.Item1,
        Message = idAndMessage.Item2,
        OnChange = newMessage => props.OnChange(idAndMessage.Item1, newMessage)
      }));
    return DOM.Div(
      new Attributes { ClassName = "message-list" },
      messageRows
    );
  }

  public class Props
  {
    public Tuple<int, MessageEditState>[] IdsAndMessages;
    public Action<int, MessageEditState> OnChange;
  }
}

public class MessageRow : StatelessComponent<MessageRow.Props>
{
  public MessageRow(Props props) : base(props) { }

  public override ReactElement Render()
  {
    // Note that the "Key" value from the props reference does not explicitly need
    // to be mentioned here, the React bindings will deal with it (it is important
    // to give dynamic children components unique key values, but it is handled by
    // the bindings and the React library so long as a "Key" property is present
    // on the props)
    // - See https://facebook.github.io/react/docs/multiple-components.html for
    //   more details
    return DOM.Div(new Attributes { ClassName = "message-row" },
      new TextInput(new TextInput.Props
      {
        Content = props.Message.Content,
        OnChange = OnContentChange
      }),
      new TextInput(new TextInput.Props
      {
        Content = props.Message.Author,
        OnChange = OnAuthorChange
      })
    );
  }

  private void OnContentChange(string newContent)
  {
    props.OnChange(new MessageEditState
    {
      Content = newContent,
      Author = props.Message.Author
    });
  }
  private void OnAuthorChange(string newAuthor)
  {
    props.OnChange(new MessageEditState
    {
      Content = props.Message.Content,
      Author = newAuthor
    });
  }

  public class Props
  {
    public int Key;
    public MessageEditState Message;
    public Action<MessageEditState> OnChange;
  }
}

public class MessageEditState
{
  public string Content;
  public string Author;
}

If the MessageList component wanted to implement "ShouldComponentUpdate" then its job is more difficult as it has an array of message data to check. It could do one of several things - the first, and most obviously accurate, would be to perform a "deep compare" of the arrays from the current props and the "nextProps"; ensuring firstly that there are the same number of items in both and then comparing each "Content" and "Author" value in each item of the arrays. If everything matches up then the two arrays contain the same data and (so long as the "OnChange" callback hasn't changed) the component doesn't need to re-render. Avoiding re-rendering this component (and, subsequently, any of its child components) would be a big win because it accounts for a large portion of the total UI. Not re-rendering it would give the Virtual DOM much less work to do. But would a deep comparison of this type actually be any cheaper than letting the Virtual DOM do what it's designed to do?

The second option is to presume that whoever created the props references would have re-used any MessageEditState instances that haven't changed. So the array comparison could be reduced to ensuring that the current and next props references both have the same number of elements and then performing reference equality checks on each item.

The third option is to presume that whoever created the props reference would have re-used the array itself if the data hadn't changed, meaning that a simple reference equality check could be performed on the current and next props' arrays.

The second and third options are both much cheaper than a full "deep compare" but they both rely upon the caller following some conventions. This is why I say that this is a difficult problem to solve for the general case.

Immutability to the rescue

There is actually another option to consider, the object models for the props data could be rewritten to use immutable types. These have the advantage that if you find that two references are equal then they are guaranteed to contain the same data. They also have the advantage that it's much more common to re-use instances to describe the same data - partly because there is some overhead to initialising immutable types and partly because there is no fear that "if I give this reference to this function, I want to be sure that it can't change the data in my reference while doing its work" because it is impossible to change an immutable reference's data. (I've seen defensively-written code that clones mutable references that it passes into other functions, to be sure that no other code can change the data in the original reference - this is never required with immutable types).

Conveniently, I've recently written a library to use with Bridge.NET which I think makes creating and working with immutable types easier than C# makes it on its own. I wrote about it in "Friction-less immutable objects in Bridge (C# / JavaScript) applications" but the gist is that you re-write MessageEditState as:

// You need to pull in the "ProductiveRage.Immutable" NuGet package to use IAmImmutable
public class MessageEditState : IAmImmutable
{
  public MessageEditState(string content, string author)
  {
    this.CtorSet(_ => _.Content, content);
    this.CtorSet(_ => _.Author, author);
  }
  public string Content { get; private set; }
  public string Author { get; private set; }
}

It's still a little more verbose than the mutable version, admittedly, but I'm hoping to convince you that it's worth it (if you need convincing!) for the benefits that we'll get.

When you have an instance of this new MessageEditState class, if you need to change one of the properties, you don't have to call the constructor each time to get a new instance, you can use the "With" extension methods that may be called on any IAmImmutable instance - eg.

var updatedMessage = message.With(_ => _.Content, "New information");

This would mean that the change handlers from MessageRow could be altered from:

private void OnContentChange(string newContent)
{
  props.OnChange(new MessageEditState
  {
    Content = newContent,
    Author = props.Message.Author
  });
}
private void OnAuthorChange(string newAuthor)
{
  props.OnChange(new MessageEditState
  {
    Content = props.Message.Content,
    Author = newAuthor
  });
}

and replaced with:

private void OnContentChange(string newContent)
{
  props.OnChange(props.Message.With(_ => _.Content, newContent));
}
private void OnAuthorChange(string newAuthor)
{
  props.OnChange(props.Message.With(_ => _.Author, newAuthor));
}

Immediately, the verbosity added to MessageEditState is being offset with tidier code! (And it's nice not having to set both "Content" and "Author" when only changing one of them).

The "With" method also has a small trick up its sleeve in that it won't return a new instance if the new property value is the same as the old property value. This is an eventuality that could happen in the code above as an "Input" element rendered by React will raise an "OnChange" event for any action that might have altered the text input's content. For example, if you had a text box with the value "Hello" in it and you selected all of that text and then pasted in text from the clipboard over the top of it, if the clipboard text was also "Hello" then the "OnChange" event will be raised, even though the actual value has not changed (it was "Hello" before and it's still "Hello" now). The "With" method will deal with this, though, and just pass the same instance straight back out. This is an illustration of the "reuse of instances for unchanged data" theme that I alluded to above.

The next step would be to change the array type in the MessageList.Props type from

public Tuple<int, MessageEditState>[] IdsAndMessages;

public NonNullList<Tuple<int, MessageEditState>> IdsAndMessages;

The NonNullList class is also in the ProductiveRage.Immutable NuGet package. It's basically an immutable IEnumerable that may be used in Bridge.NET projects. A simple example of it in use is:

// Create a new set of values (the static "Of" method uses type inference to determine
// the type of "T" in the returned "NonNullList<T>" - since 1, 2 and 3 are all ints, the
// "numbers" reference will be of type "NonNullList<int>")
var numbers = NonNullList.Of(1, 2, 3);

// SetValue takes an index and a new value, so calling SetValue(2, 4) on a set
// containing 1, 2, 3 will return a new set containing the values 1, 2, 4
numbers = numbers.SetValue(2, 4);

// Calling SetValue(2, 4) on a set containing values 1, 2, 4 does not require any
// changes, so the input reference is passed straight back out
numbers = numbers.SetValue(2, 4);

As with IAmImmutable instances we get two big benefits - we can rely on reference equality comparisons more often, since the data with any given reference can never change, and references will be reused in many cases if operations are requested that would not actually change the data. (It's worth noting that the guarantees fall apart if any property on an IAmImmutable reference is a of a mutable type, similarly if a NonNullList has elements that are a mutable type, or that have nested properties that are of a mutable type.. but so long as immutability is used "all the way down" then all will be well).

If this philosophy was followed, then suddenly the "ShouldComponentUpdate" implementation for the MessageList component would be very easy to write - just perform reference equality comparisons on the "IdsAndMessages" and "OnChange" values on the current props and on the nextProps. While solving the problem for the general case is very difficult, solving it when you introduce some constraints (such as the use of immutable and persistent data types) can be very easy!

If we did implement this MessageList "ShouldComponentUpdate" method, then we could be confident that when a user makes changes to the "Title" text input that the Virtual DOM would not have to work out whether the MessageList or any of its child components had changed - because we'd have told the Virtual DOM that they hadn't (because the "IdsAndMessages" and "OnChange" property references wouldn't have changed).

We could take this a step further, though, and consider the idea of implementing "ShouldComponentUpdate" on other components - such as MessageRow. If the user edits a text value within one row, then the MessageList will have to perform some re-rendering work, since one of its child components needs to be re-rendered. But there's no need for any of the other rows to re-render, it could be just the single row in which the change was requested by the user.

So the MessageRow could look at its props values and, if they haven't changed between the current props and the nextProps, then inform React (via "ShouldComponentUpdate") that no re-render is required.

And why not go even further and just do this on all Presentation Components? The TextInput could avoid the re-render of its child Input if the props' "Content" and "OnChange" reference are not being updated.

Introducing the Bridge.React "PureComponent"

To make this easy, I've added a new base class to the React bindings (available in 1.4 of Bridge.React); the PureComponent<TProps>.

This, like the StatelessComponent<TProps>, is very simple and does not support state and only allows the "Render" method to be implemented - no other React lifecycle functions (such "ComponentWillMount", "ShouldComponentUpdate", etc..) may be defined on components deriving from this class.

The key difference is that it has its own "ShouldComponentUpdate" implementation that presumes that the props data is immutable and basically does what I've been describing above automatically - when React checks "ShouldComponentUpdate", it will look at the "props" and "nextProps" instances and compare their property values. (It also deals with the cases where one or both of them are null, in case you want components whose props reference is optional).

This is not an original idea, by a long shot. I first became aware of people doing this in 2013 when I read The Future of JavaScript MVC Frameworks, which was talking about using ClojureScript and its React interface "Om". More recently, I was reading Performance Engineering with React (Part 1), which talks about roughly the same subject but with vanilla JavaScript. And, of course, Facebook has long had its PureRenderMixin - though mixins can't be used with ES6 components (which seems to be the approach to writing components that Facebook is pushing at the moment).

So, this is largely just making it easy it when writing React applications with Bridge. However, using Bridge to do this does give us some extra advantages (on top of the joy of being able to write React apps in C#!). In the code earlier (from the MessageRow Render method) -

new TextInput(new TextInput.Props
{
  Content = props.Message.Content,
  OnChange = OnContentChange
})

Bridge will bind the "OnContentChange" method to the current MessageRow instance so that when it is called by the TextInput's "OnChange" event, "this" is the MessageRow and not the TextInput (which is important because OnContentChange needs to access the "props" reference scoped to the MessageRow).

This introduces a potential wrinkle in our plan, though, as this binding process creates a new JavaScript method each time and means that each time the TextInput is rendered, the "OnChange" reference is new. So if we try to perform simple reference equality checks on props values, then we won't find the current "OnChange" and the new "OnChange" to be the same.

This problem is mentioned in the "Performance Engineering" article I linked above:

Unfortunately, each call to Function.bind produces a new function.. No amount of prop checking will help, and your component will always re-render.

..

The simplest solution we've found is to pass the unbound function.

When using Bridge, we don't have the option of using an unbound function since the function-binding is automatically introduced by the C#-to-JavaScript translation process. And it's very convenient, so it's not something that I'd ideally like to have to workaround.

Having a dig through Bridge's source code, though, revealed some useful information. When Bridge.fn.bind is called, it returns a new function (as just discussed).. but with some metadata attached to it. When it returns a new function, it sets two properties on it "$scope" and "$method". The $scope reference is what "this" will be set to when the bound function is called and the $method reference is the original function that is being bound. This means that, when the props value comparisons are performed, if a value is a function and it the reference equality comparison fails, a fallback approach may be attempted - if both functions have $scope and $method references defined then compare them and, if they are both consistent between the function value on the current props and the function value on the nextProps, then consider the value to be unchanged.

The PureComponent's "ShouldComponentUpdate" implementation deals with this automatically, so you don't have to worry about it.

It's possibly worth noting that the "Performance Engineering" post did briefly consider something similar -

Another possibility we've explored is using a custom bind function that stores metadata on the function itself, which in combination with a more advanced check function, could detect bound functions that haven't actually changed.

Considering that Bridge automatically includes this additional metadata, it seemed to me to be sensible to use it.

There's one other equality comparison that is supported; as well as simple referential equality and the function equality gymnastics described above, if both of the values are non-null and the first has an "Equals" function then this function will be considered. This means that any custom "Equals" implementations that you define on classes will be automatically taken into consideration by the PureComponent's logic.

Another Bridge.NET bonus: Lambda support

When I started writing this post, there was going to be a section here with a warning about using lambdas as functions in props instances, rather than using named functions (which the examples thus far have done).

As with bound functions, anywhere that an anonymous function is present in JavaScript, it will result in a new function value being created. If, for example, we change the MessageRow class from:

public class MessageRow : PureComponent<MessageRow.Props>
{
  public MessageRow(Props props) : base(props) { }

  public override ReactElement Render()
  {
    return DOM.Div(new Attributes { ClassName = "message-row" },
      new TextInput(new TextInput.Props
      {
        Content = props.Message.Content,
        OnChange = OnContentChange
      }),
      new TextInput(new TextInput.Props
      {
        Content = props.Message.Author,
        OnChange = OnAuthorChange
      })
    );
  }

  private void OnContentChange(string newContent)
  {
    props.OnChange(props.Message.With(_ => _.Content, newContent));
  }
  private void OnAuthorChange(string newAuthor)
  {
    props.OnChange(props.Message.With(_ => _.Author, newAuthor));
  }

  public class Props
  {
    public int Key;
    public MessageEditState Message;
    public Action<MessageEditState> OnChange;
  }
}

to:

public class MessageRow : PureComponent<MessageRow.Props>
{
  public MessageRow(Props props) : base(props) { }

  public override ReactElement Render()
  {
    return DOM.Div(new Attributes { ClassName = "message-row" },
      new TextInput(new TextInput.Props
      {
        Content = props.Message.Content,
        OnChange = newContent =>
          props.OnChange(props.Message.With(_ => _.Content, newContent))
      }),
      new TextInput(new TextInput.Props
      {
        Content = props.Message.Author,
        OnChange = newAuthor =>
          props.OnChange(props.Message.With(_ => _.Author, newAuthor))
      })
    );
  }

  public class Props
  {
    public int Key;
    public MessageEditState Message;
    public Action<MessageEditState> OnChange;
  }
}

then there would be problems with the "OnChange" props values specified because each new lambda - eg..

OnChange = newContent =>
  props.OnChange(props.Message.With(_ => _.Content, newContent))

would result in a new JavaScript function being passed to Bridge.fn.bind every time that it was called:

onChange: Bridge.fn.bind(this, function (newContent) {
  this.getprops().onChange(
    ProductiveRage.Immutable.ImmutabilityHelpers.$with(
       this.getprops().message,
      function (_) { return _.getContent(); },
      newContent
    )
  );
})

And this would prevent the PureComponent's "ShouldComponentUpdate" logic from being effective, since the $method values from the current props "OnChange" and the nextProps "OnChange" bound functions would always be different.

I was quite disappointed when I realised this and was considering trying to come up with some sort of workaround - maybe calling "toString" on both $method values and comparing their implementations.. but I couldn't find definitive information about the performance implications of this and I wasn't looking forward to constructing my own suite of tests to investigate any potential performance impact of this across different browsers and different browser versions.

My disappointment was two-fold: firstly, using the lambdas allows for more succinct code and less syntactic noise - since the types of the lambda's argument(s) and return value (if any) are inferred, rather than having to be explicitly typed out.

newContent => props.OnChange(props.Message.With(_ => _.Content, newContent))

is clearly shorter than

private void OnContentChange(string newContent)
{
  props.OnChange(props.Message.With(_ => _.Content, newContent));
}

The other reason that I was deflated upon realising this was that it meant that the "ShouldComponentUpdate" implementation would, essentially, silently fail for components that used lambdas - "ShouldComponentUpdate" would return true in cases where I would like it to return false. There would be no compiler error and the UI code would still function, but it wouldn't be as efficient as it could be (the Virtual DOM would have to do more work than necessary).

Instead, I had a bit of a crazy thought.. lambdas like this, that only need to access their own arguments and the "this" reference, could be "lifted" into named functions quite easily. Essentially, I'm doing this manually by writing methods such as "OnContentChange". But could the Bridge translator do something like this automatically - take those C# lambdas and convert them into named functions in JavaScript? That way, I would get the benefit of the succinct lambda format in C# and the PureComponent optimisations would work.

Well, once again the Bridge.NET Team came through for me! I raised a Feature Request about this, explained what I'd like in an ideal world (and why) and five days later there was a branch on GitHub where I could preview changes that did precisely what I wanted!

This is not just an example of fantastic support from the Bridge Team, it is also, I believe, an incredible feature for Bridge and a triumph for writing front-end code in C#! Having this "translation step" from C# to JavaScript provides the opportunity for handy features to be included for free - earlier we saw how the insertion of Bridge.fn.bind calls by the translator meant that we had access to $method and $scope metadata (which side-steps one of the problems that were had by the author of Performance Engineering with React) but, here, the translation step can remove the performance overhead that anonymous functions were going to cause for our "ShouldComponentUpdate" implementation, without there being any burden on the developer writing the C# code.

It's also worth considering the fact that every allocation made in JavaScript is a reference that needs to be tidied up by the browser's garbage collector at some point. A big reason why judicious use of "ShouldComponentUpdate" can make UIs faster is that there is less work for the Virtual DOM to do, but it also eases the load on the garbage collector because none of the memory allocations need to be made for child components of components that do not need to be re-rendered. Since anonymous JavaScript functions are created over and over again (every time that the section of code that declares the anonymous function is executed), lifting them into named functions means that there will be fewer allocations in your SPA and hence even less work for the garbage collector to do.

Note: As of the 11th of February 2016, this Bridge.NET improvement has not yet been made live - but their release cycles tend to be fairly short and so I don't imagine that it will be very long until it is included in an official release. If you were desperate to write any code with PureComponent before then, you could either avoid lambdas in your C# code or you could use lambdas now, knowing that the PureComponent won't be giving you the full benefit immediately - but that you WILL get the full benefit when the Bridge Team release the update.

So it's an unequivocable success then??

Well, until it transpired that the Bridge translator would be altered to convert these sorts of lambdas into named functions, I was going to say "this is good, but..". However, with that change in sight, I'm just going to say outright "yes, and I'm going to change all classes that derive from StatelessComponent in my projects to derive from PureComponent". This will work fine, so long as your props references are all immutable (meaning that they are immutable all the way down - you shouldn't have, say, a props property that is an immutable NonNullList of references, but where those references have mutable properties).

And, if you're not using immutable props types - sort yourself out! While a component is being rendered (according to the Facebook React Tutorial):

props are immutable: they are passed from the parent and are "owned" by the parent

So, rather than having props only be immutable during component renders (by a convention that the React library enforces), why not go whole-hog and use fully immutable classes to describe your props types - that way props are fully immutable and you can use the Bridge.React's PureComponent to get performance boosts for free!

(Now seems like a good time to remind you of my post "Friction-less immutable objects in Bridge (C# / JavaScript) applications", which illustrates how to use the ProductiveRage.Immutable NuGet package to make defining immutable classes just that bit easier).

Posted at 20:11

Tags:

Comments

3 February 2016

Using Roslyn to identify unused and undeclared variables in VBScript WSC components

(Note: I had intended to keep this aside for April Fools since it's intended to be a bit tongue-in-cheek and really just an excuse to play with technology for technology's sake.. but I haven't got many other posts that I'm working on at the moment so I'm going to just unleash this now, rather than waiting!)

Imagine that you maintain a project which was migrated over time from an old and fragile platform to a new and improved (C#) code base. But there are various complicated external components that have been left untouched since they were (mostly) working and the new code could continue to use them - allowing the valuable rewriting time to be spent elsewhere, on less compartmentalised areas of code.

For some projects, these could be C++ COM components - I'm no expert on C++, but since people are still writing a lot of code in it and there are powerful IDEs to support this (such as Visual Studio), I presume that maintaining these sorts of components is possibly a little annoying (because COM) but not the worst thing in the world. For other projects, though, these could be "Windows Scripting Components" - these are basically COM components that are written in scripting languages, such as VBScript. They look something like the following:

<?xml version="1.0" ?>
<?component error="false" debug="false" ?>
<package>
  <component id="ExampleComponent">
    <registration progid="ExampleComponent" description="Example Component" version="1" />
    <public>
      <method name="DoSomething" />
    </public>
    <script language="VBScript">
    <![CDATA[

      Function DoSomething(ByVal objOutput)
        Dim intIndex: For intIndex = 1 To 5
          objOutput.WriteLine "Entry " & iIndex
        Next
      End Function

    ]]>
    </script>
  </component>
</package>

Creating "Classic ASP" web projects using these components had the advantage that interfaces between components could be limited and documented, enabling a semblance of organisation to be brought to bear on large solutions.. but "Classic ASP" and VBScript are technologies that, by this point, should have long since been put to bed. They do not have good IDE support or debugging tools (nor do they perform well, nor is it easy to hire good people to work on your solutions that contain code in this language).

If you have components that work and that will never be needed to change, then maybe that's no big deal. Or maybe there is something in the migration plan that says that legacy components that work (and do not require adapting or extending) will be left as-is and any components that need work will be rewritten.

If this is the case, then it's easy enough to use these working components from C# -

var filename = "ExampleComponent.wsc";
dynamic component = Microsoft.VisualBasic.Interaction.GetObject(
  "script:" + new FileInfo(filename).FullName,
  null
);
component.DoSomething(new ConsoleWriter());

Note: In order for the above code to run with the WSC presented further up, the C# code needs to provide a ComVisible "objOutput" reference which has a "WriteLine" method that takes a single (string) argument. The snippet above uses a ConsoleWriter class, which could be implemented as follows:

[ComVisible(true)]
public class ConsoleWriter
{
  public void WriteLine(string value)
  {
    Console.WriteLine(value);
  }
}

But what if there isn't an agreement to rewrite any WSCs that need work and what if there are some that need bug-fixing or new functionality? Well, good luck! Error messages from these components tend to be vague and - just to really add a little extra joy to your life - they don't include line numbers. Oh, "Object expected"? Great.. will you tell me where? No. Oh, good.

If you were so intrigued by what I've written here so far that you've actually been playing along and have saved the WSC content from the top of this post into a file and executed it using the C# above, you might have noticed another problem when you ran it. Below is what is output to the console:

Entry

But, since the VBScript is performing a simple loop and writing a message that includes that loop variable in it, shouldn't it be this instead??

Entry 1

Entry 2

Entry 3

Entry 4

Entry 5

Identifying unused and undeclared variables with the VBScriptTranslator and Roslyn

Well, I do have a glimmer of hope for the problem above and, potentially, for other VBScript-writing pitfalls.

What we could do is process WSC files to -

Extract VBScript section(s) from them
Run the VBScript content through the VBScriptTranslator to generate C#
Parse and build the resulting C# using Roslyn
Use information gleaned from steps 2 and 3 to identify errors that might otherwise not be apparent before runtime

The packages we want are available through NuGet -

Before I go through these steps, let me just explain briefly what the problem was in the VBScript sample code shown further up - just in case you're not familiar with VBScript or didn't spot it.

The loop variable in the code

Dim intIndex: For intIndex = 1 To 5
  objOutput.WriteLine "Entry " & iIndex
Next

is named "intIndex" but the line that writes out the text refers to "iIndex", which is an undeclared variable.

In C#, if we tried to do something similar then the compiler would bring it immediately to our attention - eg.

for (var i = 1; i <= 5; i++)
  Console.WriteLine("Entry " + j);

Presuming that "j" was not defined elsewhere within the scope of the above code, we would be informed that

The name 'j' does not exist in the current context

But VBScript doesn't care about this, declaring variables (such as with the use of "Dim intIndex") is generally optional. The "iIndex" value in the code above is never defined, which means it gets the special VBScript "Empty" value, which is treated as an empty string when introduced into a string concatenation operation.

VBScript does support a mode that requires that variables be declared before they are referenced; "Option Explicit". If we changed the code to the following:

Option Explicit

Dim intIndex: For intIndex = 1 To 5
  objOutput.WriteLine "Entry " & iIndex
Next

then we would get an error at runtime:

Variable is undefined: 'iIndex'

Which seems much better, but there's one big gotcha to "Option Explicit" - it is not enforced when the VBScript code is parsed, it is only enforced as the code is executed. This means that enabling Option Explicit and having a script run successfully does not mean that it contains no undeclared variables, it only means that the code path that just ran contained no undeclared variables.

To illustrate, the following script will run successfully except on Saturdays -

Option Explicit

Dim intIndex: For intIndex = 1 To 5
  If IsSaturday() Then
    objOutput.WriteLine "Entry " & iIndex
  Else
    objOutput.WriteLine "Entry " & intIndex
  End If
Next

Function IsSaturday()
  IsSaturday = WeekDay(Now()) = 7
End Function

This is a pity. I think that it would have been much better for Option Explicit to have been enforced when the script was loaded. But that ship has loooooong since sailed.

So, instead of crying about spilt milk, let's look at something positive. We've got a four step plan to crack on with!

1. Extracting VBScript content from a WSC

This is the most boring step and so I'll try not to get bogged down too much here. A WSC file is xml content and we want to identify CDATA content sections within "script" tags that have a "language" attribute with the value "VBScript".

The below is some rough-and-ready code, taken from a project that I wrote years ago, dusted off to reuse here -

private static IEnumerable<Tuple<string, int>> GetVBScriptSections(string wscContent)
{
  var document = new XPathDocument(new StringReader(wscContent));
  var nav = document.CreateNavigator();
  if (nav.HasChildren && nav.MoveToFirstChild())
  {
    while (true)
    {
      foreach (var scriptSection in TryToGetVBScriptContentFromNode(nav))
        yield return scriptSection;
      if (!nav.MoveToNext())
        break;
    }
  }
}

private static IEnumerable<Tuple<string, int>> TryToGetVBScriptContentFromNode(XPathNavigator nav)
{
  if (nav.NodeType == XPathNodeType.Text)
  {
    var navParent = nav.Clone();
    navParent.MoveToParent();
    if (navParent.Name.Equals("script", StringComparison.OrdinalIgnoreCase)
    && DoesNodeHaveVBScriptLanguageAttribute(navParent))
      yield return Tuple.Create(nav.Value, ((IXmlLineInfo)nav).LineNumber - 1);
  }
  if (nav.HasChildren)
  {
    var navChildren = nav.Clone();
    if (navChildren.MoveToFirstChild())
    {
      while (true)
      {
        foreach (var scriptSection in TryToGetVBScriptContentFromNode(navChildren))
          yield return scriptSection;
        if (!navChildren.MoveToNext())
          break;
      }
    }
  }
}

private static bool DoesNodeHaveVBScriptLanguageAttribute(XPathNavigator node)
{
  node = node.Clone();
  if (!node.HasAttributes || !node.MoveToFirstAttribute())
    return false;

  while (true)
  {
    if (node.Name.Equals("language", StringComparison.OrdinalIgnoreCase)
    && node.Value.Equals("VBScript", StringComparison.OrdinalIgnoreCase))
      return true;
    if (!node.MoveToNextAttribute())
      return false;
  }
}

The "GetVBScriptSections" function will return a set of Tuples - pairs of values where the first value is the VBScript content and the second value is the line index that the content starts at in the WSC. It returns a set, rather than a single Tuple, since it is valid for WSC files to contain multiple script tags.

The source line index will be important for identifying where in the WSC that any warnings we generate later originate.

2. Translate the VBScript sections

Now that we've got VBScript content, let's translate it into C#!

After the VBScriptTranslator NuGet package is installed, the following code may be written -

foreach (var vbscriptCodeSection in GetVBScriptSections(wscContent))
{
  // When translating the VBScript, add in new lines before the content so
  // that the lines indexes in the only-VBScript content match the line
  // indexes in the WSC
  var lineIndexInSourceFile = vbscriptCodeSection.Item2;
  var blankLinesToInject = string.Join(
    "",
    Enumerable.Repeat(Environment.NewLine, lineIndexInSourceFile)
  );

  var vbscriptContent = vbscriptCodeSection.Item1;
  var translatedStatements = DefaultTranslator.Translate(
    blankLinesToInject + vbscriptContent,
    externalDependencies: new string[0],
    warningLogger: message =>
    {
      if (message.StartsWith("Undeclared variable:"))
        Console.WriteLine(message);
    }
  );

This actually goes a long way to identifying my original problem - in order for the VBScriptTranslator to do its thing, it needs to identify any undeclared variables (because it will have to create explicitly declared variables in the resulting C# code). When it encounters an undeclared variable, it will log a warning message - the code above writes to the console any warnings about undeclared variables.

Running the above against the content at the top of this post results in the following being written out:

Undeclared variable: "iIndex" (line 14)

Success! Line 14 is, indeed, the line where an undeclared variable "iIndex" was accessed.

Now that we have a C# interpretation of the source code, though, it seems like we should be able to do more by bringing the impressive array of C# analysis tools that are now available to bear (ie. Roslyn aka "Microsoft.CodeAnalysis").

Imagine if the original VBScript content was something more like this -

Function DoSomething(ByVal objOutput)
  Dim intIndex, strName

  ' .. loads of code

  For intIndex = 1 To 5
    objOutput.Write "Entry " & iIndex
  Next

  ' .. loads more code

End Function

Those legacy VBScript writers sure did love their huge functions with 100s of lines of code! So the "loads of code" sections above really could be loads of code.

One day, someone has to change this long, long function a little bit and thinks that they've removed the only use of the "strName" variable from the function. But it's hard to be sure since the function is so long and it's got conditions nested so deeply that it's headache-inducing. The Boy Scout Rule makes it seem attractive to remove the "strName" declaration if it's no longer used.. the problem is that this someone is not utterly, 100% confident that it's safe to remove. And it's not like they could just remove the variable declaration then re-run and rely on Option Explicit to inform them if the variable is still used somewhere (for the reason outlined earlier).

One way to obtain confidence as to whether a variable is used or not is to continue to the next step..

3. Build the generated C# using Roslyn

Adding the Microsoft.CodeAnalysis.CSharp NuGet package allows us to write:

private static IEnumerable<Tuple<string, int>> GetUnusedVariables(string translatedContent)
{
  // Inspired by code from www.tugberkugurlu.com (see http://goo.gl/HYT8eo)
  var syntaxTree = CSharpSyntaxTree.ParseText(translatedContent);
  var compilation = CSharpCompilation.Create(
    assemblyName: "VBScriptTranslatedContent",
    syntaxTrees: new[] { syntaxTree },
    references:
      new[]
      {
        // VBScriptTranslator content requires System, System.Collections, System.Runtime
        // and one of its own libraries to run. To identify these assemblies, one type
        // from each is identified, then its Assembly location is used to create the
        // MetadataReferences that we need here
        typeof(object),
        typeof(List<string>),
        typeof(ComVisibleAttribute),
        typeof(DefaultRuntimeSupportClassFactory),
      }
      .Select(type => MetadataReference.CreateFromFile(type.Assembly.Location)),
    options: new CSharpCompilationOptions(OutputKind.DynamicallyLinkedLibrary)
  );
  EmitResult result;
  using (var ms = new MemoryStream())
  {
    result = compilation.Emit(ms);
  }
  if (!result.Success)
  {
    var errorMessages = result.Diagnostics
      .Where(diagnostic =>
        diagnostic.IsWarningAsError || (diagnostic.Severity == DiagnosticSeverity.Error)
      )
      .Select(diagnostic => $"{diagnostic.Id}: {diagnostic.GetMessage()}");
    throw new Exception(
      "Compilation of generated C# code failed: " + string.Join(", ", errorMessages)
    );
  }
  return result.Diagnostics
    .Where(diagnostic => diagnostic.Id == "CS0219")
    .Select(diagnostic => Tuple.Create(
      diagnostic.GetMessage(),
      diagnostic.Location.GetLineSpan().StartLinePosition.Line
    ));
}

This will take the VBScriptTranslator-generated C# code and return information about any unused variables; a set of Tuples where each pair of values is a message about an unused variable and the line index of this variable's declaration.

We'll use this information in the final step..

4. Use information gleaned from steps 2 and 3 to identify errors that might otherwise not be apparent before runtime

In the VBScriptTranslator-calling code from step 2, we got a list of translated statements. Each of these represents a single line of C# code and has the properties "Content", "IndentationDepth" and "LineIndexOfStatementStartInSource". If we so desired, we could use the "Content" and "IndentationDepth" properties to print to the console the generated C# in a nicely-indented format.

But that's not important right now, what we really want are two things; a single string for the entirety of the generated C# content (to compile with Roslyn) and we want mappings for line index values in the C# back to line index values in the source VBScript. The C# code may have more or less lines than the VBScript (the translation process is not a simple line-to-line process), which is why these line index mappings will be important.

// Each "translatedStatements" item has a Content string and a
// LineIndexOfStatementStartInSource value (these are used to
// create a single string of C# code and to map each line in
// the C# back to a line in the VBScript)
var translatedContent = string.Join(
  Environment.NewLine,
  translatedStatements.Select(c => c.Content)
);
var lineIndexMappings = translatedStatements
  .Select((line, index) => new { Line = line, Index = index })
  .ToDictionary(
    entry => entry.Index,
    entry => entry.Line.LineIndexOfStatementStartInSource
  );

Now it's a simple case of bringing things together -

foreach (var unusedVariableWarning in GetUnusedVariables(translatedContent))
{
  var unusedVariableWarningMessage = unusedVariableWarning.Item1;
  var lineIndexInTranslatedContent = unusedVariableWarning.Item2;
  var lineIndexInSourceContent = lineIndexMappings[lineIndexInTranslatedContent];

  // Line index values are zero-based but warnings messages that refer to
  // a line generally refer to a line NUMBER, which is one-based (hence
  // the +1 operation)
  Console.WriteLine(
    $"{unusedVariableWarningMessage} (line {lineIndexInSourceContent + 1})"
  );
}

If this was run against our second WSC sample, then we would get a new warning reported:

The variable 'strname' is assigned but its value is never used (line 13)

Which is precisely what we wanted to find out - the "strName" variable is declared but never used, so it's safe for our Boy Scout Developer to remove it!

Ooooooo, I'm excited! What else could I do??

I must admit, I haven't thought too much about what other possibilities are available when some static analysis is available for VBScript code, I was just intending to mess about with Roslyn a bit. But, thinking about it, a few ideas come to mind.

As an example of the frankly terrible errors that you get when working with VBScript WSCs, if you took the WSC example from earlier and decided to refactor the FUNCTION into a SUB (in VBScript, a SUB is basically a FUNCTION that can not return a value) and you made the silly mistake of changing the function "header" but not its "terminator" - eg.

Sub DoSomething(ByVal objOutput)
  Dim intIndex: For intIndex = 1 To 5
    objOutput.Write "Entry " & iIndex
  Next
End Function

Then you would get a particularly unhelpful error when trying to load the WSC into the .net runtime -

Cannot create ActiveX component.

The problem is that the "END FUNCTION" should have been changed "END SUB", since the first VBScript line has had the keyword "FUNCTION" changed to "SUB". It would seem that the VBScript interpreter would have plenty of information available to it that would allow it to raise a more descriptive error. However, it chooses not to.

If this WSC content was run through the VBScriptTranslator, though, then an exception with the following error message would be raised:

Encountered must-handle keyword in statement content, this should have been handled by a previous AbstractBlockHandler: "End", line 16 (this often indicates a mismatched block terminator, such as an END SUB when an END FUNCTION was expected)

Ok.. I'll admit that this is not the friendliest error message ever formed. What exactly is a "must-handle keyword"? What is an "AbstractBlockHandler"?? But the good thing is that a line number is included along with a reference to an "END" token - and this hopefully is enough to point you at where the problem is.

Another idea that springs to mind is to try to identify functions that have inconsistent return types, in terms of whether they are value types or object references. In VBScript, you must be aware of this distinction at all times - if calling a function that you expect to return an object, then you need to write the function call using the "SET" keyword - eg.

Set objPrice = GetPriceDetails(order)

But if you expect it to return a value type, then you would write it as

sngPrice = GetPriceDetails(order)

VBScript has a special kind of null that represents an object with no value; "Nothing". This allows you to write functions that will always return an object reference, but that may return a reference that means "no result" - eg.

Function GetPriceDetails(ByVal x)
  If IsObject(x) Then
    Set GetPriceDetails = x.PriceDetails
    Exit Function
  End If
  Set GetPriceDetails = Nothing
End Function

However, I've seen code that forgets this and returns a value type "Null" instead - eg.

Function GetPriceDetails(ByVal x)
  If IsObject(x) Then
    Set GetPriceDetails = x.PriceDetails
    Exit Function
  End If
  GetPriceDetails = Null
End Function

Now, when calling GetPriceDetails, you will get an object reference sometimes and a value type other times. How do you know whether to use "SET" when calling it if you don't know whether you are expecting an object reference or a value type back? Answer: You don't. Most likely whoever wrote the code used "SET" because they tested the "happy case" (which returns an object reference) and forgot to test the less-happy case, which returned a "Null" value type (and that would fail at runtime if called with use of "SET").

Well, this is something else that the VBScriptTranslator can help with. Instead of using the DefaultTranslator's "Translate" method, we can use its "Parse" method. This will return a syntax tree describing the source code. By examining this data, we can identify cases, like the one above, which are almost certainly mistakes.

Below is a complete example. I won't go too deeply into the details, since that would send me even further off track than I am now!

static void Main(string[] args)
{
  var scriptContent = @"
    Function GetPriceDetails(ByVal x)
      If IsObject(x) Then
        Set GetPriceDetails = x.Price
        Exit Function
      End If
      GetPriceDetails = Null
    End Function";

  // Note: An "AbstractFunctionBlock" is a Function, a Sub, or a Property - they are
  // all variations on a theme
  var codeBlocks = DefaultTranslator.Parse(scriptContent);
  foreach (var function in GetAllCodeBlocks(codeBlocks).OfType<AbstractFunctionBlock>())
  {
    var returnValueSetters = GetAllCodeBlocks(function.Statements)
      .OfType<ValueSettingStatement>()
      .Where(ValueSetterTargetIs(function.Name));
    var valueTypeReturnValueSetterLineNumbers = returnValueSetters
      .Where(v => v.ValueSetType == ValueSettingStatement.ValueSetTypeOptions.Let)
      .Select(v => v.ValueToSet.Tokens.First().LineIndex + 1)
      .Distinct();
    var objectReturnValueSetterLineNumbers = returnValueSetters
      .Where(v => v.ValueSetType == ValueSettingStatement.ValueSetTypeOptions.Set)
      .Select(v => v.ValueToSet.Tokens.First().LineIndex + 1)
      .Distinct();
    if (valueTypeReturnValueSetterLineNumbers.Any()
    && objectReturnValueSetterLineNumbers.Any())
    {
      Console.WriteLine(
        "{0} \"{1}\" has both LET (lines {2}) and SET (lines {3}) return values",
        function.GetType().Name,
        function.Name.Content,
        string.Join(", ", valueTypeReturnValueSetterLineNumbers),
        string.Join(", ", objectReturnValueSetterLineNumbers)
      );
    }
  }
  Console.ReadLine();
}

private static IEnumerable<ICodeBlock> GetAllCodeBlocks(IEnumerable<ICodeBlock> blocks)
{
  foreach (var block in blocks)
  {
    yield return block;

    var parentBlock = codeBlock as IHaveNestedContent;
    if (parentBlock != null)
    {
      foreach (var nestedBlock in GetAllCodeBlocks(parentBlock.AllExecutableBlocks))
        yield return nestedBlock;
    }
  }
}

private static Func<ValueSettingStatement, bool> ValueSetterTargetIs(NameToken target)
{
  return valueSetter =>
  {
    if (valueSetter.ValueToSet.Tokens.Count() > 1)
      return false;
    var valueSetterTarget = valueSetter.ValueToSet.Tokens.Single();
    return
      (valueSetterTarget is NameToken) &&
      valueSetterTarget.Content.Equals(target.Content, StringComparison.OrdinalIgnoreCase);
  };
}

This will write out the warning

FunctionBlock "GetPriceDetails" has both LET (lines 7) and SET (lines 4) return value setters

Hurrah! Very helpful! No more waiting for run time execution to find out that some code paths return object references and some return value types!

Static analysis is very valuable. It's one of the reasons why I like C# so much because there is a lot of power in static analysis - and I'm always looking out for ways to leverage it further, such as more strongly-typed classes (should a phone number really be a string or should it be a "PhoneNumber" class?) and technologies such as code contracts (which I've been meaning to look back into for about a year now.. must stop making excuses).

But there's one other thing that could be done with VBScript WSCs and the VBScriptTranslator - instead of just translating the code to analyse it, it could be translated into C# and then executed as C#! This way the (very expensive) COM boundary would be removed between the .net hosting environment and the old legacy component. And the translated code will execute more quickly than VBScript. Double-win!

The output from a "DefaultTranslator.Translate" call is content that may be saved into a file that will then define a class called "TranslatedProgram" (this string content is what we were earlier pushing through Roslyn for further analysis). This may be executed using a runtime library included in the VBScriptTranslator NuGet package (or that is available on its own, in the VBScriptTranslator.RuntimeSupport NuGet package) with the following code -

// The "compatLayer" provides implementations of VBScript functions (like "CInt")
// to the translated code, along with functions such as "CALL", which enable late-
// bound method calls to be executed (which are then compiled into LINQ expressions
// and cached so that subsequent calls are close in performance to hand-written C#)
using (var compatLayer = DefaultRuntimeSupportClassFactory.Get())
{
  // The Runner's "Go" function returns a new instance of the translated
  // component. The "DoSomething" method from the component may then be
  // called. Translated names are all lower-cased, it makes the mismatch
  // between VBScript's case insensitivity and C#'s case SENSITIVITY
  // less important.
  var component = new TranslatedProgram.Runner(compatLayer).Go();
  component.dosomething(new ConsoleWriter());
}

So.. not actually that much Roslyn then?

Sticklers for accuracy may note, at this point, that there hasn't actually been that much use of Roslyn in a post that features that word in its title. Well.. yes, that is fair enough.

But, then, this entire post was only intended to be a slightly silly foray into "just because I can.." that included a detour through Roslyn. Let's not take things too seriously, though - I mean, really, who is still even using VBScript in any serious production applications these days??

Posted at 23:41