Productive Rage

19 June 2013

Parsing CSS

A few months ago I wrote about some extensions to the CSS Minifier to support pseudo-Source-Mapping for compiled and minified content (among other things) and I've been meaning to write about the code I used to analyse the style sheet content.

History Lesson

A long time ago I wrote some code to parse javascript to remove comments and minify the content, this was before there were a proliferation of excellent plugins and such like to do it all for you - I think the YUI Compressor might have been around but since it required java to be installed where it would be used, we couldn't use it to compile scripts on-the-fly.

The first pass through the content would break it down into strings representing javascript code, javascript strings and comments. Strings can be quoted with either single or double quotes, single-quote-wrapped strings could contain double quotes without escaping them and vice versa, either string format could contain their own quotes so long as they were formatted. Comments could be multi-line if they were wrapped in /* and */ or single line if they started with // (terminating with a line return or end-of-file). So similar to CSS in a lot of ways! (Particularly if you consider parsing LESS which supports single line comments, unlike regular CSS).

I wrote it in a fairly naive manner, trying to handle each case at a time, building up a primary loop which went through each character, deciding what to do with it based upon what character it was and whether the current content was a string (and what would indicate the end of the string), a comment (and what would terminate that) or javascript code. There were various variables to keep track of these items of state. It did the job but I was keen not to repeat the same approach when writing this "CSS Parser" I wanted.

Employing Immutability (what a surprise!)

Keeping track of the changing state in this way meant that at any point in time there was a lot to hold in my head while I was trying to understand what was going on if something appeared to be misbehaving and made each change to add new functionality increasingly difficult. But reducing the places where state change is a large part of the immutability obsession I've got going on so I figured there must be a better way.

The idea was to start with two interfaces

public interface IProcessCharacters
{
  CharacterProcessorResult Process(IWalkThroughStrings stringNavigator);
}

public interface IWalkThroughStrings
{
  char? CurrentCharacter { get; }
  IWalkThroughStrings Next { get; }
}

with corresponding class and enum

public class CharacterProcessorResult
{
  public CharacterProcessorResult(
    CharacterCategorisationOptions characterCategorisation,
    IProcessCharacters nextProcessor)
  {
    if (!Enum.IsDefined(typeof(CharacterCategorisationOptions), characterCategorisation))
      throw new ArgumentOutOfRangeException("characterCategorisation");
    if (nextProcessor == null)
      throw new ArgumentNullException("nextProcessor");

    CharacterCategorisation = characterCategorisation;
    NextProcessor = nextProcessor;
  }

  public CharacterCategorisationOptions CharacterCategorisation { get; private set; }
  public IProcessCharacters NextProcessor { get; private set; }
}

public enum CharacterCategorisationOptions
{
  Comment,
  CloseBrace,
  OpenBrace,
  SemiColon,
  SelectorOrStyleProperty,
  StylePropertyColon,
  Value,
  Whitespace
}

such that a given string can be traversed character-by-character with a processor returning the type of that character and providing a processor appropriate to the next character.

The clever part being each processor will have very tightly-scoped behaviour and responsibility. For example, if a string is encountered that starts with double quotes then a processor whose entire job is string-handling would be used. This processor would know what quote character would terminate the string and that processing should go back to the previous processor when the string has terminated. All characters encountered within the string would be identified as the same type (generally this will be of type Value since strings are most commonly used in style properties - eg. a url string as part of a background property - so if a semi-colon is encountered it would be identified as type Value despite a semi-colon having more significant meaning when not part of a string value). Handling escape characters becomes very simple if a skip-characters processor is used; when a backslash is encountered, the quoted-section processor hands off to a processor that returns a fixed type for the next character and then returns control back to the quoted-section processor. This means that the quoted-section processor doesn't need to maintain any state such as even-if-the-next-character-is-the-terminating-quote-character-do-not-terminate-the-string-yet-as-it-is-being-escaped.

Comment sections can be handled in a very similar manner, with different processors for multiline comments than single line since the termination manners are different (and this helps keep things really easy).

There is a primary processor which is a bit meatier than I'd like (but still only 320-odd commented lines) that looks out for the start of string or comments and hands off processing appropriately, but also identifies single signficant characters such as opening or closing braces, colons (usually indicating the a separator between a style property name and its value but sometimes a pseudo-class indicator - eg. in "a:hover") and semi-colons.

Parsing is made more challenging as I wanted to support LESS which allows for nesting of rules whereas the only nesting that regular CSS supports is selectors within media queries. CSS 2.1 only allows for a single media query to wrap a selector while CSS 3 may support nesting media rules - see this answer on Stack Overflow: Nesting @media rules in CSS.

As a bit of a cop-out, I don't differentiate between a selector and a property name in the CharacterCategorisationOptions enum, they are both rolled into the value SelectorOrStyleProperty (similarly, media query content is classified as a SelectorOrStyleProperty). While this feels lazy on the one hand, on the other I wanted to make this pass through the content as cheap and clean as possible and accurately determining whether a given character is a selector or a property name could involve significant reading back and forth through the content to find out for sure.

This way, not only is the implementation easier to follow but it enables the main loop to parse only as much content as required to enumerate as far through the content as the caller requires.

To explain what I mean, I need to introduce the class that wraps IProcessCharacters and IWalkThroughStrings -

public interface ICollectStringsOfProcessedCharacters
{
  IEnumerable<CategorisedCharacterString> GetStrings(
    IWalkThroughStrings contentWalker,
    IProcessCharacters contentProcessor
  );
}

and its return type..

public class CategorisedCharacterString
{
  public CategorisedCharacterString(
    string value,
    int indexInSource,
    CharacterCategorisationOptions characterCategorisation)
  {
    if (string.IsNullOrEmpty(value))
      throw new ArgumentException("Null/blank value specified");
    if (indexInSource < 0)
      throw new ArgumentOutOfRangeException("indexInSource", "must be zero or greater");
    if (!Enum.IsDefined(typeof(CharacterCategorisationOptions), characterCategorisation))
      throw new ArgumentOutOfRangeException("characterCategorisation");

    Value = value;
    IndexInSource = indexInSource;
    CharacterCategorisation = characterCategorisation;
  }

  public string Value { get; private set; }

  public int IndexInSource { get; private set; }

  public CharacterCategorisationOptions CharacterCategorisation { get; private set; }
}

The default ICollectStringsOfProcessedCharacters implementation will traverse through the IWalkThroughStrings content and group together characters of the same CharacterCategorisationOptions into a single CategorisedCharacterString, using yield return to return the values.

This means that

/* Test */ .Content { color: black; }

would return content identified as

"/* Test */"     Comment
" "              Whitespace
".Content"       SelectorOrStyleProperty
" "              Whitespace
"{"              OpenBrace
" "              Whitespace
"color"          SelectorOrStyleProperty
":"              StylePropertyColon
" "              Whitespace
"black"          Value
";"              SemiColon
" "              Whitespace
"}"              CloseBrace

But if the enumeration of the data returned from the GetStrings method stopped after the ".Content" string was returned then no more parsing of the CSS would be carried out. If accurate differentiation of selectors, media queries and style property names was required at this point then a lot more parsing may be required to ensure that that string (".Content") was indeed a selector.

Another benefit arises if a large amount of content is to be parsed; an IWalkThroughStrings implementation that wraps a TextReader may be used so the content could be loaded from disk in chunks and as much or as little parsed as desired, using relatively few resources.

No Read-ahead at all??

Having just jabbered on about how amazing it is that this SelectorOrStyleProperty categorisation requires absolutely zero reading ahead in order to categorise any given character (so long as all of the preceeding characters have been parsed), there are a couple of exceptions to this rue:

When a @media rule is encountered, all of the following content needs to be considered to be either Whitespace or SelectorOrStyleProperty until the opening brace for the rule is encountered, since the rule may contain otherwise-significant characters such as colon (eg. the ":" in "@media (min-width:500px)" is part of the media query and does not signify the separator symbol between a property name and a property value), so when a "@" is encountered, the next characters are read to determine whether it's a media rule or not
A colon in a pseudo class should also not be identified as a StylePropertyColon, it should be considered part of the SelectorOrStyleProperty, so if a colon is encountered while processing what is thought to be a selector then some reading ahead is done to try to determine whether the content indicates that it is indeed a psuedo-class selector and not a separator between a property name and its value

To make this easier, the IWalkThroughStrings interface has an additional method

/// <summary>
/// This will try to extract a string of length requiredNumberOfCharacters from the current
/// position in the string navigator. If there are insufficient characters available, then
/// a string containing all of the remaining characters will be returned. This will be an
/// empty string if there is no more content to deliver. This will never return null.
/// </summary>
string TryToGetCharacterString(int requiredNumberOfCharacters);

I contemplated making this an extension method since the data can always be retrieved using the CurrentCharacter and Next properties, but depending upon the implementation there may be more efficient ways to retrieve the data and so it became an interface method.

An original idea?

I'm really happy with the way this approach to the problem has influenced the final design. There were a few issues that I hadn't foreseen when I started (the complications with pseudo classes giving different meaning to the colon character, for example, as outlined above, had somehow slipped my mind entirely when I got going) but extending it to cover these cases wasn't particularly difficult as keeping all of the complicated bits as segregated as possible made it easy to reason about where changes needed to be made and whether they could have any unpleasant side effects.

I don't think I can take credit for the originality of the idea, though. The overarching plan is to have a processor instance which is posed to start processing content, at this point it has produced no results and is in an uninitialised state. This is the first IProcessCharacters instance. When its Process method is called, the first character from the IWalkThroughStrings is taken and a CharacterProcessorResult returned which identifies the type of that first character and specifies an IProcessCharacters instance to process the next character. That character triggered the change in state. The next call to Process might return a result with a different type of IProcessCharacters and/or a different CharacterCategorisationOptions.

The point is that for any current state, there are a finite number of states that can be moved to next (since there are a limited number of CharacterCategorisationOptions values and IProcessCharacters implementations) and a finite number of triggers for each change in state (since there are only so many possible characters, even if we do consider the huge extended alphabets available). This puts me in mind of a Finite State Machine which is a well-documented concept.. the article on Wikipedia is thorough and there's another article on learn you some Erlang for great good! which I haven't read all of, but I've heard good things about that site so intend to read that article properly before hopefully reading and following more of the tutorials on there.

Overview of processors

Just to emphasise how this approach made things easier and spread much of the logic across self-contained components, I'll spin through the processors which loop through the content, passing control back and forth as appropriate.

The first is always the SelectorOrStylePropertySegment, which is actually the one that has to deal with the most different circumstances. By default it will identify each character as being of type SelectorOrStyleProperty unless it encounters any one-offs like an OpenBrace or a SemiColon or anything that constitutes Whitespace. If it encounters the ":" character then it has to do a little reading ahead to try to determine whether that indicates that a delimiter between a Style Property Name and the Property Value or whether it's part of a pseudo class (eg. ":hover"). If it's a Property Value then it hands off to the StyleValueSegment class which walks through content, marking it as either type Value or Whitespace until it hits a ";" and returns control back to the SelectorOrStylePropertySegment.

If the StyleValueSegment encounters a quote character then it hands off control to a QuotedSegment instance which walks through the content marking it as type Value until it encounters the closing quote and returns control back to where it came from. The QuotedSegment has a constructor argument for the termination character (the closing quote) so doesn't have to do anything complicated other than wait for that character to show up!

The SelectorOrStylePropertySegment does something similar to handing off to the StyleValueSegment when it encounters an opening square bracket as that indicates the start of an attribute selector (eg. "a[href]") - control is given to a BracketedSelectorSegment which identifies all content as being type SelectorOrStyleProperty until the closing "]" character is encountered.

All three of SelectorOrStylePropertySegment, StyleValueSegment and BracketedSelectorSegment have to make exceptions for comments. When a "/" is encountered, they will look ahead to see if the next is either "/" or "" and hand off to a SingleLineCommentSegment or MultiLineCommentSegment, respectively. The first simply has to mark everything as Comment content until passing back control when a line break is encountered. The second marks content as Comment until it encounters a "" which the character after is a "/". When this "*" is encountered it hands off to a SkipCharactersSegment which marks the next character as Comment as well and then hands back to whatever handed control to the MultiLineCommentSegment. Only a single character can be identified at once, hence the use of the SkipCharactersSegment, but even this small hoop is only a small one to jump through. These three classes are very minor specialisation of a shared base class so that this logic is shared.

The QuotedSegment doesn't inherit from the same since all content should be identified as being of a particular type, comment-like content within a quoted string does not constitute an actual comment. The QuotedSegment class takes a constructor argument to indicate the type of content that it will be representing since a quoted section while processing Value content should be identified as type Value while a quoted section in SelectorOrStyleProperty content (eg. in "input[type='text']") should also be identified as type SelectorOrStyleProperty.

So essentially it all boils down to is-the-current-processor-ok-for-this-character? If yes, then continue to use it. If a condition is encountered where the processor should change (either handing control to a new processor or handing control back to a previous processor) then do that and let it continue.

When I started writing it, I somehow forgot all about attribute selectors (there's a fair argument that more planning might have been beneficial but I wanted to do it an exercise in jumping in with this approach and then hoping that the entire design would lend itself well to "changing requirements" - aka. me overlooking things!). If this had been processed in some contorted single loop full of complicated interacting conditions - like that javascript parser of my past - then adding that extra set of conditions would have filled me with dread. With this approach, it was no big deal.

The Processor Factory

There was only one thing that struck me with the idea of all of these processor instances being created left, right and centre; that there could be a lot of churn. That if there was content being processed then there could thousands of MultiLineCommentSegment instances being created, for instance, when they're nearly all to perform the same task - record comment content and pass back to the primary SelectorOrStylePropertySegment processor. If these instances could be shared then the churn could be reduced. And since each processor is immutable there is no state to worry about and so they are inherently shareable.

To achieve this, an IGenerateCharacterProcessors is passed as a constructor argument to classes that need to instantiate other processors. The simplest implementation of this is to spin up a new instance of the requested processor type, passing the provided constructor arguments. This is what the CharacterProcessorsFactory class does. But the CachingCharacterProcessorsFactory class will wrap this and keep a record of everything it's instantiated and return a previous reference if it has the same type and constructor arguments as the request specifies. This enables the reuse that I had in mind.

I will admit that there is a slight air of premature optimisation around this, worrying about churn with no evidence that it's a problem, but I intend for these processors to be used on substantial sized chunks of CSS / LESS - and when the IWalkThroughStrings interface allows for a class to be written backed onto a TextReader (as described earlier) so that only the minimum content need be held in memory at any one time, then this extra work to reuse processor instances seems to make sense.

Deeper Analysis

Ok, that explanation of how simple everything was ended up longer and quite possibly more detailed than I'd originally expected but there's one more thing I want to address!

All of the code described above really only allows for quite a simplistic representation of the data. But it paves the way for more complicated processing.

What I really needed was a way to analyse the structure of LESS content - this is all looping back to the idea of "linting" stylesheets to see if they adhere to the rules in the Non-cascading CSS Post. A simple example is being able to determine whether all content in a stylesheet (that has been identified as not being one of the Resets or Themes sheets) should have the content wrapped in a html tag which limits the scope of any declared mixins or values.

A naive way approach would be trim the raw string content and see if it starts with "html {" or some variation with whitespace, hoping that there is no comment content that needs to be ignored. A better way is to use the CSS Processor as-is and skip over any leading comment and whitespace content and look for a html tag at the start of the content. However, more work would have to be done to ensure that that html tag isn't closed and then followed with more content which may or may not be wrapped in a scope-restricting html tag.

To deal with cases like this which require "deep analysis", the "ExtendedLESSParser" project has a class, the LessCssHierarchicalParser, which takes the output from the CSSParser (a CategorisedCharacterString set) and transforms it into hierarchical data describing selectors, media queries, import statements, style property names and style property values. Selectors and media queries are containers that have child "fragments" (these could be style properties or they could be nested selectors). All mention of whitespace and comments are removed and just a representation of the raw style data remains.

// Example
html
{
  h1
  {
    color: black;
    background: white url("background.jpg") no-repeat top left;
  }
  p.Intro { padding: 8px; }
}

becomes something like

html
  h1
    color
      black
    background
      white
      url("background.jpg")
      no-repat
      top
      left
  p.Intro
    padding
      8px

(Above: "html" represent a Selector instance with a ChildFragments property containing Selector instances for the "h1" and "p", each with ChildFragments data made up of StylePropertyValue and StylePropertyValue instances. These classes implement ICSSFragment as do the Import and MediaQuery, which aren't present in the example here).

To ensure that content is wrapped in scope-restricting html tags, what must be done is that the output from the LessCssHierarchicalParser (a set of ICSSFragment implementations) must be considered and it be asserted that they are either Import instances or Selector instances whose Selectors property indicates that the selector in the source content was only "html". An implementation can be found in my NonCascadingCSSRulesEnforcer project on Bitbucket, specifically the file HtmlTagScopingMustBeAppliedToNonResetsOrThemesSheets.cs.

Unfortunately, since this level of analysis requires that the entire content be considered before the structure can be described, this is not as lightweight a process as the CSSProcessor's parsing. However, it is much more powerful in enabling you to drill down into the structure of a stylesheet. The NonCascadingCSSRulesEnforcer has code to enforce nearly all of the rules in my original Non-cascading CSS Post, along with an ITextFileLoader implementation which allows the rules validation to be integrated with my CSSMinifier project which I've been using to rebuild a real site (not just my blog) with these rules. It's been going really well and I intend to put up a concluding post to this "Non-cascading CSS" mini-series with any final insights and any evidence I can present for and against trying to apply them to all builds I'm involved with in the future.

Posted at 21:50

Tags:

Comments