Productive Rage

30 July 2013

JavaScript Compression (Putting my JSON Search Indexes on a diet)

When I wrote a couple of weeks ago about writing a way to consume Full Text Indexer data entirely on the client (so that I could recreate my blog's search functionality at productiverage.neocities.org - see The Full Text Indexer goes client-side!), I said that

Yes, the search index files are bigger than I would have liked..

This may have been somewhat of an understatement.

As a little refresher; there is one JSON file responsible for matching posts to search terms ("oh, I see you searched for 'cats'! The posts that you want are, in descending order of match quality: 26, 28, 58, 36, 27, 53, 24"). Then there is a JSON file per-post for all of the source location mappings. These describe precisely where in the posts that tokens may be matched. There's quite a lot of data to consider and it's in a fairly naive JSON structure (I used single letter property names but that was the only real attempt at size-constraining*). On top of this, there's a plain-text representation of each post so that the source location mappings can be used to generate a best match "content excerpt" for each result (à la Google's results, which highlight matched terms). And finally a JSON file for all of the titles of the posts so that the titles can be displayed when the search logic knows which posts match the query but the content excerpt generation is still being worked on in the background.

This results in 5.8MB of JSON and plain text data.

Now, in fairness, this is all gzip'd when it comes down the wire and this sort of data compresses really well. And the detail files for the posts are only requested when the corresponding posts are identified as being results for the current search. So in terms of what the user has to download, it's no big deal.

However, the hosting at neocities.org only offers 10MB at the moment so 5.8MB solely for searching seems a little excessive. To put it into perspective, this is several times more than the html required to render all of the actual posts!

I hadn't actually realised just how much of a problem it was until I published my last post. When I came to re-generate the flattened content to upload to NeoCities, that last post (being something of a beast) pushed the storage requirements past the magic 10MB mark!

* (Very minor diversion: There was actually one potential optimisation that I did consider when first serialising the index data for use on the client. Instead of a simple associative array, I wondered if using a Ternary Search Tree would reduce the space requirements if a lot of the keys had similarities. I used a TST internally in the C# project for performance reasons and talked about it in The .Net Dictionary is FAST!. Alas, when I tried it as a structure for my data, it resulted in slightly larger files).

So. Compression, yeah? In JavaScript??

The biggest contributors to size are the per-post search index files that contain the source locations data. Each is JSON data describing an associative array matching a "normalised" search term (to take into account plurality, case-insensitivity, etc) to a post key, weight and array of source locations. Each source location has the fields FieldIndex, TokenIndex, ContentIndex and ContentLength (read all about it at The Full Text Indexer: Source Locations).

I know that this data can be compressed since, not only does textual or JSON data frequently lend itself to being effectively compressed, but I can see the compression that gzip achieves when the data is delivered to the browser.

My first thought when it comes to compression is that we're dealing with binary data. Certainly, when I see gzip'd responses in Fiddler (before using the "Response is encoded.. Click here to transform" link) it looks like gobbledygook, not like any text I know!

This reminded me of something I read some time ago about a guy creating a PNG file where the pixels are generated from bytes extracted from textual content. This PNG can be read by javascript and the pixels extracted again, and from the pixel values the textual source content can be recreated. The real benefit with PNG here is that it incorporates a lossless compression scheme. Lossless compression is really important here, it means that the decompressed content will be identical to the source content. JPEG is a lossy scheme and can achieve higher compression rates if the quality is reduced sufficiently. But it loses information when it does this. The idea is that the resulting image is either acceptably close visually to the source image or that the discrepancy is evident but thought to be a worthwhile trade-off considering the file size benefits. If we're trying to extract text data that represents javascript than "close-ish, maybe" is not going to cut it!

The original article that I'd remembered was this: Compression using Canvas and PNG-embedded data. The results sound impressive, he got an older version of jQuery (1.2.3) compressed down from 53kb to 17kb. Bear in mind that this is already the minified version of the code! It's quite a short article and interesting, so give it a visit (and while you're there notice that the Mario background is interactive and can be played using the cursor keys! :)

The summary of the article, though, is that it's not suitable for mainstream use. Browser support is restricted (all modern browsers now would work, I'm sure, but I don't know how many versions of IE it would work in). And it concludes with "this is meant only as thing of interest and is not something you should use in most any real life applications, where something like gzip will outperform this".

Ok.

Glad I looked it up again, though, since it was interesting stuff.

On the same blog, there's also an article Read EXIF data with Javascript, which talks about retrieving binary data from a file and extracting the data from it. In this case, the content would have to be compressed when written and then decompressed by the javascript loading it. Unlike, the PNG approach, compression doesn't come for free. From the article, he's written the file binaryajax.js. Unfortunately, browser support apparently is still incomplete. The original plan outlined works for Chrome and Firefox but then some dirty hacks (including rendering VBScript script tags and executing those functions) are required for IE and, at the time at least, Opera wouldn't work at all.

Again, interesting stuff! But not quite what I want.

Help, Google!

So I had to fall back to asking google about javascript compression and trying not to end up in article after article about how to minify scripts.

In fairness, it didn't take too long at all until a pattern was emerging where a lot of people were talking about LZ compression (info available at the Wikipedia page). And I finally ended up here lz-string: JavaScript compression, fast!

From that page -

lz-string was designed to fulfill the need of storing large amounts of data in localStorage, specifically on mobile devices. localStorage being usually limited to 5MB, all you can compress is that much more data you can store.

What about other libraries?

All I could find was:

- some LZW implementations which gives you back arrays of numbers (terribly inefficient to store as tokens take 64bits) and don't support any character above 255.

- some other LZW implementations which gives you back a string (less terribly inefficient to store but still, all tokens take 16 bits) and don't support any character above 255.

- an LZMA implementation that is asynchronous and very slow - but hey, it's LZMA, not the implementation that is slow.

- a GZip implementation not really meant for browsers but meant for node.js, which weighted 70kb (with deflate.js and crc32.js on which it depends).

This is sounding really good (and his view of the state of the other libraries available reflects my own experiences when Googling around). And he goes on to say

localStorage can only contain JavaScript strings. Strings in JavaScript are stored internally in UTF-16, meaning every character weight 16 bits. I modified the implementation to work with a 16bit-wide token space.

Now, I'm not going to be using it with localStorage but it's gratifying to see that this guy has really understood the environment in which it's going to be used and how best to use that to his advantage.

Preliminary tests went well; I was compressing this, decompressing that, testing this, testing that. It was all going swimmingly! The only problem now was that this was a clearly a custom (and clever) implementation of the algorithm so I wouldn't be able to use anything standard on the C# side to compress the data if I wanted the javascript to be able to decompress it again. And the whole point of all of this is to "flatten" my primary blog and serialise the search index in one process, such that it can be hosted on NeoCities.

The javascript code is fairly concise, so I translated it into C#. When I'd translated C# classes from my Full Text Indexer into javascript equivalents, it had gone surprisingly painlessly. I'd basically just copied the C# code into an empty file and then removed types and tweaked things to work as javascript. So I thought I'd take a punt and try the opposite - just copy the javascript into an empty C# class and then try to fix it up. Adding appropriate types and replacing javascript methods with C# equivalents. This too seemed to go well, I was compressing some strings to text files, pulling them with ajax requests in javascript, decompressing them and viewing them. Success!

Until..

I gave it a string that didn't work. The decompress method returned null. Fail.

Troubleshooting

I figured that there's only so much that could be going wrong. If I compressed the same string with the javascript code then the javascript code could decompress it just fine. The data compressed with the C# version refused to be decompressed by the javascript, though. Chance are I made a mistake in the translation.

I got the shortest reproduce string I could (it's from the titles-of-all-posts JSON that the search facility uses) -

{"1":"I lo

and got both the C# and javascript code to print out a list of character codes that were generated when that string was compressed.

These were identical. So maybe my translation wasn't at fault.

Well something must be getting lost somewhere!

This led me on to wondering if it was somehow the encoding. The compressed content is being served up as UTF8 (basically the standard on the web) but the compression algorithm is intended to compress to UTF16. Now, surely this won't make a difference? It means that the bits sent over the wire (in UTF8) will not be the exact same bits as the javascript string (UTF16) is represented by when it's received, but these encoded bits should be describing the same character codes.

So the next step was to intercept the ajax request that the javascript client was making for the data (compressed by C# code, delivered over the wire with UTF8 encoding) and to write out the character codes at that point.

And there was a discrepancy! The character codes received were not the same that I'd generated by the C# code and that I thought I was transmitting!

Thinking still that this must somehow be encoding-related, I started playing around with the encoding options when writing the data to disk. And noticed, innocently hidden away in an alternate constructor signature, the System.Text.UTF8Encoding class has an option to "throwOnInvalidBytes". What is this?

I knew how UTF8 worked, that it uses a variable number of bytes and uses the most-signficant-bits to describe how many bytes are required for the current character (the Wikipedia article explains it nicely) and thought that that was pretty much all there was to it. So how could a byte be invalid?? Well, with this constructor argument set to true, I was getting the error

Unable to translate Unicode character \uD900 at index 6 to specified code page.

so clearly particular bytes can be invalid somehow..

UTF Limitations

With this error, it didn't actually take much searching. There's a link on www.unicode.com; Are there any 16-bit values that are invalid? that states that

Unpaired surrogates are invalid in UTF8. These include any value in the range D80016 to DBFF16 not followed by a value in the range DC0016 to DFFF16, or any value in the range DC0016 to DFFF16 not preceded by a value in the range D80016 to DBFF16

I spent a little while wandering through various searches on the internet trying to decide what the best way would be to try to address this. I didn't want to have to try to find another compressor for all of the reasons that the author of the one I'm using outlined! Which made me think, maybe there's more information about this on his site.

Lo and behold, in the changelog (at the bottom of that page), there's mention that there's a v1.3.0 available that has additional methods compressToUTF16 and decompressToUTF16 ("version 1.3.0 now stable") that "allow lz-string to produce something that you can store in localStorage on IE and Firefox".

These new methods wrap the methods "compress" and "decompress". But the "compress" and "decompress" methods in this new version of the code look different to those in the code that I had been using (and had translated). But it's no big deal to translate the newer version (and the new methods).

And now it works! I wish that this version had been the file you see when you go to the main lz-string GitHub page rather than being hidden in the "libs" folder. But considering how happy I am that the solution to my problem has been provided to me with little-to-no-effort, I'm really not going to complain! :)

Incorporating it into the solution

Step 1 was to alter my Blog-to-NeoCities Transformer to write compressed versions of the per-post source location mappings data, along with compressed versions of the plain text post data and the JSON that has the titles for each post.

The C# translation of the LZString code can be seen at: LZStringCompress.cs.

Step 2 was to alter the javascript search code to handle the compressed content. Having included the 1.3.0 version of LZString.js, I needed to replace some of the $.ajax requests with calls to one of

function LoadCompressedData(strUrl, fncSuccess, fncFailure) {
  // Note: I've seen this fail when requesting files with extension ".js" but work when the exact
  // same file is renamed to ".txt", I'm not sure if this is in IIS that it's failing or if jQuery
  // is requesting the data in a slightly different manner (despite the explicit dataType option),
  // so best just ensuring that all LZ-compressed data is stored in a file with a ".txt" extension.
  $.ajax({
    url: strUrl,
    dataType: "text",
    success: function(strCompressedContent) {
      var strContent;
      try {
        strContent = LZString.DecompressFromUTF16(strCompressedContent);
      }
      catch(e) {
        if (fncFailure) {
          fncFailure(e);
        }
        return;
      }
      fncSuccess(strContent);
    }
  });
}

function LoadCompressedJsonData(strUrl, fncSuccess, fncFailure) {
  LoadCompressedData(
    strUrl,
    function(strContent) {
      var objData;
      try {
        eval("objData = " + strContent);
      }
      catch(e) {
        if (fncFailure) {
          fncFailure(e);
        }
        return;
      }
      fncSuccess(objData);
    },
    function() {
      if (fncFailure) {
        fncFailure(arguments);
      }
    }
  );
}

Step 3 is.. there is no step 3! Everything was now working but taking up less space on the hosting.

When compressed, the detailed source location mappings data is reduced from a combined 4.7MB to 1.5MB. The plain text content was reduced from 664kb to 490kb (not as much of a saving as I'd expected, to be honest). The titles JSON shrank marginally from 2.58kb to 2.36kb. The summary data JSON wasn't compressed so that the first stage of the search could be performed as quickly as possible and one non-compressed file on the server was no problem (it's still gzip'd when passed to the client, though, so there's no bandwidth cost). In total, 5.3MB of data was condensed into requiring less than 2MB on the server. Which I am happily marking down as a win :)

So here's to me hopefully fitting many more posts (along with all of the related javascript searching data) into the NeoCities hosting limitations! I'm sure that if I ever start pushing that 10MB point again, by then the 10MB limit will have been raised - 20MB is already on the way!

Posted at 23:22

Tags:

Comments

10 July 2013

The NeoCities Challenge! aka The Full Text Indexer goes client-side!

When I heard about NeoCities last week, I thought it was a cool idea - offering some back-to-basics hosting for an outlay of absolutely zero. Yeah, the first thing that came to mind was those geocities sites that seemed to want to combine pink text, lime green backgrounds and javascript star effects that chase the cursor.. but that's just nostalgia! :)

The more I thought about it, the more I sort of wondered whether I couldn't host this blog there. I'm happy with my hosting, it's cheap and fast, but I just liked the idea of creating something simple with the raw ingredients of html, css and javascript alone. I mean, it should be easy enough to flatten the site so that all of the pages are single html files, with a different file per month, per Post, per tag.. the biggest stumbling blog was the site search which is powered by my Full Text Indexer project. Obviously that won't work when everything is on the client, when there is no server backend to power it. Unless..

.. a client-side Full Text Indexer Search Service.. ??

Challenge accepted!! :D

NeoCities Hosting

Before I get carried away, I'll breeze through the NeoCities basics and how I got most of my site running atNeoCities. I initially thought about changing the code that runs my site server-side to emit a flattened version for upload. But that started to seem more and more complicated the more I thought about it, and seemed like maintaining it could be a headache if I changed the blog over time.

So instead, I figured why not treat the published blog (published at www.productiverage.com) as a generic site and crawl it, grab content, generate new urls for a flattened structure, replace urls in the existing content with the new urls and publish it like that. How hard could it be??

Since my site is almost entirely html and css (well, LESS with some rules and structure, but it compiles down to css so why be fussy) with only a smattering of javascript, this should be easy. I can use the Html Agility Pack to parse and alter html and I very conveniently have a CSS Parser that can be used to update urls (like backgrounds) in the stylesheets.

And, in a nutshell, that's what I've done. The first version of this NeoCities-hosted blog hid the site search functionality and ran from pure html and css. I had all of the files ready to go in a local IIS site as a proof of concept. Then I went to upload them.. With the Posts and the Archive and Tags pages, there were 120-something files. I'd only used the uploader on the NeoCities page to upload a single test file up to this point and so hadn't realised that that's all it would let you do; upload a single file at a time. Even if I was willing to upload all of the files individually now (which, I must admit, I did; I was feeling overexcitable about getting the first version public! :) this wouldn't scale, I couldn't (wouldn't) do it every time I changed something - a single new Post could invalidate many Pages, considering the links between Posts, the Monthly Archive pages, the Tags pages, etc..

I spent a few minutes googling to see if anyone else could offer a sensible solution and came up empty.

So Plan B was to have a look with Fiddler to see what the upload traffic looked like. It seemed fairly straight-forward, an authorisation cookie (to identify who I was logged in as) and a "csrf_token" form value, along with the uploaded file's content. I was familiar with the phrase "Cross Site Request Forgery" (from "csrf_token") but didn't really know what it meant in this context. I thought I'd take a punt and try manipulating the request to see if that token had to vary between uploads and it didn't seem to (Fiddler lets you take a request, edit it and broadcast it - so uploading a text file provided an easy to mess-with request, I could change one character of the file and leave everything else, content-length included, the same and refresh the browser to see if the new content had arrived).

This was enough to use the .net WebRequest class and upload files individually. I wrote something to loop over the files in my local version of the site and upload them all.. and it worked! There were some stumbling blocks with getting the cookie sent and specifying the form value and the file to upload but StackOverflow came to the rescue each time. There is one outstanding issue that each upload requests received a 500 Server Error response even though it was successful, but I chose to ignore that for now - yes, this approach is rough around the edges but it's functional for now!

In case this is useful to anyone else, I've made the code available at Bitbucket: The BlogToNeocitiesTransformer.

If you plug in the auth cookie and csrf token values into the (C#) console application (obtaining those values by looking at a manual upload request in Fiddler still seems like the only way right now) then you can use it yourself, should you have need to. That app actually does the whole thing; downloads a site's content, generates a flattened version (rewriting the html and css, ensuring the urls follow NeoCities' filename restrictions) and then uploads it all to a NeoCities account.

Temporary Measures

Thankfully, this file upload situation looks to only be temporary. Kyle Drake (NeoCities' proud father) has updated the blog today with NeoCities can now handle two million web sites. Here's what we're working on next. In there he says that the file upload process is going to be improved with "things like drag-and-drop file uploading, and then with an API to allow developers to write tools to upload files", which is excellent news! He also addresses the limits on file types that may be uploaded. At the moment "ico" is not in the white list of allowed extensions and so I can't have a favicon for my blog at NeoCities. But this is soon to be replaced with a "black list" (to block executables and the like) so my favicon should soon be possible* :)

* (I half-contemplated, before this announcement, a favicon in another format - such as gif or png - but it seems that this counts out IE and I wanted a solution for all browsers).

Hopefully this black listing approach will allow me to have an RSS feed here as well - essentially that's just an xml file with an xslt to transform the content into a nice-to-view format. And since xslt is just xml I thought that it might work to have an xslt reference in the xml file that has an xml extension itself. But alas, I couldn't get it working, I just kept getting a blank screen in Chrome although "view source" showed the content was present. I'll revisit this when the file restrictions have been changed.

Update (25th July 2013): The NeoCities interface has been updated so that multiple files can be uploaded by dragging-and-dropping! I still can't upload my favicon yet though..

Update (12th August 2014): I'm not sure when, exactly, but favicons are now support (any .ico file is, in fact). It's a little thing but I do think it adds something to the identity of a site so I'm glad they changed this policy.

Site Search!

This was all well and good, but at this point the site search was missing from the site. This functionality is enabled by server-side code that takes a search string, tries to find matching Posts and then shows the results with Post titles and content excerpts with the matching term(s) highlighted. The matching is done against an index of tokens (possible words) so that the results retrieval can be very fast. The index records where in the source content it matches the token, which enables the except-highlighting. It has support for plurality matching (so it knows that "cat" and "cats" can be considered to be the same word) and has some other flexibility with ignoring letter case, ignoring accents on characters, ignoring certain characters (so "book's" is considered the same as "books") and search term splitting (so "cat posts" matches Posts with "cat" and "posts" in, rather than requiring that a Post contain the exact phrase "cat posts").

But the index is essentially a string-key dictionary onto the match data (the C# code stores it as a ternary search tree for performance but it's still basically just a dictionary). Javascript loves itself some associative arrays (all objects are associated arrays, basically bags of string-named properties) and associative arrays are basically string-key dictionaries. It seemed like the start of an idea!

If the index was still generated server-side (as it has to be for my "primary" blog hosting) then I should be able to represent this data as something that javascript could interpret and then perform the searching itself on the client..

The plan:

Get the generated IIndexData<int> from my server-side blog (it's an int since the Posts all have a unique int "Id" property)
Use the GetAllTokens and GetMatches methods to loop through each token in the data and generate JSON to represent the index, storing at this point only the Post Keys and their match Weight so that searching is possible
Do something similar to get tokens, Keys, Weights and Source Locations (where in the source content the matches were identified) for each Post, resulting in detailed match data for each individual Post
Get plain text content for each Post from the server-side data (the Posts are stored as Markdown in my source and translated into html for rendering and plain text for searching)
Write javascript classes that can use this data to perform a search and then render the results with matched terms highlighted, just like the server-side code can do
Profit!

(This is actually a slightly amended plan from the original, I tried first to generate a single JSON file with the detailed content for all Posts but it was over 4 meg uncompressed and still bigger than I wanted when delivered from NeoCities gzip'd. So I went for the file-with-summary-data-for-all-Posts and then separate files for detailed data for individual Posts).

I used JSON.Net for the serialisation (it's just the go-to for this sort of thing!) and used intermediary C# classes where each property was only a single character long to try to keep the size of the serialised data down. (There's nothing complicated about this, if more detail is of interest then the code can be found in the JsonSearchIndexDataRecorder class available on Bitbucket).

C# -> Javascript

So now I had a single JSON file for performing a search across all Posts, multiple JSON files for term-highlighting individual Posts and text files containing the content that the source locations mapped onto (also for term-highlighting). I needed to write a javascript ITokenBreaker (to use the parlance of the Full Text Indexer project) to reduce a search term into individual words (eg. "Cat posts" into "Cat" and "posts"), then an IStringNormaliser that will deal with letter casing, pluralisation and all that guff (eg. "Cat" and "posts" into "cat~" and "post~"). Then a way to take each of these words and identify Posts which contain all of the words. And finally a way to use ajax to pull in the detailed match data and plain text content to display the Post excerpts with highlighted search term matches.

To make things as snappy-feeling as possible, I wanted to display the title of matched Posts first while waiting for the ajax requests to deliver the match highlighting information - and for the content excerpts to be added in later.

The file IndexSearchGenerator.js takes a JSON index file and a search term, breaks the search term into words, "normalises" those words, identifies Posts that contain all of the normalised words and returns an array of Key, Weight and (if it was present in the data) Source Locations. It's only 264 lines of non-minified javascript and a lot of that is the mapping of accented characters to non-accented representations. (The Source Locations will not be present in the all-Posts summary JSON but will be in the per-Post detail JSON).

SearchTermHighlighter.js takes the plain text content of a Post, a maximum length for a content-excerpt to show and a set of Source Locations for matched terms and returns a string of html that shows the excerpt that best matches the terms, with those terms highlighted. And that's only 232 lines of non-minified, commented code. What I found particularly interesting with this file was that I was largely able to copy-and-paste the C# code and fudge it into javascript. There were some issues with LINQ's absence. At the start of the GetPlainTextContentAsHighlightedHtml method, for example, I need to get the max "EndIndex" of any of the highlighted terms - I did this by sorting the Source Locations data by the EndIndex property and then taking the EndIndex value of the last element of the array. Easy! The algorithm for highlighting (determining the best excerpt to take and then highlighting any terms, taking care that any overlapped terms don't cause problems) wasn't particularly complicated but it was fairly pleasant to translate it "by rote" and have it work at the end.

Finally, SearchPage.js fills in the gaps by determining whether you're on the search page and extracting (and url-decoding) the terms from the Query String if so, performing the initial search against the summary data (displaying the matched titles) and then making the ajax requests for the detailed data and rendering the match excerpts as they become available. Looping through the results, making ajax requests for detail data and handling the results for each Post when it is delivered is a bit like using a complicated asynchronous model in .net but in javascript this sort of async callback madness is parr for the course :) If this script decides that you're not on the search page then it makes an ajax request for the summary regardless so that it can be browser-cached and improve the performance of the first search you make (the entirety of the summary data is only 77k gzip'd so it's no big deal if you don't end up actually performing a search).

This last file is only 165 lines of commented, white-spaced javascript so the entire client-side implementation of the search facility is still fairly approachable and maintainable. It's effective (so long as you have javascript enabled!) and - now, bear with me if I'm just being overly impressed with my own creation - it looks cool performing the complicated search mechanics so quickly and fading in the matched excerpts! :)

Signing off

I'm really proud of this and had a lot of fun within the "NeoCities boundaries"! Yes, the source-site-grabbing-and-rewriting could be tidied up a bit. Yes, the file upload is currently a bit of a dodgy workaround. Yes, I still have a handful of TODOs about handling ajax failures in SearchPage.js. Yes, the search index files are bigger than I would have liked (significantly larger than not only the plain text content but also their full page html representations), which I may address by trying out a more complicated format rather than naive JSON (which was very easy). But it all works! And it works well. And the bits that are a bit untidy are only a bit untidy, on the whole they're robust and I'm sufficiently unashamed of them all that the code is all public!

Speaking of which, my blog is primarily hosted at www.productiverage.com* and I write about various projects which are hosted on my Bitbucket account. Among them, the source code for the Blog itself, for the Full Text Indexer which powers the server-hosted blog and which generates the source index data which I've JSON-ified, the CSSParser which I use in the rewriting / site-flattening process, the BlogToNeocitiesTransformer which performs the site-flattening and a few other things that I've blogged about at various times. Ok, self-promotion over! :)

* (If you're reading this at my primary blog address, the NeoCities version with the client-side search can be found at productiverage.neocities.org).

Posted at 20:21