The NeoCities Challenge! aka The Full Text Indexer goes client-side!

10 July 2013

The NeoCities Challenge! aka The Full Text Indexer goes client-side!

When I heard about NeoCities last week, I thought it was a cool idea - offering some back-to-basics hosting for an outlay of absolutely zero. Yeah, the first thing that came to mind was those geocities sites that seemed to want to combine pink text, lime green backgrounds and javascript star effects that chase the cursor.. but that's just nostalgia! :)

The more I thought about it, the more I sort of wondered whether I couldn't host this blog there. I'm happy with my hosting, it's cheap and fast, but I just liked the idea of creating something simple with the raw ingredients of html, css and javascript alone. I mean, it should be easy enough to flatten the site so that all of the pages are single html files, with a different file per month, per Post, per tag.. the biggest stumbling blog was the site search which is powered by my Full Text Indexer project. Obviously that won't work when everything is on the client, when there is no server backend to power it. Unless..

.. a client-side Full Text Indexer Search Service.. ??

Challenge accepted!! :D

NeoCities Hosting

Before I get carried away, I'll breeze through the NeoCities basics and how I got most of my site running atNeoCities. I initially thought about changing the code that runs my site server-side to emit a flattened version for upload. But that started to seem more and more complicated the more I thought about it, and seemed like maintaining it could be a headache if I changed the blog over time.

So instead, I figured why not treat the published blog (published at www.productiverage.com) as a generic site and crawl it, grab content, generate new urls for a flattened structure, replace urls in the existing content with the new urls and publish it like that. How hard could it be??

Since my site is almost entirely html and css (well, LESS with some rules and structure, but it compiles down to css so why be fussy) with only a smattering of javascript, this should be easy. I can use the Html Agility Pack to parse and alter html and I very conveniently have a CSS Parser that can be used to update urls (like backgrounds) in the stylesheets.

And, in a nutshell, that's what I've done. The first version of this NeoCities-hosted blog hid the site search functionality and ran from pure html and css. I had all of the files ready to go in a local IIS site as a proof of concept. Then I went to upload them.. With the Posts and the Archive and Tags pages, there were 120-something files. I'd only used the uploader on the NeoCities page to upload a single test file up to this point and so hadn't realised that that's all it would let you do; upload a single file at a time. Even if I was willing to upload all of the files individually now (which, I must admit, I did; I was feeling overexcitable about getting the first version public! :) this wouldn't scale, I couldn't (wouldn't) do it every time I changed something - a single new Post could invalidate many Pages, considering the links between Posts, the Monthly Archive pages, the Tags pages, etc..

I spent a few minutes googling to see if anyone else could offer a sensible solution and came up empty.

So Plan B was to have a look with Fiddler to see what the upload traffic looked like. It seemed fairly straight-forward, an authorisation cookie (to identify who I was logged in as) and a "csrf_token" form value, along with the uploaded file's content. I was familiar with the phrase "Cross Site Request Forgery" (from "csrf_token") but didn't really know what it meant in this context. I thought I'd take a punt and try manipulating the request to see if that token had to vary between uploads and it didn't seem to (Fiddler lets you take a request, edit it and broadcast it - so uploading a text file provided an easy to mess-with request, I could change one character of the file and leave everything else, content-length included, the same and refresh the browser to see if the new content had arrived).

This was enough to use the .net WebRequest class and upload files individually. I wrote something to loop over the files in my local version of the site and upload them all.. and it worked! There were some stumbling blocks with getting the cookie sent and specifying the form value and the file to upload but StackOverflow came to the rescue each time. There is one outstanding issue that each upload requests received a 500 Server Error response even though it was successful, but I chose to ignore that for now - yes, this approach is rough around the edges but it's functional for now!

In case this is useful to anyone else, I've made the code available at Bitbucket: The BlogToNeocitiesTransformer.

If you plug in the auth cookie and csrf token values into the (C#) console application (obtaining those values by looking at a manual upload request in Fiddler still seems like the only way right now) then you can use it yourself, should you have need to. That app actually does the whole thing; downloads a site's content, generates a flattened version (rewriting the html and css, ensuring the urls follow NeoCities' filename restrictions) and then uploads it all to a NeoCities account.

Temporary Measures

Thankfully, this file upload situation looks to only be temporary. Kyle Drake (NeoCities' proud father) has updated the blog today with NeoCities can now handle two million web sites. Here's what we're working on next. In there he says that the file upload process is going to be improved with "things like drag-and-drop file uploading, and then with an API to allow developers to write tools to upload files", which is excellent news! He also addresses the limits on file types that may be uploaded. At the moment "ico" is not in the white list of allowed extensions and so I can't have a favicon for my blog at NeoCities. But this is soon to be replaced with a "black list" (to block executables and the like) so my favicon should soon be possible* :)

* (I half-contemplated, before this announcement, a favicon in another format - such as gif or png - but it seems that this counts out IE and I wanted a solution for all browsers).

Hopefully this black listing approach will allow me to have an RSS feed here as well - essentially that's just an xml file with an xslt to transform the content into a nice-to-view format. And since xslt is just xml I thought that it might work to have an xslt reference in the xml file that has an xml extension itself. But alas, I couldn't get it working, I just kept getting a blank screen in Chrome although "view source" showed the content was present. I'll revisit this when the file restrictions have been changed.

Update (25th July 2013): The NeoCities interface has been updated so that multiple files can be uploaded by dragging-and-dropping! I still can't upload my favicon yet though..

Update (12th August 2014): I'm not sure when, exactly, but favicons are now support (any .ico file is, in fact). It's a little thing but I do think it adds something to the identity of a site so I'm glad they changed this policy.

Site Search!

This was all well and good, but at this point the site search was missing from the site. This functionality is enabled by server-side code that takes a search string, tries to find matching Posts and then shows the results with Post titles and content excerpts with the matching term(s) highlighted. The matching is done against an index of tokens (possible words) so that the results retrieval can be very fast. The index records where in the source content it matches the token, which enables the except-highlighting. It has support for plurality matching (so it knows that "cat" and "cats" can be considered to be the same word) and has some other flexibility with ignoring letter case, ignoring accents on characters, ignoring certain characters (so "book's" is considered the same as "books") and search term splitting (so "cat posts" matches Posts with "cat" and "posts" in, rather than requiring that a Post contain the exact phrase "cat posts").

But the index is essentially a string-key dictionary onto the match data (the C# code stores it as a ternary search tree for performance but it's still basically just a dictionary). Javascript loves itself some associative arrays (all objects are associated arrays, basically bags of string-named properties) and associative arrays are basically string-key dictionaries. It seemed like the start of an idea!

If the index was still generated server-side (as it has to be for my "primary" blog hosting) then I should be able to represent this data as something that javascript could interpret and then perform the searching itself on the client..

The plan:

Get the generated IIndexData<int> from my server-side blog (it's an int since the Posts all have a unique int "Id" property)
Use the GetAllTokens and GetMatches methods to loop through each token in the data and generate JSON to represent the index, storing at this point only the Post Keys and their match Weight so that searching is possible
Do something similar to get tokens, Keys, Weights and Source Locations (where in the source content the matches were identified) for each Post, resulting in detailed match data for each individual Post
Get plain text content for each Post from the server-side data (the Posts are stored as Markdown in my source and translated into html for rendering and plain text for searching)
Write javascript classes that can use this data to perform a search and then render the results with matched terms highlighted, just like the server-side code can do
Profit!

(This is actually a slightly amended plan from the original, I tried first to generate a single JSON file with the detailed content for all Posts but it was over 4 meg uncompressed and still bigger than I wanted when delivered from NeoCities gzip'd. So I went for the file-with-summary-data-for-all-Posts and then separate files for detailed data for individual Posts).

I used JSON.Net for the serialisation (it's just the go-to for this sort of thing!) and used intermediary C# classes where each property was only a single character long to try to keep the size of the serialised data down. (There's nothing complicated about this, if more detail is of interest then the code can be found in the JsonSearchIndexDataRecorder class available on Bitbucket).

C# -> Javascript

So now I had a single JSON file for performing a search across all Posts, multiple JSON files for term-highlighting individual Posts and text files containing the content that the source locations mapped onto (also for term-highlighting). I needed to write a javascript ITokenBreaker (to use the parlance of the Full Text Indexer project) to reduce a search term into individual words (eg. "Cat posts" into "Cat" and "posts"), then an IStringNormaliser that will deal with letter casing, pluralisation and all that guff (eg. "Cat" and "posts" into "cat~" and "post~"). Then a way to take each of these words and identify Posts which contain all of the words. And finally a way to use ajax to pull in the detailed match data and plain text content to display the Post excerpts with highlighted search term matches.

To make things as snappy-feeling as possible, I wanted to display the title of matched Posts first while waiting for the ajax requests to deliver the match highlighting information - and for the content excerpts to be added in later.

The file IndexSearchGenerator.js takes a JSON index file and a search term, breaks the search term into words, "normalises" those words, identifies Posts that contain all of the normalised words and returns an array of Key, Weight and (if it was present in the data) Source Locations. It's only 264 lines of non-minified javascript and a lot of that is the mapping of accented characters to non-accented representations. (The Source Locations will not be present in the all-Posts summary JSON but will be in the per-Post detail JSON).

SearchTermHighlighter.js takes the plain text content of a Post, a maximum length for a content-excerpt to show and a set of Source Locations for matched terms and returns a string of html that shows the excerpt that best matches the terms, with those terms highlighted. And that's only 232 lines of non-minified, commented code. What I found particularly interesting with this file was that I was largely able to copy-and-paste the C# code and fudge it into javascript. There were some issues with LINQ's absence. At the start of the GetPlainTextContentAsHighlightedHtml method, for example, I need to get the max "EndIndex" of any of the highlighted terms - I did this by sorting the Source Locations data by the EndIndex property and then taking the EndIndex value of the last element of the array. Easy! The algorithm for highlighting (determining the best excerpt to take and then highlighting any terms, taking care that any overlapped terms don't cause problems) wasn't particularly complicated but it was fairly pleasant to translate it "by rote" and have it work at the end.

Finally, SearchPage.js fills in the gaps by determining whether you're on the search page and extracting (and url-decoding) the terms from the Query String if so, performing the initial search against the summary data (displaying the matched titles) and then making the ajax requests for the detailed data and rendering the match excerpts as they become available. Looping through the results, making ajax requests for detail data and handling the results for each Post when it is delivered is a bit like using a complicated asynchronous model in .net but in javascript this sort of async callback madness is parr for the course :) If this script decides that you're not on the search page then it makes an ajax request for the summary regardless so that it can be browser-cached and improve the performance of the first search you make (the entirety of the summary data is only 77k gzip'd so it's no big deal if you don't end up actually performing a search).

This last file is only 165 lines of commented, white-spaced javascript so the entire client-side implementation of the search facility is still fairly approachable and maintainable. It's effective (so long as you have javascript enabled!) and - now, bear with me if I'm just being overly impressed with my own creation - it looks cool performing the complicated search mechanics so quickly and fading in the matched excerpts! :)

Signing off

I'm really proud of this and had a lot of fun within the "NeoCities boundaries"! Yes, the source-site-grabbing-and-rewriting could be tidied up a bit. Yes, the file upload is currently a bit of a dodgy workaround. Yes, I still have a handful of TODOs about handling ajax failures in SearchPage.js. Yes, the search index files are bigger than I would have liked (significantly larger than not only the plain text content but also their full page html representations), which I may address by trying out a more complicated format rather than naive JSON (which was very easy). But it all works! And it works well. And the bits that are a bit untidy are only a bit untidy, on the whole they're robust and I'm sufficiently unashamed of them all that the code is all public!

Speaking of which, my blog is primarily hosted at www.productiverage.com* and I write about various projects which are hosted on my Bitbucket account. Among them, the source code for the Blog itself, for the Full Text Indexer which powers the server-hosted blog and which generates the source index data which I've JSON-ified, the CSSParser which I use in the rewriting / site-flattening process, the BlogToNeocitiesTransformer which performs the site-flattening and a few other things that I've blogged about at various times. Ok, self-promotion over! :)

* (If you're reading this at my primary blog address, the NeoCities version with the client-side search can be found at productiverage.neocities.org).

Posted at 20:21

Last time:

Parsing CSS

Auto-releasing Event Listeners

Tags:

Productive Rage