KISS the DOM interpreter

This is the twelfth blog post, and the last in a rapid fire of LIFT supported blogging exercises as prescribed in the first post. I will continue to blog on achieving the Smarter Grocery Store App, but as the Christmas holiday is coming to an end, and my full-time day-job is about to start again, the frequency of publication will be lower. Less quantity, more quality is the intention, so stay tuned!

Crawling store locations

In an attempt to get a high quality list of store locations available, we’ve planned to create some site visitors that will index store locations from the retailers websites. I’ve created one visitor, as example of how to do such a thing. We’ll look at the ASP.NET C# code that does the job nicely. We’re still in pilot stage, so code quality is not brilliant, but just makes us aware what we need and how easy it is to achieve. We have targeted three brands, which share the same retailer organization, and a common website. These are Digros, Dirk van den Broek and Bas van der Heiden. The page we will index is Screenshot below. bas_dirk_digros_winkels
The list at the bottom, is 101 stores long. It contains all information our information model Store entity properties require. I fetched a local copy of the HTML file for development purposes. You can fetch it yourself if you want to review the page source in detail, using the link above. I will only highlight the relevant sections for us. Our code uses two libraries, imported using NUGET into our Visual Studio project:

  • HtmlAgilityPack – Purposed to do just what we want, interpret the Document Object Model (DOM) of an external HTML page.
  • JSON.NET – Work with JavaScript Object Notation (JSON) objects in C# code.

Our code begins with loading the page into a document using the HtmlAgilityPack HtmlWeb object:

Once loaded we find and deserialize the JSON data embedded in the page that contains store details, including store location in latitude, longitude format and the specific brand the store is, e.g. Digros, Dirk or Bas. In serialized format the string looks like this:

We find the node with the following piece of code:

Then we extract and deserialize the JSON object like:

There is no need to perform defensive programming. If the page changes structure, I want my script to break and throw an exception. This way I am putting the least amount of effort in the Visitor itself and be aware of pages that are changed. It doesn’t yet need to be smarter than that! This gives us store details in an accessible format. Next up, the summary information visible on the page to the customer. We will use this as the main means of iteration and indexing, and inject detailed information as required from the prepared JSON array. First lets see how a store summary result is formatted on the page:

Pretty straightforward isn’t it, there are 101 entries on the page like this. Because of this repeated and consistent structure, parsing is super easy. First we find all DIVs with a class attibute of result, and we prepare a list of store objects which properties we will define on the go.

Then we loop all found result nodes and parse their content to fit our formatting rules for the store properties:

At this point, we incorporate reading the detailed information in the loop:

And that essentially does the trick! Our store entity used essentially became:

Very simple, yet pretty complete already. Again, no effort done to get a high quality display in place, just proofs that this is easy piecy. And we can do this also for the other retailers’ websites exposing store location information. Next blog post will describe how we can publish this data using Azure Mobile Services. Fun fun fun!

Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *