Robust HTML Parsing (in Ruby)? – Kanojo

Have you ever wanted to parse information from some rather complex or totally broken (in terms of html standards compliance) website? Maybe you tried fighting that problem with regular expressions or DOM or SAX XML parser. If you did you probably ran into some problems: Maybe there were too many similar matches for your regex as there are repeating similar patterns in the website or your XML parser went crazy with invalid formatted or non-xhtml-compliant content?

I wanted to parse a website that had no RSS feed for changes and create a RSS feed. I first tried around with various of the ideas mentioned above but as the website is kind of “irregular” (every item is a slight bit different) and W3 validator shows over 11k of errors (in 1.1 transitional) i had quite some problems.

Until i found Rubies Hpricot, a HTML parser that lets you realize robust HTML parsing of fucked up formatted and non-standard-compliant content at ease.

Hpricot is quite simple to use. The basic idea of the parsing part is that you specify the tag-order you want to walk down in the tree. So maybe you want the content of a div inside a td inside a tr inside a table inside a table inside a div … you get the idea. By the way, Firebug is extremely useful for finding the structures you need in the HTML tree hirarchy. Hpricot will walk you down all paths that match your criteria and return you the rest portion of the tree found down there:
require ‘hpricot’
require ‘open-uri’
overview = Hpricot(open(“INSERT SOME URL HERE”))
prodno = 0
(overview/”table”).each do |product|
if product.attributes=={} and not product.to_s.include? “closed on the days marked”
(product/”tr/td/table”).each do |article|
prodno += 1
end
end
end
This is a small snippet of my parser code, it will open the url, fetch the content and create a Hpricot parser object, then for every table check whether its the table we search for (identified by the attributes and content text). Then it will count every item (in a 2-column table).

As you see the basic idea is quite simple, fetch element by tree position, identify the element with no doubts, do the actual magic.

For the actual magic point Hpricot also helps you alot! Things like
linkurl = (content/”a”)[0].attributes[‘href’].to_s.gsub(“rn”,””)
imgurl = (img/”img”)[0].attributes[‘src’].to_s.gsub(“rn”,””)
name = (content/”a”)[0].inner_text.to_s.gsub(“rn”,””)
are so simple (attributes seems self-explanatory, inner_text extracts only the text, not the tags that are children of the element you call it on).

A slightly more sophisticated example would be this:
detail = Hpricot(open(linkurl))

price = nil
stock = nil
sale = nil
(detail/”table//tr//td//table”).each do |data|
if data.to_s.include? “can buy from here”
entries = (data/”tr”)
entries.delete_at(0)
entries.each do |entry|
estr = entry.to_s.downcase
if estr.include? “sale price”
price = (entry/”td”)[1].inner_text.to_s.gsub(“rn”,””)
elsif estr.include? “sale status”
sale = (entry/”td”)[1].inner_text.to_s.gsub(“rn”,””)
elsif estr.include? “stock status”
stock = (entry/”td”)[1].inner_text.to_s.gsub(“rn”,””)
end
end
end
end
Here a site like this is parsed for information like stock status, sale status and price tag.

As you see this is a more robust approach compared to python string.split, XML DOM/SAX which doesn’t work for non-standard sites. It’s not as perfect as i would wish for “easy” html parsing, but its better than everything i’ve seen so far.

Also i’ll post the script for parsing amiami.com for changes later (beware, no nice ruby code, hacked late at night in 1h :P) so you can see a more elaborate example. I hope you’ll have more fun parsing HTML using these Hpricot and these snippets :P.

Leave a Comment Cancel reply