January 2010 – Page 2 – Kanojo

Have you ever wanted to parse information from some rather complex or totally broken (in terms of html standards compliance) website? Maybe you tried fighting that problem with regular expressions or DOM or SAX XML parser. If you did you probably ran into some problems: Maybe there were too many similar matches for your regex as there are repeating similar patterns in the website or your XML parser went crazy with invalid formatted or non-xhtml-compliant content?

I wanted to parse a website that had no RSS feed for changes and create a RSS feed. I first tried around with various of the ideas mentioned above but as the website is kind of “irregular” (every item is a slight bit different) and W3 validator shows over 11k of errors (in 1.1 transitional) i had quite some problems.

Until i found Rubies Hpricot, a HTML parser that lets you realize robust HTML parsing of fucked up formatted and non-standard-compliant content at ease.