<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Kanojo.de Blog &#187; programming</title>
	<atom:link href="http://blog.kanojo.de/tag/programming/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.kanojo.de</link>
	<description></description>
	<lastBuildDate>Fri, 06 Jan 2012 09:54:50 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Mahjong ruleset, TeXed</title>
		<link>http://blog.kanojo.de/2010/01/02/mahjong-ruleset-texed/</link>
		<comments>http://blog.kanojo.de/2010/01/02/mahjong-ruleset-texed/#comments</comments>
		<pubDate>Sat, 02 Jan 2010 17:45:39 +0000</pubDate>
		<dc:creator>nebuk</dc:creator>
				<category><![CDATA[Games]]></category>
		<category><![CDATA[Mahjong]]></category>
		<category><![CDATA[computer]]></category>
		<category><![CDATA[Non-Tut]]></category>
		<category><![CDATA[programming]]></category>

		<guid isPermaLink="false">http://blog.kanojo.de/?p=268</guid>
		<description><![CDATA[As we've been playing more and more Mahjong (not the solitair version, the real thing) recently and just stumbeled upon #mahjong in rizon where we got linked a really really nice ruleset, here i though ... well, wouldn't it be nice to have this as a booklet printout so you can check the rules or [...]]]></description>
			<content:encoded><![CDATA[<p>As we've been playing more and more <a title="Mahjong" href="http://en.wikipedia.org/wiki/Japanese_Mahjong">Mahjong</a> (not the solitair version, the real thing) recently and just stumbeled upon #mahjong in rizon where we got linked a really really nice <a href="http://tmp.kanojo.de/rules2up.ps">ru</a>leset, <a title="here" href="http://www.ofb.net/~whuang/ugcs/gp/mahjong/mahjong.html">here</a> i though ... well, wouldn't it be nice to have this as a booklet printout so you can check the rules or yaku right at the table if you're unsure.</p>
<p>Okay, so after almost two days of fighting with XeLaTeX to get nice unicode support and fighting defoma for getting a nice font its finally done!</p>
<p>You can fetch the <a href="http://tmp.kanojo.de/rules.pdf">PDF</a> here, the booklet printing (just print it, fold the whole stack in the middle (short-edge oriented) and <a href="http://www.youtube.com/watch?v=pgD1cNiVLSM">staple</a> it together) version as postscript is available <a href="http://tmp.kanojo.de/rules2up.ps">here</a>.</p>
<p>Also note that there might be a few mistakes due to the hardcore TeX action in the typesetting, feel free to report those to me to get 'em fixed. For errors in the original document please contact the original author or me.</p>
<p>I hope you're having your fun playing with those rule sheets, i hope they came out nicely <img src='http://blog.kanojo.de/wp-includes/images/smilies/icon_razz.gif' alt=':P' class='wp-smiley' /> . For a nice, short yaku overview just surf up <a href="http://www.osamuko.com/2009/12/20/yaku-overview-pdf/">here</a>.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fblog.kanojo.de%2F2010%2F01%2F02%2Fmahjong-ruleset-texed%2F&amp;title=Mahjong%20ruleset%2C%20TeXed" id="wpa2a_2"><img src="http://kanojo.blogs.ghostdub.de/wp-content/plugins/add-to-any/share_save_120_16.png" width="120" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.kanojo.de/2010/01/02/mahjong-ruleset-texed/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Robust HTML Parsing (in Ruby)?</title>
		<link>http://blog.kanojo.de/2010/01/01/robust-html-parsing-in-ruby/</link>
		<comments>http://blog.kanojo.de/2010/01/01/robust-html-parsing-in-ruby/#comments</comments>
		<pubDate>Fri, 01 Jan 2010 18:12:33 +0000</pubDate>
		<dc:creator>nebuk</dc:creator>
				<category><![CDATA[Electronic]]></category>
		<category><![CDATA[tinkering]]></category>
		<category><![CDATA[computer]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[Techniques]]></category>

		<guid isPermaLink="false">http://blog.kanojo.de/?p=227</guid>
		<description><![CDATA[Have you ever wanted to parse information from some rather complex or totally broken (in terms of html standards compliance) website? Maybe you tried fighting that problem with regular expressions or DOM or SAX XML parser. If you did you probably ran into some problems: Maybe there were too many similar matches for your regex [...]]]></description>
			<content:encoded><![CDATA[<p>Have you ever wanted to parse information from some rather complex or totally broken (in terms of html standards compliance) website? Maybe you tried fighting that problem with regular expressions or DOM or SAX XML parser. If you did you probably ran into some problems: Maybe there were too many similar matches for your regex as there are repeating similar patterns in the website or your XML parser went crazy with invalid formatted or non-xhtml-compliant content?</p>
<p>I wanted to parse a website that had no RSS feed for changes and create a RSS feed. I first tried around with various of the ideas mentioned above but as the website is kind of "irregular" (every item is a slight bit different) and W3 validator shows over 11k of errors (in 1.1 transitional) i had quite some problems.</p>
<p>Until i found Rubies Hpricot, a HTML parser that lets you realize robust HTML parsing of fucked up formatted and non-standard-compliant content at ease.</p>
<p><span id="more-227"></span></p>
<p>Hpricot is quite simple to use. The basic idea of the parsing part is that you specify the tag-order you want to walk down in the tree. So maybe you want the content of a div inside a td inside a tr inside a table inside a table inside a div ... you get the idea. By the way, <a title="Firebug" href="http://getfirebug.org">Firebug</a> is extremely useful for finding the structures you need in the HTML tree hirarchy. Hpricot will walk you down all paths that match your criteria and return you the rest portion of the tree found down there:<br />
require 'hpricot'<br />
require 'open-uri'<br />
overview = Hpricot(open("INSERT SOME URL HERE"))<br />
prodno = 0<br />
(overview/"table").each do |product|<br />
  if product.attributes=={} and not product.to_s.include? "closed on the days marked"<br />
    (product/"tr/td/table").each do |article|<br />
      prodno += 1<br />
    end<br />
  end<br />
end<br />
This is a small snippet of my parser code, it will open the url, fetch the content and create a Hpricot parser object, then for every table check whether its the table we search for (identified by the attributes and content text). Then it will count every item (in a 2-column table).</p>
<p>As you see the basic idea is quite simple, fetch element by tree position, identify the element with no doubts, do the actual magic.</p>
<p>For the actual magic point Hpricot also helps you alot! Things like<br />
linkurl = (content/"a")[0].attributes['href'].to_s.gsub("\r\n","")<br />
imgurl = (img/"img")[0].attributes['src'].to_s.gsub("\r\n","")<br />
name = (content/"a")[0].inner_text.to_s.gsub("\r\n","")<br />
are so simple (attributes seems self-explanatory, inner_text extracts only the text, not the tags that are children of the element you call it on).</p>
<p>A slightly more sophisticated example would be this:<br />
      detail = Hpricot(open(linkurl))</p>
<p>      price = nil<br />
      stock = nil<br />
      sale = nil<br />
      (detail/"table//tr//td//table").each do |data|<br />
        if data.to_s.include? "can buy from here"<br />
          entries = (data/"tr")<br />
          entries.delete_at(0)<br />
          entries.each do |entry|<br />
            estr = entry.to_s.downcase<br />
            if estr.include? "sale price"<br />
              price = (entry/"td")[1].inner_text.to_s.gsub("\r\n","")<br />
            elsif estr.include? "sale status"<br />
              sale = (entry/"td")[1].inner_text.to_s.gsub("\r\n","")<br />
            elsif estr.include? "stock status"<br />
              stock = (entry/"td")[1].inner_text.to_s.gsub("\r\n","")<br />
            end<br />
          end<br />
        end<br />
      end<br />
Here a site like <a title="this" href="http://www.amiami.com/shop/?set=english&amp;vgForm=ProductInfo&amp;sku=FIG-MOE-0559&amp;template=default/product/e_display.html">this</a> is parsed for information like stock status, sale status and price tag.</p>
<p>As you see this is a more robust approach compared to python string.split, XML DOM/SAX which doesn't work for non-standard sites. It's not as perfect as i would wish for "easy" html parsing, but its better than everything i've seen so far.</p>
<p>Also i'll post the script for parsing amiami.com for changes later (beware, no nice ruby code, hacked late at night in 1h <img src='http://blog.kanojo.de/wp-includes/images/smilies/icon_razz.gif' alt=':P' class='wp-smiley' /> ) so you can see a more elaborate example. I hope you'll have more fun parsing HTML using these Hpricot and these snippets <img src='http://blog.kanojo.de/wp-includes/images/smilies/icon_razz.gif' alt=':P' class='wp-smiley' /> .</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fblog.kanojo.de%2F2010%2F01%2F01%2Frobust-html-parsing-in-ruby%2F&amp;title=Robust%20HTML%20Parsing%20%28in%20Ruby%29%3F" id="wpa2a_4"><img src="http://kanojo.blogs.ghostdub.de/wp-content/plugins/add-to-any/share_save_120_16.png" width="120" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://blog.kanojo.de/2010/01/01/robust-html-parsing-in-ruby/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

