Related Entries

India PyCon 2009
Quick wallpaper changer
Load testing with Grinder
Adding namespace to XML
Opera RSS to OPML

« Easy geodata utilities
» Code aware IDEs have their uses!

Scrape 'em!

One of those moments where "scripting" is better than compiled languages:-)

I wanted to prove a point about the value of less number of lines of code to get something done. I also wanted to prove a point about readability of code too.

Python code is below. Naturally, the other option is to finish up semi-colons, braces and compilation in that other language.

Here’s the code.

The whole thing was about the merits and de-merits in choosing the right tool to scrape and make RSS feeds from existing pages. Since I can’t link to the original pages, I used databasejournal.com as an example.

  1. Neat code. It does fail in Python2.2.3, with a "maximum recursion limit exceeded" error.

    It runs successfully under Python2.3.

    Posted by: Glenn Stauffer on October 23, 2003 05:08 PM
  2. The same code in ruby is even more beautiful:
    require 'open-uri'
    pattern = /<a class="header" href="(.*?)"><b>(.*?)<\/b><\/a>\s+<font size="2"><b>(.*?)<\/b><\/font><br>(.*?)<p>/m
    url = 'http://www.databasejournal.com/features/oracle/'
    open(url) { |io| io.read.scan(pattern) { |a| p a } }

    Posted by: ruby fan on October 24, 2003 01:12 PM
  3. hmmmm... I like verbose code, so my judgement is that Python code (in this instance) is more readable, and hence more beautiful.

    Posted by: Babu on October 24, 2003 02:08 PM
  4. I don't like this line of Ruby, too much symbology gives me Perl flashbacks:

    open(url) { |io| io.read.scan(pattern) { |a| p a } }

    But is Python better? What is the ratio of alpha characters to total code length for each?

    # start python code

    class AlphaRatio(str):
    def __init__(self,code=''):
    self.code=code
    self.len=(len(code))
    alpha='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
    self.count=0
    self.ratio=float(1)
    def getRatio(self):
    for position in range(0,len(self.code)):
    if self.code[position] in alpha:
    self.count+=1
    self.ratio=float(self.count)/float(self.len)
    return self.ratio

    >>> ruby=AlphaRatio('open(url) { |io| io.read.scan(pattern) { |a| p a } }')

    >>> ruby.getRatio()
    0.55769230769230771

    >>> # comment the ruby line is 55.7% alpha

    >>> pythonVers='''import urllib, re
    parse_pattern = re.compile("""(.*?)\s+(.*?)(.*?)""", re.DOTALL)
    toc_src = urllib.urlopen("http://www.databasejournal.com/features/oracle/").read()
    for (link, title, date, description) in parse_pattern.findall(toc_src):
    print (link, title, date, description)'''

    >>> py=AlphaRatio(pythonVers)

    >>> py.getRatio()
    0.625

    >>> # comment: the python code is 62.5% alpha

    >>> rubyVers='''require 'open-uri'
    pattern = /(.*?)\s+(.*?)(.*?)/m
    url = 'http://www.databasejournal.com/features/oracle/'
    open(url) { |io| io.read.scan(pattern) { |a| p a } }'''

    >>> rube=AlphaRatio(rubyVers)

    >>> rube.getRatio()
    0.53941908713692943

    >>> # comment: the ruby code is only about 54% alpha characters


    This is an awfully rough measure of readability, but it seems to quantify what I subjectively feel about the ruby code: too much meaning packed into symbols, not enough use of natural language.

    But I guess that's why people call it "code"!

    Eric
    [apologies, it looks like all indenting is lost in the comments...]

    Posted by: ep on May 16, 2004 09:49 PM