Related Entries

India PyCon 2009
Quick wallpaper changer
Load testing with Grinder
Adding namespace to XML
Opera RSS to OPML

« Grasso Quits
» Post Isabel edition

Regex HTML extraction examples

Couple of examples to parse web pages with news items.

Every other week, I find myself facing a job requiring regular expressions. And every other week, I need to refer to python re module and regex howto.

In an effort to reduce that time, here is working code that -- at the moment -- parses National Geographic News and IBM dW home page. Perhaps it might be useful to newbies too. The functions return a list of tuples like (title, url, description, date, category)

If you like to generate RSS from this, checkout Python RSS2Gen module.

It is slowly getting difficult to remember what I read in the documentation; especially after reading documentation on different technologies all the time. I think writing code snippets and templates for ready reference is a better way to keep things in memory a little longer.

[Update: 2003-09-22] Scrape article listing to generate wget commands for downloading

  1. I wrote similar code a while ago, to gather URLs from various (computer-related) news sites. The resulting program, called Mygale, can be found here: http://www.awaretek.com/nowak/mygale.html

    Most of it was written in late 2001. I cannot guarantee that the code still works on recent versions of the sites.

    Posted by: Hans on September 18, 2003 03:05 PM
  2. Well I was wondering if you guys could show a bit of python power. I need to extract html tables from about 200 web pages. I dont know python, but wanted to use it for this task so I could learn. Any tips, or pointers will be very helpful.

    Thanks

    Posted by: Geoff on September 18, 2003 03:12 PM
  3. Geoff, if the HTML pages are well formed, you can use SGMLLib module for easy parsing. See diveintopython.org, section on HTML processing. If they are not, you might want to pass them through HTML Tidy to make them well formed. Otherwise, like in the example code above, you could use regular expressions.

    Once you can parse one file properly, just make it into a function and call it from a loop like 'for file in [file1, file2...]' - see python.org tutorial. Python is very easy to start and to move along.

    Posted by: Babu on September 18, 2003 05:57 PM
  4. I love your site!!!

    Posted by: Annett Joel on September 1, 2004 03:56 PM
  5. link to working code is broken :-( sniff I'd like a peek at it

    Posted by: Nicolas on November 2, 2004 05:56 AM
//-->