Tobiasz Cudnik

Web Scraping with cli version of phpquery piped with sed

In Web Scraping, phpQuery on 05.10.2008 at 3:05

I’ve done what i was thinking about for some time. Terminal-firendly phpQuery CLI interface. Took about 10 minutes of coding… Works like this:


phpquery http://code.google.com/p/phpquery/downloads/list --find '.vt.col_4 a:first' --contents

This will return number of downloads latest phpQuery release file. Notice there is no need to quote url in any way. I was very happy with this so i’ve added callback support in text() and htmlOuter() methods, like so:


phpquery http://code.google.com/p/phpquery/downloads/list --find '.vt.col_4 a:first' --text strip_tags trim

When i had all stuff working, i’ve used it straight away to scrap forums and categories lists from old IPB v1.x. I’ve piped phpQuery result with sed, filtering final output.

// Fetch categories
./phpquery http://forum.wiadomosc.info/ --find '.maintitle a' | sed -r 's/^.*?c=([0-9]+).+?>(.+?)]*>([^<]*)<.*$/1: // 2/g'

All comments are screened for appropriateness. Commenting is a privilege, not a right. Good comments will be cherished, bad comments will be deleted.