Whats the best way to harvest information from a website?
For what it’s worth, you’re right — it would be nice if the courts and all the other government agencies started publishing their information in easily consumable formats. It’s entirely possible that they already have this capability, so you might want to drop a quick email to the administrator of the site which you’re going to spider to find out what they’ve got under the hood. You never know unless you ask…
Definitely use Perl and WWW::Mechanize. It rocks for this kind of thing. Having got to the requisite page, there are also Perl modules which specialise in the analysis of HTML tables, like HTML::TableContentParser and HTML::TableExtractor. By the way: I go to the court website, select (using a pull-down menu) the type of search I want to do (“docket search”), enter a date in a text box in a different frame of the page, and get a results page this part can possibly be automated easily for you if your results page has a URL like domain.com/script.cgi?search=docket&date=1/1/05 — you could have bookmarks set up or a javascript bookmarklet which automatically went to todays date for instance.
I’ve done this in perl (scraping eMusic.com) and in Firefox using javascript (scraping a “best of the web” community web site); each program had several hundred satisfied users. In perl, as RichardP notes, WWW::Mechanize provided an easy start. Doing it in Firefox is easier (but requires, obviously, Firefox). Now, I don’t, usually, do my own legal work: I hire a lawyer because he can do it better and faster than I can. (This is just a trivial application of Ricardo’s Law.) I’d suggest that it’s ultimately cheaper for you to not distract yourself from making money as a lawyer, by hiring a coder to do this for you.