Web scraping databases with Ruby
Introduction
This post is more of an overview with issues that I have encountered and my way of approaching accumulating data records from websites than a technical how-to. For technical information read the documentation and consult stackoverflow or other forums. I do provide a neat way to turn your json records into csv courtesy of an unattributed forum post (if I find it again, I will attribute).
Ingredients
Alternatively to point 2. above you will need the following bundle:Method
- Examine your target database - you can usually search by some identifier number. What range of numbers does the database employ? Does it count up from 1 or 1000? Are there series of numbers associated with a type of record as part of the set you are acquiring (e.g. individual or corporation)? Does the range have zeros prepended?
- Determine the range of numbers you wish to search.
- Divide this up into manageable sub-ranges if the numbers exceed say 10,000 queries. Or if there are 20 million queries, perhaps you do 20 sub-ranges of a million each.
- Test getting to a sample target record in irb with mechanize first. Can you successfully submit the form? Save the result and examine what appears in your browser.
- If you can get to the target record with mechanize then you have two options: save the html for each record and post-process; or process on the fly. Either way you will be using nokogiri and xpath selectors.
- If you can't figure it out with mechanize (usually because there is a whole pile of javascript going on), then you will have to use the webdriver headless option: start off without headless and test with irb. You may well have to process on the fly, so monitor your records as they come through and restart if something is amiss.
- Process your data with nokogiri and xpath selectors. Using Firebug will help you visualise where things are. Remember: nokogiri structures everything into an onion of associative arrays (hashes).
- Changing the xml tags with nokogiri can sometimes be the silver bullet to help you solve a problem of logic in extracting data!
- Conditional statements are your friend. There may be 5 potential rows of data but only four, three or two appear in various records in your range. Can you use conditional statements about label text or distinguishing tags to help you?
- Save your records as one record per file by the number you searched for in json format as a hash mimicking the website record. You can convert this to csv later for presentation in spreadsheets with the following:
require 'csv'
ary = Array.new
Dir.foreach("data") do |f|
next if f == '.' or f == '..'
# put the file into the hash
record_json = File.open("data#{f}", "r").read
record = JSON.parse(record_json)
ary << record
end
CSV.open("data/data.csv", "wb") do |csv|
csv << ary.first.keys
ary.each do |hash|
csv << hash.values
end
end