FULL POST

Simple example of web scraping with Nokogiri

Web scraping it's a helpful method to extract information from websites when the website doesn't have a proper API to get it, or simply for have fun getting random content from the web.

For my next post (which I will publish in a few days), I need some data about the Pokemons of the first Gameboy game, and one of the websites with more info about this it's http://pokemondb.net, but has no API to request this list (or at least I have not found), so I use web scraping to get all the info I'll need.

This is something very simple, but as this year it's the Pokémon's 20th Anniversary, I thought that someone will find this useful, or at least funny :)

You could see the complete documentation of Nokogiri here for more details, but here basically I do two main things: I read the HTML from an URL opened with open-uri using Nokogiri::HTML, and I search for the elements I want with the css and at_css methods using CSS queries.

At the end, we'll have something like this:

#!/usr/bin/env ruby

require 'rubygems'  
require 'nokogiri'  
require 'open-uri'

base_url = "http://pokemondb.net"  
url_index = "#{base_url}/pokedex/game/firered-leafgreen"  
index = Nokogiri::HTML(open(url_index))

index.css(".infocard-tall").each do |item|  
  begin
    name = item.at_css(".ent-name").text

    puts "Fetching #{name} info..."
    url_detail = "#{base_url}#{item.at_css(".ent-name")[:href]}"
    number = kind = species = height = weight = abilities = nil
    pokemon_detail = Nokogiri::HTML(open(url_detail))

    number = pokemon_detail.at_css(".vitals-table tr:contains('National')").at_css("td").text
    kind = pokemon_detail.at_css(".vitals-table tr:contains('Type')").at_css("td").text.split(" ").join(", ")
    species = pokemon_detail.at_css(".vitals-table tr:contains('Species')").at_css("td").text
    height = pokemon_detail.at_css(".vitals-table tr:contains('Height')").at_css("td").text
    weight = pokemon_detail.at_css(".vitals-table tr:contains('Weight')").at_css("td").text
    abilities = pokemon_detail.at_css(".vitals-table tr:contains('Abilities')").at_css("td")
  rescue
    puts "Something goes wrong with #{name} :("
  ensure
    puts "Pokemon info"
    puts "Name: #{name}, number: #{number}, kind: #{kind}, species: #{species}, height: #{height}, weight: #{weight}, abilities: #{abilities}"
  end
end  

You could copy and edit this code to get different data from other Pokemon editions, simply changing the base URL and playing with the CSS queries to reach the info you want.

Then, simply make the script executable with chmod +x NAME_OF_SCRIPT.rb, execute ./NAME_OF_SCRIPT.rb and enjoy viewing the data flowing on the terminal.

Any place where I can pick the code of this tip?

Yeah! Here you have the Github gist for this tip :)

COMMENTS