Simple example of web scraping with Nokogiri

Web scraping it's a helpful method to extract information from websites when the website doesn't have a proper API to get it, or simply for have fun getting random content from the web.
For my next post (which I will publish in a few days), I need some data about the Pokemons of the first Gameboy game, and one of the websites with more info about this it's http://pokemondb.net, but has no API to request this list (or at least I have not found), so I use web scraping to get all the info I'll need.
This is something very simple, but as this year it's the Pokémon's 20th Anniversary, I thought that someone will find this useful, or at least funny :)
You could see the complete documentation of Nokogiri here for more details, but here basically I do two main things: I read the HTML from an URL opened with open-uri using Nokogiri::HTML
, and I search for the elements I want with the css
and at_css
methods using CSS queries.
At the end, we'll have something like this:
#!/usr/bin/env ruby require 'rubygems' require 'nokogiri' require 'open-uri' base_url = "http://pokemondb.net" url_index = "#{base_url}/pokedex/game/firered-leafgreen" index = Nokogiri::HTML(open(url_index)) index.css(".infocard-tall").each do |item| begin name = item.at_css(".ent-name").text puts "Fetching #{name} info..." url_detail = "#{base_url}#{item.at_css(".ent-name")[:href]}" number = kind = species = height = weight = abilities = nil pokemon_detail = Nokogiri::HTML(open(url_detail)) number = pokemon_detail.at_css(".vitals-table tr:contains('National')").at_css("td").text kind = pokemon_detail.at_css(".vitals-table tr:contains('Type')").at_css("td").text.split(" ").join(", ") species = pokemon_detail.at_css(".vitals-table tr:contains('Species')").at_css("td").text height = pokemon_detail.at_css(".vitals-table tr:contains('Height')").at_css("td").text weight = pokemon_detail.at_css(".vitals-table tr:contains('Weight')").at_css("td").text abilities = pokemon_detail.at_css(".vitals-table tr:contains('Abilities')").at_css("td") rescue puts "Something goes wrong with #{name} :(" ensure puts "Pokemon info" puts "Name: #{name}, number: #{number}, kind: #{kind}, species: #{species}, height: #{height}, weight: #{weight}, abilities: #{abilities}" end end
You could copy and edit this code to get different data from other Pokemon editions, simply changing the base URL and playing with the CSS queries to reach the info you want.
Then, simply make the script executable with chmod +x NAME_OF_SCRIPT.rb
, execute ./NAME_OF_SCRIPT.rb
and enjoy viewing the data flowing on the terminal.
Any place where I can pick the code of this tip?
Yeah! Here you have the Github gist for this tip
About The Author
Iván González - Software Developer
Hi, my name is Iván (aka dreamingechoes). I'm a passionate software developer from the north of Spain, interested in all kind of technologies.