Web Scraping With Ruby

by neumanngregor in Circuits > Software

1400 Views, 17 Favorites, 0 Comments

Web Scraping With Ruby

ruby-mini-logo.png

A short Q&A about this instructable.

Q: What the #$%* is web scrapping and why do someone need it ?

A: The most of the webpages on the internet do not offer a web API, and you need one. The idea is to take data from the web page structured in a way that can be used by your application (a script, a executable, a webpage or even a database).

Q: Why ?

A: Lets see, you seek a apartment in city X, within a certain area, and it needs to be over Y square meters, you can seek whit the tools provided (but sometimes your criteria is not seek-able by the page tools) but the results are not presented in the way you need/like. Now think about a script that gets the data for city X in the way its best for your post processing, you then seek automatically for the certain area and display only the apartments that are over Y square meters as a list, sorted with the cheapest first. All this by just a double click and works on Windows, Mac or Linux.

Q: Is scraping legal ?

A: It is not ilegal, you don't get data that you are not supposed to get, you just get it in a automated manner and if you do it right you don't spam the server with not needed requests.

Q: It will always work, like a web API ?

A: No, if the webpage changes in a form that affects your readings then you will need to change your script to the new data layout. Nothing too big or hard, i can do it in under 1 minute.

Q: Can i get data that is not supposed to by accessed, like with SQL Inject ?

A: No, you can't, scraping is not hacking, it is just a way to get only what you need from one or more websites.

Detailed Info and Example

cheat-sheet-for-ruby.jpg

Now out there people will try to tell, you need the X or Y gem (like Nekogiri or Mechanize) still for most of the cases YOU DON'T NEED THEM.

A normal ruby install and a text editor (Notepad++, or whatever you like).

I use RubyMine, it is not free, but i like it, it feels & looks like Visual Studio.

Now for the example. I play a game called Warframe (www.warframe.com) and the game has a system that offers one time mission with nice rewards, but the missions are time limited and appear randomly. The official site has a twitter account that presents the alert missions and there are some fan made sites too, even a android application. For windows you need to be logged it with the game or keep a browser window open with twitter or one of the fan made sites, but there is no application. Until now :D

I gonna use one of the fan made sites to get the data needed. (http://deathsnacks.com/wf/index.html)

now for the code (http://pastebin.com/153FFXJf) commented and syntax highlighted.

---------

# http://deathsnacks.com/wf/index.html
require "open-uri"

#start new thread

t = Thread.new do

while true

conn = open('http://deathsnacks.com/wf/index.html').read

table_data = conn.scan / /

table_data_refined = []

table_data.each { |data|

data.gsub!(/<.+?>/, '')

# add space after price

data.gsub!('0cr', '0cr ')

table_data_refined << data

}

puts ' '

puts ' Warframe Alerts by Neumann Gregor'

$i = 0

table_data_refined.each do |looped|

if (table_data_refined[$i][0] =~ /[[:digit:]]/)

#insert spaces between lowercase and uppercase letters in string

puts ' ' + (table_data_refined[$i]).to_s.gsub(/(?<=[a-z])(?=[A-Z])/, ' ')

end

$i +=1

end

sleep 10

Gem.win_platform? ? (system "cls") : (system "clear")

end

end

gets

t.kill

---------

As you see, we just read all data, the html page, then look for <li> </li> tags and get that in a array. then we refine that by looking for the records that start with numbers and we then strip the html tags and add some spaces for a better reading, we repeat that every 10 seconds until we hit enter, if you do that it quits.

I have added the source code as a .rb file and a ocra generated exe for the people that don't have ruby installed and don't want to install it.