Web Scraping With Ruby
A short Q&A about this instructable.
Q: What the #$%* is web scrapping and why do someone need it ?
A: The most of the webpages on the internet do not offer a web API, and you need one. The idea is to take data from the web page structured in a way that can be used by your application (a script, a executable, a webpage or even a database).
Q: Why ?
A: Lets see, you seek a apartment in city X, within a certain area, and it needs to be over Y square meters, you can seek whit the tools provided (but sometimes your criteria is not seek-able by the page tools) but the results are not presented in the way you need/like. Now think about a script that gets the data for city X in the way its best for your post processing, you then seek automatically for the certain area and display only the apartments that are over Y square meters as a list, sorted with the cheapest first. All this by just a double click and works on Windows, Mac or Linux.
Q: Is scraping legal ?
A: It is not ilegal, you don't get data that you are not supposed to get, you just get it in a automated manner and if you do it right you don't spam the server with not needed requests.
Q: It will always work, like a web API ?
A: No, if the webpage changes in a form that affects your readings then you will need to change your script to the new data layout. Nothing too big or hard, i can do it in under 1 minute.
Q: Can i get data that is not supposed to by accessed, like with SQL Inject ?
A: No, you can't, scraping is not hacking, it is just a way to get only what you need from one or more websites.
Detailed Info and Example
Now out there people will try to tell, you need the X or Y gem (like Nekogiri or Mechanize) still for most of the cases YOU DON'T NEED THEM.
A normal ruby install and a text editor (Notepad++, or whatever you like).
I use RubyMine, it is not free, but i like it, it feels & looks like Visual Studio.
Now for the example. I play a game called Warframe (www.warframe.com) and the game has a system that offers one time mission with nice rewards, but the missions are time limited and appear randomly. The official site has a twitter account that presents the alert missions and there are some fan made sites too, even a android application. For windows you need to be logged it with the game or keep a browser window open with twitter or one of the fan made sites, but there is no application. Until now :D
I gonna use one of the fan made sites to get the data needed. (http://deathsnacks.com/wf/index.html)
now for the code (http://pastebin.com/153FFXJf) commented and syntax highlighted.
---------
# http://deathsnacks.com/wf/index.html
require "open-uri"
#start new thread
t = Thread.new do
while true
conn = open('http://deathsnacks.com/wf/index.html').read
table_data = conn.scan / /
table_data_refined = []
table_data.each { |data|
data.gsub!(/<.+?>/, '')
# add space after price
data.gsub!('0cr', '0cr ')
table_data_refined << data
}
puts ' '
puts ' Warframe Alerts by Neumann Gregor'
$i = 0
table_data_refined.each do |looped|
if (table_data_refined[$i][0] =~ /[[:digit:]]/)
#insert spaces between lowercase and uppercase letters in string
puts ' ' + (table_data_refined[$i]).to_s.gsub(/(?<=[a-z])(?=[A-Z])/, ' ')
end
$i +=1
end
sleep 10
Gem.win_platform? ? (system "cls") : (system "clear")
end
end
gets
t.kill
---------
As you see, we just read all data, the html page, then look for <li> </li> tags and get that in a array. then we refine that by looking for the records that start with numbers and we then strip the html tags and add some spaces for a better reading, we repeat that every 10 seconds until we hit enter, if you do that it quits.
I have added the source code as a .rb file and a ocra generated exe for the people that don't have ruby installed and don't want to install it.