Simple, Powerful Web Scraper in 5 Minutes

by IoTalabs in Circuits > Software

3391 Views, 34 Favorites, 0 Comments

Simple, Powerful Web Scraper in 5 Minutes

Screen Shot 2016-06-10 at 11.47.31 AM.png

We are IoTalabs and we are a group of Internet of Things enthusiasts that love hacking together different devices. Over the past few months we have learned how to scrape the internet in a really easy way. We wanted to share our hacking tips with you all. Be sure to check out our current projects at

Our Website - doteverything.co

Requirements:

1) Intermediate Coding Experience (Javascript + Html)

2) Google chrome

3) YQL – We will teach how to use this

The world of online APIs is a tough one. While many websites and services have their own APIs now, they are usually heavily limited in regards to usage and information available. An alternative to using a service’s API is to simply scrape their website for the data you want. Scraping involves parsing the HMTL of a web page and finding information based on the standardized structure of site. We have figured out the quickest way to scrape a website, so get ready!

Theory Behind Scraping

pic1.png
pic2.png

So say I had a simple website that looked like the following

We can see that the vital information we want lies in a span with the class “hiInstructables”.(Image 1) It turns out that websites are very consistent when labeling a piece of information. So we can assume that if there were multiple vital pieces of information that we needed, they would be labeled all with the same class like this: (Image 2)

So this tackles the essence of scraping. Websites use a specific format for labeling their content. If we can figure out what that format is, then we can make a program that automatically looks for those labels in that format to get the data we need.

Your First Scrape: Grabbing the Usernames Out of a Reddit Thread

website.png
web2.png

https://www.reddit.com/r/arduino/comments/3rixq5/i...

The first step in building a scraper is always going to be
identifying what our key information is labeled under. In this case, we want all the usernames in the comments of a reddit thread. So we are going to use google chrome’s inspect element tool to find out what the username is labeled as. (image 1)

This should bring up the following terminal with the username highlighted: (2)

We see that all usernames in a reddit thread are related to links with the class “author”. Now here’s the tricky part: we need some way to sort through all the different web page elements to get through to the tag with the class “author”. As you can see it’s not an easy journey because these links lie in the:

<div class = "commentarea">

which then drops down into

<div id = "siteTable_t3_3rixq5" class = "sitetable nestedlisting">

which drops into even more html elements. To minimize the
amount of javascript we have to write, we are going to outsource the actual parsing of our web page to Yahoo’s YQL Language. This will traverse through all the different html elements and return us those precious tags that we desire. Don’t worry if you’re confused right now; the next step will make things more clear.

YQL ( YAHOO QUERY LANGUAGE)

yql1.png
yql2.png
yql3.png

So we’ve identified where in the web page our

usernames are. We now just need to obtain that information in a traversable format. Normally, scrapers are built by just loading the entire web page in a dense tree-like XML node format. This is a headache. Loading a webpage in JSON is much easier because it allows us to access elements directly using the . operator. To get the web-page in JSON format, we are going to use Yahoo’s Query Language. Basically YQL is an open tool built by Yahoo to query web pages into Json. The actual language is very similar to MYSQL. This is the link to the console:

https://developer.yahoo.com/yql/console/

Here's how it looks: (image 1)

so our query is pretty straight forward:

select * from html where url = "https://www.reddit.com/r/arduino/comments/3rixq5/i_programmed_a_robot_arm_to_feed_me_breakfast/" and xpath='//a[contains(@class,"author")]'

select * just means select everything from the webpage where the url = our reddit thread.

The xpath basically says, search through the page and return each place where we have an tag with a class of “author”.

As you can see the query is successful and returns all the usernames we wanted:(image 2)

To get this result in a JSON format, just click the json tab: (image 3)

Manipulating With Javascript

end.png

Now to get this from the console into a local variable, all

we do is the use the REST Query (https://en.wikipedia.org/wiki/Representational_sta...) found at the bottom of the page. Our code with the proper async call is below: (image 1)

Using this code you can get all the usernames into an array
and then keep it forever! Thanks for reading. Let me know in the comments if you have any questions. Check out our website and more javascript tutorials at our website and doteverything.co/blog.html.