STEP BY STEP PYTHON: Writing an HTML Parser to Find Webpage Links

95 Views, 2 Favorites, 0 Comments

STEP BY STEP PYTHON: Writing an HTML Parser to Find Webpage Links

Hello there! If you are still starting out in Python and want to try a simple but useful program, you’ve come to the right place! In a few steps, you’ll have a way to easily pull the links from a webpage instead of painstakingly searching a page’s HTML. You will also level up your knowledge in class inheritance, HTML, and specifically HTML parsing! We're gonna go step by step and explain every detail, so you know exactly what's going on. Let's begin!....

Today we will be inheriting from the pre existing HTMLParser class, to create a modified version called LinkCollector. This class will go through the HTML code of any website url you give it, and return back to you sets of the links it finds. It will also differentiate between absolute links and relative links, giving you the option of seeing sets containing just one of those types, or both. Absolute links start with ‘http’, and lead to places outside the current website, while relative links don’t have that base, and only lead to other pages on the same website.

Supplies

To start, we need to understand a little bit about HTML. HTML consists of a multitude of different ‘tags’ which are lines that start and end with the ‘<>’ characters. See image for example:

This is an example of a link tag. Anytime you see the ‘a’ at the beginning, that means it's a link tag even though it doesn’t refer to the word link -_-. Notice the attributes of href and target, and their respective values. We are particularly interested in the ‘href’ attribute as it contains the url that the link is pointing to. Every link in a website is made by using one of these tags with the intended url.

Importing Modules

In the Python IDE of your choice (I am using IDLE in this example), create a new file and type in these imports at the top of your program. Refer to the image.

Explanation:

This will import the 3 modules in order to use certain methods that we will need.

Create the Class

Following the image, create a class that inherits from HTMLParser.

Explanation:

The class I am writing will be called LinkCollector because its goal is to go through the HTML of a web page and copy all of the links it contains into sets. It can be named according to your preferences, though it is useful to name classes, functions, and variables according to their purpose.
We want to inherit from HTMLParser so that it has the basic functionality of that class. However, we want to add onto it for our purposes, which is why we need to create LinkCollector.

Write Function #1

Write the __init__ function with the usual ‘self’ parameter, and a ‘url’ parameter. Follow the image and type in the contents of the function.

Explanation:

Anytime we inherit from a class and define a function, the first parameter is always 'self.'
We use the url parameter so that we can pass in the url of the website we want to find the links of.
Line 7 calls the __init__ of the parent class, HTMLParser, in order to make sure our class is inheriting everything properly.
Line 8 assigns the url parameter to the self.url attribute. We do this to be able to refer to that url in other functions of this class.
Line 9 and 10 assign empty sets to self.absolutes and self.relatives which is where we will store the absolute links and relative links respectively.

Write Function #2

Write a function called handle_starttag with the parameters of self, tag, and attrs. Follow the image and type the contents of the function which are nested if statements and a for loop.

Explanation:

This function will go through all the HTML tags, find the link tags, and organize the links inside them into the appropriate sets.
Into the parameters tag and attrs, we will pass each tag that the HTML parser finds, and the attributes inside the tags respectively.
Line 13 checks if the HTML tag is an ‘a’ tag (remember ‘a’ tags are for links!).
In the tag is an ‘a’ tag, line 14 iterates through the attributes of the tag and their values.
Line 15 checks if the attribute is ‘href’, which contains the url of this link.
Line 16 creates the variable ‘link’, uses the urljoin method to combine the website’s url (self.url) with the url found in the current tag we are looking at, and assigns it to ‘link’
Line 17 checks if the first four characters of this tag’s url are ‘https’ which would mean it is an absolute link.
If line 17 is true, line 18 adds that joined link from line 16 the self.absolutes set.
Lines 19 and 20 run if line 17 is false, meaning the link is relative.
Line 20 adds the relative link to the self.relatives set.

Write Function #3

Following the image, create this small function called getAbsolutes.

Explanation:

This is one of three very short functions we will write.
The only line it contains is a return statement, returning the set of absolute links we have accumulated with the previous function.

Write Function #4

Following the image, create this function called getRelatives.

Explanation:

The only line it contains is a return statement, returning the set of relative links we have accumulated with handle_starttag.

Write Function #5

Following the image, create the last function called getLinks.

Explanation:

The only line it contains is a return statement again, but this one uses the .union() method to join both sets of relatives and absolutes, giving back a set of all the links found.

Test the Program!

Run the program, taking note of any errors that come up.

Explanation:

If there are errors, check back to see that everything is typed correctly, every detail matters in coding!

Create an Instance of Your Class

Create a variable of any name (mine is named ‘lc’, short for link collector), create an instance of the LinkCollector class, and assign it to the variable. Pass the url of the website you want to observe as a parameter. (It can be any website url, I’m using the Depaul CS site)

Explanation:

Make sure you are now typing in the shell after running the code with no errors. This is where we are gonna see the class in action!
Make sure the website url is in quotes or double quotes.

Perform Methods on Your Variable

Following the image, use the .feed() on your variable, and pass the website url into it. On the url, use the urlopen function, the .read() method. And the .decode() method.

Explanation:

The. feed() function from the html.parser module provides the HTML code into the parser.
The urlopen() function opens the url you give it, accessing the data within.
The .read() method allows python to read the code in a file.
The .decode() method is needed to change the encoded format of the string to a regular string.

Try Your Methods!

You can now try out the methods, getAbsolutes(), getRelatives(), and getLinks() by applying these to the variable containing your LinkCollector object.

Explanation:

The above image shows an example of using the getAbsolutes() method for the Depaul CS site, which displays the set of all the absolute links found on that site!

You're Done!!

Now you’re done!! Congratulations, you’ve made your own modified version of the HTMLparser class which specifically checks for links and gives you back sets of the absolute and relative links. Now, you should be able to pass any website's url into this class, and easily see a set of the links it contains. I send you forth into the world to find all the links you desire!