How to Find the Most Popular Title Words of Instructables: Title Analysis
by FuzzyPotato in Circuits > Software
213 Views, 2 Favorites, 0 Comments
How to Find the Most Popular Title Words of Instructables: Title Analysis
This project focuses on analyzing the titles of Instructables to identify and uncover the most popular words used in their titles. By scraping the titles, performing data preprocessing, and conducting word frequency analysis, we can gain valuable insights into the trends and interests within the Instructables community. Join us as we delve into the world of Instructable titles and discover the words that make them stand out.
Supplies
Here's a list of supplies you will need:
- Windows computer with Python installed
- Chrome browser
- IDE (I'm using PyCharm)
- Web Driver
- NumPy
- matplotlib
- Selenium
Install the Required Libraries
Before we begin, make sure you have the following libraries installed:
- NumPy
- matplotlib
- Selenium
- Open the pyCharm development environment (IDE).
- Navigate to File>Settings>Project>Python Interpreter
- Search for the library.
- Install the library.
- Wait for the installation process to complete. You should see a message indicating a successful installation.
- Repeat the process for the other libraries.
Get the Web Driver
To download the Chrome WebDriver (Chromedriver), you can follow these steps:
- Determine the version of Google Chrome installed on your machine:
- Open Chrome browser.
- Click on the three-dot menu at the top right corner.
- Go to "Help" > "About Google Chrome".
- Note down the version number.
- Visit the official Chromedriver download page:
- Go to the following URL: https://sites.google.com/chromium.org/driver/downloads?authuser=0
- Download the appropriate Chromedriver version:
- Locate the version of Chromedriver that matches your Chrome browser version.
- Click on the download link for your operating system (e.g., Windows, macOS, Linux).
- Extract the Chromedriver executable:
- After downloading, extract the contents of the downloaded ZIP file to a directory of your choice.
- I extracted the driver directly into the C drive. This was my path: C:\chromedriver_win32.
- If you extract the driver to a different location make sure to replace the drive path in the code.
Code
Now that we have the libraries installed, it's time for the Python code.
- Create a new Python script.
- Copy and paste the following code.
- Copy and paste the URL's of interest into the code.
import re
import string
import time
from collections import Counter
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import matplotlib.pyplot as plt
import numpy as np
# Function to clean and preprocess a project title
def preprocess_title(title):
# Remove punctuation and special characters
title = re.sub(r'[^\w\s]', '', title)
# Convert to lowercase
title = title.lower()
# Remove irrelevant words
irrelevant_words = ['a', 'an', 'the', 'in', 'on', 'of', 'for', 'to']
words = title.split()
words = [word for word in words if word not in irrelevant_words]
# Join the words back into a single string
cleaned_title = ' '.join(words)
return cleaned_title
# Define the URLs for different categories
urls = ['https://www.instructables.com/circuits/?sort=Views',
'https://www.instructables.com/workshop/projects?sort=Views']
# Define the data dictionary
data = {}
for url in urls:
# Set up Chrome WebDriver
webdriver_service = Service('C:\chromedriver_win32')
chrome_options = Options()
chrome_options.add_argument('--headless') # Run Chrome in headless mode
driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)
# Extract the category from the URL
category = url.split('/')[-2]
print("---", category, "---")
driver.get(url)
# Find the "Load All" button and click it repeatedly until all titles are loaded
load_all_button = WebDriverWait(driver, 10).until(
EC.element_to_be_clickable((By.CSS_SELECTOR, 'button.loadAll__Fq1eC')))
for i in range(0, 1, 1): # Adjust the number of iterations as needed
load_all_button.click()
time.sleep(2) # Adjust the delay between clicks as needed
# Find all <a> tags with class="title__t0fGQ"
title_elements = driver.find_elements(By.CSS_SELECTOR, 'a.title__t0fGQ')
# Extract the project titles from the <a> tags
project_titles = [title.text.strip() for title in title_elements]
# Clean and preprocess the project titles
cleaned_titles = [preprocess_title(title) for title in project_titles]
# Perform word frequency analysis
word_counts = Counter()
for title in cleaned_titles:
words = title.split()
word_counts.update(words)
# Store word frequencies in the data dictionary
data[category] = dict(word_counts)
# Print the most common words and their frequencies
most_common_words = word_counts.most_common(10)
for word, frequency in most_common_words:
print(f"{word}: {frequency}")
# Quit the WebDriver
driver.quit()
# Get the common words across all categories
common_words = set()
for word_counts in data.values():
common_words.update(word_counts.keys())
# Select the top N common words
top_n = 10 # Change this value as per your requirement
top_common_words = sorted(common_words, key=lambda x: sum(data[c].get(x, 0) for c in data), reverse=True)[:top_n]
# Prepare data for plotting
categories = list(data.keys())
word_data = [[data[c].get(word, 0) for word in top_common_words] for c in categories]
# Plotting the bar chart
fig, ax = plt.subplots(figsize=(10, 6))
# Calculate the positions of the bars on the x-axis
num_categories = len(categories)
bar_width = 0.8 / num_categories
positions = np.arange(top_n)
# Plot bars for each category
for i in range(num_categories):
word_counts = word_data[i]
bar_positions = positions + (i * bar_width)
ax.bar(bar_positions, word_counts, label=categories[i], width=bar_width, alpha=0.7)
# Set x-axis ticks and labels
ax.set_xticks(positions + (bar_width * (num_categories - 1) / 2))
ax.set_xticklabels(top_common_words, rotation=45)
# Set labels and title
ax.set_xlabel('Words')
ax.set_ylabel('Frequency')
ax.set_title('Word Frequencies in Instructable Title Categories')
# Add a legend
ax.legend()
# Display the chart
plt.tight_layout()
plt.show()
Explanation of the Code
Here's a simplified explanation of the code algorithmically:
- Import the required libraries for web scraping, data preprocessing, and visualization.
- Define a function to preprocess a project title by removing punctuation, converting to lowercase, and removing irrelevant words.
- Define a list of URLs representing different categories of project titles.
- Iterate over each URL category.
- Set up the Chrome WebDriver and navigate to the URL.
- Find the "Load All" button and click it repeatedly until all project titles are loaded.
- Extract the project titles from the webpage.
- Clean and preprocess the project titles using the defined function.
- Perform word frequency analysis on the preprocessed titles using the Counter class.
- Store the word frequencies in a data dictionary with the category as the key.
- Print the most common words and their frequencies.
- Quit the WebDriver.
- Get the common words across all categories by finding the intersection of word sets.
- Select the top N common words based on their overall frequency.
- Prepare the data for plotting by extracting the word counts for each category and the selected common words.
- Set up the plot with the appropriate dimensions.
- Calculate the positions of the bars on the x-axis.
- Plot the bars for each category using a loop, using different bar positions for each category.
- Set the x-axis ticks and labels to display the common words.
- Set the labels and title of the plot.
- Add a legend to differentiate the categories.
- Display the chart.
The code combines web scraping, data preprocessing, and data visualization techniques to extract project titles, analyze word frequencies, and generate a bar chart comparing the most common words across different categories of project titles.
If you found this Instructable helpful in any way, I would greatly appreciate it if you could show your support by hitting the like button. Your feedback makes a significant difference!
Happy Tinkering!