Quick intro into Web Scraping with Python using BeautifulSoup

Steven Kyle
4 min readNov 7, 2021
Photo by Ella Olsson on Unsplash

Web Scraping

Web scraping is an extremely useful and necessary tool that every Data Scientist/Analyst should know. Web scraping is the act of gathering information from a webpage by fetching and extracting the contents. It is a powerful tool that can be leveraged to gather vast amounts of data. In this blog post we will cover a library called Beautiful Soup and go through an example of scraping a website.

Installation of Beautiful Soup

The version of Beautiful Soup we will be working with is version 4. The documentation for Beautiful Soup can be found here. We can use pip to install the library. We will also need to make sure we have requests library. The following commands can be entered into the command prompt.

pip install beautifulsoup4
pip install requests

Website we’ll be using

The website we will be using to demonstrate beautifulsoup4 is a dummy website that was created for people to practice scraping. The website can be found is called “Books to Scrape” and can be found here.

Inspect the website

The first thing to do when wanting to scrape a website is to first get familiar with the sites layout. Pay specific attention to the websites URL and how it changes when navigating to different parts of the site. Once you have explored the website a little and are familiar with it, we must look and understand how the data is structured for display. To do this we can use a shortcut command while viewing the webpage, the shortcut command for Mac is CMD + Alt + I and for windows it is CTRL + Shift + I. The webpage should look something like this when looking at the HTML elements.

As you can see in the picture, you can interact with the HTML elements on the right and see what elements are responsible for different parts of the website.

Gather HTML Content

Now that we are familiar with the website and the data structure we can go ahead and download the sites HTML code using the requests package. You can see the example code below that is used to gather the HTML code from the fiction catalogue.

The HTML data for that page is now stored as the variable page.

Looking through HTML

Now that we have the HTML data we can look through it to extract the data. However if we just look at the raw data it would be confusing to read/understand. We will now implement the beautiful soup library to look through the HTML data.

page.content is the HTML content that we got earlier, the ‘html.parser’ is the parser that is used to parse through the html content

Finding the data we want

Now that we have the webpage as a BeautifulSoup object, we can find the data we want to extract from the page. All html elements from the webpage should be in the “soup” . Each element will be uniques and be identified with unique identifiers for things like “id” and “class”. This is why the beginning exploratory phase of the web-scraping is so important, we must know the specifics of the data we are trying to extract.

For this example we will be using the contents from the page that we turned into a “soup” to find the price of all the books on that page. Since all book information are stored in the “product_pod” class we can use it to gather the data for all the books on the page.

As you can see there is still a lot of html information for all the books. We can now just loop through the information to gather the price of each book. We can target the class “product_price” since the price is within that element. We will also make sure to get the title of the book as well and add all the data in a dictionary.

Conclusion

In this blog post I just did a quick introduction into how to get into web scraping. Web scraping can be a very tedious thing to work with since every website is a little different. So don’t give up hope if you come across difficult challenges. Hopefully this gave you a little insight on how to get started web scraping, there is still so much more that goes into web scraping so make sure to read and learn from other useful guides and documentation.

--

--

Steven Kyle

25 year old Texan in the midst of a career change into DataScience.