Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Scraping Goodreads using Python BeautifulSoup

What is Scraping?
Web Scraping is a technique for fetching and then extracting information from websites. Every website hosts information over the web in the form of HTML or some static text and scraping involve taking in HTML code and extracting relevant information like the title of the page, headings in the page, links or email address, etc.

What is Goodreads?
Goodreads is a social cataloging website that allows individuals to freely search for books and reviews. Users can sign up and generate library catalogs and reading lists or can create their own groups of Book suggestions.

How do we do Scraping?
Manual scraping:  Manually copying and pasting the web page content
Text Pattern Matching:  Using the UNIX grep command
Google Docs:  From Google sheets, using the IMPORTXML(, ) function
HTML parsing:  Using some programming (JAVA, PYTHON, etc.)  scripts

How do we scrape Goodreads using Python?
Now, that is the question that this post is all about!
We use Web Scraping to extract some information from a website. Goodreads actually provide an API to get that information. But, in this post, I will extract information using the old style, using some Python Libraries.
So, let's start scraping!

In this Post, I will scrape the below page - "Thriller Shelf" from Goodreads.
URL = https://www.goodreads.com/shelf/show/thriller
For this example, we will use the below libraries:
1. Requests to communicate with the server
2. BeautifulSoup for web scraping
pip install requests
pip install bs4
Let’s first import the above libraries and let the fun begin!
import requests
from bs4 import BeautifulSoup as bs
Now, since we have all the libraries required, let’s start cooking.
url= "https://www.goodreads.com/shelf/show/thriller" 
page = requests.get(url) 
soup = bs(page.content, 'html.parser') 
print(soup)
Thus, here goes the soup containing all the Information I need:
The next thing is to extract the important Data out of this soup.
We will extract 2 details:
1. Book Title
2. Author Name
titles = soup.find_all('a', class_='bookTitle') 
authors = soup.find_all('a', class_='authorName') 
print(titles)
At this point,  the Title and the Author data is still not the way I was looking for! But, we are almost there.
for title, author in zip(titles, authors): 
    print("Book: ", title.get_text(), "By: " , author.get_text())
And BOOM! Here goes the output that we were looking for! We have the Book Title and the corresponding Author!
Do you wonder, how to scrape the more detailed information about the Book like Book_id, ISBN, ASIN, Average_Rating, etc? Well, that will be my next post!

Thank You for reading!


This post first appeared on What The Data Says, please read the originial post: here

Share the post

Scraping Goodreads using Python BeautifulSoup

×

Subscribe to What The Data Says

Get updates delivered right to your inbox!

Thank you for your subscription

×