Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Web-Scraping rental properties using Python !

Introduction

The real estate market is something that every person living in any country
has to deal with, and as a result it makes a great topic about data analytics.

There are 2 ways to capture this data from any of the real estate sites like Zillow, Trulia, ForRent.com etc.

1- Via Rest APIs exposed by the websites but increasingly Rest APIs are either being blocked or tied with paid subscriptions. Also Rest APIs can provide only the data exposed by the website owners so there is limited data/info available.

2- Web-Scraping websites to extract the required information from raw htmls of the web pages. This approach is quite flexible as there is no limit on that data that you can extract. Limitation with this approach is if the website html tags are changed then the code also needs to be updated with that so it needs continuous integration.

In this post we will be using Python – Beautiful Soap to implement web-scraping from a popular real estate website called Trulia.com

Requirement

  • Gather data or text information about the real estate market in San Francisco Bay Area. The location covers but not limited to: San Francisco, San Jose, Pleasanton, Fremont, Hayward, Livermore, Berkeley, Sunnyvale etc.
  • The data may include but not limited to: property data (type, size, bedroom/bathroom), price data (current listing price, and or historical selling price), address, city/state/zip code, safety/security (crime rate in neighborhood), agent information (name, company website, and so on), Reviews (review of the property, review of the agent).
  • Collect information on at least 600 distinct properties or more from your search. The raw data types may include string, float, integer and so on.

Implementation

As we want to capture data for at least 600 Properties or more so we will be scanning through properties in 7 locations and extract data for 100 properties each. This will require us to do pagination as well as all the 100 results will not be present on a single page. So lets get started.

Step 1

Open https://www.trulia.com and search for locations.

For example, we search for Oakland, CA location on Trulia and filter with 1+beds and Single Family Home for rental properties. We would see results like below-

Observe URL in the browser and update this for all seven locations like below- (only the location name changes and rest of the URL remains constant)

https://www.trulia.com/for_rent/Oakland,CA/1p_beds/SINGLE-FAMILY_HOME_type/
https://www.trulia.com/for_rent/San_Jose,CA/1p_beds/SINGLE-FAMILY_HOME_type/
https://www.trulia.com/for_rent/San_Francisco,CA/1p_beds/SINGLE-FAMILY_HOME_type/
https://www.trulia.com/for_rent/Sunnyvale,CA/1p_beds/SINGLE-FAMILY_HOME_type/
https://www.trulia.com/for_rent/Berkeley,CA/1p_beds/SINGLE-FAMILY_HOME_type/
https://www.trulia.com/for_rent/Fremont,CA/1p_beds/SINGLE-FAMILY_HOME_type/
https://www.trulia.com/for_rent/Pleasanton,CA/1p_beds/SINGLE-FAMILY_HOME_type/
https://www.trulia.com/for_rent/Livermore,CA/SINGLE-FAMILY_HOME_type/

Step 2

On the search results page we can see 30 properties per page. So in order to get 100+ properties for a location we need to span to 4 pages. For this click on second page and see the URL in browser-

For every next page number you click on, the URL gets appended with 2_p or 3_p or 4_p depending on the page number.

Step 3

We can scrap details like price(rent), location, size etc directly from property cards on search results page. For details like description, crime rate, commute, suggested income etc we would need to open link for each property and then scrap the details.

So we will be capturing data in two parts-

  • Scan all properties on search results page and extract price, location and size details.
  • Open link for each individual property and extract details like description, crime rate, commute etc.

Once details are captured for one location then move to other locations and repeat the same.

Step 4

How to check the html tags and fetch the required details?

All the details we need are enclosed in html tags. So all we need to do is identify the right tag and attribute we can use to extract the required data.

For example, to fetch the price(rent) of a property –

Right click on the rent of any property and select Inspect.

This will open the html tag for rent and show the attributes in Developer Tools.

So as highlighted, we can extract the text inside the

tag with attribute ‘data-testid’:’property-price’ and we should get the price. We will see how to do it in code.

Like this we need to get the appropriate html tags and attributes for each of the values that we want to scrap. Please note that sometime the direct html tag and attribute can be a generic one and shared between different values so we need to carefully find a tag which could be unique to our required value.

Web-Scraping using Beautiful Soup

Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful:

– Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn’t take much code to write an application

– Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn’t take much code to write an application

– Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility.

Referred from - https://www.crummy.com/software/BeautifulSoup/

Below is the code we have come up with to scrap properties for 7 different location and extract the required data. It will send store the data into a datastore which can ultimately be used to analyze the data.

from urllib.request import urlopen,Request
from bs4 import BeautifulSoup as BS #BeautifulSoup is a Python library
                                    #for pulling data out of HTML and XML files.

import urllib.request
import urllib.parse
import urllib.error
import ssl
import re
import pandas as pd
import np
import json
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr
import seaborn as sns

def get_headers():
    #Headers
    headers={'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
            'accept-language':'en-US,en;q=0.9',
            'cache-control':'max-age=0',
            'upgrade-insecure-requests':'1',
            'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'}

    return headers


ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
count=1 # for pagination
address=[]
rent=[]
sch_crime=[]
sugg_income=[]
add1=[]
area=[]
bed=[]
bath=[]
floor=[]
commute=[]
descp=[]
addr_link=[]
urls = ["https://www.trulia.com/for_rent/Oakland,CA/1p_beds/SINGLE-FAMILY_HOME_type/",
         "https://www.trulia.com/for_rent/San_Jose,CA/1p_beds/SINGLE-FAMILY_HOME_type/",
       "https://www.trulia.com/for_rent/San_Francisco,CA/1p_beds/SINGLE-FAMILY_HOME_type/",
       "https://www.trulia.com/for_rent/Sunnyvale,CA/1p_beds/SINGLE-FAMILY_HOME_type/",
       "https://www.trulia.com/for_rent/Berkeley,CA/1p_beds/SINGLE-FAMILY_HOME_type/",
       "https://www.trulia.com/for_rent/Fremont,CA/1p_beds/SINGLE-FAMILY_HOME_type/",
       "https://www.trulia.com/for_rent/Pleasanton,CA/1p_beds/SINGLE-FAMILY_HOME_type/",
       "https://www.trulia.com/for_rent/Livermore,CA/SINGLE-FAMILY_HOME_type/"]


for x in urls:
    count=1
    y=x
    while(count 

Export Data Frame to CSV

#Save the obtained dataframe to csv
data_frame.to_csv('SFBayArea_Rental.csv')

Full code file with other data analytics operations present in below GIT project –


https://github.com/Abmun/WebScraping-RentalProperties
0 forks.
0 stars.
0 open issues.
Recent commits:

  • Add files via upload, GitHub
  • Update README.md, GitHub
  • Delete WebScraping_RentalProperties.ipynb, GitHub
  • Update WebScraping_RentalProperties.ipynb, GitHub
  • Update README.md, GitHub

Thanks for checking out.

Do comment or reach out to us in case of any issues/queries.

The post Web-Scraping rental properties using Python ! appeared first on TechManyu.



This post first appeared on TechManyu, please read the originial post: here

Share the post

Web-Scraping rental properties using Python !

×

Subscribe to Techmanyu

Get updates delivered right to your inbox!

Thank you for your subscription

×