May 17th 2017

I live in Paris, France, and yes, it is beautiful. The bistrots, the bread, the cheese and the wine, the apéros... All of these are real. But Parisians, either native or adopted, have a love-hate relationship with their city. More often than not, the level of difficulty involved in finding an apartment plays a tremendous role in this.

Dnevni horoskop za ponedjeljak, 1.4.:…
Xiaomi Civi 4 Allegedly Receives 3C C…
An Iconic Movie Theatre Now Files For…
Vencedores dos â€œPrÃ©mios Angola 35 …
Is the TP-Link Tapo C100 Security Cam…

More applicants than available places, picky landlords, and dealing with real-estate agencies, are some of the issues apartment-hunters face. But you could always go the rental by owners’ route; no middleman, and room for improvisation if your application is a little less than ideal, but you made a nice first impression nonetheless. Some online platforms connect owners and applicants, and PAP.fr is one of the best.

When using one of these sites, the critical success factor is time. The faster you spot a listing and get in touch with the owner, the higher your chances are of getting it. But more often than not, before you can even go to your email alert and follow the link you’ve been sent to the Listing, other applicants are already on it, and before you know it, the owner’s voicemail is full.

Here’s the trick to nail it — leave a thumbnail open on the search result page on PAP.fr, and refresh it every 15 minutes or so. But you won’t, because, just like me, you’re at work and you’re busy all day, right?

Well… I ended up building my own bot to assist me in my flat search and, lo and behold, here I am now in my new Montmartre apartment. So in this post, I’ll teach you how to use Python, Google Spreadsheet and CALLR to find an apartment in Paris.

You can check out the Github repository to clone or follow along with the final Code.

The gist

I assume you have minimal code literacy — enough at least to open the code and run it after changing a line or two to fit your own need. The code isn’t hard for a Pythonista, but it is not trivial either. You’ll need Python 3.5 to get started. I strongly suggest you set up a virtual environment before you write a single line of code. Installing both is outside the scope of this tutorial, but the provided links should get you going.

We’ll build a bot that will automate fetching the data you’d read from a search result page of apartment listings. Python’s terseness and readability lends itself perfectly for it. It will use Google Spreadsheet as a simple data table to store the listings, and CALLR API to send us the newest ones, in concise and timely text messages. SMS are still the safest option when you need to be alerted and take action right away, and this use-case is a perfect example. I know I always check out my texts — but emails? — not so much.

Our bot will be triggered every 15 minutes by a scheduled task manager (AT or Scheduled Task on Windows, cron job on Linux and MacOS) from our machine. That way we can still work during the day while it does most of the search for us, running in the background.

Part 1 – Setting up your search bot

Install the dependencies as needed:

pip install requests beautifulsoup4 pygsheets

We install requests to facilitate fetching the HTML pages via HTTP, beautifulsoup4 to extract data from the markup, and pygsheets to ease our control of a remote Google spreadsheet via code.

Create and access the spreadsheet

The Google API requires creating a service account to use it the way we want.

Head to the Google API console.
Create a new project.
Click Enable API, then use the search function to find the Google Drive and Google Sheets API, and enable them both.
Create credentials for a new Web Server accessing Application Data.
Name your service and give it a role — Editor or Owner is fine.
Set Key Type to JSON and click Continue to download the file.

Rename it credentials.json, open it, and find the client_email entry. Copy the value. Now go create a new spreadsheet, and once on it, click the “Share” button. Paste the email you’ve copied, so that your project is now able to read and write in this file.

Copy the URL of your spreadsheet from the browser address bar and keep it handy, we’ll use it in a minute. Now we’re ready to start coding.

Part 2 – Scraping HTML content

Go to PAP.fr and make your search, like you’d normally do. Copy the URL of the result page.

Now create a bot.py file where you stored your credentials file and fill it with the following code:

# -*- coding: utf-8 -*-
import os
import re
import requests
import pygsheets
from bs4 import BeautifulSoup as Bs


SEARCH_PAGE =   # replace with real URL string
SPREADSHEET_URL =   # same here
URL_DOMAIN = 'http://www.pap.fr'

PAGINATION_SELECTOR = '.pagination li a'
LISTING_DETAIL_BTN_SELECTOR = '.btn-details'
NEXT_PAGE_SELECTOR = '.next'
GEOLOC_SELECTOR = '.item-geoloc'
SPECS_SELECTOR = '.item-summary'
DESCRIPTION_SELECTOR = '.item-description'
METRO_SELECTOR = '.item-metro .label'
PRICE_SELECTOR = '.price'

try:
    gc = pygsheets.authorize(service_file='credentials.json')

    sheet = gc.open_by_url(SPREADSHEET_URL).sheet1

    res = requests.get(SEARCH_PAGE)
    dom = Bs(res.text, 'lxml')

except Exception as e:
    print(e)

Let’s break this down.

Setting up the scene

You start by importing the necessary modules.

from bs4 import BeautifulSoup as Bs

It will import Beautiful Soup and rename it Bs in our code, so we don’t have to write its full name because we’re lazy.

Replace the SEARCH_PAGE and SPREADSHEET_URL with the values you’ve copied earlier. I personally like to store personal data like the spreadsheet URL into an environment variable.

The URL_DOMAIN is used later, when we rebuild full pages from chunks of data we’ll scrape from the HTML. Leave it as it is now, it will make more sense soon.

The next batch of constants are actually CSS selectors. You can find them in the markup of some of the page you (and subsequently, your bot) will visit when browsing through the listings. We use them here to target areas of interest in the pages for your bot during its crawl — blocks of relevant data, UI elements to use to go to the next page or open the details of a listing, etc.

Then we wrap in a try / except block the interesting part.

It connects to the Google API using the credentials file. It accesses the spreadsheet via its URL, for upcoming use. It then fetches the HTML from our search result page, and stores it in a variable, dom. I named it like that because —surprise, surprise— it’s now an object with several properties related to extracting DOM-related data. In layman’s terms, this variable allows us to dissect the content of the HTML page it was made from, get rid of the code and pour out raw data.

Add the following snippet below the dom variable.

links = [SEARCH_PAGE] + [
    URL_DOMAIN + a.get('href')
    for a in dom.select(PAGINATION_SELECTOR)
]

We’re leveraging list concatenation and list comprehension to create a variable, links, that will gather the URLs of all results pages. We could have broken down the above code like so:

# Create an empty list.
links = []

# First, add the original results page to it.
links.append[SEARCH_PAGE]

# Then, find all results pages link
# in the pagination block at the bottom of the page
paginated_links = dom.select(PAGINATION_SELECTOR)

# For each of them...
for a in paginated_links:
    # ...extract a relative URL from its 'href' attribute.
    relative_url = a.get('href')
    # From it, create the absolute URL we will send the bot to,
    # and add it to the list we created.
    absolute_url = URL_DOMAIN + relative_url
    links.append(absolute_url)

Scraping the results pages

Now that we have gathered all links for the results of our search, we want to do two things:

follow them and discover the listings they display,
then enter each listing to read and report back their details.

We will create two dedicated functions for this — process_listings_page, and process_listing. In your file, add the following snippet after the imports and variables declarations, and before the try / catch block you’ve just written. We will then examine the code in detail.

def process_listings_page(link):
    try:
        res = requests.get(link)
        dom = Bs(res.text, 'lxml')

        details_urls = [
            URL_DOMAIN + btn.get('href')
            for btn in dom.select('.btn-details')
        ]

        return [
            process_listing(listing_details_url)
            for listing_details_url in details_urls
   ]
    except Exception as e:
        print(e)

The process_listings_page take as argument a string, which is the URL of a HTML page. If you remember what we did with the previous snippet, you should know that the URLs we will be passing are those of our search results pages.

So for each of our results page, we try to do the following…

We send the bot to visit this page and capture the HTML content for us to consume:

res = requests.get(link)
dom = Bs(res.text, 'lxml')

We find all “details” button on the page — each linking to a listing’s full content — and gather them in a new list, details_urls.

This is what this seemingly opaque snippet does:

details_urls = [
    URL_DOMAIN + btn.get('href')
    for btn in dom.select('.btn-details')
]

Let’s unfold this list comprehension to fully understand the process, and rewrite it like so:

# We'll hold the final values in this list
details_urls = []

# Find all "details" button on the page. 
# All ".btn-details" are  tags. 
details_btn = dom.select('.btn-details')

# For each button...
for btn in list_of_details_btn:
    # Find the relative URL to this detailed listing,
    # by extracting the href attribute of its tag.
    details_relative_url = btn.get('href')

	    # Now rebuild it into an absolute URL,
    # adding "http://" and the domain name.
	    details_absolute_url = URL_DOMAIN + details_relative_url
	    
	    # Store the URL.
	    details_urls.append(details_absolute_url)

The result of another list comprehension is returned by this function:

return [
    process_listing(listing_details_url)
    for listing_details_url in details_urls
]

Elegantly concise. Also, you may have noticed that we’re making use of the process_listing function, that we’ve not yet implemented. We’ll do so in just a moment — but again, it’s useful to take the time to unfold this snippet just for our own understanding of what it does. We could rewrite it as such:

# We’ll hold the final values in the list
    processed_listings = []

    # For each listing details URL…
    for listing_details_url in detail_urls:
        # Process the content and keep the returned values 
        # in a variable.
        result = processing_listing(listing_details_url)

        # Store the result.
        processed_listings.append(result)

    # Return the results. It’s a list of all listings matching
    # our research, and we’ll examine what this looks like in
    # detail when we implement the process_listing function.
    return processed_listings

Scraping the actual listings

Now let’s write the process_listing function, and three companion “utility” functions, clean_markup, clean_spaces, and clean_special_chars.

I’d like to bring your attention on why we want to clean special characters such as ² or the € sign. These special characters will forcibly change the encoding of your text messages, thus possibly turning your single text message into a multi-part one.

Finally, some of the DOM-related snippets look idiosyncratic and convoluted. That’s a downside of scraping, and apart from the Python code itself which depends on your level of code literacy, the best way to approach it is to look at the code, and look at the HTML source of one of your result pages at the same time to figure out what type of manipulation the bot is trying to do.

def process_listing(listing):
    # Access listing page and read its HTML content.
    res = requests.get(listing)
    dom = Bs(res.text, 'lxml')

    # Select DOM nodes giving short specifications infos
    # on the apartment (number of rooms, etc).
    # For each node, turn this —e.g. “rooms 3”— 
    # into this —e.g. “rooms: 3”—, and concatenate all information
    # in a single line for a listing, separated by forward slashes.
    specs = ' / '.join([
        clean_spaces(
clean_markup(
str(li).replace('', ': ').lower()
)
   )
        for li in dom.select(SPECS_SELECTOR)[0].select('li')
    ])

    # Find the DOM nodes giving the nearby metro station. 
    # Extract their text, and concatenate the stations 
    # list of names into one short comma separated string.
    metro = ', '.join([
        clean_markup(elm.get_text()) 
        for elm in dom.select(METRO_SELECTOR)
    ])

    # Extract the location info from the relevant DOM node.
    location = dom.select(GEOLOC_SELECTOR)[0].h2.text

    # Extract the description, and clean the resulting text
    # from artifacts such as unwanted whitespaces.
    description_body = dom.select(DESCRIPTION_SELECTOR)[0]
    description = clean_spaces(description_body.get_text())

    # Get the price.
    price = dom.select(PRICE_SELECTOR)[0].text

    # Return a dictionary of all extracted data from this listing.
    return {
        'specs': specs,
        'location': location,
        'description': description,
        'metro': metro,
        'url': listing,
        'price': price
    }

def clean_special_chars(string):
    """Remove special characters."""
    return string.replace('²', '2').replace('€', 'e')

def clean_markup(string):
    """Very basic removal of HTML tags + special chars."""
    string = clean_special_chars(string)
    return re.sub(r']*>', '', string)

def clean_spaces(string):
    """Crunch multiple whitespaces into a single one."""
    string = re.sub('\n|\r|\t', ' ', string)
    return re.sub('\s{2,}', ' ', string).strip()

Storing the results in a spreadsheet

Let’s review the full code within our original try / catch block. We’ll take the opportunity to add the snippet that’s responsible for saving our results into a spreadsheet. Review the code and its comments carefully.

try:
    gc = pygsheets.authorize(service_file='credentials.json')

    sheet = gc.open_by_url(SPREADSHEET_URL).sheet1

    res = requests.get(SEARCH_PAGE)
    dom = Bs(res.text, 'lxml')

    links = [SEARCH_PAGE] + [
        URL_DOMAIN + a.get('href')
        for a in dom.select(PAGINATION_SELECTOR)
    ]

    # This is new. It makes use of the code we wrote above.
    # It gathers URLs of listing details it got from scraping
    # the search results pages...
    for link in links:
        # ...then uses these URLs to dive into each listing and store
        # their informations in a list...
        for ls in process_listings_page(link):
            # ...and finally, it uses pygsheets to
            # write the data into our spreadsheet table.
            sheet.insert_rows(row=0, values=[
                ls['specs'], ls['price'] ls['location'], 
                ls['description'], ls['metro'], ls['url'],
            ])

except Exception as e:
    print(e)

Run the bot once, and after a few seconds, you should see some nice results starting to fill rows inside your spreadsheet. But what if you run it again? Oh noes!… we’re piling up redundant data. Let’s add a quick and dirty patch to ensure we only store unique results each time.

# Store the URLs of listings we’ve already registered.
# The column holding URLs in our spreadsheet is number 5, 
# as the index of column starts at 1.
urls_stored = sheet.get_col(5)

for link in links:
    for ls in process_listings_page(link):
        # If this URL isn’t already known, we want to add it!
        # Insert it above every others on the table
        if ls['url'] not in urls_stored:
            sheet.insert_rows(row=0, values=[
                ls['specs'], ls['location'], 
                ls['description'], ls['metro'], ls['url']
            ])

Now run your script again… and again a few minutes later — if new listing have appeared, they will stack up before your eyes in the spreadsheet!

Launching the bot automatically with a task scheduler

On Windows, you can now use your task scheduler to run the script every 15 minutes or so. On a Unix machine (including MacOS), you’ll need a cron job for this.

If you have used a virtual environment for the development of the bot, as advised earlier, it implies that the Python binary bundled in this virtualenv instance is the one you must use to launch the bot. You can not just run python bot.by out of the blue — you will need to pop a Terminal window open, enable the virtual environment, and run the script from here. Or more conveniently, you may find the location of your the Python binary used in your virtualenv, and use its fully qualified path when running the command.

To do so, enable your virtualenv and enter the command which python in your console. The returned value is the fully qualified path to your binary.
Mine says/Users/davy.braun/.virtualenvs/househunterbot/bin/python. We also want the full path to the bot script, so that the machine can invoke it right away. Mine is /Users/davy.braun/Code/projects/house-hunter-bot/bot.py. Keep both values handy, we’re using them right away!

Run the following command: env EDITOR=nano crontab -e. This will open the scheduled job list with the built-in nano editor.

Now enter the following line — and edit it accordingly with your own values for the binary and the file location:

*/15 * * * * /Users/davy.braun/.virtualenvs/househunterbot/bin/python /Users/davy.braun/Code/projects/house-hunter-bot/bot.py

This adds a cron job to the jobs list, basically saying “every 15 minutes, run the bot script located at that exact location, using no other than the Python binary located there”. Press Ctrl+X and confirm to save and exit.

Now be smart — when you’re done with your research, just be civil and remove this cron job! Don’t leave a zombie bot at large… It’s your responsibility to turn it off by the end of the day (or not use an automatically scheduled task at all!).

Part 3 – Texting back your results

Our bot does the redundant part of the search for us. This is all fine and dandy, but so far there’s no added value to subscribing to an email alert on the site. So let’s add a functionality that send us text messages from now on, each time it finds a new result.

We want to receive, for each new listing, a text message with a summary of the apartment specifications, the price, the location, the nearest metro stations, and the URL to this listing to look it up right away. Now these are looooong URLs. That’s bad. So we’ll shorten them on the fly thanks to the Google Shortener API.

Setting up the CALLR API and Google Shortener API

Now, go to your Google Developer console. Enable the Shortener API…

…then create a Google API key.

Because I’m sharing my code, I’ll be storing my CALLR credentials and Google Shortener API key as environment variables, before declaring them in my code like so:

CALLR_API_LOGIN = os.environ.get('LOGIN')
CALLR_API_PASSWORD = os.environ.get('PASSWORD')
GOOGLE_SHORTENER_API_KEY = os.environ.get('API_KEY')

If you’re using a virtual environment, you must declare your environment variable in a specific place: a file defined you can open with a text editor, at the following location: $VIRTUAL_ENV/bin/postactivate. Deactivate then, reactivate virtualenv, and you’ll have access to the environment variable, scoped within your virtual environment.

Great! Now, let’s install the CALLR SDK, and pyshorteners, a Python library that will handle URL shortening via the Shortener API for us.

pip install callr pyshorteners

Import both libraries in your code…

import callr
from pyshorteners import Shortener

And we’re good to go. Let’s dive back to the main try / catch block that holds the meat of our program.

Send text messages

Here’s the code, with the added part to send SMS alerts.

# Create the Shortener, and the CALLR api object.
shortener = Shortener('Google', api_key=GOOGLE_SHORTENER_API_KEY)
api = callr.Api(CALLR_API_LOGIN, CALLR_API_PASSWORD)

try:
    gc = pygsheets.authorize(service_file='credentials.json')

    sheet = gc.open_by_url(SPREADSHEET_URL).sheet1

    dom = get_scraped_page(SEARCH_PAGE)

    links = [SEARCH_PAGE] + [
        URL_DOMAIN + a.get('href')
        for a in dom.select(PAGINATION_SELECTOR)
    ]

    urls_stored = sheet.get_col(5)

    for link in links:
        for ls in process_listings_page(link):
            if ls['url'] not in urls_stored:
                sheet.insert_rows(row=0, values=[
                    ls['specs'], ls['location'],
                    ls['description'], ls['metro'], ls['url']
                ])

                # This is new.
                # If this is not the first time we store data 
                # (i.e. urls_stored is not empty)
                # we want to receive SMS alerts 
                # on each new listing.
                if len(urls_stored) > 0:
                    send_data_via_sms(ls)

except Exception as e:
    print(e)

You’ve noticed we have created objects to deal with the CALLR and shortener APIs. We’re also invoking a send_data_via_sms function when a new listing stacks up, so let’s implement this function.

def send_data_via_sms(data):
    msg = "{0} - {1} - {2} - {3} - {4}".format(
        data['specs'], data['price'], data['location'], data['metro'],
        shortener.short(data['url'])
    )
    api.call('sms.send', 'SMS', '+33600000000', msg, None)

We’re accessing the listing data passed as argument, and format some of its field into the resulting message we expect to receive. The URL is passed the shortener.short function which — well, you’ve guessed it — shortens it.

Then we leverage the CALLR API to send the text. That’s just it — a one-liner. Notice the third parameter we’re passing to the api.call function. It’s an E.164-formatted phone number — and that, obviously, should be your own.

Now let’s run the bot once more to verify everything is working properly. To ensure we can capture a few new listings to trigger the SMS alerts, remove two or three lines from your Google spreadsheet. After running python bot.py, the bot will then fetch them as if they were new, and your phone should start vibrating!

Final words

Building this bot, we have covered a few interesting topics — list manipulations in Python, scraping data on a website, using a Google spreadsheet as an ad-hoc database… But I find the easiest and most interesting part was triggering the SMS alerts. Take a minute and think of all the IoT, or bot-related hacks you can build now that you have added Good Ol’ Telephone to your utility belt!

Every communication channel has its pros and cons. Hacking our own HouseHunterBot has been an interesting project to get our feet wet, especially on a use-case where quick reaction was paramount — hence the SMS alerts. If you want to talk about bots, life automation hacks, telecommunication or apéros in Paris, I’m @davypeterbraun on Twitter.

Want to add SMS capabilities to your bot?

The post How to automate your apartment hunt with a CALLR/Python SMS house hunter bot appeared first on CALLR Blog.

This post first appeared on CALLR's, please read the originial post: here