Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Scrape Data from a Lazy Loading Website with Selenium Python

Sign upSign InSign upSign InPaige NiedringhausFollowITNEXT--ListenShareA few months ago, my friend wanted me to write a program to collect the data of one of the NFT collections on the NFTrade site, compute the current price of each NFT in US dollars based on the current market price of the BNB cryptocurrency it was listed for sale in, and compile all of the NFTs for sale into a CSV file that he could sort and manipulate.Unfortunately, the NFTrade website does not have a public API so writing a Node.js script to fetch the data from the API and format it as required was not an option. Instead, I needed to make a site scraper to actually go to the website page and “scrape” the data off of it.Having not written a web scraper before (and also wanting to make the script easier for my friend to update and run on his own machine), I decided to write the program in Python (it seems to be a very popular programming language choice for a task such as this). Along the way, my little web scraper’s requirements evolved and got more complex, and I learned a bunch of useful new techniques about using Python for my project, which I intend to share in a series of posts over the coming months.My first attempt to scrape the data from NFTrade was unsuccessful beyond locating the first 75 NFTs on the page. I figured out this was because NFTrade (as many other websites do) lazy loads NFTs onto the page 75 at a time: once the user’s scrolled down far enough to reach the end of the currently visible items, the site loads the next batch of elements onto the page (essentially a fancier version of pagination). So I needed a way to have my web scraper program collect whatever data was available on the page then scroll down far enough to trigger more data to load and collect that, and rinse and repeat.After some trial and error, I finally found a working solution with the help of a Python package named Selenium Python, and I’ll share with you today how to write your own Python script to scrape data from a lazy loading website with Selenium WebDriver.NOTE: I am not normally a Python developer so my code examples may not be the most efficient or elegant Python code ever written, but they get the job done.There are a few different popular Python packages available for web scraping which I tried before reaching for Selenium, but I had an issue with them in that they only worked for static websites that were generated at build time, not for sites that are generated on the client-side via JavaScript, like NFTrade is.To that end, I had to do a little digging to find a package that could work with scraping sites with dynamically loaded data, and I ran across the Selenium Python package during my investigation.Selenium Python is a Python-based API that allows users to write scripts or automated tests using Selenium WebDriver in an intuitive, Python-flavored way. And Selenium WebDriver is a software that can drive a browser natively, as a user would, either locally or on a remote machine. Originally created back in 2004, some version of Selenium has been around for years and is considered one of the earliest versions of automated testing that emulates user actions on a web page (commonly known today as end-to-end testing).The cool thing about WebDriver though, is that its uses span beyond automation testing, as scripts can actually be written to scrape data off of live web pages, and that’s just what I ended up doing in my Python script, so let’s get started.As with most projects, the first thing to do is add the Selenium Python package to the Python project. The easiest way is to use pip to install the Selenium package.Assuming you have pip on your machine, at the root of your Python project folder, run the following command from a terminal.Then, add the selenium package to your requirements.txt file so anyone downloading the repo in the future can install all the necessary project dependencies.requirements.txtAnd that’s all it takes to be ready to use WebDriver in your Python script. Simple enough.After adding the Selenium Python bindings to the project, it’s time to import Selenium’s WebDriver and some of its helpful configuration options to the actual Python script that does the website scraping. I named my file for_sale_scraper.py since I was specifically looking for NFTs that are for sale (not all of the NFTs listed on NFTrade are - some are just visible but not actually available to purchase), but you can choose any sort of file name that makes sense for you.Below are the imports I added to my file. I’ll break down what each one is doing below.for_sale_scraper.pyThe very first import line brings in the selenium.webdriver module and provides all the WebDriver implementations.Next, as I chose to use Chrome as the browser I wanted WebDriver to interact with (Selenium supports Firefox, Chrome, Edge, and Safari browsers), I imported the Options class from the selenium.webdriver.chrome.options module. This allowed me to add specific config details about how I want the Chrome browser to be set up when the Python script runs against it: things like headless mode or disable extensions, etc.I’ll cover the arguments I passed here in detail in the next section.WebDriverWait, added in the third line of imports, is part of the special sauce that makes WebDriver a good solution for sites like NFTrade that dynamically fetch data on the client side: it allows for implicit and explicit wait times before trying to locate an element on the page, which gives the browser time for data to come back from the server and populate in the DOM.This type of wait is an “explicit wait”, meaning I manually set a period during which the code will wait before continuing to try and execute.And finally, there is the import for By. By is what allows me to locate elements on the page - it is immeasurably useful and powerful.The By class accepts element IDs, names, attributes, XPaths, link text, tag names, class names, and CSS selectors just to name a few, and once again, it is a key player when it comes to scraping data off of the web page, as I’ll demonstrate soon.Right, all the Selenium WebDriver imports are now present in the Python file, time to initialize them and get to work.Before WebDriver can begin scraping the data from NFTrade, an instance of the browser that WebDriver will interact with must be instantiated and the proper options supplied to it.1. Initialize the Selenium WebDriver instanceIn my attempt to try to follow good Python coding practices (again, disclaimer: I don’t write Python as my primary coding language), I created a class for the the file named class ForSaleNFTScraper, and created an __init__() method immediately inside of it where I created the Chrome WebDriver instance that the whole script will be able reference in the remainder of its methods.The first thing I did inside of the __init__() method was to add a couple of Chrome browser configs via the Options import from the last section by declaring a new options variable.Since I wanted this script to run without actually opening a browser window on the user’s local machine, I added the config argument of --headless and the argument of --start-maximized, so the (unseen) window would take up as much screen size as was available (and hopefully load as many NFTs as quickly as possible by doing so).Then I passed the new options object to the instance of webdriver.Chrome, which was set to the variable of self.driver ( self is a variable accessible throughout the rest of the methods within this ForSaleNFTScraper class), and instructed the new WebDriver to wait for 5 seconds after startup (which would presumably give it time to go to the specified NFTrade web URL and load the data onto the page before attempting to scrape it).There’s plenty happening in that first method, but it’s all pretty straightforward once you go through the code line by line and understand what the arguments mean to the Chrome WebDriver instance, and why it’s doing what it’s doing. Now that the WebDriver instance was configured and ready to go, I could write the code fetching the NFT card data, and lazy loading more data once the end of the currently visible info was reached.2. Write the get_cards() and get_current_card_count() methodsThis is where the code really starts to get interesting in my opinion, because it’s where I learned to collect whatever data was currently visible in a (headless) browser and then load more data to add to the list. Pay close attention, because this is where the lazy loading code resides that gets more and more data onto the page.Ok, here we go.For starters, there are two methods that I’m displaying in the code snippet here. The first method, get_current_card_count() is how I keep track of how many NFT cards in a collection are currently visible on the screen.As I’ve said, NFTrade lazy loads its NFT collections onto a site to make initial page load quicker, and when a user scrolls down to the end of the currently loaded batch of elements, the NFTrade page then triggers to load more cards into the DOM at that point in time.The second method is get_cards(), which handles going to the NFTrade collection URL and scraping all the available card data. It relies on get_current_card_count() to help it know to load more NFT cards until the desired number of cards has been loaded in the browser to scrape data from.get_cards() methodI’ll talk about get_cards() first as it's the more complicated of the two methods. The first thing the method does is declare a new variable named URL - this variable is set to the URL of the NFTrade collection page I want WebDriver to navigate to and scrape the data from. I used the Selenium WebDriver driver.get() method to navigate to the page given by the URL.After navigating to the proper URL, I created a variable called last_card_count and set it equal to 0: this variable will be used to track how many NFTs are currently visible on the page and compare it to the max_card_count variable passed to the get_cards() method (if a number isn't passed for max_card_count it defaults to 500).Below is the key code to lazy loading more and more data in the browserInside of get_cards(), there's a while loop set up to compare the last_card_count and max_card_count variables. As long as last_card_count is less then max_card_count, the loop will run, and each time it executes WebDriver uses the driver.execute_script() method to scroll down the page, wait for 3 seconds (allowing more cards to load onscreen), and then updating the last_card_count variable equal to the new amount of cards on the page using the get_current_card_count() method.NOTE: The window.scrollTo() method is criticaldriver.execute_script() allows for the synchronous execution of JavaScript in the current window, so when you see the code self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);"), what's happening is that WebDriver is using the JavaScript window.scrollTo() method to scroll the browser all the way to the bottom of the page (that's why document.body.scrollHeight is present - it's a measurement of the height of the whole document.body page element), which triggers the page to load more NFT cards into view.And this is a perfect time to segue into discussing the get_current_card_count() method, which is short and sweet.get_current_card_count() methodThis method exists simply to find the count of the current elements loaded in the browser, and it does so by combining the WebDriver find_elements() method with the By.XPATH element locator method.Due to how the NFTrade site is built, there are no easily identifiable classes, IDs, or other consistent ways to identify all the cards on the page, so I had to resort to XPath expressions to identify each element and include it in my count to update the last_card_count variable. I cobbled together the XPath below by using my Chrome DevTools to inspect the elements on the page and construct the XPath from there through trial and error.NOTE: What is XPath?If you’re unfamiliar like I was, XPath is a syntax that can be used to navigate through elements and attributes in a standard XML document (or webpage). The link I provided to W3Schools has some good examples of what typical XPath expressions look like and how to interpret them.So the code inside of the get_current_card_count() method is just the one line of code:In the code snippet, I’m getting the count (using the build-in Python method len()) of all the elements on the page that match the XPath of a

containing the class of "Item_itemContent__1XIcH", because each NFT on the page is wrapped by that
with that class. It's not the prettiest thing to read and understand, but it gets the job done.And finally, jumping back to the get_cards() method again, once the last_card_count variable has been updated and surpassed the max_card_count variable (i.e. enough NFT cards are loaded into the browser), the while loop ends, and all the cards on the screen are targeted (using the very same XPath used in the get_current_card_count() method, I might add) and set equal to the cards variable defined at the top of the get_cards() method. That variable then gets returned to the __main__ method running the whole script, which I'll cover next.There’s quite a bit going on here, but hopefully it makes more sense now what these methods are doing. Time to test out this lazy loading script functionality and see how WebDriver does.3. Run the Python scriptAll right, now that all the code and logic to load multiple sets of NFTrade cards into the browser and collect the data has been written, it’s time to run the code.To do that, I declared a __main__ method at the bottom of the file which can be started from the terminal with the following command.Here is what __main__ method includes.The first thing the method does is create a new instantiation variable named scraper by calling the ForSaleNFTScraper() class. It then proceeds to fetch all the card data and set it equal to a variable named cards by calling the method scraper.get_cards(max_card_count=200) and supplying a max_card_count variable of 200.After this step, as a sanity check, I used the Python and methods to print out all the card data and a count of the total cards fetched by the get_cards() method, and ensure all the info I needed to include in the CSV (NFT price, NFT ID, etc.) was available to me. Here's a screenshot of some of the data printed out in my console helping me know my code is doing what I expect.And after verifying the right data’s there (and the right amount of data as well), I continued on extracting the data, calculating the current price in USD for each NFT, and assembling a CSV of all the data. But I’ll save those steps for future blog posts.Building a Python-based website scraper to create a CSV of NFTs available for sale on NFTrade was a unique challenge I learned a lot of new things from.After my first attempt failed due to NFTrade dynamically lazy loading NFTs in batches of 75 onto the page as a user scrolled further down, I had to come up with a more creative solution that would allow me to trigger the site to load more cards on the page first, then grab the data on the cards for sale.I found the solution I was looking for with the help of a Python package called Selenium Python. Selenium Python is a powerful Python-based API that allows users to write scripts or automated tests leveraging Selenium WebDriver. And it was up to the task at hand: with just a few methods I was able to specify as many NFTs as I wanted loaded on the page before scraping and collecting all their data all at once.Check back in a few weeks — I’ll be writing more blogs about the problems I had to solve while building this Python website scraper in addition to other topics on JavaScript or something else related to web development.If you’d like to make sure you never miss an article I write, sign up for my newsletter here: https://paigeniedringhaus.substack.comThanks for reading. I hope seeing how to make a Python Selenium WebDriver load data onto a dynamic webpage before scraping it comes in handy for you in the future.Originally published at https://www.paigeniedringhaus.com.----ITNEXTStaff Software Engineer at Blues, previously a digital marketer. Technical writer & speaker. Co-host of Front-end Fire & LogRocket podcastsPaige NiedringhausinBits and Pieces--9Mahdi MallakiinITNEXT--6Jacob FerusinITNEXT--8Paige NiedringhausinBits and Pieces--10Darshan Khandelwal--Kamna SinhainData At The Core !--2Kushal Singhal--ZenRows--Aleksandra Liutikova aka Java SenoritainCode Like A Girl--24Builescu DanielinPython in Plain English--1HelpStatusAboutCareersBlogPrivacyTermsText to speechTeams


This post first appeared on VedVyas Articles, please read the originial post: here

Share the post

Scrape Data from a Lazy Loading Website with Selenium Python

×

Subscribe to Vedvyas Articles

Get updates delivered right to your inbox!

Thank you for your subscription

×