This was seemingly a simple project at first, but after having to overcome a few difficulties I ended up having a great learning experience from a tough challenge. Here is a tutorial I’ve created (loosely, but hopefully thoroughly enough to help you) to follow along with and build your own!
My goal was to code a web scraper to identify if an item for purchase on a particular site was in stock, and then send an email notification to my inbox to tell me if it is. The item I wanted was currently out of stock. If the item you desire is in stock, the scraper will work the same, but you will want to modify the conditional statement determining when to send the email.
Getting Set Up:
The first step is to install the necessary tools and frameworks to complete the project.
Note: Here, I use selenium. I also have experience though using Beautiful Soup. If you want to learn more about BS and how to use it to perform web scraping information and placing the information into a data frame- check out my machine learning project here.
Install Selenium: Selenium is a framework for testing and automatic web sites and web applications. Further documentation and info can be found here.
Install WebDriver: First, confirm which version of google chrome you have. You can do this by going to Chrome > Help > About Chrome and it will list your version. Once you have this visit https://chromedriver.chromium.org/downloads and choose the driver that matches your version. Download this and store it in an easily accessible place. We will need to easily find it later.
Writing the Code:
Imports and Configurations
The code above imports the libraries and configures the web driver. Headless Chrome is a way to run the Chrome browser in a headless environment without the full browser UI, and it gives you a real browser context without the memory overhead of running a full version of Chrome. You can find a great walkthrough and explanation of sandbox here.
Imports aside, the rest of the code is importing the driver — this is where the path you saved your chrome driver to is needed. Find the path via terminal or get it from the file and paste it in. Next, directly copy the link to the page you want to scrape and paste that in as well.
Building the Scraper
Only 10 lines? With spaces? Yep. The real work is done in the “presence of element” line. What the scraper is doing is going through the HTML of the site and finding the class of the item you want to find — and letting you know what’s there. The delay is telling the scraper to wait a second until after the page loads, otherwise it’d be trying to scrape information that isn’t there yet. Getting the class is fairly simple and I hope not too bizarre!
- Navigate to the page your item and it’s availability is listed on.
- Right click on the text, button, etc. that display the availability and click “Inspect”
- Type Copy and paste the class = ‘…’ from your HTML item, when you click inspect it should highlight that item within the code. Be sure to add an @ in front of class = in your code. Your final line should look something like this:
If there is an element within the class i.e if you are scraping a button and the text displaying it’s availability is in a span tag, you can achieve finding it by keeping your code the same, but adding the specific tag you are looking for like ‘span’ or ‘p’ directly after the // following the brackets, so your code would look something like this:
Sending the Email
This for me was honestly the tricky part. First let’s check out the code and then I’ll walk you through the steps to ensure it works. NOTE: I used Gmail here.
So the conditional statement in my case was letting me know if the text by the item changed from “Sold Out” to something else. If it were to change, it sends me an email.
We need to authenticate our Gmail though in order to send our email.
Step 1: Turn on the Gmail API
Step 2: Install the Google Client Library
Check out the step by step Google tutorial here. If you run into some trouble check out some common things below.
- Enable access to third party apps (you can enable your script specifically if you choose). You can achieve this by going Google > Manage your Google Account > Security > Manage third-party access and choose to enable it.
- One of the issues I ran into was following these steps and still having a login error — this is likely because the reCAPTCHA is still enabled. Just google search : gmail turn off reCAPTCHA and follow the link to the one listed accounts.google.com and you should see this:
Click “Continue” and then rerun your script.
Automating the Program
We’ve made it to the end! Next step is to automate the scraper to run on it’s own. JC Chouinard has two great tutorials walking thorough exactly how to do this through task managers on your local machine.
If you have Windows, go here.
If you have Mac, go here.
Alright, that’s the end of the tutorial! I hope it was helpful. If you have any questions or struggled on some parts I could have explained better or been more clear on, please leave a comment or message me.
If you see some mistakes or better ways to do things, please also leave a message or contact me, I’m always looking to learn more. Check me out by following the link below, and feel free to explore more of my stories!