Semalt Expert Explains How To Scrape An AJAX Website Using Python

Web scraping is a method that employs the use of software to extract data from a web page. There are lots of tools to use for scraping the web with python, some of them being; Sky, Scrapy, Requests, and Beautiful Soup. However, most of these tools are limited by the fact that they only retrieve static HTML that comes from the server and not the dynamic part rendered by JavaScript.

However, there are some techniques in which this problem can be overcome:

1. Automated Browsers

You can make use of automated browsers such as Selenium or Splash which are full browsers that run headless. However, setting them up can be quite complex, and so we will focus on the second option below.

2. Intercept AJAX calls

This involves trying to intercept the AJAX calls from the page and trying to replay or reproduce them.

In this article, we will focus how to catch AJAX calls and replay them by making use of the Requests Library and Google Chrome browser. Though frameworks like Scrapy may provide you with a more efficient solution when it comes to scraping, it is not required for all cases. AJAX calls are mostly performed against an API that will return a JSON object which the Requests library can easily handle.

The first thing you need to know is that trying to replay an AJAX call is like using an undocumented API. Therefore, you have to look at all the call the pages make. You can go to the site, play with it a while and see how some information is rendered. After you are done playing, come back and start scraping.

Before we get into the details, let us first understand how the page works. If you visit a stores page by state, select any state, and the page will render information on the store. Every time you select a state, the website renders new stores to replace the old ones. This is achieved by using, and AJAX call to a server asking for the information. Our intention now is to catch that call and replay it.

To do so, all you have to do is open the Chrome browser DevTools consoled and go to the XHR subsection. XHR is an interface that performs HTTP and HTTPS requests. Thus the AJAX requests will be shown here. When you double-click the AJAX call, you will find a lot of information on the stores. You can also preview the requests.

You will note that a lot of data is sent to the server. However, don't worry since not all of it is required. To see what data you need, you can open a console and perform various post requests to the website. Now that you know how the page works and have deciphered the AJAX call, you can write your scraper.

You may be asking, 'why not use an automated browser?' The solution is simple; always try to replay the AJAX calls before embarking on something much more heavy and complicated such as an automated browser. It is simpler and lighter.


mass gmail