Everybody involved with web scraping knows how difficult it can be to extract data from modern websites. In this blog post, I’ll show you the fastest and most reliable way to extract data from a website. Once you have mastered this technique, you’ll come to love these websites, since this is “nearly” as fast and reliable as pulling data straight out of a database.
(Please note that you cannot use this technique with every website, but it’s worth checking before you start building a web scraping bot.)
The solution lies in the asynchronous calls modern websites make to load data. The web server functionality that provides the data is often called a Web API, so the asynchronous calls are often referred to as Web API requests. The Web API normally provides structured data in JSON format which is very easy to work with, and the Web API requests are very fast compared to loading a full web page.
You can make the Web API requests without loading the full web page, and then just parse the returned JSON and save your data. The hardest part is figuring out how to call the Web APIs, since they obviously don’t come with public documentation.
You can use many different tools to examine how a website uses a Web API. Fiddler is a popular tool, but you can also use the developer tools that exist in most web browsers, or you can use a dedicated web scraping tool such as Content Grabber. Once you know how a website uses a Web API, you can replicate that in your own program to extract the data.
In this example, I’ll extract data from the website http://www.woolworths.com.au which is a highly dynamic website that loads data from a Web API.
The target website is a grocery website that has many categories of products, but to keep this example simple I’ll extract data from just one category. It’s relatively easy to extend the web scraping bot to extract data from all categories on the website.
First I’ll open the website in the Content Grabber editor and navigate to the category I’m interested in.
I can now open the browser activity screen to view all the requests that have been sent to the web server. Content Grabber even tells me if any interesting JSON data has been returned from the server
I can view the JSON content to make sure it contains the data I’m interested in.
Now that I’m satisfied I’ve found the right Web API request, I’ll take a look at the request URL which looks like this:
If you look closely at the URL you’ll notice a URL parameter named pageSize. This parameter controls how many products are returned by the Web API request. I want to return all products in the category in a single request, so I’ll increase this number to a value that is high enough to return all products in the category. The URL now looks like this.
All I need to do now is load the URL into a JSON Parser in Content Grabber, so make sure you change the Content Grabber browser from Dynamic Browser to JSON Parser. You can choose the Content Grabber browser type from the Agent Settings menu.
I get a nice view of the JSON data available, and creating the bot is an easy point and click process.
The bot I’ve created extracts 185 products in 8 seconds, and 5 seconds of that is used to initialize the agent, so the actual extraction has taken only a couple of seconds. This would easily have taken 10-20 times as long using a full-featured web browser.
Since I’m using Content Grabber, I can easily save my extracted data almost anywhere with the click of a button, including databases, CSV, Excel, XML, JSON and PDF. Here’s what the data looks like in Excel.
Extracting data from a modern website can be a nightmare, but it doesn’t have to be. I’ve shown how you can hook up directly to a Web API and get rich data fast and reliably. You don’t even have to get your hands dirty with programming if you use the right web scraping tool.
If you have any questions, please use the comment section below or contact us here.