How to scrape a website - Part 1
In this tutorial we will go through how to turn a website into a nicely formatted API. We're going to pretend that we're building an app and need to get some player statistics from nhl.com. Since nhl.com doesn't have an API we need to somehow extract the information from their web site. One option would be to programatically make a HTTP request using Curl (or similar tool) and then parse and validate the HTML response that is returned. Here we will show you how APFy.me can be used to transform the original HTML to something much more useful.
Your application will still make a HTTP request, but instead of calling the source site directly you call APFy.me. Parsing and validation will be handled here and you will get Json- or XML-data, containing only the data you need and structured the way you want it.
All you need to follow this tutorial is a web browser and some basic understanding of XML and XSLT. We will create and test our API method using the APFy.me playground, no need to write any code at this stage.
You can see the final result of the tutorial here: API for www.nhl.com
1. Fetch the source data
This will fetch the HTML from the source URL and convert it to valid XML so that we can manipulate it using XSLT in the following steps. It is also possible to pass parameters when fetching the page. We will examine the source URL and add some parameters later in this tutorial but for starters we just fetch the default page without filters and sorting.
2. Define the output layout
It's always easier to configure an API method if we know what we want the output to look like. So before doing anything else we take a look at the data that exists on the source site and decide how we would like this data returned to us. In this tutorial we decide that our output should look like this:
We start with a root node, stats, that contains information attributes about the current page, total number of pages and the rating span shown on this page. Each player is represented by a player-element. We add the rank as an attribute together with the player id. The statistics are presented as separate elements. We also decided to wrap all statistics concerning goals inside a parent goals-element.
3. Create the template
3.1. Get the base template
Click Get base template, tick the following boxes and click Insert template
- Exslt Regular Expressions
- Exslt Sets
Now we have a base template and can start extracting the data we want.
3.2. Add our root node and attributes
We start by adding the root node, a general setting to strip spaces from all element values and a variable that will make our following XPaths more simple.
We also add another template to retrieve and add the information attributes of the root node. These could of course also be added directly in the main template but we thought it looked nicer to put it in a separate template. We pass a reference to the table footer node as a parameter to the newly created template, just to make it easy to navigate to the values we want.
Once this is added you can click Transform to see what the output looks like right now. This will give you the root element, stats, with attributes containing information fetched from the source URL.
3.3. Get the player statistics
Next we want to add the actual player statistics to our output. Start by adding a for-each loop and paste the player element layout inside it. Then for each element we just fetch the corresponding data from the source.
Click Transform again to see the output. This is actually all that is needed to create an API method. From here we could click Create API from Playground and celebrate that we managed to create an API!
How to scrape a website - Part 2
In Part 2 of this tutorial we go through a few simple steps to make our API even better and more useful by changing our template as well as adding some validation. We will also add query parameters to filter, sort and fetch specific pages. And last but not least we will learn how to test the API to make sure it works as it should.