How to scrape a website - Part 2
In Part 1 of this tutorial we went through the basics of how to configure an API based on data from a website.
Here in the second part we will go through how to improve the XSL template and how to add validation, both to request parameters as well as the output.
1. Improving the template
1.1. Format the numbers
We already have something that would serve quite well as an API method but we're going to improve it a little bit. First we want the numeric fields to contain only numbers, not the formatted version that is found in the source document. This is to make sure the consuming application can use the data as numeric directly without having to parse a string. To clean the numeric fields we define a simple RegEx-pattern that can be used to clean a numeric string.
Click Transform again to make sure we have clean numbers.
1.2. Smarter column references
Now we're just making one more adjustment to make our method less vulnerable to changes in the source document. Right now we're referencing the data columns by numbers. Our method will be broken if any column is moved, added or removed i.e. if the Goal and Assist column would swap place our method would still return what appears to be correct data but the information for Goals and Assists would be mixed up. So to make it a bit more stable we decide to use the table headers as reference for fetching the correct column. We add variables and find column positions based on the header values. And when getting the data we use the variables as reference instead of static numbers.
Now columns could change place in the source document, also new columns could be added without causing any trouble. Of course we're now more vulnerable to text changes of the column headers but we assume that this is less probable to happen.
2. Validating the output
Our API doesn't really need anything more to work but we can do one more thing to make it more robust and usable for us. By adding XSD validation we can detect changes or scenarios we didn't take into account when defining the API method. When we detect a change we can go to APFy.me and correct it without having to make any alterations to the consuming application. Writing XSD:s can be a bit tedious so to speed things up we have added an Infer XSD-button that gives you a base for validation. This can be further improved by adding null-restrictions, pattern validations and so on to make sure the data that is returned is valid.
We're going to adjust the inferred XSD a little bit to suit our needs a little bit better. We will do the following:
- Change all xs:unsignedByte to xs:unsignedShort. Since we were on page 1 when inferring the XSD some values had low valueand thus interpreted as bytes.
- Add some custom types to validate specific conditions.
3. Adding parameters
As you can see on the source site we can filter the list based on some values and also sort it based on the columns. We also have a parameter for paging. All of these are passed to the source site using the querystring.
If you click Next on the source site you will see that pg=2 is added to the querystring. When you click any of the column headers you will see sort=something. For the filters you can inspect the drop downs and check the element names to get the parameter name and check the value-attributes to see the possible values.
3.1. Validating the parameters
If we click Create API from Playground all of our settings are sent to the page for adding a new API method. Here we can also define how we would like to validate parameters passed to the method. In our example there are some values that will alter the layout of the table so we want to make sure these are not passed to our method. E.g. Position=Goalie.
4. Testing the API
Since APFy.me acts as a proxy and accepts HTTP requests just as any other web site you can use the Playground to test the API method and check the response. From admin you can click Test method and from the public page of the method definition you can click Test. This will take you to the Playground with the API definition pre filled together with your API key (if you're logged in). Once here you just have to set the parameter values to test and click Fetch.
You can also use Curl to test the API method, an example can be found in the public API method definition. Just copy the example (remove any "\", these are just there to show you that all should be written on the same row) and add your API-key. To check the Json-response do the same but also add the x-apfy-accept-header and set the value to application/json and run again.
- To learn more about XML, XSLT, XPath and XSD you can have a look at w3schools
- To learn more about the extension methods
- EXSLT - http://exslt.org/
- To learn more about Regular Expressions
- More on APFy.me
How to scrape a website - Part 1
Go back to Part 1 of this tutorial if you need to refresh your memory of what we did there.