Sometimes the hardest, or at least most time-consuming, part of data journalism is getting your hands on the data. This is particularly the case when you’re dealing with organisations and government departments that are not overly keen to have their data scrutinised. Take, for example, the South African crime statistics released every year by the South African Police Services (Saps). For reasons best known to Saps itself the crime statistics are released as PDFs, literally hundreds of them.
PDFs are the antithesis of useful for a data journalist but in this case assume you have no alternative but to download each PDF. If you take a look at the crime stats section on the website you’ll see that there is a separate PDF for each station in each province. That’s a good couple of hours of downloading.
Just the PDFs
The alternative is to find a tool for downloading just the files of a particular format from a website. There are many tools around but, being an open source fan, my preferred tool is Wget.
Wget is open source, runs on most platforms (including Windows) and is a command line tool, which sound like a pain but it’s actually relatively easy to use. Wget’s speciality is downloading files from websites in bulk, including downloading entire websites automatically. In this case is very usefully can download all the files of a particular format on a website.
To do this, I open a terminal window and type in the following:
wget -r --no-parent -A .pdf http://www.saps.gov.za/statistics/reports/crimestats/2012/crime_stats.htm
It looks like a mouthful but it’s actually pretty easy to understand if you know what the various bits mean:
- wget obviously calls the program.
- The -r tells wget to follow links recursively to find all the appropriate files. Without the -r wget will just retrieve the file you list in the command.
- The –no-parent flag tells wget to only look in directories below the one you specify. By giving this option we restrict wget’s search to the …/crimestats/2012/ directory. Without it we’d end up downloading all the PDFs on the Saps website, which is not what we want.
- The -A .pdf tells wget that we only want the files with a .pdf extension. This could, of course be changed to something like .xls or .doc to find other files.
- Finally we tell wget where to start looking, In this case the page listing the links to the crime statistics for 2012.
The result (if takes a good few minutes to run) is a complete set of the PDFs from the crime stats website.
As useful as this is, if you’re downloading large volumes of data this way it’s important to be considerate of the server you’re connecting to. Don’t pull everything from a website unless you’re absolutely sure you need it. You can also limit the download rate to reduce the load on the server. You can do this by adding the option –limit-rate=20k which would drop the download rate to 20 kilobytes a second. Just change the 20k to something appropriate.
There are many other options available for wget. To read more about those you can look for the manual package in your version of wget or read the manual online.
If you’re looking for specific crime statistics for provinces and stations then you’re probably better off looking at the excellent Crime Hub run by the Institute for Security Studies than trying to download hundreds of PDFs.
This same technique could be used to download all the images from a website, all the Excel files or even all the videos.