Data journ: Sifting through a website to find files of a specific format

Sometimes the hardest, or at least most time-consuming, part of data journalism is getting your hands on the data. This is particularly the case when you’re dealing with organisations and government departments that are not overly keen to have their data scrutinised. Take, for example, the South African crime statistics released every year by the South African Police Services (Saps). For reasons best known to Saps itself the crime statistics are released as PDFs, literally hundreds of them.

PDFs are the antithesis of useful for a data journalist but in this case assume you have no alternative but to download each PDF. If you take a look at the crime stats section on the website you’ll see that there is a separate PDF for each station in each province. That’s a good couple of hours of downloading.

Just the PDFs

The alternative is to find a tool for downloading just the files of a particular format from a website. There are many tools around but, being an open source fan, my preferred tool is Wget.

Wget is open source, runs on most platforms (including Windows) and is a command line tool, which sound like a pain but it’s actually relatively easy to use. Wget’s speciality is downloading files from websites in bulk, including downloading entire websites automatically. In this case is very usefully can download all the files of a particular format on a website.

To do this, I open a terminal window and type in the following:

wget -r --no-parent -A .pdf http://www.saps.gov.za/statistics/reports/crimestats/2012/crime_stats.htm

It looks like a mouthful but it’s actually pretty easy to understand if you know what the various bits mean:

  • wget obviously calls the program.
  • The -r tells wget to follow links recursively to find all the appropriate files. Without the -r wget will just retrieve the file you list in the command.
  • The –no-parent flag tells wget to only look in directories below the one you specify. By giving this option we restrict wget’s search to the …/crimestats/2012/ directory. Without it we’d end up downloading all the PDFs on the Saps website, which is not what we want.
  • The -A .pdf tells wget that we only want the files with a .pdf extension. This could, of course be changed to something like .xls or .doc to find other files.
  • Finally we tell wget where to start looking, In this case the page listing the links to the crime statistics for 2012.

The result (if takes a good few minutes to run) is a complete set of the PDFs from the crime stats website.

Some considerations

As useful as this is, if you’re downloading large volumes of data this way it’s important to be considerate of the server you’re connecting to. Don’t pull everything from a website unless you’re absolutely sure you need it. You can also limit the download rate to reduce the load on the server. You can do this by adding the option –limit-rate=20k which would drop the download rate to 20 kilobytes a second. Just change the 20k to something appropriate.

There are many other options available for wget. To read more about those you can look for the manual package in your version of wget or read the manual online.

If you’re looking for specific crime statistics for provinces and stations then you’re probably better off looking at the excellent Crime Hub run by the Institute for Security Studies than trying to download hundreds of PDFs.

This same technique could be used to download all the images from a website, all the Excel files or even all the videos.

This is why big media is broken

If you have anything to do with big media you’ll have heard, no doubt repeatedly, how the internet is killing the news business; how a rag-tag bunch of internet lowlifes are ruining the business for everyone. At least that’s the concept that big media would like to have us believe.

But is it true?

Last night South African swimming sensation Chad le Clos scored an historic win in the Olympic 200m butterfly event to take home the country’s second gold. In a BBC interview that followed his win, Chad’s emotional father gave an interview that that is widely been described as the “media moment” of the games so far. In the interview Bart le Clos openly showed all the emotion that goes with an achievement this significant. It’s a wonderful moment.

If you’re connecting to the internet from outside the UK, however, you won’t be able to watch the interview. Visit the page on the BBC site and you’ll be told that the video is only available to UK residents.

However, if you visit any of a number of other sites such as Deadspin you can watch the entire interview, no matter where you are in the world.

So, BBC records a superb interview but then restricts who can watch it? Clearly the BBC is happy that all the readers that would have visited their site are instead going elsewhere to watch their content.

I’m sure there are all sorts of limitations, restrictions and protocols involved, and the BBC will happily tell us about those to explain away the situation. What is abundantly clear, however, is how significantly out of touch with the internet many of these larger media organisations are.

I don’t really care why they don’t want to or can’t show the interview to me in South Africa, I just want to watch it. And so do thousands of others.

We will find it elsewhere and we won’t be watching it on the BBC site.

Whose fault is that?

 

Making data science fun

I’m a geek at heart. I may work in the world of media but I suspect I’m more of a geek than I ever realised. This occurred to me recently while watching an episode of Triangulation, a fantastic podcast run by Leo Laporte. In this particular episode Leo and his sidekick, Tom Merritt, interviewed Jeremy Howard, creator of Fastmail and a host of other startups. In archetypal laid-back Aussie style JH described his newest venture, a platform for creating data science competitions.

What? Data science and fun? It’s a rare and, one would think, unlikely pairing of words but a fascinating one at that and the ensuing discussion is fascinating. Well, at least I found it fascinating, which is why I suspect I’m even more of a geek than I previously believed.

Take a watch of the podcast below. It’s long but intriguing.

How the Daily Mail Conquered England

“The paper’s defining ideology is that Britain has gone to the dogs.”

A fantastic piece in the New Yorker about the UK’s Daily Mail and its unprecedented success both in the UK and now in the US. Also some great insight into Alan Dacre, the DM’s long-serving editor who’s grasp of what the average person actually wants to read has driven the paper forward. Love the Daily Mail or hate it, its success can’t be ignored.

http://m.newyorker.com/reporting/2012/04/02/120402fa_fact_collins?currentPage=all

Sink or swim: Digital publishers need to be bold

With 270,000 digital subscribers and close to one-third of our revenues coming from digital, the FT’s impossible plan for paid content online is now a success.

FT.com’s MD Rob Grimshaw writes in Wired that, from the FT’s experience, the market for digital news is more flexible than most imagine. Initially the FT’s plan to hide its best stuff behind a paywall was considered by many to be publishing suicide, but with a an ever-expanding digital (paying) subscriber base the FT is proving its critics wrong. The key, for the FT at least, has been to focus on a “quality, not quantity” ad sales model, says Grimshaw. Of course, being in a niche market literally gagging for the most up-to-date financial news and insight also helps.

Similarly Reuters and Bloomberg are also in the pound seats, earning the bulk of their revenues not from paying readers but big publishers and businesses that rely on timeous and specialised financial news.

 

400+ South African journalists on Twitter

The past month has been pretty hectic with big changes afoot on the work front (more on that shortly). As a result of the general mayhem I’ve been ignoring the this blog for far too long. Time to rectify that.

In my previous post I wrote about how I (with the help of others) had set up a Google Spreadsheet to list South African journalists active on Twitter. The spreadsheet approach worked for a while but then I decided to refresh some of my PHP skills. In the process of doing that I created the Hacks List, a slightly more useful version of the spreadsheet.

That list can be found here and now lists 430+ South African journalists that are active on Twitter. Journalists that are not already on the list are free to add their names and Twitter handles.

So tell your friends and colleagues about it and let’s get as many SA journalists as possible listed here.

300+ SA journalists on Twitter and counting

UPDATE: I have moved the lists of South African journalists on Twitter to a new home. That list can be found here. You can read more about the new list here

Last year @RayJoe and I developed a list of South African journalists on Twitter. Over the past few months this list has grown from the initial handful into a list of more than 300 working journalists on Twitter, and the list keeps on growing. The list is open to all working journalists in South Africa and is maintained as a Google Spreadsheet to make it easy for everyone to add and edit their details. So, if you’re a South African journalist that uses Twitter but you’re not on the list you can add yourself here: http://goo.gl/IXGqL

Media training in an increasingly digital age

I’m fortunate enough to be involved, albeit part-time, with the recently re-established Independent Newspapers Cadet School. Three years ago the group, which publishes titles such as The Star, Cape Times, Argus and Pretoria News, decided to revive the cadet school which had been closed down many years before because of financial constraints and a changing media environment.

In February we took in our third set of cadets, nine in total. All of them have some form of tertiary training although having a journalism degree was not a pre-requisite. In fact very few of the candidates we interviewed had been anywhere near a journalism school. What we looked for instead were raw talent, a desire to learn and a sense of news. We completed that process at the end of November last year and in February the nine cadets selected began their nine-month on-the-job training.

The media is undergoing a massive transition and the days of simply printing words on paper and distributing it are long over. Today the internet and mobile devices are pushing news provision into new arenas, many of which we as old-school journalists are still uncomfortable with. Nonetheless, there is little value in teaching cadets exclusively old school skills while the world around them is moving rapidly towards the new.

An understanding of digital and social media is today an important part of the journalist’s toolbox and in order to promote these skills in a practical way we have tried to integrate a number of popular online tools into our daily work with cadets. We use a Facebook group for basic communication with cadets such as listing timetables for the week and so on. We also have a WordPress-based blog which we use to publish cadet work online. We have also insisted that cadets set up and manage their own blogs. Not personal ones but the cadets from each region (Gauteng, Western Cape and KwaZulu Natal) jointly manage a blog for each region. These are linked from the main cadet school blog.

It’s still early days and we’re less than a month into this year’s training and we no doubt have many lessons still to learn. But in this period of transition there is not a lot of time to waste making sure absolutely everything works properly and exactly as need be. It’s a time for learning on the job, even for us.

A Black Tuesday clothing hack

My Black Tuesday shirt. I’m not sure this was the original intention when @mybroadband gave me this shirt but with the addition of a small red gag I think it serves a purpose.

Support the Right2Know campaign and oppose the Protection of [State] Information Bill.

eBooks: part of a survival strategy for news organisations

It’s pretty obvious now that if news publishers want to survive this rough period they’re going to have to be smart, clever and brave. In particular they’re going to look beyond their traditional markets to new ventures that draw on existing skills but tap into new income streams.

Two things that all news organisations have are writing skills and an archive of news and information. Why not combine these two and produce a series of ebooks on issues close to readers’ hearts?

The Guardian is doing exactly that with its Guardian Shorts series. The ebooks, on topics as diverse as Dr Who and the Murdoch phone hacking saga, cost between £1.99 and £3.99 and are available for the Kindle or through the iTunes store. Each book is an edited collection of Guardian coverage of a specific topic. The Phone Hacking ebook, for example, collects Guardian coverage of the issues that goes all the way back to 2005.

It’s a genius idea. The bulk of the writing is already done and, even with editing work, the books ought to be relatively easy to produce.

Clearly ebook sales aren’t going to make up entirely for ongoing newspaper losses but it is part of a strategy that builds on news organisation strengths and adds a new revenue stream with relatively little additional work. It also extends the organisational brand into new arenas which can’t be all bad.

Or, as John Paton says, these new avenues may not replace the lost dollars but perhaps its time for news organisations to start “stacking the dimes“.