As part of some testing earlier today I was playing around with Instagram. I’ve never been a heavy user of Instagram but I was looking setting up an account for our newsroom. Along the way I got distracted by some of the great pictures I found. To test Instagram’s embed capabilities I put together a small gallery of photos of Johannesburg, one of my favourite photographic subject matters. I even indulgently included one my own photos. Click on my name if you want to follow me.
In the world of data visualisation d3.js is the toolset of choice for most interactive journalists. But D3 also comes with a steep learning curve that makes it relatively inaccessible to the average journalist with just a small amount of coding experience.
I’ve done some very basic D3 stuff in the past but it took absolutely ages to get it working properly and I broke the visualisations more often than I improved them. And I certainly never got close to making anything that even resembled a map visualisation in all the time I spent trying to learn D3. So Mapstarter is something of a blessing.
Mapstarter makes it simple to create a basic interactive map from shapefiles as well as GeoJSON and TopoJSON files. And even if you do know how to programme your own D3 map, Mapstarter speeds up the time to get a map from concept to reality.
I heard about Mapstarter a couple of days ago (ht: @siyafrica) and decided to give it a spin. I downloaded a shape file from the South African Demarcation Board website (in this case an election ward map for Johannesburg) and 10 minutes later I had a functioning map of the number of registered voters in each ward.
Mapstarter is literally that, a “starter”. Once you have the map created you can tweak the information and styles and, if you know some programming, you can build out even more impressive map visualisations.
The original map I made popped up the number of voters in each ward when you hovered over it but it was extremely basic. So I opened up the shape file database file in LibreOffice, added another column for label text and rebuilt the map. This version is still very basic but at least you can tell what the numbers represent.
Once the map is created you can download it as a SVG or image file which you could use in Illustrator as the basis for an illustration, or you can download the code and include that in your website. That’s what I’ve done above.
So the tour is under way. Some numbers:
I’m rather pleased with this interactive graph, not because it’s particularly good but because it’s my first foray into D3.js.
So, after spending most of my long weekend glued to my computer I can finally show off something that I wrote from scratch. In all honesty it’s not particularly wonderful and not as attractive as some of the other graphs I’ve done using free tools on the web but it is a start.
If for some reason you decide to try and learn D3.js, do yourself a favour and start with this excellent tutorial by Scott Murray: Interactive Data Visualization. If it wasn’t for this excellent introduction I’m pretty sure I wouldn’t be anywhere close to where I have managed to get to.
Earlier this week I put together a small set of scripts to track the amount of attention the Oscar Pistorius trial was getting in Twitter. With not only local but also international audiences keen to follow the murder trial of the celebrity athlete it wasn’t surprising that Oscar was big on the social network. As you can see from this the Twitter activity around the Oscar trial regularly topped 2,000 tweets an hour while court was in session and on Wednesday peaked at just under 4,000 an hour.
So, when on Wednesday the Public Protector Thuli Madonsela finally released her investigative report into Jacob Zuma’s Nkandla residence I decided to do the same to see if people cared as much about the allegations against the president as they did about the gory details of a celebrity murder trial.
The good news is that they do, and significantly so (hover over chart for details):
Using the exact same method as I used with the Oscar Pistorius trial, I tracked mentions of Nkandla on Twitter from 10am on the morning of the announcement until 1pm the following day. As we can see there were just over 500 Nkandla-related tweets an hour at 12pm, half an hour before the announcement, but that quickly spiked to just over 6,000 an hour by 2pm, just over an hour into the announcement.
We’re into the 11th day of the Oscar Pistorius murder trial and during a particularly long, drawn out pre-lunch session I was messing around with the a large collection of tweets from the morning’s session that I had collected. What to do with almost 10,000 tweets but make a word cloud? So here it is: a quick wordcloud generated from 9,996 tweets over the course of the morning’s session.
It’s worth mentioning that the wordcloud was created with the excellent Wordle wordcloud generator. The tweets that were used in this were collected automatically using a series of scripts I wrote over the weekend. But more on that later.
This week I spent some time working on an education project I have been dabbling with for a few weeks. Inevitably I eventually hit the PDF wall. The only place to get the extra data I needed to move the project forward was locked in a 276 page PDF. Even worse, the tables I wanted to pull out of the PDF were huge and spanned most of the 276 pages.
Split the PDF
Fortunately the tables were neatly divided into sections so the first thing I did was split the PDF up into smaller chunks. The easiest way to do this was to open the PDF in the Chrome browser and then “print” the pages I wanted to a new PDF. It’s pretty painless: open the PDF in Chrome, hover over the bottom right of the screen and select the printer icon. In the print switch from your standard printer to PDF and select the pages you want to print.
Convert to CSV
Next I tried every online PDF to Excel converter I could find. And each one failed, maybe because the files were still pretty big or because of the annoying layout which had column header text flipped 90 degrees. Either way it was going nowhere so I decided to give Tabula a try.
Tabula is a free, Java-based program designed specifically to “liberate data” stuck in PDFs. It’s pretty simple to install and it runs in your browser, a lot like OpenRefine.
Select and download
Once Tabula is open in your browser you need to find the PDF you want to extract data from. Upload the file and that is opened in the editing window. Next, drag a rectangle shape around the table you want. If the table continues on the next page(s), click the “Repeat Selection” button and the selection area is duplicated across the following pages.
Scrolling down you can reviews the selections and either move them if they’re slightly off, resize them or even remove certain pages from the selection. When you’re done you click the download button and download the content as a CSV or TSV file.
My selections covered dozens of pages at a time but Tabula did an excellent job. There were a few relatively minor formatting issues with the final files but given that I was working with 276 pages and more than 6,000 lines of data in the final document they were relatively insignificant.