• Blog

    awards data journalism Passmark

    Media Hack’s Passmark project shortlisted for two global data journalism awards

    Posted By Alastair Otter

    Media Hack’s Passmark project has been shortlisted in two categories in the worldwide Data Journalism Awards run by the Global Editors Network. Passmark was been shortlisted in both the Data Journalism Website of the Year  and the Public Choice categories.

    According to the organisers there were 630 entries from more than 50 countries in this year’s awards. The shortlisted projects were named during the International Journalism Festival in Perugia, Italy in early April.

    The Public Choice award is awarded based on votes by members of the public. The 11 projects shortlisted in the category can be found here. Voting on the Public Choice award is open to the public until 15 May. You can vote for your favourite project here.

    The winners of the Data Journalism Awards will be announced on 31 May 2018 during the GEN Summit in Lisbon, Portugal.



    Read More

    data journalism visualisation work

    Six State of the Nation Addresses compared

    Posted By Alastair Otter

    The 2018 State of the Nation Address (SONA) in South Africa was an event that most of the country was eagerly looking forward to. The reason was that it was to be the first SONA by the country’s newest president, Cyril Ramaphosa, following the resignation of Jacob Zuma. Amidst the ongoing “State Capture” investigations, the country was hoping Ramaphosa would offer new hope for ridding the country of corruption.

    Ramaphosa dedicated a substantial chunk of his first State of the Nation Address in Parliament on February 16 2018, to persuading South Africans to put “the era of discord, disunity and disillusionment” behind them. “A new dawn is upon us,” he said.

    What does that mean? Working on the assumption that the number of words the president devotes to a topic is an indication of the level of importance he places on that topic, Passmark looked at the past six State of the Nation addresses and visualised the 12 topics that got the most airtime. Education, we are happy to report, has been in the top 12 in all but one year, 2016, which was, ironically, just after the #FeesMustFall protests started on campuses around the country.

    We also compared Cyril Ramaphosa’s top 12 topics with the number of words Jacob Zuma dedicated to those same topics in his SONA speeches. To see that comparison, click the ‘compare’ button below.

    You can view the full report and visualisation here.

    Read More

    data journalism visualisation work

    12 years in South Africa’s schools

    Posted By Alastair Otter

    Last week we published a new data story for our Passmark project. The story looks at the South African schooling system and how pupils progress (or don’t progress) through the system, and paints a bleak picture of the state of education in the country. “In 2005 just over 1.2 million children started grade one. In an ideal world these students would have taken 12 years to complete their schooling. In reality, less than half of them will reach grade 12 within 12 years, and even fewer of them will pass the national senior certificate (matric) exams and gain a university entrance.” You can view the full data story here.


    Read More
    Featured Video Play Icon

    tutorials video tutorials

    Tutorial: Extracting data from PDFs with Tabula

    Posted By Alastair Otter

    It’s one of data journalism’s most frustrating challenges. You have a dataset which is perfect for your project, but the data is locked in a PDF that makes it almost impossible to extract. You could try and retype the data yourself, but what if the PDF is multiple pages long and there are hundreds or even thousands of rows of data? Manually copying the data is not an option.

    One of the best tools we’ve found for extracting data is Tabula, an open source, easy-to-use tool that does a better than average job of pulling recalcitrant data from PDFS. In this tutorial we look at a fairly simple example of using Tabula to do just this.

    You can find more of our tutorial videos here.

    Read More
    Featured Video Play Icon

    data journalism video tutorials

    Working with importHtml: Part 2

    Posted By Alastair Otter

    Earlier this week we released the first in our video tutorial series called “Getting Started with Data Journalism”. In that tutorial we looked at using the importHtml command to import data from a Wikipedia page into a Google Sheet. In this, the second part of the tutorial, we look at some of the things you can do with the data once you’ve imported. We’ll also look at some of the things to watch out for when working with importHtml.

    You can find more of our tutorial videos here.


    Read More
    Featured Video Play Icon

    data journalism video tutorials

    Getting started with data journalism: a video tutorial series

    Posted By Alastair Otter

    Are you keen to learn some data journalism skills but don’t know where to start? Do you want to get up to speed quickly with finding, cleaning, filtering, analysing or visualising data?

    Last week we released the first of our tutorial videos on essential skills for digital and data journalists.

    Subscribe to our YouTube channel for updates on new tutorials or subscribe to our newsletter for media/journalism news and updates on the tutorial series.

    You can find more of our tutorial videos here.

    Here at Media Hack we’ve spent the past four or five years working on data journalism, storytelling and visualisation. When we started, we knew almost nothing about these new emerging fields. And there weren’t many resources, unless you were a trained technologist. Today there are so many options and resources when it comes to data journalism that it’s hard to known where to start. It can be daunting.

    Over the years we have developed an understanding that doing data journalism is not a single skill, nor is it an entire job description. Data journalism encompasses a variety of (often small) skills that make our lives as journalists easier, and make us better journalists, even if you don’t consider yourself a data journalist.

    Data journalism isn’t necessarily about producing the most impressive data visualisation. Sometimes it is just about understanding basic numbers so we can tell better stories, or being able to sort a spreadsheet to find meaningful insight. Sometimes it is about cleaning, or even creating, a large data set. But most times it is just about knowing how to find and use relevant information in an increasingly data-driven world.

    Just what you need to know

    One of the things we’ve learned over the years is that the best way to learn a new skill is to have a problem you need to solve. Our new video tutorial series looks at some of the basic skills that journalists working with data might need. In each tutorial we try and look at real-world data challenges and the most effective way to deal with them.



    Read More

    data journalism Passmark visualisation work

    Hidden danger: Asbestos in Gauteng schools

    Posted By Alastair Otter

    There are more than 200 schools in Gauteng that are built partially or fully from asbestos, a construction material banned in South Africa and a known cause of cancer. Of these, 29 of the schools are built entirely of asbestos, and should be replaced according to South African Education Department regulations.

    Over the past few months Passmark, our education project, has been collecting, cleaning and analysing public records and statements by education officials to determine exactly which of the schools in the province are built of asbestos. The final investigation was published by the Times media group (online and in print) on June 20. We also published a full version on the Passmark website.

    The final report combined narrative with a range of interactive visualisations as well as a simple tool for parents and learners to look up their schools to see if they are on the asbestos list.

    The project was funded by the Taco Kuiper Fund for Investigative Journalism.


    Read More

    data journalism Journalism Toolbox

    The beginner’s guide to extracting data from PDFs

    Posted By Laura Grant

    This is part 3 of an occasional series on useful tools for data journalists. You can see all the other parts in the series here: The Journalism Toolbox.

    Journalists get lots of data in PDF format – they can be tables of data that are embedded in reports, or spreadsheets that have been thoughtfully saved as PDFs before they’re emailed to you – but until you can get that data into a spreadsheet, there’s not much you can do with it.

    Luckily, there are a few great tools that can liberate your data quickly and relatively easily. I’ve listed some of the ones that I’ve used here, but there are no doubt loads more out there.

    I love Tabula. It’s my go-to option, firstly because it’s free, and secondly, it’s really easy to use. Its website says it was created “by journalists for journalists”, which is probably why it’s so popular with non-techie people like me. I often need to extract tables of data from biggish PDF reports. Tabula lets you upload an entire document and select just the tables you want. You can convert one table at a time or a few depending on the layout of your document into a CSV, TSV of JSON file, which you can import to Google Sheets (free), Libre Office Calc (free), Excel (not free), or whatever program you prefer.

    The only times I don’t go straight to Tabula is when I have PDFs that have been scanned in, or when the tables I want to convert are rotated 90°. But I’ll deal with those later.

    This one is also popular with journalists – not least because IRE members get free premium membership – and it’s really easy to use. You can convert up to five documents a week for free, but you have to subscribe if you want to do more. I quite like the fact that you can subscribe for a month at a time for $9.99, but if you really like it you can get a lifetime membership for about $130. You upload or import the pdf you want to convert, click the convert button and choose between Excel and .ODS (which you can open in Libre Office), unfortunately .CSV isn’t an option, but if you don’t have either of those spreadsheet packages, you can upload the file to Google Drive and open it in Google Sheets. It works quickly and well, but the really nice thing about Cometdocs is that it does optical character recognition (OCR), so it can convert scanned pdfs. You need to check the converted document against the original, though, just to be sure it picked everything up correctly. Like Tabula, it can’t handle tables that are rotated.

    Adobe Export PDF
    This one’s not free, but it’s not terribly expensive either – about $24 a year. If you use Adobe Reader, which is Adobe’s free PDF reader, Export PDF allows you to convert a PDF document that you’ve opened in Acrobat Reader to Excel, Word, PowerPoint or rtf. It works well and quickly with fairly big documents. But, like Tabula, it can’t do scanned documents or rotated tables.

    Nitro Pro
    If you have a Windows machine, Nitro is a great tool for editing and converting PDFs to useful formats, but it’s not free (about $160) and the fact that it only works with Windows means it’s out of reach for me and my MacBook. I have tried it out on somebody else’s machine, though, and I was suitably impressed.

    Acrobat Pro
    This one is accessible for Mac users, but it’s also not free (about $15 a month and it requires an annual commitment).

    This UK-based company has developed software to automate PDF processing. It’s not free, but you can see what it can do by trying out it’s demo document converter – as long as your document is 1.5MB or smaller. You upload your pdf, tell them what you want it converted to, give them your email address and they’ll mail you the converted document.

    This is another online conversion tool where you can upload your document, choose the format you want to convert it to and it’ll email the converted document to the email address of your choice.

    Rotated tables
    Sometimes the tables in PDF documents have been rotated 90°. You need to be able to rotate the tables back to a normal orientation before any conversion tool will be able to identify them as text. Just rotating the page in Acrobat Reader or Preview, for example, won’t work. You need to rotate the table itself. To do this you need a proper PDF editor such as Acrobat Pro or Nitro Pro.

    If you have Acrobat Pro, here’s what you do:

    • If you your tables are part of a larger document, open your document and using the Organise Pages option, extract the pages with the tables you want to rotate. If you want to extract a number of consecutive pages, it’s simpler to extract them into separate files.
    • Open the page with the table on it. Go to the View menu and rotate view until your table is upright.
    • If there are headers and footers, or any other text, that are not rotated in the same direction as your table remove them using the Edit PDF function – you need to delete them, covering them up doesn’t work.
    • Go to the Enhance Scans option and choose Recognise Text, check the settings to make sure the option “Save as editable text and images” is selected. This may take a few minutes and when it’s finished your table may be rotated 90% again.
    • Go back to View and rotate your page till the table is upright again. Then save your file.
    • You can try to convert your page to an Excel spreadsheet using the Export PDF function, but I find that Tabula generally does the job better.

    Always check the converted data against the original documents because sometimes 8s can be mistaken for 6s or Bs. But even if your converted document isn’t absolutely perfect, converting it this way will be much quicker than manually typing it into a spreadsheet.

    Converting scanned PDFs
    In a scanned PDF, a table will be identified as an image rather than text, so if you want to extract the data from a table you first need to convert it to text with something that has optical character recognition (OCR). You can use Cometdocs, Acrobat Pro or Nitro Pro. Acrobat Pro’s Enhance Scans tool should recognise the text in your pdf as long as the quality of the scan isn’t terrible. Sometimes it helps to save a snapshot of the table you want to extract into its own pdf before you use the Enhance Scans tool. Once the scan is converted to text and images I still save it as a pdf and convert it to a CSV with Tabula. And, of course, always check your data against the original.

    Password protected PDFs
    Sometimes pdfs are password protected so that you can’t edit them or convert them to any other format. If you have a Mac with Preview try opening your PDF in Preview, then select the Export as PDF option under the File menu. Open the new version of your PDF and see if you’re able convert it to a spreadsheet now.

    Do you have a favourite tool for extracting data from PDFs? Let me know. You can find me on Twitter: @laurajgrant

    If you found this useful, consider signing up for our weekly media and journalism newsletter for more tips.

    Read More

    data journalism Journalism Toolbox new media tools visualisation

    A mapping toolbox for journalists: 10+ tools worth checking out

    Posted By Alastair Otter

    This is part 2 of an occasional series on useful tools for data journalists. Part 1: Want to be a data journalist? Learn these important tools or see all the other parts in the series here: The Journalism Toolbox

    Maps are one of the most popular ways to visualise data and are an easy way to add context to geographically-based datasets in stories. They can also be beautiful and offer a different view of our world.  

    So learning to make appealing and informative maps to support stories is a great skill to learn.

    Over the past few years I’ve found myself making an increasing number of maps to illustrate stories. Partly that’s because I’ve always had a love of maps and partly that’s because so much of the data I’ve worked with lends itself to being visualised on a map. Over this time I’ve also experimented with dozens of mapping tools and libraries and the list below is a shortlist based on the tools I find myself using most often.

    Some of these tools are easy to use. Add some data, choose your preferred settings and you create your map. Some of them (mostly towards the end of the list) require some programming knowledge but if you’re willing to invest the time you can create some great custom data visualisations with them.

    1 – My Maps
    Ease of use: Easy
    This is a great place to start if you’re new to mapping. If you have a dataset that includes columns for addresses or GPS co-ordinates then you’re set. In My Maps create a new map, import your dataset and select the column you want to use as the point marker location and you’re done. You can also add points to the map, or draw polygons to indicate areas of interest or even add direction information to maps. It’s easy to use and versatile.

    2 – BatchGeo
    Ease of use: Easy
    BatchGeo does a pretty simple job but it does it well. Again, if you have a dataset with columns indicating address or GPs co-ordinates then you can paste this into BatchGeo and it will automatically look up these positions and add them to a map. It has fewer features than My Maps but if you’re just after a map with multiple, labelled points of interest then BatchGeo is worth taking a look at.

    3 – MapJam
    Ease of use: Easy
    MapJam is another really easy-to-use mapping tool. It also has a fairly unique look for its maps which are really nicely styled. Adding information to MapJam maps is really easy and it’s easy to add annotations to the maps. MapJam produces either flat image maps or interactive, embeddable maps.

    A Fusion Table map of the provincial crime rates in South Africa

    4 – Fusion Tables
    Ease of use: Easy/Medium
    Fusion Tables is part of the Google Drive suite of tools. At first it may not seem obvious what this tool does but it is pretty powerful once you get used to it. Fusion Tables imports most common file formats such as CSV, Excel and KML. Fusion Tables is excellent at geocoding your data, so long as you’ve got a decent location column. One of its most useful features is the ability to merge datasets. Mapping with FT is relatively simple and there are options for creating your own color buckets to illustrate your data. If you want to learn more about Fusion Tables take a look at my Fusion Tables and KML files primer.

    5 – CartoDB
    Ease of use: Easy/Medium
    CartoDB is a lot like Fusion Tables but with loads more styling options. CartoDB can use different basemap styles and can handle multiple layers of information. Creating a basic, attractive map in CartoDB is pretty easy. It imports most common formats and can geocode country-level data. CartoDB has an extensive set of tools for making really detailed maps which can take a little while to master. The ability to export datasets in multiple formats makes this a go-to tool for me. I tend to do some initial planning work in CartoDB and then export the data as geojson files for use in a mapping library like Leaflet.js.

    6 – MapShaper.org
    Ease of use: Easy
    MapShaper is brilliantly simple. It does only a couple of things but it does them well, which makes it an essential part of my toolset. MapShaper opens most mapping formats such as shapefiles, geojson and topojson and renders those for you. This makes it really easy to quickly investigate the map data that you have. There is also an option to simplify the contour lines of your map which is essential in reducing your eventual file sizes. The inspector makes it easy to see the data that’s attached to all your map points and polygons. Maps can then be exported in a range of formats to use in other mapping tools.

    7 – Mapstarter
    Ease of use: Medium
    Mapstarter is exactly as it sounds. It’s a quick and easy way to turn a dataset into a visual map. Mapstarter opens shapefiles as well as Topojson and Geojson files and immediately renders them to the screen. You can then change styles such as colors and the interactive elements like mouseover info boxes. One of the useful features is the ability to edit the data in your file such as removing features. Maps can then be exported as SVG or image files. But where it really gets interesting is that maps can also be exported as basic D3.js maps. Anyone who has coded a D3.js map knows what a time saver this could be. I previously wrote up a slightly longer introduction to Mapstarter.

    A Leaflet.js-based map showing the marker cluster plugin.

    8 – Leaflet.js
    Ease of use: Medium/Hard
    Leaflet is a Javascript library for mapping and it’s my go-to-tool whenever I need something more customised than most other mapping tools offer. You do need to be able to programme to get the best out of Leaflet but once you do get the hang of it there’s no going back. Being Javascript-based leaflet also works well with most other libraries which makes the options for customising your maps endless. Combine those with this handy basemap chooser and your maps will be unique.

    9 – Mapbox
    Ease of use: Medium/Hard
    Mapbox is for when you’re looking for that completely unique map, something that no-one else has. And this is where you enter mapping geekdom. With Mapbox you can design your own maps from the ground up. Things like customising road colours, or boundary lines, or place names is done in Mapbox studio. You can then save your own unique map styles and use those in other applications such as Leaflet or CartoDB. Mapbox also has its own Javascript-based mapping library which you could use to build your own custom maps.

    10 – A few other tools
    There are tons of other mapping or map-related tools available. Some of the ones I do use include Color Brewer which is great for finding a color scheme for your maps, QGIS which is great for more high-end map manipulations though quite daunting at first, Tableau Public which is great for general visualisation including maps, Google’s Tour Builder is an interesting way to tell a story visually using a combination of maps and multimedia.

    This list is far from a comprehensive list of mapping tools but but it is hopefully a useful starting place and overview for anyone getting into mapping.

    If you think there is a mapping tool I ought to be trying out send me an email (alastair@mediahack.co.za) or find me on Twitter (@alastairotter).

    If you found this useful, consider signing up for my weekly media and journalism newsletter for more tips.

    Read More

    Data data journalism visualisation work

    How the petrol price is calculated

    Posted By Alastair Otter

    For some time now I’ve been working on some ideas around presenting the monthly fluctuation in the petrol price in an easy-to-understand and interactive form. For many people the sudden and seemingly volatile changes in the petrol price inspire thoughts of conspiracy (clearly government is fleecing us) and yet the actual changes in the petrol price are mostly rational and based on well-established principles and numbers. Obviously we, as citizenry, can complain that the various taxes applied to the price of petrol are excessive, or indeed insufficient, but the factors that affect the price of petrol month-to-month are largely out of our hands and even out of the direct hands of government (though their impact on the exchange rate is often fairly clear).

    Yesterday (May 3, 2017) the petrol price again went up, this time by 49c for 95 octane petrol in inland provinces (the usual benchmark).

    With this in mind I decided to try and complete a portion of the the ongoing petrol price project and release that on its own, which I did. This portion tries to explain the major inputs and costs that are used to calculate the price of petrol at the pump. The chart does some grouping of costs to simplify it and breaks it down into the two major components: the basic fuel price on the one hand, and the various taxes, tariffs and costs on the other.

    If the embedded version below doesn’t work correctly please take a look at the full version on petrol page.


    For the technically inclined, the majority of the graphic is built using D3.js. There’s a little bit of JQuery in there which I could probably replace with D3 which I may do in the next iteration.

    If you have any thoughts on this, or questions, you can always find me on Twitter.

    Read More
    The business of journalism is changing rapidly. Media Hack tracks these changes and delivers news, tips and insight directly to your inbox, every week.
    CLOSE [ X ]