•  
  • The beginner’s guide to extracting data from PDFs

    Media Hack data journalism, Journalism Toolbox The beginner’s guide to extracting data from PDFs

    data journalism Journalism Toolbox

    The beginner’s guide to extracting data from PDFs

    Posted By Laura Grant

    This is part 3 of an occasional series on useful tools for data journalists. You can see all the other parts in the series here: The Journalism Toolbox.

    Journalists get lots of data in PDF format – they can be tables of data that are embedded in reports, or spreadsheets that have been thoughtfully saved as PDFs before they’re emailed to you – but until you can get that data into a spreadsheet, there’s not much you can do with it.

    Luckily, there are a few great tools that can liberate your data quickly and relatively easily. I’ve listed some of the ones that I’ve used here, but there are no doubt loads more out there.

    Tabula
    http://tabula.technology/
    I love Tabula. It’s my go-to option, firstly because it’s free, and secondly, it’s really easy to use. Its website says it was created “by journalists for journalists”, which is probably why it’s so popular with non-techie people like me. I often need to extract tables of data from biggish PDF reports. Tabula lets you upload an entire document and select just the tables you want. You can convert one table at a time or a few depending on the layout of your document into a CSV, TSV of JSON file, which you can import to Google Sheets (free), Libre Office Calc (free), Excel (not free), or whatever program you prefer.

    The only times I don’t go straight to Tabula is when I have PDFs that have been scanned in, or when the tables I want to convert are rotated 90°. But I’ll deal with those later.

    Cometdocs
    https://www.cometdocs.com
    This one is also popular with journalists – not least because IRE members get free premium membership – and it’s really easy to use. You can convert up to five documents a week for free, but you have to subscribe if you want to do more. I quite like the fact that you can subscribe for a month at a time for $9.99, but if you really like it you can get a lifetime membership for about $130. You upload or import the pdf you want to convert, click the convert button and choose between Excel and .ODS (which you can open in Libre Office), unfortunately .CSV isn’t an option, but if you don’t have either of those spreadsheet packages, you can upload the file to Google Drive and open it in Google Sheets. It works quickly and well, but the really nice thing about Cometdocs is that it does optical character recognition (OCR), so it can convert scanned pdfs. You need to check the converted document against the original, though, just to be sure it picked everything up correctly. Like Tabula, it can’t handle tables that are rotated.

    Adobe Export PDF
    This one’s not free, but it’s not terribly expensive either – about $24 a year. If you use Adobe Reader, which is Adobe’s free PDF reader, Export PDF allows you to convert a PDF document that you’ve opened in Acrobat Reader to Excel, Word, PowerPoint or rtf. It works well and quickly with fairly big documents. But, like Tabula, it can’t do scanned documents or rotated tables.

    Nitro Pro
    https://www.gonitro.com
    If you have a Windows machine, Nitro is a great tool for editing and converting PDFs to useful formats, but it’s not free (about $160) and the fact that it only works with Windows means it’s out of reach for me and my MacBook. I have tried it out on somebody else’s machine, though, and I was suitably impressed.

    Acrobat Pro
    https://acrobat.adobe.com/us/en/acrobat/acrobat-pro.html
    This one is accessible for Mac users, but it’s also not free (about $15 a month and it requires an annual commitment).

    Zanran
    https://pdf.zanran.com/
    This UK-based company has developed software to automate PDF processing. It’s not free, but you can see what it can do by trying out it’s demo document converter – as long as your document is 1.5MB or smaller. You upload your pdf, tell them what you want it converted to, give them your email address and they’ll mail you the converted document.

    Zamzar
    http://www.zamzar.com/
    This is another online conversion tool where you can upload your document, choose the format you want to convert it to and it’ll email the converted document to the email address of your choice.

    Rotated tables
    Sometimes the tables in PDF documents have been rotated 90°. You need to be able to rotate the tables back to a normal orientation before any conversion tool will be able to identify them as text. Just rotating the page in Acrobat Reader or Preview, for example, won’t work. You need to rotate the table itself. To do this you need a proper PDF editor such as Acrobat Pro or Nitro Pro.

    If you have Acrobat Pro, here’s what you do:

    • If you your tables are part of a larger document, open your document and using the Organise Pages option, extract the pages with the tables you want to rotate. If you want to extract a number of consecutive pages, it’s simpler to extract them into separate files.
    • Open the page with the table on it. Go to the View menu and rotate view until your table is upright.
    • If there are headers and footers, or any other text, that are not rotated in the same direction as your table remove them using the Edit PDF function – you need to delete them, covering them up doesn’t work.
    • Go to the Enhance Scans option and choose Recognise Text, check the settings to make sure the option “Save as editable text and images” is selected. This may take a few minutes and when it’s finished your table may be rotated 90% again.
    • Go back to View and rotate your page till the table is upright again. Then save your file.
    • You can try to convert your page to an Excel spreadsheet using the Export PDF function, but I find that Tabula generally does the job better.

    Always check the converted data against the original documents because sometimes 8s can be mistaken for 6s or Bs. But even if your converted document isn’t absolutely perfect, converting it this way will be much quicker than manually typing it into a spreadsheet.

    Converting scanned PDFs
    In a scanned PDF, a table will be identified as an image rather than text, so if you want to extract the data from a table you first need to convert it to text with something that has optical character recognition (OCR). You can use Cometdocs, Acrobat Pro or Nitro Pro. Acrobat Pro’s Enhance Scans tool should recognise the text in your pdf as long as the quality of the scan isn’t terrible. Sometimes it helps to save a snapshot of the table you want to extract into its own pdf before you use the Enhance Scans tool. Once the scan is converted to text and images I still save it as a pdf and convert it to a CSV with Tabula. And, of course, always check your data against the original.

    Password protected PDFs
    Sometimes pdfs are password protected so that you can’t edit them or convert them to any other format. If you have a Mac with Preview try opening your PDF in Preview, then select the Export as PDF option under the File menu. Open the new version of your PDF and see if you’re able convert it to a spreadsheet now.

    Do you have a favourite tool for extracting data from PDFs? Let me know. You can find me on Twitter: @laurajgrant

    If you found this useful, consider signing up for our weekly media and journalism newsletter for more tips.

    Tagged ,

    Written by Laura Grant

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    The business of journalism is changing rapidly. Media Hack tracks these changes and delivers news, tips and insight directly to your inbox, every week.
    CLOSE [ X ]