•  
  • Tutorial: Extracting data from PDFs with Tabula

    Media Hack tutorials, video tutorials Tutorial: Extracting data from PDFs with Tabula
    Featured Video Play Icon

    tutorials video tutorials

    Tutorial: Extracting data from PDFs with Tabula

    Posted By Alastair Otter

    It’s one of data journalism’s most frustrating challenges. You have a dataset which is perfect for your project, but the data is locked in a PDF that makes it almost impossible to extract. You could try and retype the data yourself, but what if the PDF is multiple pages long and there are hundreds or even thousands of rows of data? Manually copying the data is not an option.

    One of the best tools we’ve found for extracting data is Tabula, an open source, easy-to-use tool that does a better than average job of pulling recalcitrant data from PDFS. In this tutorial we look at a fairly simple example of using Tabula to do just this.

    You can find more of our tutorial videos here.

    Written by Alastair Otter

    Data visualisation & design, journalist, hacker.

    http://mediahack.co.za/alastair

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    The business of journalism is changing rapidly. Media Hack tracks these changes and delivers news, tips and insight directly to your inbox, every week.
    CLOSE [ X ]