It’s one of data journalism’s most frustrating challenges. You have a dataset which is perfect for your project, but the data is locked in a PDF that makes it almost impossible to extract. You could try and retype the data yourself, but what if the PDF is multiple pages long and there are hundreds or even thousands of rows of data? Manually copying the data is not an option.

One of the best tools we’ve found for extracting data is Tabula, an open source, easy-to-use tool that does a better than average job of pulling recalcitrant data from PDFS. In this tutorial we look at a fairly simple example of using Tabula to do just this.

