This week I spent some time working on an education project I have been dabbling with for a few weeks. Inevitably I eventually hit the PDF wall. The only place to get the extra data I needed to move the project forward was locked in a 276 page PDF. Even worse, the tables I wanted to pull out of the PDF were huge and spanned most of the 276 pages.
Split the PDF
Fortunately the tables were neatly divided into sections so the first thing I did was split the PDF up into smaller chunks. The easiest way to do this was to open the PDF in the Chrome browser and then “print” the pages I wanted to a new PDF. It’s pretty painless: open the PDF in Chrome, hover over the bottom right of the screen and select the printer icon. In the print switch from your standard printer to PDF and select the pages you want to print.
Convert to CSV
Next I tried every online PDF to Excel converter I could find. And each one failed, maybe because the files were still pretty big or because of the annoying layout which had column header text flipped 90 degrees. Either way it was going nowhere so I decided to give Tabula a try.
Tabula is a free, Java-based program designed specifically to “liberate data” stuck in PDFs. It’s pretty simple to install and it runs in your browser, a lot like OpenRefine.
Select and download
Once Tabula is open in your browser you need to find the PDF you want to extract data from. Upload the file and that is opened in the editing window. Next, drag a rectangle shape around the table you want. If the table continues on the next page(s), click the “Repeat Selection” button and the selection area is duplicated across the following pages.
Scrolling down you can reviews the selections and either move them if they’re slightly off, resize them or even remove certain pages from the selection. When you’re done you click the download button and download the content as a CSV or TSV file.
My selections covered dozens of pages at a time but Tabula did an excellent job. There were a few relatively minor formatting issues with the final files but given that I was working with 276 pages and more than 6,000 lines of data in the final document they were relatively insignificant.