????Even today, data in the form of PDFs are a common source of information (for a variety of reasons).
However, extracting the relevant parts out of the PDFs can be tricky, specially if the data of your choice is somewhere “in between” a 200 page report with all the facts, figures, illustrations, text etc.
The objective of my task was:1.
To use an unstructured source of information for data analysis.
Extract the required data.
Transform, Clean and Load the extracted data in the relevant format for use.
Now, there could be more than one way to do this.
Charles Bordet in his blog post explains two techniques using the pdftools and tm packages in R.
In another blog post, Troy Walters explains a working example by using the tabulizer package in R.
In this article, we would be using ‘Global Peace Index Report’ as the source of our unstructured data.
We are interested in getting the data from the tables on page 10 and 11 (which give the rank, name of country and score) as shown in the image below.
Global Peace Index RankingsThe following gives a detailed step by step approach to extract the relevant data.
Step 1: Install the necessary packages.
The first step requires you to install the tidyverse and tabulizer package in R.
Step 2: Extracting the required data.
Next step involves, using the extract_tables() function.
The URL of the PDF from which we want to extract the data is specified here.
The arguments that we use under this function are shown in the code snippet below.
Notice that we are iterating over the ‘pages’ 10 and 11 more than once.
To be precise, thrice over page 10 and twice over page 11.
This is because, we are extracting the tables one at a time instead of selecting the entire ‘area’.
The ‘areas’ argument is an “optional list, of length equal to the number of pages specified, where each entry contains a four-element numeric vector of coordinates (top,left,bottom,right) containing the table for the corresponding page.
”Often when using tabulizer if one is interested in a specific area of data to be extracted, manually defining the area makes it a cinch.
The locate_areas()function in R is a great way to do this interactively.
This function lets you drag a box like area on your pdf (in our case the area around the table structure).
After having done this, it gives us the co-ordinates which can be plugged into the areas argument.
See the code snippet and screenshots below.
Step 1: After running the code, open link in a browser to mark the areasStep 2: Drag a box around the area of the table you are interested in and click ‘Done’Step 3: Check the list of co-ordinates generated as the output of selected areas and use it in the arguments.
Step 3: Cleaning and transforming the required data.
The final step is to transform the extracted rows by re-ordering/re-naming/re-structuring them to get the final output.
Final OutputThe above, is a relatively easy to understand example.
Note that this method can typically be used when the structure of the table in pdf is somewhat fixed.
There could be several ways in which this process could be more streamlined and generalized to suit individual needs.
Nevertheless, learning something new is always fun ????Cheers !.. More details