Python for PdfUmer FarooqBlockedUnblockFollowFollowingJan 22Table of contentIntroductionWhy Python for PDF processingCommon Python LibrariesExtracting Text from pdfReading the Table data from pdfExport Pdf into ExcelFurther ReadingIntroductionBeing a high-level, interpreted language with a relatively easy syntax, Python is perfect even for those who don’t have prior programming experience.
Popular Python libraries are well integrated and provide the solution to handle unstructured data sources like Pdf and could be used to make it more sensible and usefulPDF is one of the most important and widely used digital media.
used to present and exchange documents.
PDFs contain useful information, links and buttons, form fields, audio, video, and business logic.
Why Python for PDF processingPDF processing comes under text analytics.
Most of the Text Analytics Library or frameworks are designed in Python only.
This gives leverage on text analytics.
Once you extract the useful information from PDF you can easily use that data into any Machine Learning or Natural Language Processing Model.
Common Python LibrariesHere is the list of some Python Libraries could be used to handle PDF filesPDFMiner is a tool for extracting information from PDF documents.
Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data.
PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.
It can also add custom data, viewing options, and passwords to PDF files.
It can retrieve text and metadata from PDFs as well as merge entire files together.
Tabula-py is a simple Python wrapper of tabula-java, which can read the table of PDF.
You can read tables from PDF and convert into pandas’ DataFrame.
tabula-py also enables you to convert a PDF file into CSV/TSV/JSON file.
Slate is wrapper Implementation of PDFMinerPDFQuery is a light wrapper around pdfminer, lxml and pyquery.
It’s designed to reliably extract data from sets of PDFs with as little code as possible.
xpdf Python wrapper for xpdf (currently just the “pdftotext” utility)Extracting Text from pdfFirst, we need to Install the!pip install PyPDF2Following is the code to extract simple Text from pdf using PyPDF2# modules for import PyPDF2# pdf file object# you can find find the pdf file with complete code in belowpdfFileObj = open('example.
pdf', 'rb')# pdf reader objectpdfReader = PyPDF2.
PdfFileReader(pdfFileObj)# number of pages in pdfprint(pdfReader.
numPages)# a page objectpageObj = pdfReader.
getPage(0)# extracting text from page.
# this will print the text you can also save that into Stringprint(pageObj.
extractText())You can read more Details from hereReading the Table data from pdfIn order to work with the Table data in Pdf, we can use Tabula-pypip install tabula-pyFollowing is the code to extract simple Text from pdf using PyPDF2import tabula# readinf the PDF file that contain Table Data# you can find find the pdf file with complete code in below# read_pdf will save the pdf table into Pandas Dataframedf = tabula.
pdf")# in order to print first 5 lines of Tabledf.
head()If you Pdf file contain Multiple Tabledf = tabula.
pdf”,multiple_tables=True)you can extract Information from the specific part of any specific page of PDFtabula.
pdf", area=(126,149,212,462), pages=1)If you want the output into JSON Formattabula.
pdf", output_format="json")Export Pdf into Excelyou can us Below code to convert the PDF Data into Excel or CSVtabula.
xlsx", output_format="xlsx")Further Readingsyou can find the complete code and Pdf files in This Github LinkThis question on StackOverflow also has a lot of useful link in its Answer How to extract table as text from the PDF using Python?Working with PDF files in Python using PyPDF2Working with PDF and Word Documents.. More details