Data Import Mechanism of PDF using R in SQL Server Machine Learning Language

Harisha Lakshan Warnakulasuriya
4 min readApr 9, 2022

In today’s digital world, data is available in many formats such as Excel, CSV, PDF, HTML, JSON, XML, TXT. We use SQL Server Integration Services for data imports and exports SQL Server. It can read data from various data sources, databases, transform them in the required format.

You cannot import or export data in PDF format from the SSIS package directly. The PDF is a popular format for all useful documents such as agreements, contracts, policy documents, research. Generally, we use Microsoft word for preparing the documents and later converts PDF formats.

SQL Server Machine Learning Language provides various functionality in the SQL Server. You can directly run your external R script from the SQL Server console. In this 4th article on the SQL Server R script, we will use the R functionality to read and import data from a portal file format (PDF) file.

Environment details

You need the SQL Server environment as specified below in this article.

  • Version: SQL Server 2019
  • Installed Feature: R Machine learning Language
  • SQL Server Launchpad and Database engine service should be in running state
  • SQL Server Management Studio

Read data from PDF files using R Scripts SQL Server

We can use the external libraries to use the required functions in the R Scripts SQL Server. For this article, we use pdftools external library.

Launch the administrative R console from the C:\Program Files\Microsoft SQL Server\MSSQL15.INST1\R_SERVICES\bin and install the PDF Tools library using the below SQL Server R Script.

install.packages(“pdftools”)

EXECUTE sp_execute_external_script

@language = N’R’,

@script = N’

library(rio)

print (packageVersion(“pdftools”))

I download the sample PDF file from the URL. The File content is as below.

Import the library in your R script session. If we want to check the PDF files available in our current R directory, you can use the list.files() function in R Scripts SQL Server and filter the results for PDF files.

> library(pdftools)
> files <- list.files(pattern = “pdf$”)

Now, to read the PDF file content, we use the function pdf.text() in the SQL Server R script.

data <- pdf_text(“MyData.pdf”)
data

In the output, you get file contents. You can compare the output with your original PDF document to verify the contents.

The R Scripts SQL Server displays PDF text without any formatting. It does not consider any space or line break between the sentences. We might want to display all characters in a single line. We can use the strsplit() function, and it breaks down the extracted text in multiple lines using the new line character(\n).

> data<- pdf_text(“MyData.pdf”)
> Splitdata<-strsplit(data,”\n”)
> Splitdata

Let’s convert the script into the SQL Server R script format using the sp_execute_external_scripts stored procedure.

In the below script, we use the following arguments:

  • @language: Specify the machine learning language R
  • @Script: In the script section, we specify the R script that we want to execute SQL Server

EXECUTE sp_execute_external_script

@language = N’R’,

@script = N’

library(pdftools)

data<- pdf_text(“C://Temp//MyData.pdf”)

OutputDataSet = data.frame(data)

with result sets ((“Extracted Text from PDF” varchar(max)))

In the Grid format, the PDF text appears in a single block.

If you display results in the text ( Result to text — CTRL+T) in SSMS, it gives you formatted data because it reads whole data in a text format ( 1 row affected).

Similarly, you can modify the R Scripts SQL Server with the strsplit() function and split the extracted text into multiple lines.

EXECUTE sp_execute_external_script

@language = N’R’,

@script = N’

library(pdftools)

data<- pdf_text(“C://Temp//MyData.pdf”)

Splitdata<-strsplit(data,”\n”)

OutputDataSet = data.frame(Splitdata)

with result sets ((“Extracted Text from PDF” varchar(max)))

With this end of this lesson you will be able to build and understand the skill to extract data from PDF file using R as a SQL Machine Learning Language.

Good Luck!

This Lesson has been edited and gathering information and well articulated to structured subject-matter expert on,

L.P.Harisha Lakshan Warnakulasuriya

Who is well-experienced Software Engineer of field in Software/Database/Network/Business Intelligence(BI) /Business Analyst/Network and Security/System/Mobile development with subject matter expertise in Business development and Leadership practices.

Follow me LinkedIn @ Harisha Lakshan Warnakulasuriya | LinkedIn

Follow me website @ http://www.unicornprofessional.ml

Email me @ unicornprofessionalbay@gmail.com

Surf my Portfolio Website @ https://www.srilankancodingchamp.ml/

You can find out more information about my completed project in-detail.

--

--

Harisha Lakshan Warnakulasuriya

Experinced Certified Software Engineer with holding an outstanding project portfolio and Experienced Certified Markteer with experience in many industries.