Mining Electronic Documents for Fun and Profit - Raymond Camden

October 07, 2022

The good news -
PDFs are easy to store
easy to share with people
electronic docs inherently are great
accessibly, easy to share, etc.

The bad news --
2.5 Trillion PDFs out there
That's a lot of data
Data is safe but "hidden" in other documents

Our goal: go from sad to happy

How do we do this?

Begin by getting "info" from the PDFs

(lots of into can be gotten from PDFs)

Getting stuff from pdfs --
Text
Styling info
tables
images

The Solution:
Adobe Document Services
APIs related to documents
create by people that REALLY know PDFS
(PDF is a very complex standard)

Adobe Document Services - PDF services
- the "catch all" for "take a document and make it a PDF"
- not much different than CFPDF
- if you're on CF, just stick with CFPDF since that's local
- if you're not happy with CFPDF, then look at the APIs
- Utility type operations - joining docs, etc.

If you're selling a PDF, can make the "snippet" version that people can download for free, just grabbing the first 3 pages, etc.

Document Generation
Allows me to take MS Word, put code in there and do things like "here's a variable for a person's name and their salary" or "if person is in CA, show this text", looping over things, etc.
can do this in MS word, simple template language
after that, call the API, and we spit out dynamic PDF docs

PDF Embed
Browsers do a good job of rendering PDFs now
but it's the ENTIRE web page, you've lost your navigation, branding, other context
PDF Embed is "here's a DIV, put your PDF in there" and it's a better UX
also have hooks into docs
snoop on the doc, "they selected text, they went from page 2 to page 3", etc
Lawyers: we really need you to MAKE people read a document, watch to make sure people have scrolled all the way to the end of the document, etc.

PDF Extract
uses Adobe Sensei - ML, AI, Skynet, etc.
extracts text, tables, images, styling info
will take everything possible out of a PDF
tables can be stored as CSV, XLSX, or images
extracts document structure
- not just "the text says Ray" but also "this text was an H1 header"
- will auto OCR when necessary

Details
SDKS for Node, Java, .Net, and Python
(Ben Forta has a great book on Python)
REST APIs
- can use this with CF
- free trial. 1000 calls over 6 months

Go to the Adobe PDF Services API website and click "get credentials"

Code Process --
1. get credentials (one time)
(all the doc services stuff kind of follows this same process)
2. Get the SDK (or use REST API)
3. Write code to extract from PDF xyz
4. Automate the previous step
5. That's it! Profit!

Adobe gives you the data from the doc. Interesting part comes AFTER that.

For today - using Node.js
ColdFusion - Java
Tony Junkes GitHub repo, did the work of getting the Java SDK to magically work in CF, can be dropped into "this.javasettings" to load in the PDF stuff.
Or just consider using the new REST API

Genreal pseudo-code flow:
make a credentials obj
create an execution context specific to your operation (extract pdf, ocr, etc)
set your input and options
execute
save results

// Create an ExecutionContext using credentials
const executionContext = PDFServicesSDK.ExecutionContext.create(credentials);
// Build extractPDF options
const options = new PDFServicesSDK.ExtractPDF.options.ExtractPdfOptions.Builder()
.addElementsToExtract(PDFServicesSDK.ExtractPDF.options.ExtractElementType.TEXT, PDFServicesSDK.ExtractPDF.options.ExtractElementType.TABLES)
.addElementsToExtractRenditions(
PDFServicesSDK.ExtractPDF.options.ExtractRenditionsElementType.FIGURES,
PDFServicesSDK.ExtractPDF.options.ExtractRenditionsElementType.TABLES
)
.addTableStructureFormat(PDFServicesSDK.ExtractPDF.options.TableStructureType.CSV)
.build()
// Create a new operation instance.
const extractPDFOperation = PDFServicesSDK.ExtractPDF.Operation.createNew(),
input = PDFServicesSDK.FileRef.createFromLocalFile(
'PlanetaryScienceDecadalSurvey.pdf',
PDFServicesSDK.ExtractPDF.SupportedSourceFormat.pdf
);


What it means --
documentation that talks about stuff at a high level in terms of what the json is reporting
also have json schema
- a way to define the structure of a json file
Visualizer - parts of the PDF doc and now they related to the parts of the JSON doc

So now what?

Scenario - get text
useful for search engine
present text fragments go people
- grab the 1st paragraph out of a PDF, etc.

let data = JSON.parse(fs.readFileSync('./output/structuredData.json', 'utf8'));
let text = data.elements.reduce((text, el) => {
if(el.Text) text += el.Text + '\n';
return text;
},'');
console.log(text);

...returns a "wall of text", not terribly useful by itself

Scenario - get headers
slightly better, look at the return data and see what the headers are
what are the high level topics in this doc, etc.
maybe we only want to give headers to the search engine, not ALL the text

 

let data = JSON.parse(fs.readFileSync('./output/structuredData.json', 'utf8'));
let text = data.elements.reduce((text, el) => {
if(el.Path.includes('H1')) text += el.Text + '\n';
return text;
},'');
console.log(text);

 

Tip: any code you write, don't do the extract() over and over.
Extract once, then run code against the JSON extraction -- WAY faster.

 

Scenario - Style compliance
- look for fonts

let data = JSON.parse(fs.readFileSync('./output/structuredData.json', 'utf8'));
let fonts = new Set();
data.elements.forEach(e => {
if(e.Font) fonts.add(e.Font.name);
});
console.log('List of fonts from input PDF:\n');
for(let font of fonts) console.log(font);


Scenario - text compliance
look for words we want/don't want
words that must include others

 

let data = JSON.parse(fs.readFileSync('./output/structuredData.json', 'utf8'));
let text = data.elements.reduce((text, el) => {
if(el.Text &&
(el.Path.indexOf('H1') === -1) &&
(el.Path.indexOf('H2') === -1) &&
(el.Path.indexOf('H3') === -1)
) text += el.Text + '\n';
return text;
},'');
let words = text.split(/\s+/);
let possibleAcro = words.reduce((matches, word) => {
if(word.match(/^[A-Z]{3,}$/) && matches.indexOf(word) === -1) {
matches.push(word);
}
return matches;
}, []);
console.log(possibleAcro);


Scenario - process tabular data
analyze the table data, etc.

 

const options = new PDFServicesSDK.ExtractPDF.options.ExtractPdfOptions.Builder()
.addElementsToExtract(PDFServicesSDK.ExtractPDF.options.ExtractElementType.TEXT, PDFServicesSDK.ExtractPDF.options.ExtractElementType.TABLES)
.addTableStructureFormat(PDFServicesSDK.ExtractPDF.options.TableStructureType.CSV)
.build()
// Create a new operation instance.
const extractPDFOperation = PDFServicesSDK.ExtractPDF.Operation.createNew(),
input = PDFServicesSDK.FileRef.createFromLocalFile(
inputPDF,
PDFServicesSDK.ExtractPDF.SupportedSourceFormat.pdf
);

 

Turning it up -
ML / AI
What is the text discussing?
Who is the text discussing?
"If this doc mentions Bill Gates..."
legal clauses that might apply to us, etc.
problematic language

 

Scenario - Sentiment of a doc
flag anything that's negative, etc.
Service called Diffbot
- returns things about docs, including sentiment

 

Scenario - Facts
what does this doc talk about?
Fact != truth
If the doc says "Ray is a good dancer"
that is a fact put forth in the doc
Diffbot can check this as well.

 

Scenario - Entities
which "things" are discussed
people, places, orgs

 

Scenario - summarize
what's the gist of a doc
MeaningCloud.com service that can do this -- free trial available

 

Image Analysis
this PDF had 5 images
what's IN those images, is it problematic, etc
can flag NSFW images, etc.
Microsoft Computer Vision service
extract the images, get the images, pass those to a 3rd party service to see what's in them.

 

Resources -
Documentation
Support Forum
on Stack Overflow tags: adboe-documentgeneration, adobe-embed-api, adobe-pdfservices
Adobe Tech Blog
- "Adobe Document Cloud" section, various blog posts about PDFs in there.

 

Ray's Contact info:
Twitter: @raymondcamden
raymondcamden.com
Demos can all be found in this GitHub repo.