Mining Electronic Documents for Fun and Profit - Raymond Camden
October 07, 2022
The good news -
PDFs are easy to store
easy to share with people
electronic docs inherently are great
accessibly, easy to share, etc.
The bad news --
2.5 Trillion PDFs out there
That's a lot of data
Data is safe but "hidden" in other documents
Our goal: go from sad to happy
How do we do this?
Begin by getting "info" from the PDFs
(lots of into can be gotten from PDFs)
Getting stuff from pdfs --
Text
Styling info
tables
images
The Solution:
Adobe Document Services
APIs related to documents
create by people that REALLY know PDFS
(PDF is a very complex standard)
Adobe Document Services - PDF services
- the "catch all" for "take a document and make it a PDF"
- not much different than CFPDF
- if you're on CF, just stick with CFPDF since that's local
- if you're not happy with CFPDF, then look at the APIs
- Utility type operations - joining docs, etc.
If you're selling a PDF, can make the "snippet" version that people can download for free, just grabbing the first 3 pages, etc.
Document Generation
Allows me to take MS Word, put code in there and do things like "here's a variable for a person's name and their salary" or "if person is in CA, show this text", looping over things, etc.
can do this in MS word, simple template language
after that, call the API, and we spit out dynamic PDF docs
PDF Embed
Browsers do a good job of rendering PDFs now
but it's the ENTIRE web page, you've lost your navigation, branding, other context
PDF Embed is "here's a DIV, put your PDF in there" and it's a better UX
also have hooks into docs
snoop on the doc, "they selected text, they went from page 2 to page 3", etc
Lawyers: we really need you to MAKE people read a document, watch to make sure people have scrolled all the way to the end of the document, etc.
PDF Extract
uses Adobe Sensei - ML, AI, Skynet, etc.
extracts text, tables, images, styling info
will take everything possible out of a PDF
tables can be stored as CSV, XLSX, or images
extracts document structure
- not just "the text says Ray" but also "this text was an H1 header"
- will auto OCR when necessary
Details
SDKS for Node, Java, .Net, and Python
(Ben Forta has a great book on Python)
REST APIs
- can use this with CF
- free trial. 1000 calls over 6 months
Go to the Adobe PDF Services API website and click "get credentials"
Code Process --
1. get credentials (one time)
(all the doc services stuff kind of follows this same process)
2. Get the SDK (or use REST API)
3. Write code to extract from PDF xyz
4. Automate the previous step
5. That's it! Profit!
Adobe gives you the data from the doc. Interesting part comes AFTER that.
For today - using Node.js
ColdFusion - Java
Tony Junkes GitHub repo, did the work of getting the Java SDK to magically work in CF, can be dropped into "this.javasettings" to load in the PDF stuff.
Or just consider using the new REST API
Genreal pseudo-code flow:
make a credentials obj
create an execution context specific to your operation (extract pdf, ocr, etc)
set your input and options
execute
save results
// Create an ExecutionContext using credentials const executionContext = PDFServicesSDK.ExecutionContext.create(credentials); // Build extractPDF options const options = new PDFServicesSDK.ExtractPDF.options.ExtractPdfOptions.Builder() .addElementsToExtract(PDFServicesSDK.ExtractPDF.options.ExtractElementType.TEXT, PDFServicesSDK.ExtractPDF.options.ExtractElementType.TABLES) .addElementsToExtractRenditions( PDFServicesSDK.ExtractPDF.options.ExtractRenditionsElementType.FIGURES, PDFServicesSDK.ExtractPDF.options.ExtractRenditionsElementType.TABLES ) .addTableStructureFormat(PDFServicesSDK.ExtractPDF.options.TableStructureType.CSV) .build() // Create a new operation instance. const extractPDFOperation = PDFServicesSDK.ExtractPDF.Operation.createNew(), input = PDFServicesSDK.FileRef.createFromLocalFile( 'PlanetaryScienceDecadalSurvey.pdf', PDFServicesSDK.ExtractPDF.SupportedSourceFormat.pdf );
What it means --
documentation that talks about stuff at a high level in terms of what the json is reporting
also have json schema
- a way to define the structure of a json file
Visualizer - parts of the PDF doc and now they related to the parts of the JSON doc
So now what?
Scenario - get text
useful for search engine
present text fragments go people
- grab the 1st paragraph out of a PDF, etc.
let data = JSON.parse(fs.readFileSync('./output/structuredData.json', 'utf8')); let text = data.elements.reduce((text, el) => { if(el.Text) text += el.Text + '\n'; return text; },''); console.log(text);
...returns a "wall of text", not terribly useful by itself
Scenario - get headers
slightly better, look at the return data and see what the headers are
what are the high level topics in this doc, etc.
maybe we only want to give headers to the search engine, not ALL the text
let data = JSON.parse(fs.readFileSync('./output/structuredData.json', 'utf8')); let text = data.elements.reduce((text, el) => { if(el.Path.includes('H1')) text += el.Text + '\n'; return text; },''); console.log(text);
Tip: any code you write, don't do the extract() over and over.
Extract once, then run code against the JSON extraction -- WAY faster.
Scenario - Style compliance
- look for fonts
let data = JSON.parse(fs.readFileSync('./output/structuredData.json', 'utf8')); let fonts = new Set(); data.elements.forEach(e => { if(e.Font) fonts.add(e.Font.name); }); console.log('List of fonts from input PDF:\n'); for(let font of fonts) console.log(font);
Scenario - text compliance
look for words we want/don't want
words that must include others
let data = JSON.parse(fs.readFileSync('./output/structuredData.json', 'utf8')); let text = data.elements.reduce((text, el) => { if(el.Text && (el.Path.indexOf('H1') === -1) && (el.Path.indexOf('H2') === -1) && (el.Path.indexOf('H3') === -1) ) text += el.Text + '\n'; return text; },''); let words = text.split(/\s+/); let possibleAcro = words.reduce((matches, word) => { if(word.match(/^[A-Z]{3,}$/) && matches.indexOf(word) === -1) { matches.push(word); } return matches; }, []); console.log(possibleAcro);
Scenario - process tabular data
analyze the table data, etc.
const options = new PDFServicesSDK.ExtractPDF.options.ExtractPdfOptions.Builder() .addElementsToExtract(PDFServicesSDK.ExtractPDF.options.ExtractElementType.TEXT, PDFServicesSDK.ExtractPDF.options.ExtractElementType.TABLES) .addTableStructureFormat(PDFServicesSDK.ExtractPDF.options.TableStructureType.CSV) .build() // Create a new operation instance. const extractPDFOperation = PDFServicesSDK.ExtractPDF.Operation.createNew(), input = PDFServicesSDK.FileRef.createFromLocalFile( inputPDF, PDFServicesSDK.ExtractPDF.SupportedSourceFormat.pdf );
Turning it up -
ML / AI
What is the text discussing?
Who is the text discussing?
"If this doc mentions Bill Gates..."
legal clauses that might apply to us, etc.
problematic language
Scenario - Sentiment of a doc
flag anything that's negative, etc.
Service called Diffbot
- returns things about docs, including sentiment
Scenario - Facts
what does this doc talk about?
Fact != truth
If the doc says "Ray is a good dancer"
that is a fact put forth in the doc
Diffbot can check this as well.
Scenario - Entities
which "things" are discussed
people, places, orgs
Scenario - summarize
what's the gist of a doc
MeaningCloud.com service that can do this -- free trial available
Image Analysis
this PDF had 5 images
what's IN those images, is it problematic, etc
can flag NSFW images, etc.
Microsoft Computer Vision service
extract the images, get the images, pass those to a 3rd party service to see what's in them.
Resources -
Documentation
Support Forum
on Stack Overflow tags: adboe-documentgeneration, adobe-embed-api, adobe-pdfservices
Adobe Tech Blog
- "Adobe Document Cloud" section, various blog posts about PDFs in there.
Ray's Contact info:
Twitter: @raymondcamden
raymondcamden.com
Demos can all be found in this GitHub repo.