Scanned PDF purchase orders: OCR that actually holds up
Roughly 30% of the POs an industrial distributor sees are scanned PDFs, not text-based ones. Fax-quality, sometimes faded, occasionally upside-down. Every PO automation pitch glosses over this part because the demo always shows a clean text PDF. Here's what the OCR layer looks like when it has to handle the actual mail.
The shape of the problem
"Scanned PDF" covers more ground than people assume. A few things we see in the wild:
- Faxed POs — buyer prints a PO, walks to the fax machine, sends it to your fax-to-email number, you get a PDF with grayscale noise around every character.
- Photographed POs — buyer takes a phone picture of a printed PO and emails it. Geometric distortion, glare, sometimes a thumb in frame.
- Scanned PDFs with bordered tables — a copier feeds the page through and the table grid lines become OCR confusion. Every "1" becomes "I"; every "0" becomes "O" inside a black border.
- Handwritten edits — a printed PO with a quantity scratched out and a different number written in by hand. The printed part OCRs fine; the handwriting needs a different pass.
- Mixed pages — a 6-page PDF where 4 pages are clean and 2 are scanned (a cover letter scanned plus the actual PO printed digitally). The mix is harder than a single mode.
Tesseract OCR — the default open-source library — handles maybe 60% of these cleanly out of the box. Above 70% on the better scans. Drops to 30-40% on bordered tables and faxed pages. The naive integration just runs Tesseract on every PDF and trusts the result, which is why most "PO automation" demos avoid scanned files entirely.
The trust gate
SideQuest runs Tesseract first because it's fast, local, and free. But the output goes through a trust gate before the matcher ever sees it. The gate checks four signals:
- Structural confidence. Did Tesseract find rows that look like line items — qty, part number, description, price columns? If a "scanned PO" came back with no parseable structure, the gate rejects it.
- Character confidence. Tesseract reports a per-character confidence score. Average below 65 on a tabular page is a rejection signal.
- Garble detection. Bordered tables produce a specific kind of OCR mess where every other character is a pipe or hash. If the output is more than 15% noise tokens, the gate rejects it.
- Length sanity. If Tesseract returned 5 tokens for a one-page PO that obviously contains 30 items based on the file size and resolution, something went wrong.
When the gate fails, the connector falls through to a second pass: vision passthrough. The PDF page gets rendered as a PNG, the PNG goes directly to Claude Desktop's native vision capability, and Claude writes the line items as structured JSON. Bordered tables that Tesseract destroys, Claude reads natively. Handwriting that Tesseract refuses to attempt, Claude writes out.
Why both passes matter
A pure-Claude pipeline would work too, but it wouldn't be cheap. Claude vision calls cost real money per page; running every PO through them at scale eats the unit economics. The gate is an economic decision as much as an accuracy decision. Cheap path first, fall through to expensive path only when the cheap path fails.
The numbers we see on a 270-PO sample from one industrial distributor:
- 185 POs (69%) parsed cleanly with Tesseract — text PDFs and well-scanned ones
- 52 POs (19%) hit the trust gate and went to Claude vision — bordered tables, faded scans
- 22 POs (8%) needed human review even after Claude vision — mixed-mode or genuinely illegible
- 11 POs (4%) failed both passes and got routed to "send the buyer a reply asking for a clean copy"
That last 4% is honest. There is a tail of POs no OCR pipeline reads. The right move is to drop them into a "needs buyer follow-up" queue with a pre-written reply asking for a fresh PDF, not to pretend the system handles 100%.
Bordered tables, specifically
Bordered tables are the single biggest reason a buyer gets a wrong Estimate from a "PO automation" tool. The OCR reads:
| Qty | Part # | Description | Price |
| 25 | BR-ELB-050 | 1/2 in elbow | 4.85 |
| 50 | VLV-BL-100 | 1 in ball valve| 28.00 |
...as something like:
I Qty I Part # I Description I Price I
I 25 I BR-EL8-05O I I/2 in elbow I 4.85 I
I 5O I VLV-8L-IOO I I in ball valve I 28.OO I
Every 0 is now an O. Every 1 is an I. The price column is now alphanumeric. The matcher chokes, but worse — if the trust gate isn't there, the connector silently writes an Estimate to QuickBooks for part BR-EL8-05O that doesn't exist. The customer gets an Estimate full of garbage SKUs and the AR team finds out three days later.
SideQuest's gate catches this specific failure mode. The 15%-noise-token signal lights up on bordered tables almost every time. The connector falls through to Claude vision and the page parses cleanly. Try it on the parser playground if you have a bordered-table PO handy.
Handwritten edits
A printed PO with a handwritten correction is a special case. The page OCRs mostly fine — the printed parts are clean — but Tesseract returns nothing useful for the inked-in correction. The naive integration would use the printed quantity and ignore the handwriting; the careful one notices that there's pixel data over the printed quantity and flags the line for review.
SideQuest treats handwriting as a confidence-zero region. The printed value gets through with a confidence-zero override flag. The reviewer sees "this line had a handwritten edit — verify the qty matches what the buyer wrote in" before clicking submit. That's the operator UI behavior; the matcher never silently accepts the printed value when ink overlaps it.
Where the OCR layer sits in the workflow
End to end, here's the flow for a PO with a scanned PDF attachment:
- Connector reads the labeled Gmail thread.
- Body text gets parsed against the standard anchors (Bill To, Ship To, PO #, terms, need-by).
- Each PDF attachment goes through pdfplumber for embedded text. If the PDF has selectable text, we use that directly — fastest path.
- If pdfplumber returns less than 20 characters from a page, it's a scanned image. The page goes to Tesseract.
- Tesseract output runs through the trust gate. Pass = use it. Fail = render the page as a PNG and call Claude vision.
- Whichever pass produced text feeds the same line-item parser the body uses.
- Matcher runs against the QB catalog. Any line that came from the OCR path with under-threshold confidence gets flagged for review.
- Draft Estimate gets built. The OCR confidence is shown on each line so the reviewer knows what to trust.
The whole pipeline runs on your computer. The page image only leaves your machine when the trust gate fails and the connector calls Claude — and even then, it goes to Anthropic's API directly, not via SideQuest's servers. We never see your PO content.
What this means for evaluation
If you're evaluating PO automation tools, the test that matters is: send the tool a stack of your real scanned POs, including the ugly ones, and see what comes back. Demo videos with clean text PDFs prove nothing. Customer references with mature integrations also prove nothing about cold-start accuracy. The real test is the messy middle.
The parser playground handles uploaded PDFs the same way the production connector does — pdfplumber first, browser-side Tesseract second (gated), Claude vision is not available in the browser version. You can paste any PO email body or drop a PDF and see exactly what the parser would do with it. No signup, no upload to our servers, content never leaves your browser.
Keep reading
How we evaluated every credible OCR option
Why we ship Tesseract + Claude vision instead of Azure DI or AWS Textract.
TechnicalFive PO formats that break every OCR pipeline
270 real industrial-distributor POs. The shapes that cause 30% to come back unusable.
TechnicalReading handwritten purchase orders
Confidence-zero regions, the operator UI, and what we do when ink overlaps print.