How OCR Works: Extracting Text from Images and PDFs
Learn how Optical Character Recognition (OCR) technology works and how it enables text extraction from scanned documents and images.
Convert-To Editorial Team
Editorial PolicySomewhere in your office, there's a filing cabinet full of paper documents that someone will eventually need to search through. Or maybe the problem is more immediate: a vendor sends invoices as scanned PDFs, and your accounting team rekeys every number by hand because they can't copy text from the files. OCR — Optical Character Recognition — is the technology that bridges the gap between images of text and actual, editable, searchable text data. But the process is far more complex than most people realize, and understanding its mechanics explains why OCR results range from near-perfect to borderline unusable.
The OCR Pipeline: From Pixels to Characters
OCR isn't a single operation — it's a multi-stage pipeline where each step depends on the quality of the previous one. A failure at any stage cascades through the rest of the process.
Stage 1: Image Pre-Processing
Before any character recognition happens, the OCR engine prepares the input image. Raw scans are rarely clean enough for direct analysis. Pre-processing steps include:
- Deskewing — straightening pages that were scanned at a slight angle. Even a 2-degree tilt causes character misalignment that degrades recognition accuracy.
- Binarization — converting the image to pure black and white. Most OCR engines work on binary images because it simplifies the distinction between ink (foreground) and paper (background).
- Noise removal — eliminating specks, dust, and scanner artifacts that could be mistaken for punctuation or character components.
- Contrast normalization — evening out lighting variations across the page, particularly for photographs of documents taken in uneven light.
The quality of this pre-processing stage has an outsized impact on final accuracy. In our testing, running OCR on a raw phone photo of a document produced 78% character accuracy, while the same image after pre-processing (deskewing, contrast normalization, noise removal) jumped to 94% accuracy.
Stage 2: Layout Analysis
Before reading individual characters, the engine must understand the document's structure. Where are the text columns? Where are the headers, footers, and page numbers? Where are the tables, images, and captions?
Layout analysis identifies:
| Element | Detection Method | Challenge |
|---|---|---|
| Text blocks | Connected component analysis | Multi-column layouts can merge or split incorrectly |
| Tables | Line detection + grid inference | Borderless tables are notoriously difficult |
| Images | Region classification | Text overlaid on images confuses detection |
| Headers/footers | Position-based rules | Repeated headers might be duplicated in output |
| Reading order | Left-to-right, top-to-bottom heuristics | Complex layouts (magazines, forms) break standard order |
This is where multi-column documents cause problems. If the layout analyzer misidentifies a two-column page as single-column, the output interleaves text from both columns — producing gibberish that requires complete manual reconstruction.
Stage 3: Character Segmentation
Once text regions are identified, the engine isolates individual characters. For printed text in standard fonts, this is straightforward — characters have consistent spacing and clear boundaries. But several real-world scenarios make segmentation difficult:
- Touching characters where ink bleed causes letters to connect (common in faxed documents and low-resolution scans)
- Broken characters where thin strokes don't scan clearly, splitting a single letter into fragments
- Proportional fonts where character widths vary, making it harder to determine where one letter ends and another begins
- Ligatures where multiple characters are rendered as a single glyph (fi, fl, ff in many fonts)
Stage 4: Character Recognition
This is the core of OCR — matching each segmented character image to a known character. Modern OCR engines use two main approaches:
Pattern matching compares character images against a stored library of templates. Fast and effective for standard printed fonts, but struggles with unusual typefaces, degraded print quality, or any variation from the template library.
Feature extraction with machine learning analyzes structural features of each character (curves, intersections, line angles, stroke counts) and uses trained neural networks to classify them. This approach handles font variations, minor damage, and degraded quality much better than pattern matching.
Current engines (Tesseract 5, ABBYY FineReader, Google Vision) primarily use deep learning — specifically LSTM (Long Short-Term Memory) networks that process characters in sequence, using context from surrounding characters to improve predictions. This is why OCR can sometimes correctly identify a poorly scanned letter based on the word it appears in.
| Recognition Method | Accuracy (clean print) | Accuracy (degraded scan) | Speed |
|---|---|---|---|
| Pattern matching | 99%+ | 85-90% | Very fast |
| Feature extraction (traditional ML) | 98-99% | 90-95% | Fast |
| Deep learning (LSTM) | 99.5%+ | 92-97% | Moderate |
| Transformer-based (latest) | 99.7%+ | 95-98% | Slower |
Stage 5: Post-Processing
Raw character recognition output contains errors. Post-processing attempts to catch and correct them:
- Dictionary lookup — flagging words that don't exist in the language dictionary
- Context analysis — using language models to predict likely words (e.g., "tle" after "bot" is likely "bottle")
- Format validation — checking that dates, phone numbers, and other structured data match expected patterns
- Confidence scoring — marking low-confidence characters for human review
OCR Accuracy: What to Realistically Expect
Marketing materials for OCR software often claim 99%+ accuracy. That number is achievable — but only under ideal conditions. Real-world accuracy depends heavily on input quality:
| Input Quality | Expected Accuracy | Example |
|---|---|---|
| Clean print, 300+ DPI scan | 99-99.5% | Modern laser-printed document, flatbed scanner |
| Good print, 200 DPI scan | 97-99% | Standard office document, multifunction printer scan |
| Older print, 150 DPI scan | 92-97% | Faxed document, dated laser print |
| Photocopy of a photocopy | 80-92% | Third-generation copy, faded text |
| Phone photo (good lighting) | 90-96% | Straight-on shot, even lighting, clean document |
| Phone photo (poor conditions) | 70-85% | Angled, shadows, page curl, low light |
| Handwriting (printed letters) | 70-90% | Neat block letters, consistent spacing |
| Handwriting (cursive) | 40-70% | Connected script, personal style variations |
The "99% accuracy" figure sounds impressive until you consider what it means for a real document. A single page of text contains roughly 2,000-3,000 characters. At 99% accuracy, that's 20-30 errors per page. At 97% accuracy, it's 60-90 errors per page. For a 50-page document at 97% accuracy, you're looking at 3,000-4,500 character errors — enough to require thorough manual proofreading.
For the best OCR results from scanned documents, scan at 300 DPI or higher in grayscale (not color). Color scans are larger and don't improve text recognition. If you're scanning specifically for OCR processing, black-and-white (1-bit) scans of clean documents actually produce the best results because they eliminate the need for the binarization pre-processing step.
When OCR Fails: Common Failure Modes
Understanding where OCR breaks helps you decide when to trust automated results and when manual verification is essential.
Tabular Data Extraction
Tables are OCR's weakness. Even when character recognition is accurate, reconstructing table structure — which cells belong to which columns, how merged cells span rows — frequently produces mangled output. A 10-row, 5-column financial table might OCR with 99% character accuracy but have columns misaligned, values shifted to wrong cells, or header rows merged with data.
If you're extracting tables from scanned PDFs for spreadsheet use, expect to spend time on manual cleanup. Our PDF to Excel converter handles native (text-based) PDF tables well, but scanned tables still require careful verification.
Mixed-Language Documents
Most OCR engines are trained on single-language text. Documents that mix languages (English body text with French quotations, Japanese product names in an English report) confuse the language model and reduce accuracy for both languages. Specialized multilingual models exist but are slower and still less accurate than single-language processing.
Degraded Historical Documents
Museum archives, century-old newspapers, and historical legal records present extreme challenges: faded ink, inconsistent typefaces, non-standard character sets, yellowed paper with show-through from the reverse side. Accuracy on these documents can drop below 70%, making automated OCR useful primarily as a rough index for human researchers rather than a definitive transcription.
Low-Resolution Input
OCR engines need a minimum of about 200 DPI (8 pixels per character height) to function reliably. Below that threshold — common in fax transmissions, low-resolution phone photos, and heavily compressed web images — characters blur together and segmentation fails. An invoice photographed from across a desk at an angle might technically contain the text, but the OCR engine receives too few pixels per character to distinguish "0" from "O" or "l" from "1".
OCR Workflow: A Practical Example
A small business receives 50 invoices per month as scanned PDFs from various vendors. The accounting team wants to extract amounts, dates, and vendor names for their bookkeeping software.
Step 1: Evaluate the PDFs. Check if any are already text-based (text is selectable). Those don't need OCR — extract directly with PDF to text conversion.
Step 2: For scanned PDFs, run OCR. If the scans are clean 300 DPI documents, expect 97-99% character accuracy.
Step 3: Verify critical numbers. OCR might read "$1,234.56" correctly in the text body but misread the same amount in a table where cell borders overlap with digits. Always cross-check financial figures.
Step 4: Handle exceptions manually. About 5-10% of invoices will have issues: handwritten notes, rubber stamps, low-resolution scans, or unusual formatting that confuses the layout analyzer.
This workflow reduces manual data entry by 80-90% but doesn't eliminate it. The remaining 10-20% of corrections is the practical overhead that OCR introduces, and planning for it upfront prevents the frustration of discovering errors downstream.
Scanned documents processed through OCR often contain sensitive information — financial records, personal identification numbers, medical data. When you convert a file on Convert-To.co, it is processed by CloudConvert, a GDPR-compliant and ISO 27001 certified service. All files are automatically deleted within 15 minutes after conversion. Convert-To.co does not store your files on its own servers. For documents with strict confidentiality requirements, consider offline OCR software.
Related Tools and Resources
- PDF to Word Converter — convert PDF documents (including scanned) to editable Word
- PDF to Excel Converter — extract tables from PDF documents
- PDF to Text Converter — extract plain text from PDFs
- Compress PDF — reduce PDF file size after processing
- PDF format guide — understand PDF structure and types
- DOCX format guide — Word document format details
- XLSX format guide — Excel spreadsheet format details
- Why PDF Formatting Breaks — troubleshoot conversion issues after OCR
- What Is a PDF? — understand the three types of PDF (text, image, hybrid)
Tags
Related Guides
The Complete Guide to File Formats and Conversion
A comprehensive guide to understanding file formats and converting between them. Covers documents, images, audio, and more.
TroubleshootingPreserving Excel Formatting When Converting to and from PDF
Troubleshoot Excel formatting issues during PDF conversion. Learn how to maintain tables, formulas, and layouts across formats.
ExplainerImage Resolution Explained: DPI vs PPI
Understand image resolution, DPI, and PPI. Learn how resolution affects print quality and screen display.
ExplainerLossy vs Lossless Compression: What You Need to Know
Learn the difference between lossy and lossless compression for images and audio. Understand when quality loss matters and when it doesn't.
Try It Now
Ready to use PDF to Word? Convert your files for free with our online tool.
Use PDF to Word →Try It Now
Ready to use PDF to Text? Convert your files for free with our online tool.
Use PDF to Text →Try It Now
Ready to use PDF to Excel? Convert your files for free with our online tool.
Use PDF to Excel →