Explainer

How OCR Works: Extracting Text from Images and PDFs

Learn how Optical Character Recognition (OCR) technology works and how it enables text extraction from scanned documents and images.

February 22, 20269 min read

Convert-To Editorial Team

Editorial Policy

Somewhere in your office, there's a filing cabinet full of paper documents that someone will eventually need to search through. Or maybe the problem is more immediate: a vendor sends invoices as scanned PDFs, and your accounting team rekeys every number by hand because they can't copy text from the files. OCR — Optical Character Recognition — is the technology that bridges the gap between images of text and actual, editable, searchable text data. But the process is far more complex than most people realize, and understanding its mechanics explains why OCR results range from near-perfect to borderline unusable.

The OCR Pipeline: From Pixels to Characters

OCR isn't a single operation — it's a multi-stage pipeline where each step depends on the quality of the previous one. A failure at any stage cascades through the rest of the process.

Stage 1: Image Pre-Processing

Before any character recognition happens, the OCR engine prepares the input image. Raw scans are rarely clean enough for direct analysis. Pre-processing steps include:

  • Deskewing — straightening pages that were scanned at a slight angle. Even a 2-degree tilt causes character misalignment that degrades recognition accuracy.
  • Binarization — converting the image to pure black and white. Most OCR engines work on binary images because it simplifies the distinction between ink (foreground) and paper (background).
  • Noise removal — eliminating specks, dust, and scanner artifacts that could be mistaken for punctuation or character components.
  • Contrast normalization — evening out lighting variations across the page, particularly for photographs of documents taken in uneven light.

The quality of this pre-processing stage has an outsized impact on final accuracy. In our testing, running OCR on a raw phone photo of a document produced 78% character accuracy, while the same image after pre-processing (deskewing, contrast normalization, noise removal) jumped to 94% accuracy.

Stage 2: Layout Analysis

Before reading individual characters, the engine must understand the document's structure. Where are the text columns? Where are the headers, footers, and page numbers? Where are the tables, images, and captions?

Layout analysis identifies:

ElementDetection MethodChallenge
Text blocksConnected component analysisMulti-column layouts can merge or split incorrectly
TablesLine detection + grid inferenceBorderless tables are notoriously difficult
ImagesRegion classificationText overlaid on images confuses detection
Headers/footersPosition-based rulesRepeated headers might be duplicated in output
Reading orderLeft-to-right, top-to-bottom heuristicsComplex layouts (magazines, forms) break standard order

This is where multi-column documents cause problems. If the layout analyzer misidentifies a two-column page as single-column, the output interleaves text from both columns — producing gibberish that requires complete manual reconstruction.

Stage 3: Character Segmentation

Once text regions are identified, the engine isolates individual characters. For printed text in standard fonts, this is straightforward — characters have consistent spacing and clear boundaries. But several real-world scenarios make segmentation difficult:

  • Touching characters where ink bleed causes letters to connect (common in faxed documents and low-resolution scans)
  • Broken characters where thin strokes don't scan clearly, splitting a single letter into fragments
  • Proportional fonts where character widths vary, making it harder to determine where one letter ends and another begins
  • Ligatures where multiple characters are rendered as a single glyph (fi, fl, ff in many fonts)

Stage 4: Character Recognition

This is the core of OCR — matching each segmented character image to a known character. Modern OCR engines use two main approaches:

Pattern matching compares character images against a stored library of templates. Fast and effective for standard printed fonts, but struggles with unusual typefaces, degraded print quality, or any variation from the template library.

Feature extraction with machine learning analyzes structural features of each character (curves, intersections, line angles, stroke counts) and uses trained neural networks to classify them. This approach handles font variations, minor damage, and degraded quality much better than pattern matching.

Current engines (Tesseract 5, ABBYY FineReader, Google Vision) primarily use deep learning — specifically LSTM (Long Short-Term Memory) networks that process characters in sequence, using context from surrounding characters to improve predictions. This is why OCR can sometimes correctly identify a poorly scanned letter based on the word it appears in.

Recognition MethodAccuracy (clean print)Accuracy (degraded scan)Speed
Pattern matching99%+85-90%Very fast
Feature extraction (traditional ML)98-99%90-95%Fast
Deep learning (LSTM)99.5%+92-97%Moderate
Transformer-based (latest)99.7%+95-98%Slower

Stage 5: Post-Processing

Raw character recognition output contains errors. Post-processing attempts to catch and correct them:

  • Dictionary lookup — flagging words that don't exist in the language dictionary
  • Context analysis — using language models to predict likely words (e.g., "tle" after "bot" is likely "bottle")
  • Format validation — checking that dates, phone numbers, and other structured data match expected patterns
  • Confidence scoring — marking low-confidence characters for human review

OCR Accuracy: What to Realistically Expect

Marketing materials for OCR software often claim 99%+ accuracy. That number is achievable — but only under ideal conditions. Real-world accuracy depends heavily on input quality:

Input QualityExpected AccuracyExample
Clean print, 300+ DPI scan99-99.5%Modern laser-printed document, flatbed scanner
Good print, 200 DPI scan97-99%Standard office document, multifunction printer scan
Older print, 150 DPI scan92-97%Faxed document, dated laser print
Photocopy of a photocopy80-92%Third-generation copy, faded text
Phone photo (good lighting)90-96%Straight-on shot, even lighting, clean document
Phone photo (poor conditions)70-85%Angled, shadows, page curl, low light
Handwriting (printed letters)70-90%Neat block letters, consistent spacing
Handwriting (cursive)40-70%Connected script, personal style variations

The "99% accuracy" figure sounds impressive until you consider what it means for a real document. A single page of text contains roughly 2,000-3,000 characters. At 99% accuracy, that's 20-30 errors per page. At 97% accuracy, it's 60-90 errors per page. For a 50-page document at 97% accuracy, you're looking at 3,000-4,500 character errors — enough to require thorough manual proofreading.

Convert-To Tip

For the best OCR results from scanned documents, scan at 300 DPI or higher in grayscale (not color). Color scans are larger and don't improve text recognition. If you're scanning specifically for OCR processing, black-and-white (1-bit) scans of clean documents actually produce the best results because they eliminate the need for the binarization pre-processing step.

When OCR Fails: Common Failure Modes

Understanding where OCR breaks helps you decide when to trust automated results and when manual verification is essential.

Tabular Data Extraction

Tables are OCR's weakness. Even when character recognition is accurate, reconstructing table structure — which cells belong to which columns, how merged cells span rows — frequently produces mangled output. A 10-row, 5-column financial table might OCR with 99% character accuracy but have columns misaligned, values shifted to wrong cells, or header rows merged with data.

If you're extracting tables from scanned PDFs for spreadsheet use, expect to spend time on manual cleanup. Our PDF to Excel converter handles native (text-based) PDF tables well, but scanned tables still require careful verification.

Mixed-Language Documents

Most OCR engines are trained on single-language text. Documents that mix languages (English body text with French quotations, Japanese product names in an English report) confuse the language model and reduce accuracy for both languages. Specialized multilingual models exist but are slower and still less accurate than single-language processing.

Degraded Historical Documents

Museum archives, century-old newspapers, and historical legal records present extreme challenges: faded ink, inconsistent typefaces, non-standard character sets, yellowed paper with show-through from the reverse side. Accuracy on these documents can drop below 70%, making automated OCR useful primarily as a rough index for human researchers rather than a definitive transcription.

Low-Resolution Input

OCR engines need a minimum of about 200 DPI (8 pixels per character height) to function reliably. Below that threshold — common in fax transmissions, low-resolution phone photos, and heavily compressed web images — characters blur together and segmentation fails. An invoice photographed from across a desk at an angle might technically contain the text, but the OCR engine receives too few pixels per character to distinguish "0" from "O" or "l" from "1".

OCR Workflow: A Practical Example

A small business receives 50 invoices per month as scanned PDFs from various vendors. The accounting team wants to extract amounts, dates, and vendor names for their bookkeeping software.

Step 1: Evaluate the PDFs. Check if any are already text-based (text is selectable). Those don't need OCR — extract directly with PDF to text conversion.

Step 2: For scanned PDFs, run OCR. If the scans are clean 300 DPI documents, expect 97-99% character accuracy.

Step 3: Verify critical numbers. OCR might read "$1,234.56" correctly in the text body but misread the same amount in a table where cell borders overlap with digits. Always cross-check financial figures.

Step 4: Handle exceptions manually. About 5-10% of invoices will have issues: handwritten notes, rubber stamps, low-resolution scans, or unusual formatting that confuses the layout analyzer.

This workflow reduces manual data entry by 80-90% but doesn't eliminate it. The remaining 10-20% of corrections is the practical overhead that OCR introduces, and planning for it upfront prevents the frustration of discovering errors downstream.

Privacy Note

Scanned documents processed through OCR often contain sensitive information — financial records, personal identification numbers, medical data. When you convert a file on Convert-To.co, it is processed by CloudConvert, a GDPR-compliant and ISO 27001 certified service. All files are automatically deleted within 15 minutes after conversion. Convert-To.co does not store your files on its own servers. For documents with strict confidentiality requirements, consider offline OCR software.

Tags

ocrtext extractionpdfscanningtechnology
Back to Blog
Updated 2/22/2026

Try It Now

Ready to use PDF to Word? Convert your files for free with our online tool.

Use PDF to Word

Try It Now

Ready to use PDF to Text? Convert your files for free with our online tool.

Use PDF to Text

Try It Now

Ready to use PDF to Excel? Convert your files for free with our online tool.

Use PDF to Excel