OCR (Optical Character Recognition) is technology that converts different types of documents, such as scanned paper documents, PDF files, or images captured by a digital camera, into editable and searchable data.

What file formats are supported?

We support all major image formats including JPG, PNG, WEBP, TIFF, BMP, and PDF documents.

How accurate is the text recognition?

Our OCR engine achieves 99.9% accuracy on clear, high-quality documents. Accuracy may vary based on image quality, handwriting, and document complexity.

Yes, all data is encrypted in transit and at rest. We use industry-standard security practices and do not share your data with third parties.

What languages are supported?

We support over 107 languages including English, Spanish, French, German, Chinese, Japanese, Arabic, and many more.

¿Funciona con documentos en español?

Sí, nuestro OCR soporta español y más de 107 idiomas. Simplemente selecciona 'Español' antes de escanear tu documento para obtener los mejores resultados con texto en español.

Can I scan Spanish documents?

Yes! Our OCR fully supports Spanish language documents. Select 'Español' as your document language before scanning to get optimized results for Spanish text.

Can I translate scanned text?

Yes! ScanThisText offers AI-powered translation for extracted text. After scanning a document, you can translate it to any of our 107+ supported languages instantly.

How many languages can I translate to?

Our translation service supports 107+ languages including English, Spanish, French, German, Chinese, Japanese, Arabic, Portuguese, Italian, Russian, Korean, and many more.

¿Puedo traducir documentos escaneados?

¡Sí! ScanThisText ofrece traducción con IA para texto extraído. Después de escanear un documento, puedes traducirlo instantáneamente a cualquiera de nuestros 107+ idiomas compatibles.

Custom AI Document Models for Enterprise

Generic document AI models are trained on public corpora — invoices, receipts, passports, standard contracts. They're great at the common case. They struggle the moment your documents are industry-specific: insurance loss runs with carrier-unique layouts, clinical trial case report forms, utility bills with regulatory line items, freight bills of lading, or any form your team designed in-house a decade ago.

That's where custom models earn their keep.

The Long-Tail Problem

Enterprises rarely have one document type. They have 40. And while generic extraction handles 25 of them at 95% accuracy, the remaining 15 sit somewhere between 60% and 80% — low enough that you still need humans reviewing every one. The long tail is where manual cost actually accumulates.

What Custom Training Changes

ScanThisText's Enterprise tier supports training on your specific document templates. Provide 50–200 examples of a document type with labeled fields, and the model learns the layout, the vocabulary, the edge cases, and the conventions your vendors or internal teams use. Accuracy on that document type typically jumps from 70–80% to 97%+ on the validation set.

Where Custom Models Pay Off Fastest

Insurance: Loss runs, ACORD forms with carrier overlays, declaration pages, endorsements.
Healthcare: Clinical trial CRFs, specialty referral forms, prior authorization packets.
Logistics: Bills of lading, customs forms, delivery receipts with carrier-specific formats.
Energy & utilities: Meter readings, regulatory filings, land lease documents.
Financial services: Loan applications, settlement statements, trade confirmations from specific counterparties.
Internal forms: Expense reports, intake forms, legacy PDFs your org has used for 15 years.

How the Training Loop Works

Label a starter set: Upload 50–200 representative documents. Our team annotates fields in collaboration with your SMEs.
Train and validate: The model is fine-tuned on your data and evaluated against a held-out validation set.
Shadow run: The custom model runs in parallel with your current process for 2–4 weeks. You see exactly where it agrees, where it flags low confidence, and where it disagrees.
Go live: Custom model handles the long-tail document types via the same API and UI as the generic pipeline.
Continuous improvement: Reviewer corrections feed back into the training set on a monthly cadence — accuracy climbs as volume grows.

Your Data Stays Yours

Custom models are trained in an isolated tenant. Your training data isn't used to improve the generic model, and the custom model isn't shared across customers. Data processing terms and the BAA (where applicable) cover the full training loop.

When It's Worth the Investment

The math is straightforward: if you process more than a few thousand documents a month of a specific type, and generic accuracy leaves more than 10% needing manual review, a custom model typically pays back in 60–90 days. Below that volume, generic extraction plus reviewer-in-the-loop is usually the right call.

Scope a Custom Model

Bring a document type and a sample set, and we'll scope the training effort, projected accuracy, and ROI in a 45-minute working session. Book an Enterprise document modeling call — and see what your long tail would look like at 97% accuracy.

When Off-the-Shelf OCR Isn't Enough: Custom AI Models for Your Document Types