Niels Co-founder

Published on Mar 27, 2026Updated on May 27, 2026

Find and contact your future customers

All-in-one prospecting platform

Try for free →

Back to hub

Chandra OCR: The Open-Source AI That Actually Reads Tables and Handwriting Right

Niels Co-founder

Published on Mar 27, 2026Updated on May 27, 2026

The Problem Nobody Truly Solved: Reading Complex Documents with AI

OCR (optical character recognition) has existed for decades, but anyone who has tried to digitize a document containing complex tables, handwriting, or mathematical equations knows that results are often disappointing. Merged cells get mixed up, columns shift, handwriting becomes illegible, and formulas are simply ignored.

Chandra OCR, developed by Datalab, a Brooklyn-based AI startup, tackles precisely this problem. Its second version, Chandra 2, a 4 billion parameter model, achieves 85.9% on the olmOCR benchmark, positioning it as the best open-source OCR model available. It supports 90+ languages, produces structured outputs in Markdown, HTML, or JSON with full layout metadata, and handles complex cases (tables with merged cells, handwriting, LaTeX equations, forms with checkboxes) that conventional solutions fail to process.

Datalab, founded in June 2024 by Vik Paruchuri and Sandy Kwon, raised $3.5 million in seed funding from Pebblebed. The code is licensed under Apache 2.0, and model weights under modified OpenRAIL-M (free for startups under $2 million in revenue).

How Chandra OCR Works: Full-Page Decoding

What distinguishes Chandra OCR from its predecessors, including Datalab's own earlier tools (Marker and Surya), is its full-page decoding approach.

Traditional OCR systems operate as pipelines: first segment the document into blocks (text, table, image), then process each block separately. This approach works reasonably for simple documents but fails when layout becomes complex. A table whose merged cells span multiple columns, or a page mixing printed text and handwriting, overwhelms segmentation pipelines.

Chandra takes a radically different approach. Based on a vision-language model (credited to Qwen3 VL), it processes the entire page in a single pass. The model "sees" the page as a human would: it simultaneously identifies content types, extracts and captions images, preserves table structures (including colspan and rowspan), reconstructs forms, and handles handwriting and mathematical equations.

The result is a structured output that preserves the logical hierarchy of the original document. An HTML table produced by Chandra retains its merged cells intact. An equation is rendered in LaTeX. A form preserves the relationship between labels and checkboxes.

Inference can be done locally via Hugging Face Transformers or through a high-performance vLLM server. In vLLM mode on an H100 GPU, Chandra processes up to 4 pages per second, potentially 345,000 pages per day. In concurrent mode with 96 parallel instances, throughput reaches 1.44 pages per second per instance.

Installation is straightforward: pip install chandra-ocr followed by chandra input.pdf output/ on the command line. A Python API via InferenceManager is also available for integration into processing pipelines.

Benchmarks: Chandra 2 Against the Competition

The olmOCR benchmark has become a reference in the community for measuring models' ability to correctly extract structured text from complex documents. Here is how Chandra 2 positions itself.

Chandra 2's performance is particularly impressive on tables (89.9%), mathematical equations (89.3%), and headers/footers (92.5%). These are precisely the areas where other solutions fall short.

In multilingual testing, Chandra 2 achieves an average of 77.8% across 43 languages, compared to 67.6% for Gemini 2.5 Flash. This 10-point multilingual advantage is significant for organizations processing documents in multiple languages.

Technical specifications: 4 billion parameters, 90+ language support, maximum 8,192 output tokens. Quantized versions (8B and 2B) are available commercially for resource-constrained deployments.

Practical Use Cases: From Invoices to Historical Archives

Chandra's ability to process complex documents opens varied use cases across different sectors.

In finance and accounting, extracting data from invoices, bank statements, and financial reports is a repetitive and costly task. Chandra can extract number tables while preserving merged cell structures, a critical point for financial statements where subtotals span multiple columns. One user reported in community discussions (Purchaser.ai) mentions six-figure savings through automating this type of processing.

In legal, digitizing contracts, court decisions, and regulatory documents benefits from Chandra's ability to preserve document hierarchical structure. Article numbers, paragraphs, and reference tables remain correctly associated.

For archives and historical research, handwriting support and 90+ languages make Chandra a valuable tool for digitizing archival documents, notebooks, or historical correspondence.

In education and research, the ability to extract equations as LaTeX is a rare differentiator. Scientific papers, textbooks, and exams containing mathematical formulas can be digitized with fidelity that other OCR solutions cannot offer.

For RAG pipelines and AI applications, Chandra can serve as a document preprocessor to feed retrieval-augmented systems. JSON outputs with bounding boxes enable fine-grained indexing and zone-based search within documents.

Installation and Deployment: From Workstation to Production Server

Chandra offers multiple deployment options depending on your needs and resources.

For local, occasional use, pip installation is the most direct path. The command pip install chandra-ocr installs the model and its dependencies. Command-line usage (chandra input.pdf output/) is sufficient for processing individual documents. A GPU is recommended but not strictly necessary for low volumes.

For high-throughput production use, vLLM deployment is the recommended option. On an H100 GPU, vLLM can serve Chandra at 4 pages per second. For large volumes (hundreds of thousands of pages), a multi-GPU configuration with vLLM and 96 concurrent instances achieves industrial throughput.

The Python API via InferenceManager enables integrating Chandra into existing pipelines. The code is structured to interface naturally with document processing frameworks (Haystack, LlamaIndex, etc.).

The model is available on Hugging Face as datalab-to/chandra, and the GitHub repository (github.com/datalab-to/chandra) has 4,700 stars. A free playground is accessible at datalab.to for testing the model without installation.

The Licensing Model: Free for Startups, Paid for Large Enterprises

Chandra's license deserves particular attention because it differs across components.

The source code is under the Apache 2.0 license, meaning it can be used, modified, and redistributed freely, including in commercial products.

The model weights are under a modified OpenRAIL-M license. The important distinction is this: startups and companies with annual revenue below $2 million can use the weights for free. Above that threshold, a commercial license is required.

This two-tier licensing approach is increasingly common in the open-source AI ecosystem. It allows Datalab to support the open-source community while generating revenue from larger enterprises that benefit most from the model.

Datalab also offers a hosted API at datalab.to with a free playground for testing and paid tiers for production. Quantized versions (8B and 2B parameters) are available exclusively through commercial licenses, targeting resource-limited deployments where reduced model size is an advantage.

Chandra vs. OCR Alternatives: When to Choose It and When Not To

The OCR market is vast, and the best choice depends on your specific use case.

Against Tesseract, the historical open-source standard, Chandra wins decisively on complex documents (tables, handwriting, multilingual). Tesseract remains relevant for simple OCR of printed text, where it is lighter and faster. If your documents are essentially well-structured printed text, Tesseract suffices. As soon as tables or handwriting enter the picture, Chandra takes the lead.

Against DeepSeek OCR (75.4% olmOCR), Chandra offers a gain of over 10 percentage points. The difference is particularly marked on tables and multilingual documents.

Against olmOCR 2 (78.5%), the model associated with the eponymous benchmark, Chandra exceeds it by 7 points. The advantage is notable but less dramatic, and olmOCR 2 may be preferable in certain specific use cases where its optimizations shine.

Against proprietary solutions like GPT-4o, Mistral OCR, and Gemini 2.5 Flash, Chandra offers the advantage of running locally without API dependency. For organizations subject to confidentiality constraints (financial, medical, legal data), on-premise deployment of Chandra eliminates the risk of data leakage to third-party APIs.

Against PaddleOCR, the major open-source alternative for layout and tables, Chandra distinguishes itself through superior performance on the most complex documents, while PaddleOCR offers a more mature ecosystem and larger community.

Solution	olmOCR	Handwriting	Tables	Multilingual	Local Deployment
Chandra 2	85.9%	Yes	Excellent	90+ languages	Yes
dots.ocr	83.9%	Partial	Good	Variable	Variable
olmOCR 2	78.5%	Limited	Fair	Variable	Yes
Tesseract	Low	No	Poor	100+ languages	Yes
GPT-4o	Variable	Yes	Good	Multilingual	No (API)

The Future of OCR: Toward Documents Truly Understandable by Machines

Chandra OCR 2 represents an important milestone in OCR's evolution. By processing the entire page as an image and using a vision-language model to understand it, Datalab made an architectural bet that is paying off: documents are no longer decomposed into independent pieces but understood in their totality.

For developers, Chandra significantly simplifies document processing pipelines. Instead of chaining a layout detector, text OCR, table extractor, and formula parser, a single model handles everything. The structured output (JSON with bounding boxes) integrates naturally into RAG pipelines and data extraction applications.

For businesses, the opportunity is automating a process that remains largely manual. Millions of paper documents and PDFs are processed manually every day in the financial, legal, and administrative sectors. An OCR capable of correctly handling tables and handwriting can automate a significant fraction of this work.

The challenge for Datalab will be maintaining its lead in a field attracting increasingly more players. DeepSeek, Google (Gemini), OpenAI, and open-source communities are investing heavily in AI-powered OCR. The pace of iteration will be decisive: between Chandra 1 (83.1%) and Chandra 2 (85.9%), the 2.8-point gain shows steady progress that will need to accelerate to stay ahead.

The GitHub repository (4,700 stars) and active Discord community demonstrate genuine developer engagement. For those who regularly process complex documents and are tired of the approximations from conventional OCR, Chandra is worth testing. The free playground at datalab.to makes it possible with zero commitment.

Discover Emelia, your all-in-one prospecting tool.

Launch my campaign

Clear, transparent prices without hidden fees

No commitment, prices to help you increase your prospecting.

Start

€37

/month

Unlimited email sending

Connect 1 LinkedIn Accounts

Unlimited LinkedIn Actions

Email Warmup Included

Unlimited Scraping

Unlimited contacts

Grow

Best seller

€97

/month

Unlimited email sending

Up to 5 LinkedIn Accounts

Unlimited LinkedIn Actions

Unlimited Warmup

Unlimited contacts

1 CRM Integration

Scale

€297

/month

Unlimited email sending

Up to 20 LinkedIn Accounts

Unlimited LinkedIn Actions

Unlimited Warmup

Unlimited contacts

Multi CRM Integrations

Unlimited API Calls

Credits(optional)

You don't need credits if you just want to send emails or do actions on LinkedIn

May use it for :

Find Emails

AI Action

Phone Finder

Verify Emails

€19per month

1,000

1,000 Emails found

1,000 AI Actions

20 Number

4,000 Verify

5,000

10,000

50,000

100,000

1,000 Emails found

1,000 AI Actions

20 Number

4,000 Verify

€19per month

Discover other articles that might interest you !

See all articles

Tips and training

Published on May 19, 2025

"When to post on LinkedIn to boost your engagement

Niels Co-founder

Blog

Published on Jun 22, 2025

VoIP for Business 2026: Top 8 Providers (B2B Sales Use)

Mathieu Co-founder

B2B Prospecting

Published on Jun 1, 2025

What is Emelia?Discover a French B2B prospecting Tool

Niels Co-founder

B2B Prospecting

Published on May 19, 2025

BCC Email: Best Practices and B2B Prospecting Alternatives

Mathieu Co-founder

Marketing

Published on Jun 17, 2025

Marketing Automation 2026: Top 7 vs Cold Email Outbound

Mathieu Co-founder

Tips and training

Published on Jul 1, 2025

How Manual Tasks Work in Advanced Campaigns on Emelia.io

Marie Head Of Sales

Made with ❤ for Growth Marketers by Growth Marketers

Find and contact your future customers

Chandra OCR: The Open-Source AI That Actually Reads Tables and Handwriting Right

The Problem Nobody Truly Solved: Reading Complex Documents with AI

How Chandra OCR Works: Full-Page Decoding

Benchmarks: Chandra 2 Against the Competition

Practical Use Cases: From Invoices to Historical Archives

Installation and Deployment: From Workstation to Production Server

The Licensing Model: Free for Startups, Paid for Large Enterprises

Chandra vs. OCR Alternatives: When to Choose It and When Not To

The Future of OCR: Toward Documents Truly Understandable by Machines

Discover Emelia, your all-in-one prospecting tool.

Clear, transparent prices without hidden fees

Start

Grow

Scale

Credits(optional)

Discover other articles that might interest you !

"When to post on LinkedIn to boost your engagement

VoIP for Business 2026: Top 8 Providers (B2B Sales Use)

What is Emelia?Discover a French B2B prospecting Tool

BCC Email: Best Practices and B2B Prospecting Alternatives

Marketing Automation 2026: Top 7 vs Cold Email Outbound

How Manual Tasks Work in Advanced Campaigns on Emelia.io

Useful links

About

Features

Follow us

Partners