OCR (optical character recognition) has existed for decades, but anyone who has tried to digitize a document containing complex tables, handwriting, or mathematical equations knows that results are often disappointing. Merged cells get mixed up, columns shift, handwriting becomes illegible, and formulas are simply ignored.
Chandra OCR, developed by Datalab, a Brooklyn-based AI startup, tackles precisely this problem. Its second version, Chandra 2, a 4 billion parameter model, achieves 85.9% on the olmOCR benchmark, positioning it as the best open-source OCR model available. It supports 90+ languages, produces structured outputs in Markdown, HTML, or JSON with full layout metadata, and handles complex cases (tables with merged cells, handwriting, LaTeX equations, forms with checkboxes) that conventional solutions fail to process.
Datalab, founded in June 2024 by Vik Paruchuri and Sandy Kwon, raised $3.5 million in seed funding from Pebblebed. The code is licensed under Apache 2.0, and model weights under modified OpenRAIL-M (free for startups under $2 million in revenue).
What distinguishes Chandra OCR from its predecessors, including Datalab's own earlier tools (Marker and Surya), is its full-page decoding approach.
Traditional OCR systems operate as pipelines: first segment the document into blocks (text, table, image), then process each block separately. This approach works reasonably for simple documents but fails when layout becomes complex. A table whose merged cells span multiple columns, or a page mixing printed text and handwriting, overwhelms segmentation pipelines.
Chandra takes a radically different approach. Based on a vision-language model (credited to Qwen3 VL), it processes the entire page in a single pass. The model "sees" the page as a human would: it simultaneously identifies content types, extracts and captions images, preserves table structures (including colspan and rowspan), reconstructs forms, and handles handwriting and mathematical equations.
The result is a structured output that preserves the logical hierarchy of the original document. An HTML table produced by Chandra retains its merged cells intact. An equation is rendered in LaTeX. A form preserves the relationship between labels and checkboxes.
Inference can be done locally via Hugging Face Transformers or through a high-performance vLLM server. In vLLM mode on an H100 GPU, Chandra processes up to 4 pages per second, potentially 345,000 pages per day. In concurrent mode with 96 parallel instances, throughput reaches 1.44 pages per second per instance.
Installation is straightforward: pip install chandra-ocr followed by chandra input.pdf output/ on the command line. A Python API via InferenceManager is also available for integration into processing pipelines.
The olmOCR benchmark has become a reference in the community for measuring models' ability to correctly extract structured text from complex documents. Here is how Chandra 2 positions itself.
Model | Overall olmOCR Score | Tables | Math | Headers/Footers |
|---|---|---|---|---|
Chandra 2 | 85.9% (SOTA) | 89.9% | 89.3% | 92.5% |
dots.ocr | 83.9% | - | - | - |
Chandra 1 | 83.1% | - | - | - |
olmOCR 2 | 78.5% | - | - | - |
DeepSeek OCR | 75.4% | - | - | - |
Gemini 2.5 Flash | - | - | - | - |
Chandra 2's performance is particularly impressive on tables (89.9%), mathematical equations (89.3%), and headers/footers (92.5%). These are precisely the areas where other solutions fall short.
In multilingual testing, Chandra 2 achieves an average of 77.8% across 43 languages, compared to 67.6% for Gemini 2.5 Flash. This 10-point multilingual advantage is significant for organizations processing documents in multiple languages.
Technical specifications: 4 billion parameters, 90+ language support, maximum 8,192 output tokens. Quantized versions (8B and 2B) are available commercially for resource-constrained deployments.
Chandra's ability to process complex documents opens varied use cases across different sectors.
In finance and accounting, extracting data from invoices, bank statements, and financial reports is a repetitive and costly task. Chandra can extract number tables while preserving merged cell structures, a critical point for financial statements where subtotals span multiple columns. One user reported in community discussions (Purchaser.ai) mentions six-figure savings through automating this type of processing.
In legal, digitizing contracts, court decisions, and regulatory documents benefits from Chandra's ability to preserve document hierarchical structure. Article numbers, paragraphs, and reference tables remain correctly associated.
For archives and historical research, handwriting support and 90+ languages make Chandra a valuable tool for digitizing archival documents, notebooks, or historical correspondence.
In education and research, the ability to extract equations as LaTeX is a rare differentiator. Scientific papers, textbooks, and exams containing mathematical formulas can be digitized with fidelity that other OCR solutions cannot offer.
For RAG pipelines and AI applications, Chandra can serve as a document preprocessor to feed retrieval-augmented systems. JSON outputs with bounding boxes enable fine-grained indexing and zone-based search within documents.
Chandra offers multiple deployment options depending on your needs and resources.
For local, occasional use, pip installation is the most direct path. The command pip install chandra-ocr installs the model and its dependencies. Command-line usage (chandra input.pdf output/) is sufficient for processing individual documents. A GPU is recommended but not strictly necessary for low volumes.
For high-throughput production use, vLLM deployment is the recommended option. On an H100 GPU, vLLM can serve Chandra at 4 pages per second. For large volumes (hundreds of thousands of pages), a multi-GPU configuration with vLLM and 96 concurrent instances achieves industrial throughput.
The Python API via InferenceManager enables integrating Chandra into existing pipelines. The code is structured to interface naturally with document processing frameworks (Haystack, LlamaIndex, etc.).
The model is available on Hugging Face as datalab-to/chandra, and the GitHub repository (github.com/datalab-to/chandra) has 4,700 stars. A free playground is accessible at datalab.to for testing the model without installation.
Chandra's license deserves particular attention because it differs across components.
The source code is under the Apache 2.0 license, meaning it can be used, modified, and redistributed freely, including in commercial products.
The model weights are under a modified OpenRAIL-M license. The important distinction is this: startups and companies with annual revenue below $2 million can use the weights for free. Above that threshold, a commercial license is required.
This two-tier licensing approach is increasingly common in the open-source AI ecosystem. It allows Datalab to support the open-source community while generating revenue from larger enterprises that benefit most from the model.
Datalab also offers a hosted API at datalab.to with a free playground for testing and paid tiers for production. Quantized versions (8B and 2B parameters) are available exclusively through commercial licenses, targeting resource-limited deployments where reduced model size is an advantage.
The OCR market is vast, and the best choice depends on your specific use case.
Against Tesseract, the historical open-source standard, Chandra wins decisively on complex documents (tables, handwriting, multilingual). Tesseract remains relevant for simple OCR of printed text, where it is lighter and faster. If your documents are essentially well-structured printed text, Tesseract suffices. As soon as tables or handwriting enter the picture, Chandra takes the lead.
Against DeepSeek OCR (75.4% olmOCR), Chandra offers a gain of over 10 percentage points. The difference is particularly marked on tables and multilingual documents.
Against olmOCR 2 (78.5%), the model associated with the eponymous benchmark, Chandra exceeds it by 7 points. The advantage is notable but less dramatic, and olmOCR 2 may be preferable in certain specific use cases where its optimizations shine.
Against proprietary solutions like GPT-4o, Mistral OCR, and Gemini 2.5 Flash, Chandra offers the advantage of running locally without API dependency. For organizations subject to confidentiality constraints (financial, medical, legal data), on-premise deployment of Chandra eliminates the risk of data leakage to third-party APIs.
Against PaddleOCR, the major open-source alternative for layout and tables, Chandra distinguishes itself through superior performance on the most complex documents, while PaddleOCR offers a more mature ecosystem and larger community.
Solution | olmOCR | Handwriting | Tables | Multilingual | Local Deployment |
|---|---|---|---|---|---|
Chandra 2 | 85.9% | Yes | Excellent | 90+ languages | Yes |
dots.ocr | 83.9% | Partial | Good | Variable | Variable |
olmOCR 2 | 78.5% | Limited | Fair | Variable | Yes |
Tesseract | Low | No | Poor | 100+ languages | Yes |
GPT-4o | Variable | Yes | Good | Multilingual | No (API) |
Chandra OCR 2 represents an important milestone in OCR's evolution. By processing the entire page as an image and using a vision-language model to understand it, Datalab made an architectural bet that is paying off: documents are no longer decomposed into independent pieces but understood in their totality.
For developers, Chandra significantly simplifies document processing pipelines. Instead of chaining a layout detector, text OCR, table extractor, and formula parser, a single model handles everything. The structured output (JSON with bounding boxes) integrates naturally into RAG pipelines and data extraction applications.
For businesses, the opportunity is automating a process that remains largely manual. Millions of paper documents and PDFs are processed manually every day in the financial, legal, and administrative sectors. An OCR capable of correctly handling tables and handwriting can automate a significant fraction of this work.
The challenge for Datalab will be maintaining its lead in a field attracting increasingly more players. DeepSeek, Google (Gemini), OpenAI, and open-source communities are investing heavily in AI-powered OCR. The pace of iteration will be decisive: between Chandra 1 (83.1%) and Chandra 2 (85.9%), the 2.8-point gain shows steady progress that will need to accelerate to stay ahead.
The GitHub repository (4,700 stars) and active Discord community demonstrate genuine developer engagement. For those who regularly process complex documents and are tired of the approximations from conventional OCR, Chandra is worth testing. The free playground at datalab.to makes it possible with zero commitment.

No commitment, prices to help you increase your prospecting.
You don't need credits if you just want to send emails or do actions on LinkedIn
May use it for :
Find Emails
AI Action
Phone Finder
Verify Emails