Z.ai has open-sourced GLM-OCR, a compact yet powerful multimodal model that ranks #1 on OmniDocBench V1.5 with a score of 94.62 — all with just 0.9 billion parameters. Released in early February 2026, GLM-OCR challenges the assumption that large model size is a prerequisite for state-of-the-art document understanding.
GLM-OCR is a multimodal optical character recognition model developed by Z.ai (the commercial arm of ZhipuAI) and designed specifically for complex document understanding. It can handle a broad spectrum of real-world materials: scanned PDFs, photos of handwritten notes, dense academic papers with formulas, multi-column tables, code listings, and documents containing stamps or seals.
The model supports three core recognition modes:
Beyond recognition, GLM-OCR also supports structured information extraction — given a JSON schema, the model extracts key-value pairs from invoices, certificates, receipts, and forms.
GLM-OCR is built on the GLM-V encoder–decoder architecture and combines three components:
Two key training innovations distinguish GLM-OCR from conventional OCR systems:
Rather than processing an entire page in a single pass, GLM-OCR uses a two-stage pipeline: it first runs layout analysis via PP-DocLayout-V3 to detect regions of interest, then performs OCR on those regions in parallel — enabling both higher accuracy and faster throughput.
GLM-OCR sets a new bar across major document understanding benchmarks:
In throughput terms, the model achieves 1.86 pages per second for PDF documents and 0.67 images per second — significantly outperforming comparable models at this parameter scale.
At 0.9B parameters, GLM-OCR is designed to run efficiently on commodity hardware. It supports multiple inference backends:
ollama run glm-ocrZ.ai also provides a hosted cloud API at $0.03 per million tokens — uniform pricing for both input and output. The SDK wraps the full pipeline (layout detection, parallel OCR, result formatting) behind a single Python call:
from glmocr import parse
result = parse("document.pdf")
result.save(output_dir="./results")
The model weights are available on Hugging Face and ModelScope under the MIT License, with the layout component under Apache 2.0. As of late February 2026, the model has seen over 1.45 million downloads on Hugging Face.
GLM-OCR is a meaningful contribution to the open-source AI ecosystem for several reasons. First, it demonstrates that efficient architecture choices — MTP loss, parallel region processing, and a small but well-trained decoder — can outperform much larger models on document tasks. Second, its sub-1B footprint makes enterprise-grade OCR accessible at the edge, in mobile apps, and in high-concurrency services without the GPU costs of deploying a 7B+ VLM.
For researchers and developers working in document AI, GLM-OCR offers a compelling baseline: faster inference than Tesseract-class tools, better structured-output quality than general VLMs, and the flexibility of full open weights. The upcoming technical report from Z.ai is expected to detail the training data and methodology behind these results.
