GLM-OCR: Z.ai’s 0.9B Model Takes the Top Spot on Document Understanding Benchmarks

Z.ai has open-sourced GLM-OCR, a compact yet powerful multimodal model that ranks #1 on OmniDocBench V1.5 with a score of 94.62 — all with just 0.9 billion parameters. Released in early February 2026, GLM-OCR challenges the assumption that large model size is a prerequisite for state-of-the-art document understanding.
What Is GLM-OCR?
GLM-OCR is a multimodal optical character recognition model developed by Z.ai (the commercial arm of ZhipuAI) and designed specifically for complex document understanding. It can handle a broad spectrum of real-world materials: scanned PDFs, photos of handwritten notes, dense academic papers with formulas, multi-column tables, code listings, and documents containing stamps or seals.
The model supports three core recognition modes:
- Text Recognition — general OCR for printed and handwritten content
- Formula Recognition — structured LaTeX output for mathematical notation
- Table Recognition — Markdown or HTML table output from complex table layouts
Beyond recognition, GLM-OCR also supports structured information extraction — given a JSON schema, the model extracts key-value pairs from invoices, certificates, receipts, and forms.
Architecture and Training
GLM-OCR is built on the GLM-V encoder–decoder architecture and combines three components:
- CogViT visual encoder — pre-trained on large-scale image–text pairs for rich visual feature extraction
- Lightweight cross-modal connector — bridges vision and language with efficient token downsampling
- GLM-0.5B language decoder — generates structured text output
Two key training innovations distinguish GLM-OCR from conventional OCR systems:
- Multi-Token Prediction (MTP) loss — improves training efficiency and output accuracy
- Stable full-task reinforcement learning — boosts generalization across diverse layouts and document types
Rather than processing an entire page in a single pass, GLM-OCR uses a two-stage pipeline: it first runs layout analysis via PP-DocLayout-V3 to detect regions of interest, then performs OCR on those regions in parallel — enabling both higher accuracy and faster throughput.
Benchmark Performance
GLM-OCR sets a new bar across major document understanding benchmarks:
- OmniDocBench V1.5: 94.62 — ranked #1 overall
- OCRBench: 94.0
- UniMERNet (formula recognition): 96.5
In throughput terms, the model achieves 1.86 pages per second for PDF documents and 0.67 images per second — significantly outperforming comparable models at this parameter scale.
Deployment and Accessibility
At 0.9B parameters, GLM-OCR is designed to run efficiently on commodity hardware. It supports multiple inference backends:
- vLLM — recommended for production, with speculative decoding via MTP
- SGLang — high-performance serving with speculative generation
- Ollama — the simplest path to local deployment:
ollama run glm-ocr - Apple Silicon (mlx-vlm) — optimized for Mac deployments
Z.ai also provides a hosted cloud API at $0.03 per million tokens — uniform pricing for both input and output. The SDK wraps the full pipeline (layout detection, parallel OCR, result formatting) behind a single Python call:
from glmocr import parse
result = parse("document.pdf")
result.save(output_dir="./results")
The model weights are available on Hugging Face and ModelScope under the MIT License, with the layout component under Apache 2.0. As of late February 2026, the model has seen over 1.45 million downloads on Hugging Face.
What This Means
GLM-OCR is a meaningful contribution to the open-source AI ecosystem for several reasons. First, it demonstrates that efficient architecture choices — MTP loss, parallel region processing, and a small but well-trained decoder — can outperform much larger models on document tasks. Second, its sub-1B footprint makes enterprise-grade OCR accessible at the edge, in mobile apps, and in high-concurrency services without the GPU costs of deploying a 7B+ VLM.
For researchers and developers working in document AI, GLM-OCR offers a compelling baseline: faster inference than Tesseract-class tools, better structured-output quality than general VLMs, and the flexibility of full open weights. The upcoming technical report from Z.ai is expected to detail the training data and methodology behind these results.
Related Coverage
- What is GLM-Image? — Z.ai’s open-source image generation model released in January 2026
- GLM-TTS — High-Quality Text-to-Speech Model — Z.ai’s open-source TTS system
- Inside GLM-4.6: Z.ai’s Latest Breakthrough in Large Language Models — background on the GLM model family





沪公网安备31011502017015号