GLM-OCR: Z.ai’s 0.9B Model Takes the Top Spot on Document Understanding Benchmarks

February 24, 2026Provided by Utku Ege Tuluk

Z.ai has open-sourced GLM-OCR, a compact yet powerful multimodal model that ranks #1 on OmniDocBench V1.5 with a score of 94.62 — all with just 0.9 billion parameters. Released in early February 2026, GLM-OCR challenges the assumption that large model size is a prerequisite for state-of-the-art document understanding.

GLM-OCR document parsing output showing text, tables, and formulas extracted from a complex document — Image credit: zai-org/GLM-OCR on GitHub

What Is GLM-OCR?

GLM-OCR is a multimodal optical character recognition model developed by Z.ai (the commercial arm of ZhipuAI) and designed specifically for complex document understanding. It can handle a broad spectrum of real-world materials: scanned PDFs, photos of handwritten notes, dense academic papers with formulas, multi-column tables, code listings, and documents containing stamps or seals.

The model supports three core recognition modes:

Text Recognition — general OCR for printed and handwritten content
Formula Recognition — structured LaTeX output for mathematical notation
Table Recognition — Markdown or HTML table output from complex table layouts

Beyond recognition, GLM-OCR also supports structured information extraction — given a JSON schema, the model extracts key-value pairs from invoices, certificates, receipts, and forms.

Architecture and Training

GLM-OCR is built on the GLM-V encoder–decoder architecture and combines three components:

CogViT visual encoder — pre-trained on large-scale image–text pairs for rich visual feature extraction
Lightweight cross-modal connector — bridges vision and language with efficient token downsampling
GLM-0.5B language decoder — generates structured text output

Two key training innovations distinguish GLM-OCR from conventional OCR systems:

Multi-Token Prediction (MTP) loss — improves training efficiency and output accuracy
Stable full-task reinforcement learning — boosts generalization across diverse layouts and document types

Rather than processing an entire page in a single pass, GLM-OCR uses a two-stage pipeline: it first runs layout analysis via PP-DocLayout-V3 to detect regions of interest, then performs OCR on those regions in parallel — enabling both higher accuracy and faster throughput.

GLM-OCR handling diverse real-world document types including stamps, tables, and mixed-layout pages — Image credit: zai-org/GLM-OCR on GitHub

Benchmark Performance

GLM-OCR sets a new bar across major document understanding benchmarks:

OmniDocBench V1.5: 94.62 — ranked #1 overall
OCRBench: 94.0
UniMERNet (formula recognition): 96.5

In throughput terms, the model achieves 1.86 pages per second for PDF documents and 0.67 images per second — significantly outperforming comparable models at this parameter scale.

GLM-OCR inference speed benchmarks compared to other OCR models — Image credit: zai-org/GLM-OCR on GitHub

Deployment and Accessibility

At 0.9B parameters, GLM-OCR is designed to run efficiently on commodity hardware. It supports multiple inference backends:

vLLM — recommended for production, with speculative decoding via MTP
SGLang — high-performance serving with speculative generation
Ollama — the simplest path to local deployment: ollama run glm-ocr
Apple Silicon (mlx-vlm) — optimized for Mac deployments

Z.ai also provides a hosted cloud API at $0.03 per million tokens — uniform pricing for both input and output. The SDK wraps the full pipeline (layout detection, parallel OCR, result formatting) behind a single Python call:

from glmocr import parse

result = parse("document.pdf")
result.save(output_dir="./results")

The model weights are available on Hugging Face and ModelScope under the MIT License, with the layout component under Apache 2.0. As of late February 2026, the model has seen over 1.45 million downloads on Hugging Face.

What This Means

GLM-OCR is a meaningful contribution to the open-source AI ecosystem for several reasons. First, it demonstrates that efficient architecture choices — MTP loss, parallel region processing, and a small but well-trained decoder — can outperform much larger models on document tasks. Second, its sub-1B footprint makes enterprise-grade OCR accessible at the edge, in mobile apps, and in high-concurrency services without the GPU costs of deploying a 7B+ VLM.

For researchers and developers working in document AI, GLM-OCR offers a compelling baseline: faster inference than Tesseract-class tools, better structured-output quality than general VLMs, and the flexibility of full open weights. The upcoming technical report from Z.ai is expected to detail the training data and methodology behind these results.

Related Coverage

What is GLM-Image? — Z.ai’s open-source image generation model released in January 2026
GLM-TTS — High-Quality Text-to-Speech Model — Z.ai’s open-source TTS system
Inside GLM-4.6: Z.ai’s Latest Breakthrough in Large Language Models — background on the GLM model family