GLM-OCR: Z.ai’s 0.9B Model Takes the Top Spot on Document Understanding Benchmarks

Z.ai has open-sourced GLM-OCR, a compact yet powerful multimodal model that ranks #1 on OmniDocBench V1.5 with a score of 94.62 — all with just 0.9 billion parameters. Released in early February 2026, GLM-OCR challenges the assumption that large model size is a prerequisite for state-of-the-art document understanding.

GLM-OCR document parsing output showing text, tables, and formulas extracted from a complex document
Image credit: zai-org/GLM-OCR on GitHub

What Is GLM-OCR?

GLM-OCR is a multimodal optical character recognition model developed by Z.ai (the commercial arm of ZhipuAI) and designed specifically for complex document understanding. It can handle a broad spectrum of real-world materials: scanned PDFs, photos of handwritten notes, dense academic papers with formulas, multi-column tables, code listings, and documents containing stamps or seals.

The model supports three core recognition modes:

  • Text Recognition — general OCR for printed and handwritten content
  • Formula Recognition — structured LaTeX output for mathematical notation
  • Table Recognition — Markdown or HTML table output from complex table layouts

Beyond recognition, GLM-OCR also supports structured information extraction — given a JSON schema, the model extracts key-value pairs from invoices, certificates, receipts, and forms.

Architecture and Training

GLM-OCR is built on the GLM-V encoder–decoder architecture and combines three components:

  • CogViT visual encoder — pre-trained on large-scale image–text pairs for rich visual feature extraction
  • Lightweight cross-modal connector — bridges vision and language with efficient token downsampling
  • GLM-0.5B language decoder — generates structured text output

Two key training innovations distinguish GLM-OCR from conventional OCR systems:

  • Multi-Token Prediction (MTP) loss — improves training efficiency and output accuracy
  • Stable full-task reinforcement learning — boosts generalization across diverse layouts and document types

Rather than processing an entire page in a single pass, GLM-OCR uses a two-stage pipeline: it first runs layout analysis via PP-DocLayout-V3 to detect regions of interest, then performs OCR on those regions in parallel — enabling both higher accuracy and faster throughput.

GLM-OCR handling diverse real-world document types including stamps, tables, and mixed-layout pages
Image credit: zai-org/GLM-OCR on GitHub

Benchmark Performance

GLM-OCR sets a new bar across major document understanding benchmarks:

  • OmniDocBench V1.5: 94.62 — ranked #1 overall
  • OCRBench: 94.0
  • UniMERNet (formula recognition): 96.5

In throughput terms, the model achieves 1.86 pages per second for PDF documents and 0.67 images per second — significantly outperforming comparable models at this parameter scale.

GLM-OCR inference speed benchmarks compared to other OCR models
Image credit: zai-org/GLM-OCR on GitHub

Deployment and Accessibility

At 0.9B parameters, GLM-OCR is designed to run efficiently on commodity hardware. It supports multiple inference backends:

  • vLLM — recommended for production, with speculative decoding via MTP
  • SGLang — high-performance serving with speculative generation
  • Ollama — the simplest path to local deployment: ollama run glm-ocr
  • Apple Silicon (mlx-vlm) — optimized for Mac deployments

Z.ai also provides a hosted cloud API at $0.03 per million tokens — uniform pricing for both input and output. The SDK wraps the full pipeline (layout detection, parallel OCR, result formatting) behind a single Python call:

from glmocr import parse

result = parse("document.pdf")
result.save(output_dir="./results")

The model weights are available on Hugging Face and ModelScope under the MIT License, with the layout component under Apache 2.0. As of late February 2026, the model has seen over 1.45 million downloads on Hugging Face.

What This Means

GLM-OCR is a meaningful contribution to the open-source AI ecosystem for several reasons. First, it demonstrates that efficient architecture choices — MTP loss, parallel region processing, and a small but well-trained decoder — can outperform much larger models on document tasks. Second, its sub-1B footprint makes enterprise-grade OCR accessible at the edge, in mobile apps, and in high-concurrency services without the GPU costs of deploying a 7B+ VLM.

For researchers and developers working in document AI, GLM-OCR offers a compelling baseline: faster inference than Tesseract-class tools, better structured-output quality than general VLMs, and the flexibility of full open weights. The upcoming technical report from Z.ai is expected to detail the training data and methodology behind these results.

Related Coverage

Sources