GLM-4.7-Flash: Z.ai’s Efficient 30B MoE Model for Coding and Agents

Z.ai has released GLM-4.7-Flash, a 30B-A3B Mixture-of-Experts (MoE) language model that punches well above its weight class in coding, agentic reasoning, and multi-step tool use — while remaining efficient enough to run on a single consumer GPU. Released on January 20, 2026, the model is fully open-source under the MIT license and available for free via the Z.ai API platform.

GLM-4.7-Flash neural network architecture visualization with glowing nodes and connections
Illustration generated by AI

What Is GLM-4.7-Flash?

GLM-4.7-Flash is the latest entry in Z.ai’s (formerly Zhipu AI) GLM series of large language models. Despite its “30B” label, the model uses a Mixture-of-Experts architecture with 31.2 billion total parameters but only approximately 3 billion active parameters during any given inference pass. This design — sometimes written as “30B-A3B” — means the model routes each token through a small subset of specialized expert sub-networks, achieving dense-model quality with a fraction of the compute.

The model supports a context window of up to 128,000 tokens, making it practical for large codebases, multi-file repositories, and extended conversations. It also supports function calling with auto-tool-choice and a dedicated “thinking” mode for multi-turn reasoning tasks.

Benchmark Results

GLM-4.7-Flash posts strong numbers across a range of coding, reasoning, and agentic benchmarks, frequently outperforming models with similar or larger active parameter counts:

Benchmark GLM-4.7-Flash Qwen3-30B-A3B GPT-OSS-20B
AIME 25 91.6 85.0 91.7
GPQA 75.2 73.4 71.5
LCB v6 64.0 66.0 61.0
SWE-bench Verified 59.2 22.0 34.0
τ²-Bench 79.5 49.0 47.7
BrowseComp 42.8 2.29 28.3

The SWE-bench Verified score of 59.2% is particularly notable — it measures the model’s ability to autonomously resolve real GitHub issues, and GLM-4.7-Flash more than doubles the score of Qwen3-30B-A3B (22.0%) on this task. The τ²-Bench result of 79.5% (versus 49.0% for Qwen3-30B-A3B) similarly highlights strong agentic capabilities.

Deployment and Accessibility

GLM-4.7-Flash is designed to be practically deployable without specialized hardware. The model can run on a single 24GB GPU (such as an RTX 3090 or RTX 4090) or on Apple Silicon Mac systems, achieving speeds of 60–80 tokens per second under typical conditions.

For developers, the model is available through multiple deployment paths:

  • Hugging Face Transformers: Load with AutoModelForCausalLM in bfloat16
  • vLLM: Supports tensor parallelism and speculative decoding with the MTP method
  • SGLang: Supported with EAGLE speculative decoding for higher throughput
  • Z.ai API: Free API access with no credit card required
  • LM Studio: Available as a quantized model for desktop use

The model also supports 68+ quantized variants on Hugging Face and has been integrated into 53 Spaces, reflecting rapid community adoption since its release.

What This Means for the AI Community

GLM-4.7-Flash represents a maturing of the MoE approach for smaller, more accessible models. The gap between its SWE-bench score (59.2%) and that of comparable open-source competitors is striking and suggests that Z.ai’s training recipe — described in the accompanying paper “GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models” (arXiv 2508.06471) — has made meaningful advances in agentic task performance.

For researchers and developers at institutions like NYU Shanghai, the model’s ability to run locally on consumer hardware lowers the barrier to building agentic coding assistants, automated research tools, and complex multi-step pipelines without relying on cloud inference costs. The MIT license also means it can be freely used, modified, and deployed in academic projects.

With over 1.7 million downloads in its first month, GLM-4.7-Flash has found rapid uptake in the open-source community and positions Z.ai as a serious contender in the competitive landscape of efficient frontier models.

Related Coverage

Sources