GLM-4.7-Flash: Z.ai’s Efficient 30B MoE Model for Coding and Agents

February 24, 2026Provided by Utku Ege Tuluk

Z.ai has released GLM-4.7-Flash, a 30B-A3B Mixture-of-Experts (MoE) language model that punches well above its weight class in coding, agentic reasoning, and multi-step tool use — while remaining efficient enough to run on a single consumer GPU. Released on January 20, 2026, the model is fully open-source under the MIT license and available for free via the Z.ai API platform.

GLM-4.7-Flash neural network architecture visualization with glowing nodes and connections — Illustration generated by AI

What Is GLM-4.7-Flash?

GLM-4.7-Flash is the latest entry in Z.ai’s (formerly Zhipu AI) GLM series of large language models. Despite its “30B” label, the model uses a Mixture-of-Experts architecture with 31.2 billion total parameters but only approximately 3 billion active parameters during any given inference pass. This design — sometimes written as “30B-A3B” — means the model routes each token through a small subset of specialized expert sub-networks, achieving dense-model quality with a fraction of the compute.

The model supports a context window of up to 128,000 tokens, making it practical for large codebases, multi-file repositories, and extended conversations. It also supports function calling with auto-tool-choice and a dedicated “thinking” mode for multi-turn reasoning tasks.

Benchmark Results

GLM-4.7-Flash posts strong numbers across a range of coding, reasoning, and agentic benchmarks, frequently outperforming models with similar or larger active parameter counts:

Benchmark	GLM-4.7-Flash	Qwen3-30B-A3B	GPT-OSS-20B
AIME 25	91.6	85.0	91.7
GPQA	75.2	73.4	71.5
LCB v6	64.0	66.0	61.0
SWE-bench Verified	59.2	22.0	34.0
τ²-Bench	79.5	49.0	47.7
BrowseComp	42.8	2.29	28.3

The SWE-bench Verified score of 59.2% is particularly notable — it measures the model’s ability to autonomously resolve real GitHub issues, and GLM-4.7-Flash more than doubles the score of Qwen3-30B-A3B (22.0%) on this task. The τ²-Bench result of 79.5% (versus 49.0% for Qwen3-30B-A3B) similarly highlights strong agentic capabilities.

Deployment and Accessibility

GLM-4.7-Flash is designed to be practically deployable without specialized hardware. The model can run on a single 24GB GPU (such as an RTX 3090 or RTX 4090) or on Apple Silicon Mac systems, achieving speeds of 60–80 tokens per second under typical conditions.

For developers, the model is available through multiple deployment paths:

Hugging Face Transformers: Load with AutoModelForCausalLM in bfloat16
vLLM: Supports tensor parallelism and speculative decoding with the MTP method
SGLang: Supported with EAGLE speculative decoding for higher throughput
Z.ai API: Free API access with no credit card required
LM Studio: Available as a quantized model for desktop use

The model also supports 68+ quantized variants on Hugging Face and has been integrated into 53 Spaces, reflecting rapid community adoption since its release.

What This Means for the AI Community

GLM-4.7-Flash represents a maturing of the MoE approach for smaller, more accessible models. The gap between its SWE-bench score (59.2%) and that of comparable open-source competitors is striking and suggests that Z.ai’s training recipe — described in the accompanying paper “GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models” (arXiv 2508.06471) — has made meaningful advances in agentic task performance.

For researchers and developers at institutions like NYU Shanghai, the model’s ability to run locally on consumer hardware lowers the barrier to building agentic coding assistants, automated research tools, and complex multi-step pipelines without relying on cloud inference costs. The MIT license also means it can be freely used, modified, and deployed in academic projects.

With over 1.7 million downloads in its first month, GLM-4.7-Flash has found rapid uptake in the open-source community and positions Z.ai as a serious contender in the competitive landscape of efficient frontier models.

Related Coverage

Inside GLM-4.6: Z.ai’s Latest Breakthrough in Large Language Models — overview of the previous GLM-4.6 release and its multilingual capabilities
What is GLM-Image? — Z.ai’s open-source image generation model released in January 2026
GLM-TTS — High-Quality Text-to-Speech Model — Z.ai’s TTS model with zero-shot voice cloning

What Is GLM-4.7-Flash?

Benchmark Results

Deployment and Accessibility

What This Means for the AI Community

Related Coverage

Sources