Qwen3‑235B‑A22B‑Instruct‑2507: Alibaba’s Updated 235B-Parameter Instruction-Tuned LLM

July 22, 2025Provided by Utku Ege Tuluk

Alibaba’s Qwen team has just released Qwen3‑235B‑A22B‑Instruct‑2507, an enhanced instruction-tuned large language model with 235B total parameters (22B active), designed for chat, reasoning, coding, and long-context understanding.

🚀 Key Enhancements

Instruction following & reasoning: Major improvements across general capabilities—arithmetic, logic, coding, and tool-using performance have been measured to surpass the previous non-thinking version. (Hugging Face)
Long-tail knowledge: Expanded coverage across multiple languages. (Hugging Face)
Better alignment: Subjective and open-ended tasks now yield more helpful and higher-quality outputs. (Hugging Face)
Extended context: Supports up to 256 K tokens natively—ideal for processing or generating very long documents or context. (GitHub)

📊 Technical Specs

Specification	Details
Architecture	Mixture-of-Experts (MoE), causal LLM
Parameter Size	235 B total, 22 B activated
Layers & Heads	94 layers; 64 Q-heads, 4 K/V
Experts	128 experts, 8 active per token
Context Length	262,144 tokens (≈256K)
Inference Mode	Non-thinking (no `<think>...</think>`) (Hugging Face)
License	Apache 2.0 (open-source) (Hugging Face, GitHub)

🎯 Benchmark Performance Highlights

On industry-standard benchmarks, Qwen3‑235B‑A22B‑Instruct‑2507 shows strong results:

Knowledge Tasks: MMLU‑Pro score 83.0% vs GPT‑4o at 81.1%; SuperGPQA 62.6% (vs GPT‑4o’s 57.2%). (Hugging Face)
Reasoning: AIME25 70.3% (GPT‑4o: 49.5%), HMMT25 55.4% (vs 38.8%), Zebralogic 95.0% (vs 89.0%). (Hugging Face)
Coding: MultiPL‑E 87.9% (GPoT‑4o: 85.7%), LiveCodeBench 51.8% (vs 48.9%). (Hugging Face)
Creativity & Alignment: WritingBench 85.2% (vs GPT‑4o’s 86.2%), Creative Writing v3 at 87.5%. (Hugging Face)

The model benchmarks very competitively with top open-source and closed-source systems. (arXiv)

🧰 Quickstart & Deployment Options

from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "Qwen/Qwen3-235B-A22B-Instruct-2507"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
# Generate with up to 16 K tokens

Supports accelerated inference via SGLang, vLLM, vLLM, Ollama, llama.cpp, LMStudio, and more. (Hugging Face, GitHub)

Also available in FP8 quantized format for improved speed and memory efficiency. (Hugging Face)

🧭 Best Practices & Usage Tips

Sampling recommendations: temperature 0.7, top_p 0.8, top_k 20, presence_penalty 0–2
Context length: Use up to 16k for most tasks; full 256k only when needed
Mode: Model is permanently in non-thinking mode—no need to disable thinking via API (Hugging Face, Reddit, GitHub)

🧑‍💻 Community Feedback

From r/LocalLLaMA:

“I’ve been kinda disappointed in Qwen3‑235’s non‑thinking quality… now, an inherent non‑thinking, improved Qwen3‑235B? It feels like a dream come true.” (Reddit)

Users appreciate performance gains and native non-thinking behavior, though some remain skeptical about real-world advantages over closed-source models. (Hugging Face)