Qwen3‑235B‑A22B‑Instruct‑2507: Alibaba’s Updated 235B-Parameter Instruction-Tuned LLM
Alibaba’s Qwen team has just released Qwen3‑235B‑A22B‑Instruct‑2507, an enhanced instruction-tuned large language model with 235B total parameters (22B active), designed for chat, reasoning, coding, and long-context understanding.
🚀 Key Enhancements
- Instruction following & reasoning: Major improvements across general capabilities—arithmetic, logic, coding, and tool-using performance have been measured to surpass the previous non-thinking version. (Hugging Face)
- Long-tail knowledge: Expanded coverage across multiple languages. (Hugging Face)
- Better alignment: Subjective and open-ended tasks now yield more helpful and higher-quality outputs. (Hugging Face)
- Extended context: Supports up to 256 K tokens natively—ideal for processing or generating very long documents or context. (GitHub)
📊 Technical Specs
| Specification | Details |
|---|---|
| Architecture | Mixture-of-Experts (MoE), causal LLM |
| Parameter Size | 235 B total, 22 B activated |
| Layers & Heads | 94 layers; 64 Q-heads, 4 K/V |
| Experts | 128 experts, 8 active per token |
| Context Length | 262,144 tokens (≈256K) |
| Inference Mode | Non-thinking (no <think>...</think>) (Hugging Face) |
| License | Apache 2.0 (open-source) (Hugging Face, GitHub) |
🎯 Benchmark Performance Highlights
On industry-standard benchmarks, Qwen3‑235B‑A22B‑Instruct‑2507 shows strong results:
- Knowledge Tasks: MMLU‑Pro score 83.0% vs GPT‑4o at 81.1%; SuperGPQA 62.6% (vs GPT‑4o’s 57.2%). (Hugging Face)
- Reasoning: AIME25 70.3% (GPT‑4o: 49.5%), HMMT25 55.4% (vs 38.8%), Zebralogic 95.0% (vs 89.0%). (Hugging Face)
- Coding: MultiPL‑E 87.9% (GPoT‑4o: 85.7%), LiveCodeBench 51.8% (vs 48.9%). (Hugging Face)
- Creativity & Alignment: WritingBench 85.2% (vs GPT‑4o’s 86.2%), Creative Writing v3 at 87.5%. (Hugging Face)
The model benchmarks very competitively with top open-source and closed-source systems. (arXiv)
🧰 Quickstart & Deployment Options
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "Qwen/Qwen3-235B-A22B-Instruct-2507"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
# Generate with up to 16 K tokens
Supports accelerated inference via SGLang, vLLM, vLLM, Ollama, llama.cpp, LMStudio, and more. (Hugging Face, GitHub)
Also available in FP8 quantized format for improved speed and memory efficiency. (Hugging Face)
🧭 Best Practices & Usage Tips
- Sampling recommendations: temperature 0.7, top_p 0.8, top_k 20, presence_penalty 0–2
- Context length: Use up to 16k for most tasks; full 256k only when needed
- Mode: Model is permanently in non-thinking mode—no need to disable thinking via API (Hugging Face, Reddit, GitHub)
🧑💻 Community Feedback
From r/LocalLLaMA:
“I’ve been kinda disappointed in Qwen3‑235’s non‑thinking quality… now, an inherent non‑thinking, improved Qwen3‑235B? It feels like a dream come true.” (Reddit)
Users appreciate performance gains and native non-thinking behavior, though some remain skeptical about real-world advantages over closed-source models. (Hugging Face)


沪公网安备31011502017015号