Qwen3‑235B‑A22B‑Instruct‑2507: Alibaba’s Updated 235B-Parameter Instruction-Tuned LLM

Alibaba’s Qwen team has just released Qwen3‑235B‑A22B‑Instruct‑2507, an enhanced instruction-tuned large language model with 235B total parameters (22B active), designed for chat, reasoning, coding, and long-context understanding.

🚀 Key Enhancements

  • Instruction following & reasoning: Major improvements across general capabilities—arithmetic, logic, coding, and tool-using performance have been measured to surpass the previous non-thinking version. (Hugging Face)
  • Long-tail knowledge: Expanded coverage across multiple languages. (Hugging Face)
  • Better alignment: Subjective and open-ended tasks now yield more helpful and higher-quality outputs. (Hugging Face)
  • Extended context: Supports up to 256 K tokens natively—ideal for processing or generating very long documents or context. (GitHub)

📊 Technical Specs

SpecificationDetails
ArchitectureMixture-of-Experts (MoE), causal LLM
Parameter Size235 B total, 22 B activated
Layers & Heads94 layers; 64 Q-heads, 4 K/V
Experts128 experts, 8 active per token
Context Length262,144 tokens (≈256K)
Inference ModeNon-thinking (no <think>...</think>) (Hugging Face)
LicenseApache 2.0 (open-source) (Hugging Face, GitHub)

🎯 Benchmark Performance Highlights

On industry-standard benchmarks, Qwen3‑235B‑A22B‑Instruct‑2507 shows strong results:

  • Knowledge Tasks: MMLU‑Pro score 83.0% vs GPT‑4o at 81.1%; SuperGPQA 62.6% (vs GPT‑4o’s 57.2%). (Hugging Face)
  • Reasoning: AIME25 70.3% (GPT‑4o: 49.5%), HMMT25 55.4% (vs 38.8%), Zebralogic 95.0% (vs 89.0%). (Hugging Face)
  • Coding: MultiPL‑E 87.9% (GPoT‑4o: 85.7%), LiveCodeBench 51.8% (vs 48.9%). (Hugging Face)
  • Creativity & Alignment: WritingBench 85.2% (vs GPT‑4o’s 86.2%), Creative Writing v3 at 87.5%. (Hugging Face)

The model benchmarks very competitively with top open-source and closed-source systems. (arXiv)

🧰 Quickstart & Deployment Options

from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "Qwen/Qwen3-235B-A22B-Instruct-2507"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
# Generate with up to 16 K tokens

Supports accelerated inference via SGLang, vLLM, vLLM, Ollama, llama.cpp, LMStudio, and more. (Hugging Face, GitHub)

Also available in FP8 quantized format for improved speed and memory efficiency. (Hugging Face)

🧭 Best Practices & Usage Tips

  • Sampling recommendations: temperature 0.7, top_p 0.8, top_k 20, presence_penalty 0–2
  • Context length: Use up to 16k for most tasks; full 256k only when needed
  • Mode: Model is permanently in non-thinking mode—no need to disable thinking via API (Hugging Face, Reddit, GitHub)

🧑‍💻 Community Feedback

From r/LocalLLaMA:

“I’ve been kinda disappointed in Qwen3‑235’s non‑thinking quality… now, an inherent non‑thinking, improved Qwen3‑235B? It feels like a dream come true.” (Reddit)

Users appreciate performance gains and native non-thinking behavior, though some remain skeptical about real-world advantages over closed-source models. (Hugging Face)