DFlash: Block Diffusion Delivers 6x Faster LLM Inference

DFlash is a new speculative decoding framework that uses block diffusion models to generate draft tokens in parallel rather than sequentially, achieving over 6× lossless acceleration on large language models — up to 2.5× faster than the previous state-of-the-art method EAGLE-3. The paper was published in February 2026 by Jian Chen, Yesheng Liang, and Zhijian Liu, and has gained significant traction in the open-source community after a viral demo on April 7.

Advanced

Visualization of parallel token generation using block diffusion for speculative decoding
Illustration generated by AI

The Problem with Sequential Drafting

Autoregressive LLMs generate tokens one at a time, leading to high inference latency and poor GPU utilization. Speculative decoding addresses this by having a smaller “draft” model propose multiple tokens that the larger “target” model verifies in a single forward pass. However, most draft models are themselves autoregressive — they still generate tokens sequentially, creating a bottleneck that limits the overall speedup.

How DFlash Works

DFlash replaces the autoregressive drafter with a lightweight block diffusion model that generates an entire block of draft tokens in a single forward pass. The key innovation is conditioning the diffusion draft model on context features extracted from the target model, which yields high-quality outputs and higher acceptance rates during verification.

Because the drafting cost remains relatively flat regardless of block length, DFlash transforms speculative decoding from an optimization trick into a scalable serving architecture. The diffusion drafter can produce 8, 16, or even more tokens simultaneously without the linear cost scaling of sequential generation.

Performance Results

  • 6× lossless acceleration across a range of models and tasks
  • 2.5× faster than EAGLE-3, the previous state-of-the-art speculative decoding method
  • Community benchmarks show Qwen3.5 27B running at ~65 tokens/second with DFlash speculation on dual RTX 3090s

The authors plan to open-source the training recipe so users can train their own DFlash draft models to accelerate any LLM. Integration with serving frameworks like SGLang and vLLM is already underway, and discussions are active in the llama.cpp community.

What This Means

DFlash could fundamentally change how LLMs are served. By eliminating the sequential drafting bottleneck, it makes speculative decoding viable for production workloads that previously couldn’t justify the complexity. For local LLM enthusiasts, the 65 t/s result on consumer hardware with a 27B model is particularly exciting — it puts real-time interactive use within reach for models that previously felt sluggish.

Sources