DFlash is a new speculative decoding framework that uses block diffusion models to generate draft tokens in parallel rather than sequentially, achieving over 6× lossless acceleration on large language models — up to 2.5× faster than the previous state-of-the-art method EAGLE-3. The paper was published in February 2026 by Jian Chen, Yesheng Liang, and Zhijian Liu, and has gained significant traction in the open-source community after a viral demo on April 7.
Advanced
Autoregressive LLMs generate tokens one at a time, leading to high inference latency and poor GPU utilization. Speculative decoding addresses this by having a smaller “draft” model propose multiple tokens that the larger “target” model verifies in a single forward pass. However, most draft models are themselves autoregressive — they still generate tokens sequentially, creating a bottleneck that limits the overall speedup.
DFlash replaces the autoregressive drafter with a lightweight block diffusion model that generates an entire block of draft tokens in a single forward pass. The key innovation is conditioning the diffusion draft model on context features extracted from the target model, which yields high-quality outputs and higher acceptance rates during verification.
Because the drafting cost remains relatively flat regardless of block length, DFlash transforms speculative decoding from an optimization trick into a scalable serving architecture. The diffusion drafter can produce 8, 16, or even more tokens simultaneously without the linear cost scaling of sequential generation.
The authors plan to open-source the training recipe so users can train their own DFlash draft models to accelerate any LLM. Integration with serving frameworks like SGLang and vLLM is already underway, and discussions are active in the llama.cpp community.
DFlash could fundamentally change how LLMs are served. By eliminating the sequential drafting bottleneck, it makes speculative decoding viable for production workloads that previously couldn’t justify the complexity. For local LLM enthusiasts, the 65 t/s result on consumer hardware with a 27B model is particularly exciting — it puts real-time interactive use within reach for models that previously felt sluggish.
