Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay).

Good to see that Diffusion Language Models are getting some more momentum now. I have some doubts about converting existing LLM models into diffusion models, but maybe this is ultimately useful tooling for those models that are trained with high precision, such as what 

>  Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay).

> ### Training Paradigm
> ![](https://m.stacker.news/122784)

----

### Models:

- [LLaDA2.0-mini](https://huggingface.co/inclusionAI/LLaDA2.0-mini) (16B)
- [LLaDA2.0-flash](http://huggingface.co/inclusionAI/LLaDA2.0-flash) (103B)

----

Good to see that Diffusion Language Models are getting some more momentum now. I have some doubts about converting existing LLM models into diffusion models, but maybe this is ultimately useful tooling for those models that are trained with high precision, such as what `ARTEMIS` is trying to move towards (https://stacker.news/items/1337284/r/optimism)?

Training ParadigmTraining Paradigm

Models:Models: