Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay).
Training ParadigmTraining Paradigm
Models:Models:
- https://huggingface.co/inclusionAI/LLaDA2.0-mini (16B)
- http://huggingface.co/inclusionAI/LLaDA2.0-flash (103B)
Good to see that Diffusion Language Models are getting some more momentum now. I have some doubts about converting existing LLM models into diffusion models, but maybe this is ultimately useful tooling for those models that are trained with high precision, such as what ARTEMIS is trying to move towards (#1337284)?