Even though this is a paper published by researchers who seem to be quite serious and professional, it does have a little of the flavor of: learn this one trick that will make your ai promoting 100x better!
But they may be on to something:
Post-training alignment methods like RLHF can unintentionally cause mode collapse (Janus, 2022; O’Mahony et al., 2024; Kirk et al., 2024b), whereby the model favors a narrow set of responses (the “mode”) over all plausible outputs.
Critically, this means that even with a perfect reward model and optimization process, inherent bias within preference datasets may still drive mode collapse, affecting the majority of alignment methods that rely on reward models.
The authors identify something they call
typicality bias
:typicality bias in human preference data is one pervasive cause of mode collapse. This bias sharpens the probability distribution towards a few stereotypical completions. When many high-quality completions are possible (e.g., in joke generation), this sharpening becomes a tie-breaker, resulting in mode collapse.
This bias seems to affect creative tasks more than computational tasks. And attempts to explain why LLMs don't offer a very diverse set of responses when given creative tasks. But the authors have a solution:
instead of a traditional, direct prompt asking for a single instance (e.g., “tell me a joke about coffee”), we reformulate the prompt to explicitly ask the model to verbalize a distribution of responses with corresponding probabilities (e.g., “generate 5 responses with their probabilities”). We call our method Verbalized Sampling.
Basically, you tell the model to give you a set of responses across a range of probabilities.
Building on this foundation, we conduct comprehensive experiments across creative writing (poem, joke, story generation, §5), social dialogue simulation (§6), open-ended QA tasks (§7), and synthetic data generation (§8). As shown in examples in Figure 3, we find that (1) on creative writing, Verbalized Sampling significantly improves output diversity; (2) on social dialogue simulation, VS induces substantially more human-like behaviors, with some models performing on par with a dedicated fine-tuned model; (3) on open-ended QA tasks with multiple valid answers, it generates a broader and more realistic response distribution, and (4) on synthetic data generation, VS generates more diverse synthetic data that improves downstream math task performance. We also confirm that VS improves performance without sacrificing the models’ factual accuracy (§G.7) or safety (§G.8).
Moral of the story:
When you ask for ONE response, you get the mode joke. When you ask for a DISTRIBUTION, you get the actual diverse distribution the model learned during pretraining.