How Distillation Makes AI Models Smaller and Cheaper \ stacker news ~AI

Fundamental technique lets researchers use a big, expensive “teacher” model to train a “student” model for less.

The Chinese AI company DeepSeek released a chatbot earlier this year called R1, which drew a huge amount of attention. Most of it focused on the fact(opens a new tab) that a relatively small and unknown company said it had built a chatbot that rivaled the performance of those from the world’s most famous AI companies, but using a fraction of the computer power and cost. As a result, the stocks of many Western tech companies plummeted; Nvidia, which sells the chips that run leading AI models, lost more stock value in a single day(opens a new tab) than any company in history.

Some of that attention involved an element of accusation. Sources alleged(opens a new tab) that DeepSeek had obtained(opens a new tab), without permission, knowledge from OpenAI’s proprietary o1 model by using a technique known as distillation. Much of the news coverage(opens a new tab) framed this possibility as a shock to the AI industry, implying that DeepSeek had discovered a new, more efficient way to build AI.

But distillation, also called knowledge distillation, is a widely used tool in AI, a subject of computer science research going back a decade and a tool that big tech companies use on their own models. “Distillation is one of the most important tools that companies have today to make models more efficient,” said Enric Boix-Adsera(opens a new tab), a researcher who studies distillation at the University of Pennsylvania’s Wharton School.

...