pull down to refresh

Apple has managed to squeeze a 20B parameter multimodal model on phones with 12 GB of RAM. This is possible because of a technology that Apple invented for sparse architecture models where the model is stored in flash memory and parameters are activated per prompt. This makes me wonder what would be possible if this was combined with the 1-bit model work that the Bonsai AI folks are doing.

AFM 3 Core Advanced, our most powerful on-device model. It’s natively multimodal, enabling helpful features like expressive voices and higher-accuracy dictation. Built on cutting-edge Apple research, this 20-billion-parameter model uses a sparse architecture, activating just 1 to 4 billion parameters at a time depending on the request. AFM 3 Core Advanced is unlocked by and optimized for our most capable Apple silicon systems.

Maximizing on-device AI capabilitiesMaximizing on-device AI capabilities

One area of deep innovation is our most powerful on-device model, AFM 3 Core Advanced. Traditional large language models—whether dense or sparsely activated—require all weights to reside in active memory (DRAM), creating a massive footprint that limits scalability on consumer hardware. To break this barrier, AFM 3 Core Advanced introduces a novel sparsely activated architecture built on Instruction-Following Pruning (IFP), a technique developed by Apple researchers (see Figure 1).
Instead of forcing the entire model into DRAM, the full model is stored in flash memory (NAND). Because NAND-to-DRAM bandwidth is too slow to swap weights token by token, as standard MoE models require, AFM 3 Core Advanced makes routing decisions per prompt. A lightweight, dense block selects a fixed set of experts during initial processing, periodically reselecting them during generation. To minimize data movement, the model relies on a high percentage of always-active “shared experts” alongside input-dependent “routed experts” swapped into DRAM only when needed.

This design also introduces crucial inference-time elasticity. Rather than using a single model for all tasks or managing an ensemble of smaller models, AFM 3 Core Advanced uses a predetermined number of active parameters tailored to each specific use case. This allows weights to be loaded incrementally across requests of varying difficulty, scaling the model size far beyond traditional DRAM limits while minimizing latency.