pull down to refresh

Groq LPU inference white papersubstack-post-media.s3.us-east-1.amazonaws.com/post-files/183040467/493c210e-30ba-486b-9da8-30869e0e1562.pdf?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=ASIAUM3FPD6BXNBBR6A4%2F20260101%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20260101T010549Z&X-Amz-Expires=3600&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEA4aCXVzLWVhc3QtMSJGMEQCIFk30is7XXzfCCwgE367qZaYxqFuZA9FOOTWV6QMKEPLAiBU%2BBuPc9SA7cNMjx6U4RKBEHnRMBgl%2BR8kWhBn4qISqiqCBAjX%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F8BEAQaDDMwMjQ3MTEyNjkxNSIMCJbHWcvpx7vyVRuYKtYDaoz2e8sbMAkIC5o5QfZnsXQRxb1HQBPaIiPhGHRDNQfi08ZDutkOimCVJ0TEJO4tg9wya%2BzLPk%2F%2FNQGe1VlO1EOYwsu2bm5gCBHlW3kdaO2ApvLsHSz0aTfA6du6buXdQhMelCCyhQGdW3Mi%2BF8StJP4UKLh5yaBWBQ9VZhyWNnl%2BGIXG%2FYG8AxK3zk1jGZQ30LO8wH%2Bje8qbD2C4TuXk%2BrJT90%2FJkGuar%2FQv8xw81sgm87Ciu4PCuZXA131OtMw0xnI5Gs2t16jcO9HH4PSp7gtPBPqZn3yv3Z597hvdyxsJUnv7hmToVj40VreQ4P23TMvRggmIzbARovOdWMoUDwSonfY9wwJLDIFV4v4Q8VAql5HtstoZG%2B2NFT2zOZ2qGKKfUxuaVzIL0ou1Tn%2FasfZRe8jclYtsv38xxkYrYMjfne1qRtwAmekW4SEvSogWna%2FR8%2FAeqlaaT%2FFOXviiUK4kR2ZYYty6U5AuaRCbg93oxurCkqgCjb9F2njdptEuM0X%2Fiv2vS%2FJWeXkJo2kLSxcL%2BGE2KcOOIVuCVWfEv8AaT6tTkY1sEvOLCsGRf%2Fuou7MKj72f2%2BEceFWXI%2FU6OGenDGZjcQgkPV683EG9xITLNpJbw4w3LXWygY6pgEaeo5usm0v8jqLYrPRM6HMN7yEgs0issqyTPAHaqPk%2FPf7MPOMR3E%2FvHoHGjC95u7%2BWxbbKonkHq3Z1PdQPryTBk97uXDeKV1SHOcx%2FzSDbmA9oBn7csdWpfLw6pGaGmrT1qvC7p%2B1xHqbvFUIfzUl5Q6Jc%2B2MgQFns7BJTxURw7osVXRPLtg51M%2F%2BPwDzl85Kitb9s67WICh%2FR4vJvmKt7P4xdqzU&X-Amz-Signature=f9803254d3fbae4ed4ca1fbe49f518b5a3b75c3aed9e93ecf7d8d2cabe547840&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3D%22Nvidia_Groq_Tsp.pdf%22&x-id=GetObject

Groq achieves amazing inference speed by loading all the parameters it needs into Static RAM (SRAM) on LPUs. Each LPU can only hold a limited amount of SRAM and so models usually need lots (hundreds) of LPUs networked together.

This also means that LPUs don’t depend on access to expensive and scarce High Bandwidth Memory

Unlike GPUs with asynchronous execution and complex runtime schedulers, Groq's LPU is fully deterministic. The compiler—not runtime hardware logic—controls every aspect of execution. When you compile a model for Groq, the compiler generates a static, cycle-accurate execution plan that maps every operation and data movement to physical hardware resources ahead of time. This eliminates runtime decision-making entirely

When running inference:

  1. Weights are pre-loaded into SRAM during initialization
  2. Input tokens trigger the pre-compiled execution sequence
  3. Data streams through the chip following the compiler's schedule
  4. Each core performs its assigned operations (matrix multiplies, activations, etc.) in perfect synchronization
  5. Results flow out deterministically with single-digit millisecond latency for large models

The trade-off is flexibility: models must be recompiled for specific hardware configurations, and the architecture is optimized for inference workloads with static compute graphs, not dynamic control flow or training.