Groq LPU inference white paper \ stacker news

pull down to refresh

Groq LPU inference white paper substack-post-media.s3.us-east-1.amazonaws.com/post-files/183040467/493c210e-30ba-486b-9da8-30869e0e1562.pdf?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=ASIAUM3FPD6BXNBBR6A4%2F20260101%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20260101T010549Z&X-Amz-Expires=3600&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEA4aCXVzLWVhc3QtMSJGMEQCIFk30is7XXzfCCwgE367qZaYxqFuZA9FOOTWV6QMKEPLAiBU%2BBuPc9SA7cNMjx6U4RKBEHnRMBgl%2BR8kWhBn4qISqiqCBAjX%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F8BEAQaDDMwMjQ3MTEyNjkxNSIMCJbHWcvpx7vyVRuYKtYDaoz2e8sbMAkIC5o5QfZnsXQRxb1HQBPaIiPhGHRDNQfi08ZDutkOimCVJ0TEJO4tg9wya%2BzLPk%2F%2FNQGe1VlO1EOYwsu2bm5gCBHlW3kdaO2ApvLsHSz0aTfA6du6buXdQhMelCCyhQGdW3Mi%2BF8StJP4UKLh5yaBWBQ9VZhyWNnl%2BGIXG%2FYG8AxK3zk1jGZQ30LO8wH%2Bje8qbD2C4TuXk%2BrJT90%2FJkGuar%2FQv8xw81sgm87Ciu4PCuZXA131OtMw0xnI5Gs2t16jcO9HH4PSp7gtPBPqZn3yv3Z597hvdyxsJUnv7hmToVj40VreQ4P23TMvRggmIzbARovOdWMoUDwSonfY9wwJLDIFV4v4Q8VAql5HtstoZG%2B2NFT2zOZ2qGKKfUxuaVzIL0ou1Tn%2FasfZRe8jclYtsv38xxkYrYMjfne1qRtwAmekW4SEvSogWna%2FR8%2FAeqlaaT%2FFOXviiUK4kR2ZYYty6U5AuaRCbg93oxurCkqgCjb9F2njdptEuM0X%2Fiv2vS%2FJWeXkJo2kLSxcL%2BGE2KcOOIVuCVWfEv8AaT6tTkY1sEvOLCsGRf%2Fuou7MKj72f2%2BEceFWXI%2FU6OGenDGZjcQgkPV683EG9xITLNpJbw4w3LXWygY6pgEaeo5usm0v8jqLYrPRM6HMN7yEgs0issqyTPAHaqPk%2FPf7MPOMR3E%2FvHoHGjC95u7%2BWxbbKonkHq3Z1PdQPryTBk97uXDeKV1SHOcx%2FzSDbmA9oBn7csdWpfLw6pGaGmrT1qvC7p%2B1xHqbvFUIfzUl5Q6Jc%2B2MgQFns7BJTxURw7osVXRPLtg51M%2F%2BPwDzl85Kitb9s67WICh%2FR4vJvmKt7P4xdqzU&X-Amz-Signature=f9803254d3fbae4ed4ca1fbe49f518b5a3b75c3aed9e93ecf7d8d2cabe547840&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3D%22Nvidia_Groq_Tsp.pdf%22&x-id=GetObject

329 sats \ 0 comments \ @byzantine 1 Jan AI

Groq achieves amazing inference speed by loading all the parameters it needs into Static RAM (SRAM) on LPUs. Each LPU can only hold a limited amount of SRAM and so models usually need lots (hundreds) of LPUs networked together.

This also means that LPUs don’t depend on access to expensive and scarce High Bandwidth Memory

Unlike GPUs with asynchronous execution and complex runtime schedulers, Groq's LPU is fully deterministic. The compiler—not runtime hardware logic—controls every aspect of execution. When you compile a model for Groq, the compiler generates a static, cycle-accurate execution plan that maps every operation and data movement to physical hardware resources ahead of time. This eliminates runtime decision-making entirely

When running inference:

Weights are pre-loaded into SRAM during initialization
Input tokens trigger the pre-compiled execution sequence
Data streams through the chip following the compiler's schedule
Each core performs its assigned operations (matrix multiplies, activations, etc.) in perfect synchronization
Results flow out deterministically with single-digit millisecond latency for large models

The trade-off is flexibility: models must be recompiled for specific hardware configurations, and the architecture is optimized for inference workloads with static compute graphs, not dynamic control flow or training.

view all related items