The rate at which an AI system can process work, typically measured in tokens per second. Higher throughput enables serving more users simultaneously. Data center design increasingly optimizes for throughput, as inference workloads scale with user demand rather than being bounded like training.
Discussed in Chapter 1 of This Is Server Country