Key Takeaways
- AI systems process text as numerical tokens, not words—about 1,300 tokens for every 1,000 words
- Inference (generating responses) happens billions of times daily and requires massive infrastructure
- Modern AI is built on transformer architecture with attention mechanisms that understand context
- Memory bandwidth, not raw computation speed, is often the limiting factor in AI performance
- The shift from training-dominant to inference-dominant economics drives data center buildout
The Token: Where AI Begins
Every time you ask ChatGPT a question, something happens before the AI "thinks" about your request: your words get converted into numbers. Not just any numbers, but specific numerical tokens that the AI can process.
This might seem like a technical detail, but it's fundamental to understanding why AI requires the infrastructure it does. When you type "How is the weather today?", the AI doesn't see those words. It sees something more like [2437, 318, 262, 6193, 1909, 30]. The exact numbers vary by model, but the principle remains: language becomes mathematics.
The ratio matters, too. Modern large language models typically use about 1,300 tokens for every 1,000 words of English text. Some words become one token. Others, especially uncommon words or technical terms, might split into two, three, or more tokens. The word "data center" might be two tokens, while "servercountry" could be three or four, depending on whether the model has seen that compound word before.
This seemingly simple conversion has profound implications. Every token requires memory. Every token requires computation. And when you're processing billions of requests per day, those tokens add up to infrastructure measured in gigawatts.
From Words to Numbers: Tokenization
The process of converting text into tokens is called tokenization, and the most common approach is something called byte-pair encoding (BPE). Think of it as finding the most efficient way to represent language using a fixed vocabulary of about 50,000 to 100,000 tokens.
Here's a simplified way to understand it: Imagine you're creating a codebook. The most common words—"the," "is," "and"—get their own codes. But instead of having a code for every possible word, you also include codes for common fragments. The suffix "-ing" gets a code. The prefix "un-" gets a code. Common syllables get codes.
This approach is efficient, but it has consequences. English tends to tokenize very efficiently because most AI models were trained primarily on English text. A sentence in English might use 20 tokens, while the same sentence in a less-common language might require 40 or 50 tokens. This isn't just inefficient—it's more expensive to run and slower to process.
Why not just use words? Because words don't capture enough structure. The model needs to understand that "running," "runs," and "ran" are related. It needs to handle made-up words, typos, and technical jargon. Tokens give the model flexibility while keeping the vocabulary manageable.
The Transformer Revolution
In 2017, researchers at Google published a paper with an audacious title: "Attention Is All You Need." That paper introduced the transformer architecture, which now powers every major AI system from ChatGPT to Claude to Gemini.
Before transformers, AI systems processed language sequentially, word by word, like reading a book from left to right. This created problems. By the time the model reached the end of a long sentence, it had partially forgotten the beginning. Context was hard to maintain.
Transformers solve this through a mechanism called attention. Instead of processing words in order, the model looks at all the words simultaneously and figures out which ones matter most for understanding each other word. It's the difference between reading a sentence word-by-word versus scanning the whole thing, seeing how the pieces relate, and then making sense of it.
Consider the word "bank." Does it mean a financial institution or the side of a river? In the sentence "I went to the bank to deposit money," the words "deposit" and "money" provide strong signals. The attention mechanism lets the model weigh those signals heavily when determining what "bank" means in this context.
This sounds simple, but it requires massive computation. For every token in a sequence, the model needs to compare it against every other token. In a 2,000-token document (about 1,500 words), that's 4 million pairwise comparisons. And that's just one layer—modern models have dozens of layers.
Inference vs. Training: Why This Matters Now
There are two distinct phases in the life of an AI model: training and inference. Understanding the difference is critical to understanding why data center buildout has accelerated so dramatically.
Training is teaching the model. You feed it trillions of tokens from books, websites, and conversations. The model adjusts billions of internal parameters to predict what comes next. Training GPT-4 reportedly cost over $100 million in computing resources and took months. But you only do it once (or occasionally, when updating the model).
Inference is using the model. Every time someone asks ChatGPT a question, that's inference. Every time an AI writes an email, generates code, or analyzes an image, that's inference. And inference happens constantly.
Consider the scale: ChatGPT reportedly handles about 2.5 billion prompts daily. That's roughly 29,000 requests per second, every second, all day. Each request might generate 500 tokens of response. That's 14.5 million tokens per second that need to be generated, which means 14.5 million trips through a model with hundreds of billions of parameters.
For years, the AI industry focused on training. Training required the most advanced chips, the most power, the most infrastructure. But around 2023, something shifted. Inference began to dominate. Not because training became less important, but because the number of people using AI exploded.
This shift is what's driving the current infrastructure boom. A training cluster might have 10,000 or 50,000 GPUs. But to serve billions of daily requests, you need inference capacity at a completely different scale. Hence the trillion-dollar buildout.
The Memory Bandwidth Bottleneck
Here's a counterintuitive fact: the limiting factor in AI performance often isn't how fast your chip can compute—it's how fast it can move data.
Modern AI models have hundreds of billions of parameters. GPT-4 is rumored to have over a trillion. Each parameter is a number, typically stored as a 16-bit or 8-bit value. When the model processes a token, it needs to access these parameters, perform computations, and move results around.
The problem is memory bandwidth. Even the most advanced AI chips can compute much faster than they can load data from memory. The NVIDIA H100 chip, the workhorse of modern AI infrastructure, can perform about 2,000 trillion operations per second. But it can only move about 3 terabytes per second from its memory to its processors.
This creates a bottleneck. The chip sits idle, waiting for data. Engineers call this being "memory-bound" rather than "compute-bound." It's like having a chef who can chop vegetables incredibly fast but has to wait for someone to hand them one carrot at a time.
This is why High Bandwidth Memory (HBM) exists. HBM is specialized memory stacked directly on or very close to the processor, reducing the distance data needs to travel. It's dramatically more expensive than regular memory, but for AI workloads, it's essential. The NVIDIA H100 includes 80GB of HBM3, and that memory costs more than the processor itself.
The memory bottleneck also explains why AI chips consume so much power. Moving data takes energy. The faster you move it, the more energy you use. And at the scales required for modern AI, that energy adds up to megawatts per rack.
Why This Requires Infrastructure
All of this—tokens, transformers, attention mechanisms, memory bandwidth—converges on a single reality: modern AI requires massive physical infrastructure.
The scaling laws are relentless. Research has shown that AI capabilities improve predictably with three factors: more parameters, more training data, and more computation. Doubling the model size doesn't just incrementally improve performance—it often unlocks entirely new capabilities. GPT-3 couldn't write coherent essays. GPT-4 can. The difference? Primarily scale.
This creates a compounding infrastructure requirement. You need more chips to hold more parameters. You need more power to run those chips. You need more cooling to remove the heat those chips generate. You need more network bandwidth to connect the chips. And you need redundancy for all of it, because if the system goes down, millions of users notice immediately.
Consider a single large inference cluster: 50,000 NVIDIA H100 chips, each drawing about 700 watts. That's 35 megawatts just for the chips, before cooling, before networking, before power conversion losses. At typical power usage effectiveness (PUE) of 1.2, you're looking at 42 megawatts total.
That's roughly enough power for 35,000 homes. But instead of homes, it's serving AI requests. And that's just one cluster. The industry is building dozens of them.
The physical reality behind the digital experience is staggering. When you ask ChatGPT to write you a poem, somewhere in a data center—likely in Northern Virginia or Iowa or Texas—thousands of chips wake up, billions of parameters flow through memory, trillions of calculations execute, and heat pours into cooling systems. All in the second or two it takes for your poem to appear.
This is why fields in rural Michigan become sites for $7 billion projects. This is why electrical substations become more valuable than proximity to fiber. This is why the AI boom isn't just a software story—it's fundamentally an infrastructure story.
And the infrastructure requirements are only growing. As models get larger and usage increases, the gap between what we have and what we need widens. The trillion-dollar buildout isn't speculative excess. Given current AI scaling trends and adoption rates, it might not be enough.
Go Deeper
The technical foundations of AI and their infrastructure implications are explored in depth in Chapter 1 of This Is Server Country, which traces the path from transformers to trillion-dollar buildout.
The book examines how attention mechanisms work, why memory bandwidth matters more than computation speed, and how the shift from training to inference changed the economics of AI infrastructure.
Learn more about the book →