How AI Works | Learn | This Is Server Country

Key Takeaways

1 AI processes text as numerical tokens—about 1,300 tokens per 1,000 words
2 Inference happens billions of times daily, requiring massive infrastructure
3 Memory bandwidth—not compute speed—is the limiting factor

Tokens: Where AI Begins

When you type a question, AI doesn't see words—it sees numbers called tokens.

Your input

"How is the weather today?"

What AI sees

2437 318 262 6193 1909 30

~1,300 tokens

1,000 words

The Transformer Revolution

In 2017, the transformer architecture changed everything. Instead of reading word-by-word, AI now processes all words simultaneously using attention.

Before

Sequential

The → quick → brown → fox

Forgets the beginning by the end

After

Parallel Attention

The quick brown fox

Sees all relationships at once

Computational Scale

4 million comparisons per layer for a 2,000-token document

Training vs. Inference

There are two phases in AI: training (teaching the model once) and inference (using it constantly). Inference now dominates infrastructure needs.

Training

$100M+ to train GPT-4 (once)

Inference

2.5B prompts/day (ChatGPT)

Scale of Inference

29,000

requests/second

14.5M

tokens/second

100B+

parameter trips/second

The Memory Bottleneck

The limiting factor isn't compute speed—it's how fast chips can move data.

NVIDIA H100

~2,000

trillion ops/sec

Compute

TB/sec

Memory

Why This Requires Infrastructure

Single Inference Cluster

50,000

H100 chips

42 MW

total power

35,000

homes equivalent

This is why fields in rural Michigan become $7 billion projects. The AI boom isn't a software story—it's an infrastructure story.

The Power Constraint How Data Centers Work

Go Deeper

Chapter 1 of This Is Server Country explores the technical foundations of AI in depth—how attention mechanisms work, why memory bandwidth matters, and how the shift to inference economics is driving the trillion-dollar buildout.

Learn more about the book →