Three-doc walkthrough — token, agentic, SemiAnalysis

The three punchlines

Token primerThe unit-level setup

One forward pass per token; tokens stress four GPU resources — compute, HBM bandwidth, HBM capacity, NVLink. Output ~3–10× input (decode is memory-bound, prefill is compute-bound).
Deflation is bimodal. Commodity tier collapses ~80%/yr (a16z "LLMflation": 1,000× in three years at fixed capability). Frontier is flat — Anthropic held Opus at $5/$25 across three releases.
Stanford: agentic tasks burn 1,000× more tokens than chat. Huang at MS-TMT: industry compute "grew 1,000-fold in two years."

PunchlineVolume × intensity is outrunning per-token deflation. Total token spend grows even as $/MTok falls.

Agentic primerWhat it does to the BOM

Agent = LLM in a ReAct loop + tool-call schema + MCP (all three labs converged on it within five months).
Three new cluster-level shifts: KV-cache spills past on-package HBM into a brand-new "warm context" tier (NVIDIA CMX = Bluefield-4 + DRAM + NAND SSD); CPU pull-through at Microsoft Fairwater runs 48 MW CPU vs 295 MW GPU (~1:6); east-west traffic dominates (70–90% of flows) and NVIDIA named a new networking category — "scale-across" — for DC-to-DC fabric.
Agentic BOM looks different: network + interconnect = 10.4% of cluster all-in (vs 6.5–7.9% elsewhere); fiber + optics = 3.7% (vs 1.3–2.9%); land/shell/EPC drops to 10.4%.

PunchlineConsensus underprices on two axes: bottom-up 2030 capex pool runs ~$2.6T Base vs Goldman's $1.46T (1.78×). BofA already lifted 2030 to $1.7T, called it "additive."

SemiAnalysisValue capture shifted to model labs

Anthropic inference-infra GM moved 38% → 70%+ in ~12 months; ARR went $9B → $44B.
Mechanism is agentic workload structure: realized Opus 4.7 blends to $0.99/MTok (vs $5/$25 sticker) because traces run 300:1 input:output and 90%+ cache hits ($0.50/MTok cached input).
Cost-per-token fell faster than realized price. Blackwell + TPUv7/Trainium 3 = 30× tokens/sec/chip vs prior gen; SW (wide-EP, disagg prefill, MTP) adds up-to-14× on identical HW.
Premium SKUs selling: Opus Fast (6× regular), Mythos ($25/$125, 5× regular). Buyers pay because per-token productivity exceeds price.

Punchline"Real agentic AI has permanently increased the market-clearing price per token, and there's no going back." Inverts The Information's deteriorating-margin narrative.

Investible themes — where the dollars land

Infrastructure layer	Thesis	Names to watch
Networking — east-west fabric	Agentic shifts traffic inside the DC. Backend NIC + switch silicon scales with port count, not GPU FLOPS.	ANET (Tomahawk/Jericho merchant silicon), AVGO, MRVL (custom-ASIC tailwind)
Networking — "scale-across" (DC-to-DC)	Brand-new category NVIDIA just named. Long-haul + metro fiber moves from optional to mandatory once GPU clusters exceed single-site power.	CIEN (TD Cowen "No DC Is An Island"), DY (JPM fiber-build read), GLW as fiber substrate
Networking — optical in-rack	400G → 800G → 1.6T transceivers; co-packaged optics is the credible next step when copper runs out of room.	COHR, LITE, FN; Celestial AI (private — optical interconnect for new memory tier)
KV-cache offload tier (CMX)	Brand-new line item. Bluefield-4 + DRAM + NAND SSDs holding "warm context" between HBM and bulk storage. Didn't exist nine months ago.	NVIDIA Bluefield; MU / SK Hynix / Samsung (HBM + GDDR7 + NAND); Solidigm, PSTG for SSD layer
CPU pull-through	Per-token economics are GPU-led; per-cluster capex is not. Agents pull CPU for orchestration, RAG, KV-spill management.	AMD (server CPU), INTC (Xeon 6+); Arm-custom angle via MRVL (AWS Graviton silicon), ARM licensing
Disaggregated prefill/decode	Rubin CPX puts GDDR7 onto AI-server bills alongside HBM4 — first time commodity memory enters the inference TAM.	Same memory names — but GDDR7 is a new line, incremental volume for MU / SK Hynix not in prior DC mix
Voice/agent application stack	Voice agents reroute call audio to 3 new endpoints per turn (STT/LLM/TTS) on sub-200 ms latency budget.	ElevenLabs, Deepgram (private); TWLO ConversationRelay as orchestration layer
Vector DB + indexing	RAG and codebase indexing are persistent new infrastructure layers — Cursor indexes every customer codebase.	Pinecone, Weaviate, Qdrant (all private); MDB Atlas Vector as public adjacency
Agent security	OWASP LLM06 ("Excessive Agency") is a real category; agent-driven egress + supply-chain risk through MCP servers.	PANW, ZS, CRWD; Protect AI, Lasso private-side
Model labs themselves	If SemiAnalysis is right, the labs are the highest-margin link — not the wrappers (Cursor/Devin run negative GM today).	Anthropic, OpenAI secondaries; wrapper exposure is the wrong end of the bar-bell

Infrastructure layer

Thesis

Names to watch

Networking — east-west fabric

Agentic shifts traffic inside the DC. Backend NIC + switch silicon scales with port count, not GPU FLOPS.

ANET (Tomahawk/Jericho merchant silicon), AVGO, MRVL (custom-ASIC tailwind)

Networking — "scale-across" (DC-to-DC)

Brand-new category NVIDIA just named. Long-haul + metro fiber moves from optional to mandatory once GPU clusters exceed single-site power.

CIEN (TD Cowen "No DC Is An Island"), DY (JPM fiber-build read), GLW as fiber substrate

Networking — optical in-rack

400G → 800G → 1.6T transceivers; co-packaged optics is the credible next step when copper runs out of room.

COHR, LITE, FN; Celestial AI (private — optical interconnect for new memory tier)

KV-cache offload tier (CMX)

Brand-new line item. Bluefield-4 + DRAM + NAND SSDs holding "warm context" between HBM and bulk storage. Didn't exist nine months ago.

NVIDIA Bluefield; MU / SK Hynix / Samsung (HBM + GDDR7 + NAND); Solidigm, PSTG for SSD layer

CPU pull-through

Per-token economics are GPU-led; per-cluster capex is not. Agents pull CPU for orchestration, RAG, KV-spill management.

AMD (server CPU), INTC (Xeon 6+); Arm-custom angle via MRVL (AWS Graviton silicon), ARM licensing

Disaggregated prefill/decode

Rubin CPX puts GDDR7 onto AI-server bills alongside HBM4 — first time commodity memory enters the inference TAM.

Same memory names — but GDDR7 is a new line, incremental volume for MU / SK Hynix not in prior DC mix

Voice/agent application stack

Voice agents reroute call audio to 3 new endpoints per turn (STT/LLM/TTS) on sub-200 ms latency budget.

ElevenLabs, Deepgram (private); TWLO ConversationRelay as orchestration layer

Vector DB + indexing

RAG and codebase indexing are persistent new infrastructure layers — Cursor indexes every customer codebase.

Pinecone, Weaviate, Qdrant (all private); MDB Atlas Vector as public adjacency

Agent security

OWASP LLM06 ("Excessive Agency") is a real category; agent-driven egress + supply-chain risk through MCP servers.

PANW, ZS, CRWD; Protect AI, Lasso private-side

Model labs themselves

If SemiAnalysis is right, the labs are the highest-margin link — not the wrappers (Cursor/Devin run negative GM today).

Anthropic, OpenAI secondaries; wrapper exposure is the wrong end of the bar-bell

The asymmetry to press on next call

The two bull cases stack rather than cancel:

If bottom-up is right (agentic primer §6)

Capex pool is ~1.78× consensus → networking, fiber, CPU, KV-tier memory all under-counted in published TAMs.

If SemiAnalysis is right

Model-lab margins keep widening → value capture concentrates with Anthropic / OpenAI; wrapper/app layer is a trap.

The least-priced corner is networking — specifically the "scale-across" DC-to-DC fiber + optics layer, because (a) it didn't exist as a named category nine months ago, (b) it shows up in every agentic capex shock scenario, and (c) the public-comp set (CIEN, DY, COHR, LITE) trades at infrastructure multiples, not AI multiples.