Agentic AI — A Working Primer

1. What agentic AI is

An agent is a language model wrapped in a loop. Where a chatbot answers one question and stops, an agent reads a task, decides on an action, observes the result, updates what it knows, and loops — taking dozens or hundreds of steps before producing a single user-visible answer. The model itself does not run code or call services; it emits text. The harness — the program around the model — parses that text, executes the requested action (read a file, query a database, send an email, run a test), and feeds the result back into the next prompt. The loop continues until the agent reports done or the harness stops it.

The formal academic framing is the ReAct paper (Yao et al., October 2022). ReAct stands for "Reasoning and Acting" and its central move is to interleave the two in a single language-model trace rather than treat them as separate problems (Yao et al., 2022, ReAct: Synergizing Reasoning and Acting in Language Models). Every step is one trip through reason → act → observe.

That structural difference vs. a chatbot — looping, calling external services, accumulating state — is what makes the system agentic. It also re-shapes the underlying infrastructure: more tokens consumed per task, different network paths, new memory tiers, a different mix of GPU and CPU silicon at the cluster level. The rest of this primer walks through those shifts.

2. How it works — the protocol stack

Three layers stacked on top of each other. The bottom layer is the loop; the middle is how the model emits an action; the top is how the application talks to third-party tools.

The loop (ReAct). The harness sends a prompt; the model writes a short rationale ("I need today's weather for Paris") and emits a structured request to call a tool. The harness runs the tool, appends the result to the conversation, and re-prompts the model. The model is stateless between turns; the harness carries state forward by appending to the conversation each time. This is the underlying reason agentic systems are so token-hungry — every loop iteration re-reads everything that came before it.

Tool-use schemas. Both Anthropic and OpenAI ship a structured protocol for the action step. The developer provides a JSON Schema that declares each tool's name, description, and arguments; the model emits a typed call that the harness can execute programmatically. Anthropic's framing: "Tool use lets Claude call functions you define or that Anthropic provides. Claude decides when to call a tool based on the user's request and the tool's description, then returns a structured call that your application executes (client tools) or that Anthropic executes (server tools)" ¹. Claude emits a tool_use block; OpenAI's equivalent is tool_calls in the Responses API ². Both vendors offer a strict: true mode that guarantees the call matches the schema exactly — closing the gap where earlier models would sometimes emit malformed JSON. Structurally, this is remote procedure call (RPC, a standard way for one program to invoke a function in another) embedded inside the conversation.

The Model Context Protocol (MCP). Function calling solved how one model talks to one developer's own tools. MCP solves what happens when every AI application wants to talk to every tool. Anthropic launched MCP on November 25, 2024 with the launch-post framing: "Yet even the most sophisticated models are constrained by their isolation from data—trapped behind information silos and legacy systems. Every new data source requires its own custom implementation" ³. The official spec uses a USB-C analogy — "MCP provides a standardized way to connect AI applications to external systems" ⁴. Without MCP, connecting M AI applications to N data sources requires M×N bespoke integrations. With MCP, each side speaks the protocol; M+N implementations cover the same ground.

The adoption timeline is the load-bearing signal: launched Nov 25, 2024; OpenAI adopted MCP in March 2025, Google DeepMind in April 2025, Microsoft via Semantic Kernel through 2025; in December 2025 Anthropic donated MCP to the new Agentic AI Foundation under the Linux Foundation, co-founded with Block and OpenAI ⁵. All three major foundation-model labs converged on a shared three-piece protocol stack — ReAct-style loop, schema-validated tool calls, MCP as the cross-vendor integration layer — within five months.

Anchor products. Every modern agent product runs some version of this stack. Claude Code (Anthropic, launched Feb 24 2025) — "reads your codebase, edits files, runs commands, and integrates with your development tools" ⁶. Cursor — multi-model agent inside a VS Code fork; supports Anthropic, OpenAI, Google, xAI, and its own model Composer 2. Devin (Cognition, March 12 2024) — "a tireless, skilled teammate, equally ready to build alongside you or independently complete tasks" — first to ship the marketing of "autonomous software engineer," scored 13.86% on SWE-bench end-to-end at launch ⁷; Cognition has not publicly disclosed Devin's internal architecture. Anthropic computer use (Oct 22 2024) — gives the model a screenshot tool and mouse/keyboard actions, scored 14.9% on OSWorld at launch ⁸. OpenAI Agents SDK + Responses API (March 11 2025) — OpenAI's consolidated agent stack with built-in web search, file search, and computer-use tools.

3. The hardware, re-covered for agents

Per the token primer §4, every LLM call stresses four GPU resources: compute (FP16/FP8 TFLOPS), HBM bandwidth, HBM capacity, and inter-GPU NVLink. Agent steps stress those resources differently from chatbot calls, in three concrete ways.

Multi-turn KV-cache pressure. Quick refresher on the hardware terms used below — covered in detail in the token primer §4. A modern AI accelerator is a GPU (NVIDIA's H100, H200, or Blackwell B200) carrying two on-package memories: HBM (high-bandwidth memory, the fast scratchpad where the model's weights and active state live during a call) measured in gigabytes of capacity and terabytes-per-second of bandwidth. Inter-GPU links — NVLink — let several GPUs talk to each other when a model is too big for one. Compute throughput is measured in TFLOPS (trillions of floating-point operations per second). The current Hopper / Hopper-200 / Blackwell generations carry 80 / 141 / 192 GB of HBM per GPU respectively (⁹, H200, DGX B200). The KV-cache is the model's running record of attention computations — "what I've already looked at in this conversation" — and it lives in HBM during a call. NVIDIA's own inference guide notes the cache grows "linearly with batch size and sequence length" and "can have a large memory footprint" ¹⁰. NVIDIA's worked example: Llama 2 7B at 16-bit precision, batch size 1, the KV-cache alone is ~2 GB. Agentic workloads compound this on both axes. Sequence length grows per task because every step appends its tool calls, observations, and reasoning to the same context window — Stanford's measurement of agentic coding tasks: "consuming 1000x more tokens than code reasoning and code chat" ¹¹. And as context grows, the GPU holds fewer concurrent conversations in HBM — a direct linear tradeoff. At 4K context, a 7B model fits ~278 users per GPU; at 32K, ~35.

Figure 1. The memory hierarchy and rack-level layout for agentic workloads. Inside the rack, eight GPUs with on-package HBM share an NVLink fabric; the KV-cache spills off-HBM through a Bluefield-4 DPU into a new "warm context" tier of DRAM then NVMe SSD. At cluster level, a dedicated CPU-and-storage building handles orchestration — Microsoft's Fairwater split estimated at roughly 1 CPU MW for every 6 GPU MW.

The KV-cache offload tier — NVIDIA CMX. Long-context, multi-turn agents now routinely push the KV-cache past on-package HBM's 80 / 141 / 192 GB ceiling on Hopper / Hopper-200 / Blackwell. NVIDIA's response, announced at GTC 2026, is CMX (the Context Memory Storage Platform): a system that lets the KV-cache spill from HBM to a cheaper, larger tier of conventional DRAM and NAND SSDs, controlled by Bluefield-4 — NVIDIA's data-processing unit (DPU, a network-attached processor that handles I/O and storage). SemiAnalysis's coverage of the GTC 2026 announcement: "CMX addresses a growing bottleneck in modern inference infrastructure: the rapid expansion of KV Cache required to support long-context and agentic workloads. KV cache grows linearly with input sequence length and number of users and is the primary tier of memory expansion that inference must address" (SemiAnalysis, "GTC 2026: The Inference Kingdom Expands," Mar 24 2026; full coverage indexed in Noldor). The architectural meaning is a new memory tier in every agent-serving cluster — a "warm context" layer between HBM and bulk storage that did not exist as a distinct line item before. Vendors building into that tier include NVIDIA (Bluefield-4 + CMX), Celestial AI (optical-interconnected DRAM), and the NAND/SSD vendors (Solidigm, Micron) supplying the underlying storage capacity.

Cluster-level CPU pull-through. Per-token economics are GPU-led — covered in the token primer §4 — but per-cluster capex pulls CPU back in. Each agent step still runs as a normal LLM forward pass on the GPU, but the agent loop's surrounding work — running tool calls, parsing JSON, retrieving from vector stores, managing the KV-cache spill across the cluster — runs on CPU. Microsoft's "Fairwater" AI campus for OpenAI is architected with a separate air-cooled CPU-and-storage building alongside the dense GPU building; Microsoft has publicly stated the network includes "millions of CPU cores for operational compute tasks" ¹². SemiAnalysis's satellite-imagery analysis estimates the split at 48 MW CPU vs 295 MW GPU — roughly 1:6 — and projects that ratio rises as GPU performance-per-watt improves faster than CPU performance-per-watt (SemiAnalysis, "CPUs Are Back," Feb 2026). The IR confirmation lives in both major server-CPU vendors' Q4 2025 calls: AMD's Lisa Su flagged "these AI processes or AI agents that are spinning off a lot of work, in an enterprise, they're actually going to a lot of traditional CPU tasks." as a driver of 2026 server-CPU TAM growth ¹³; Intel CFO David Zinsner framed the same dynamic as "The world is shifting from human-prompted requests to persistent and recursive commands driven by computer-to-computer interactions." ¹⁴; Meta's April 2026 AWS Graviton partnership commits "tens of millions of Graviton cores" explicitly for "agentic AI — autonomous systems that reason, plan, and execute complex tasks" ¹⁵.

4. How agentic AI changes a workflow — three case studies

A workflow's "agentic" upgrade is not just about productivity. The data path, the artifacts produced, and the infrastructure footprint all change. Three documented cases.

4.1 Coding — the cleanest quantitative anchor

Old flow. Engineer types code into an IDE. Local files, local compile, local tests. Network traffic from the developer's machine: git pulls/pushes, package-manager fetches, occasional doc lookups. The only AI bytes on the wire are short autocomplete suggestions (original GitHub Copilot: snippets in, single-line completions out).

New flow. Engineer types a natural-language task into Claude Code, Cursor, GitHub Copilot agent mode, or Devin. The agent reads files via a Read tool, grep/glob to locate symbols, edits files via an Edit tool, runs shell commands via a Bash tool — compiling, testing, opening branches, drafting PRs. Per Anthropic, Claude Code can write tests for untested code, fix lint errors across a project, resolve merge conflicts, update dependencies, and write release notes ⁶. GitHub's Copilot coding agent runs in "its own ephemeral development environment, powered by GitHub Actions," meaning execution moves off the developer's laptop into GitHub's hosted infrastructure ¹⁶.

What changed. Per Bai et al. (Stanford/MIT/Microsoft Research/Anthropic), "agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat"; "runs on the same task can differ by up to 30x in total tokens"; and "Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5" on identical SWE-bench Verified tasks (¹⁷, 24 Apr 2026). Input tokens dominate cost — the agent re-reads the full conversation history every turn. Cursor adds operational color: "In a focused sprint earlier this year, we drove all tool calls to at least 2 or often 3 9s of reliability" ¹⁸; semantic search across an indexed codebase is "one of the biggest drivers of agent performance," with cross-user index reuse made possible because clones within one organization "average 92% similarity" ¹⁹. Codebase indexing is a new infrastructure layer that did not exist in the autocomplete era. Macro footprint: SemiAnalysis estimates "4% of GitHub public commits are being authored by Claude Code right now. At the current trajectory, we believe that Claude Code will be 20%+ of all daily commits by the end of 2026" (SemiAnalysis, "Claude Code is the Inflection Point," Feb 2026; indexed in Noldor).

4.2 Voice customer service — the cleanest network-rerouting story

Old flow. Caller dials a 1-800 number. The call rides telco signalling protocols (SS7 for traditional circuit-switched lines; SIP — Session Initiation Protocol — for modern IP-based ones) into a contact-center-as-a-service (CCaaS) provider's session border controllers, gets queued, and terminates on a human agent's softphone. CCaaS providers — Five9, NICE, Genesys, Twilio Flex — handle queuing, routing, IVR menus (interactive voice response — "press 1 for billing"), CRM integration. Voice bytes themselves ride RTP (real-time transport protocol, the standard for streaming audio) between the telco edge and the CCaaS data center; no LLM is in the loop.

New flow. The call lands at the telco / CCaaS, but now flows over a WebSocket — a long-lived two-way HTTP connection that keeps a streaming session open — to a streaming speech-to-text (STT) endpoint. The transcript is streamed token-by-token into an LLM. The LLM's response tokens are streamed into a text-to-speech (TTS) endpoint. Synthesized audio is streamed back to the caller. The LLM can invoke tool calls mid-conversation — "transfer to human," "look up order," "process refund." Twilio's ConversationRelay documentation describes the architecture: a WebSocket session between the caller's call leg and ConversationRelay handles the speech-text conversions and orchestrates LLM calls in real time ²⁰, with documented STT/TTS providers Deepgram, Google, Amazon Polly, and ElevenLabs.

Figure 2. Voice customer service — before and after. The old flow rides a single linear path (caller → telco → CCaaS → human → CRM entry). The new flow keeps the same telco edge but fans the conversation out over WebSockets to streaming STT, LLM, TTS, and tool-call clusters per turn. Endpoint count in the loop rises from four to seven, each on different infrastructure.

What changed. Voice bytes that used to terminate at a CCaaS data center near a telco peering point now mirror over WebSocket to three new endpoints per turn: a streaming STT provider's cluster (Deepgram, Google), an LLM provider's cluster (Anthropic, OpenAI), and a streaming TTS provider's cluster (ElevenLabs, Polly). Each on different infrastructure. Latency is unforgiving: Deepgram publishes "transcription latency in 300 milliseconds or less for streaming workloads" ²¹, and its Aura-2 TTS targets sub-200 ms time-to-first-byte on the audio-output side (Deepgram Aura-2). Total wall-clock budget for a human-feeling voice agent is roughly 800 ms end-to-end. Deepgram prices its Voice Agent API (STT + LLM + TTS + orchestration bundled) at "$4.50 per hour" ²² — a different unit economics from per-license CCaaS subscription pricing. New artifacts: a real-time transcript (text), an LLM trace with reasoning and tool calls (JSON), TTS audio chunks (binary), and end-of-call summaries — where the old flow generated one audio file plus a CRM entry.

4.3 Non-voice customer service — the deflection-economics anchor

Old flow. A ticket arrives via email, chat widget, or web form. Help-desk software (Zendesk, Intercom, Salesforce Service Cloud, ServiceNow) routes it to a queue based on rule-based triggers. A human agent picks it up, looks up the customer, writes a reply. SLAs measured in hours or days. Pre-LLM bots used deterministic decision trees, keyword routing, scripted FAQs.

New flow. An AI agent (Intercom Fin, Zendesk AI, Salesforce Agentforce, Forethought) reads the ticket, retrieves relevant context from the company's knowledge base via RAG (retrieval-augmented generation: searching a vector database to fetch the right context, rather than stuffing everything into the prompt), calls tools to look up order status / customer history / billing, generates a reply, and either sends it or hands off to a human with a draft. Intercom defines the operational metric formally: "Automation rate = Involvement rate × Resolution rate. If Fin is involved in 50% of eligible conversations and resolves 60% of those, your automation rate is 30%" ²³.

Figure 3. Non-voice customer service after the agentic upgrade. A ticket flows into the LLM agent in the center; the agent retrieves context from the company's vector database (RAG), calls tools against the order, CRM, and billing systems, then either resolves the ticket directly or hands off to a human with a draft. Operational efficacy is tracked as Automation rate = Involvement × Resolution.

What changed. Fin.ai publishes "Fin averages a 67% resolution rate across its customer base, with top-performing customers reaching 80-84%" ²⁴ at "$0.99 per resolution" — vendor self-disclosure, but the price is observable. Klarna is the most cited deployment, with the most complete public record on both sides of the question. Klarna's Feb 27 2024 press release: the AI assistant "has had 2.3 million conversations, two-thirds of Klarna's customer service chats," doing "the equivalent work of 700 full-time agents," with customers resolving "in less than 2 mins compared to 11 mins previously" and "a 25% drop in repeat inquiries," estimated to drive "$40 million USD in profit improvement" ²⁵. Fifteen months later, Klarna was "turning back to people to help with customer service work" ²⁶ — the deflection and unit-economics numbers survive; the framing that AI permanently replaced 700 customer-service agents did not. Where ticket data used to stay inside one help-desk platform, every turn now serializes ticket text + retrieved knowledge-base chunks + customer context into a prompt sent to an LLM endpoint — measurable per-interaction inference cost where before there was only platform-license cost.

5. Knock-on infrastructure effects

Five second-order shifts, sized to what's currently disclosed in primary sources.

Figure 4. Three nested scale axes inside an AI build-out. Scale-up connects GPUs within a rack via NVLink; scale-out connects racks within a building via backend Ethernet; scale-across connects buildings within (and between) campuses via NVIDIA's Spectrum-XGS Ethernet and DCI fiber. Agentic workloads generate east-west traffic at all three nesting levels; only the final reply rides the north-south path back to the user.

Network traffic — east-west dominance and DC-to-DC pull-through. A data-center network has two directions: north-south is traffic between the facility and end users on the internet; east-west is everything that stays inside the facility, between servers. For a chatbot Q&A the work is mostly north-south. For an agent it's mostly east-west: each tool call, vector-store lookup, code execution, and follow-up LLM step terminates on a different server inside the same facility, and the agent loops through many such steps per user-visible answer. Industry coverage puts east-west at 70–90% of total flows inside AI-driven data centers (commentary-grade; the agentic-specific fraction is not yet primary-disclosed). Arista's FY26 10-K: "Modern AI applications need high-bandwidth, lossless, low-latency, scalable, multi-tenant networks that interconnect hundreds or thousands of accelerators at high speed from 100Gbps to 400Gbps, evolving to 800Gbps and beyond" ²⁷. Google's Jupiter fabric supports "more than 6Pb/sec of datacenter bandwidth" — Google's own engineering disclosure notes Jupiter has evolved beyond its early non-blocking-Clos design (a hierarchical multi-stage topology where every server can reach every other server through bounded hops) toward an optical-circuit-switched direct-mesh architecture ²⁸ — the kind of fabric needed when every agent step generates new east-west flows. And the long-haul story: NVIDIA's Spectrum-XGS Ethernet, launched at Hot Chips Aug 22 2025, extends NVIDIA's Spectrum-X backend networking "to interconnect multiple, distributed data centers to form massive AI super-factories capable of giga-scale intelligence," framed against the fact that "individual data centers are reaching the limits of power and capacity within a single facility" ²⁹. NVIDIA calls this new fabric category "scale-across," distinct from scale-up (more GPUs in one rack) and scale-out (more racks in one building) — and it drives the metro and long-haul fiber capex thesis carried by TD Cowen on Ciena ("As AI clusters scale, it is no longer sufficient to optimize only the backend fabric inside a single facility") and JPMorgan on Dycom ("A critical component of these infrastructure builds is the metro and long-haul fiber").

Memory hierarchy — vector databases and persistent state. Agents lean on vector databases — Pinecone, Weaviate, Qdrant — to keep their working knowledge bounded. A vector database stores text, images, or other modalities as numerical fingerprints called embeddings (high-dimensional vectors), so a query can retrieve the semantically closest matches rather than exact word matches. That's what makes RAG cheap: rather than stuffing a million tokens of company documents into every prompt, the agent embeds the user's question, queries the vector DB, retrieves the 5–10 most relevant chunks, and includes only those in the LLM call. Most production vector DBs use HNSW (Hierarchical Navigable Small World, a multi-layer graph index with logarithmic lookup time) and/or IVF (Inverted File Index, partitions vectors into clusters and searches the relevant one). Weaviate's own framing: HNSW gives "logarithmic time complexity" vs a flat index's linear cost ³⁰. Persistent memory across sessions is the other new layer. Anthropic shipped Memory as a Claude product feature: launched to Team and Enterprise on Sep 11 2025, expanded to Pro and Max on Oct 23 2025 ³¹. The architecture: Claude generates a project-scoped text summary that persists across conversations and gets re-injected into the system prompt of new ones; users can view and edit it directly.

Security surface. Agents enlarge the attack surface in three ways. Prompt injection is the foundational risk — per OWASP, "A Prompt Injection Vulnerability occurs when user prompts alter the LLM's behavior or output in unintended ways," with the indirect variant — malicious instructions hidden in third-party content the agent retrieves (a webpage, a PDF, a vector-DB record) — being especially insidious for agents that fetch outside data ³². Excessive Agency is the agent-specific category in OWASP's 2025 LLM Top 10: "An LLM-based system is often granted a degree of agency by its developer – the ability to call functions or interface with other systems via extensions" — the vulnerability arises from excessive functionality, excessive permissions, or excessive autonomy ³³. Recommended mitigations are essentially principle-of-least-privilege adapted to agents — minimize tools, restrict permissions, require human approval for high-impact actions. Supply-chain risk rides on top of MCP: an agent connecting to a third-party MCP server is trusting that server's code, the data it serves, and any tool calls it offers. Anthropic's own 2025 "agentic misalignment" research stress-tested 16 frontier models and found that "In at least some cases, models from all developers resorted to malicious insider behaviors when that was the only way to avoid replacement or achieve their goals—including blackmailing officials and leaking sensitive information to competitors." ³⁴ — those behaviors emerged under forced-binary stress tests, not in normal operation, but the finding is that current safety training does not reliably prevent them. New infrastructure spend follows the new attack surface — Zscaler, Palo Alto, and emerging vendors like Lasso Security and Protect AI are scaling into the agent-monitoring category.

Work-product shift. Agents produce more structured outputs (markdown, JSON, code, tool-call payloads) and fewer rendered binaries (PDFs, PowerPoint, images) than human creative work. Every major agent framework — LangGraph, OpenAI Agents SDK, Anthropic tool use, MCP — standardizes on JSON-serialized state. The downstream effects are plausible but unsized: storage type-mix shifts toward text and structured data; rendering overhead falls; artifacts are diff-friendly. This is a hypothesis, not a measured effect — no hyperscaler or analyst report yet quantifies the shift in storage-tier or rendering-load terms. Worth tracking, not yet a sized investment claim.

6. Why this matters — thesis tie-back

Two structural reasons agentic AI matters for the AI-infrastructure capex thesis.

(a) The bill of materials looks different

Modeling AI-data-center capex on a per-archetype basis — training-core, inference, legacy enterprise, edge, agentic — surfaces that agentic clusters carry the highest network + interconnect share (10.4% of cluster all-in, vs 7.2–9.1% in every other archetype) AND the highest fiber + optics share (3.8% vs 1.5–3.3%) AND the lowest land/shell/EPC share (10.4% vs 15.4–37.2%) at a frontier-tier $34.6M/MW. Three reasons it shows up that way: agent tasks execute across many accelerators with persistent state passed between them, so the NVLink + Spectrum-X scale-up fabric runs larger per dollar of compute; tool-call and DC-to-DC traffic is super-linear and symmetrical — unlike one-way CDN traffic (where content flows out to viewers in a single direction), every agent step round-trips, and a 100 ms delay breaks the reasoning loop, pulling 400G/800G long-haul interconnect from optional to standard; and disaggregated-prefill architectures — splitting the prompt-ingestion (prefill) and answer-generation (decode) steps of every LLM call onto different GPU SKUs — push fast-but-cheap GDDR7 memory (commodity gaming-GPU memory used for prefill in NVIDIA's new Rubin CPX racks) onto server-side bills for the first time alongside HBM4 (the latest premium on-package GPU memory) and the new NVMe-SSD KV-cache tier. The investment-relevant fact: a capex pool sized at the unified-DC level under-prices the network, fiber, and CPU lines that an agentic-heavy build-out actually consumes.

(b) Bottom-up token demand outruns top-down forecaster anchors

A bottom-up demand-to-capacity model — one that converts industry token demand into required inference megawatts via a chip-mix throughput × workload realization × derating bridge, prices that capacity at $/MW, and layers training capex on top, rather than backing into a number from analyst-led top-down forecasts — produces a 2030 capex pool of ~$2,605B under a Base scenario (50× Base agentic-intensity scalar vs chat, 30% agentic share of workload, mainstream adoption; the 1,000× intensity is the model's High stress corner, 20× the Base scalar), against the corresponding Goldman Sachs top-down anchor of $1,460B (the Goldman 2030 top-down anchor in the model, per Goldman's Tracking Trillions; an interpolated annual path, not a directly-published 2030 print; the 2029 anchor is $1,280B). That is a ratio of 1.78× in 2030 (Base). The striking feature is that bottom-up sits above the Goldman top-down anchor in every year of the forecast, not just at the end: it runs at 1.04× of top-down in 2026, holds roughly level at 1.03× in 2027, then pulls away — 1.22× in 2028, 1.45× in 2029, and 1.78× by 2030. There is no crossover to wait for; the bottom-up pool already exceeds the top-down anchor today and the gap compounds as agentic workloads scale. Three structural drivers: ~80%/yr commodity $/MTok deflation (token primer §7); ~80%/yr industry token-volume CAGR; and Stanford's 1,000× per-task token-consumption multiplier for agentic vs single-call workloads. Caveats worth flagging: (i) the gap to top-down narrows against more aggressive analyst anchors — BofA's stated $2,500B 2030 view would compress the 2030 ratio to ~1.04×, near parity; (ii) the 80% token-volume growth rate is the single most load-bearing assumption; (iii) an outer-edge "1M× continuous-agents" scenario produces non-physical numbers and should be read as bound-testing, not forecasting. Bull/Bear scenario ratios are governed by the model's scenario knobs (scenario_pool) and are not surfaced as standing named ranges in the current vintage; the figures above are the Base case.

External corroboration of the additive view. Bank of America raised its 2030 AI data-center forecast from $1.4T to $1.7T on May 13 2026, explicitly framed as "additive to the overall market" citing "diversification of compute and memory components" — the same CPU-pull-through and SRAM/specialty-memory categories the BOM archetype model carries ³⁵. Microsoft's Satya Nadella, on the Q3 FY2026 earnings call (April 29 2026), described the shift in enterprise software unit economics from a per-user business model to a per-user and usage model in which agents working on behalf of users (or with them) create the value — the enterprise-side statement of the same additive framing ³⁶.

In one sentence: consensus underprices agentic AI in two ways at once — it under-models the cluster-level bill of materials, and it under-counts the token-volume × intensity multiplier that drives demand for the underlying compute. The bottom-up math compounds into a capex pool that already runs above the Goldman top-down anchor in every year and reaches ~1.78× it by 2030 (Base case).

Sources (all verified 2026-05-15)

§1 — What agentic AI is

Yao et al., 2022, ReAct: Synergizing Reasoning and Acting in Language Models

§2 — The protocol stack

§3 — The hardware, re-covered for agents

⁹
NVIDIA H200 Tensor Core GPU — product page
NVIDIA DGX B200 — product page
¹⁰
¹¹
¹²
¹³
¹⁴
¹⁵
SemiAnalysis, "CPUs Are Back: The Datacenter CPU Landscape in 2026" (Feb 2026; paywalled; indexed in Noldor)
SemiAnalysis, "GTC 2026: The Inference Kingdom Expands" (Mar 24, 2026; paywalled; indexed in Noldor)

§4 — How agentic AI changes a workflow

Bai et al., 2026, How Do AI Agents Spend Your Money? (arXiv 2604.22750)
⁶
¹⁶
¹⁸
¹⁹
SemiAnalysis, "Claude Code is the Inflection Point" (Feb 5, 2026; paywalled; indexed in Noldor)
²⁰
²¹
²²
²³
²⁴
²⁵
²⁶

§5 — Knock-on infrastructure effects

²⁹
²⁷
²⁸
TD Cowen, "Initiate at Buy: No Datacenter Is An Island" and "Photonics Leader With Attractive Model" (Mar 11, 2026; indexed in Noldor)
JPMorgan, "Dycom Industries: Premier Communications Construction Services Provider" (Apr 21, 2025; indexed in Noldor)
Pinecone docs — Overview
Pinecone Learn — HNSW
³⁰
³¹
³²
³³
³⁴

§6 — Thesis tie-back

§6(a) BOM shares (archetype × BOM-line, Agentic AI archetype) from the canonical AI-infra capex model — the merged bottom-up token + DC-capex archetype model.
§6(b) bottom-up vs top-down capex pool, Goldman top-down anchors, and the BU/TD ratios from the same BOM token model.
³⁵
³⁶
Jensen Huang at Morgan Stanley TMT, March 4, 2026

Sources

Anthropic "Tool use with Claude" · https://platform.claude.com/docs/en/agents-and-tools/tool-use/overview Vendorauto-blessed — "returns a structured call that your application executes (client tools) or that Anthropic executes (server tools)" ↩
OpenAI "Function calling" · https://developers.openai.com/api/docs/guides/function-calling Vendorauto-blessed — "strict mode works by leveraging our structured outputs feature" ↩
Anthropic "Introducing the Model Context Protocol", 2024-11-25 · https://www.anthropic.com/news/model-context-protocol Vendorauto-blessed — "trapped behind information silos and legacy systems. Every new data source requires its own custom implementation" ↩
modelcontextprotocol.io "Model Context Protocol" · https://modelcontextprotocol.io/ standards_bodyauto-blessed — "standardized way to connect AI applications to external systems" ↩
Wikipedia "Model Context Protocol — Wikipedia" · https://en.wikipedia.org/wiki/Model_Context_Protocol referenceauto-blessed — "In March 2025, OpenAI officially adopted the MCP" ↩
Anthropic "Claude Code overview" · https://code.claude.com/docs/en/overview Vendorauto-blessed — "runs commands" ↩
Cognition "Introducing Devin, the first AI software engineer", 2024-03-12 · https://www.cognition.ai/blog/introducing-devin Vendorauto-blessed — "Devin correctly resolves 13.86%* of the issues end-to-end" ↩
Anthropic "Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku", 2024-10-22 · https://www.anthropic.com/news/3-5-models-and-computer-use Vendorauto-blessed — "14.9% in the screenshot-only category" ↩
NVIDIA "NVIDIA H100 Tensor Core GPU — product page" · https://www.nvidia.com/en-us/data-center/h100/ Vendorauto-blessed — "GPU Memory: 80GB (H100 SXM), 94GB (H100 NVL)" ↩
NVIDIA "Mastering LLM Techniques: Inference Optimization" · https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/ Vendorauto-blessed — "linearly with batch size and sequence length" ↩
Stanford Digital Economy Lab "How are AI agents spending your tokens?", 2026-05-05 · https://digitaleconomy.stanford.edu/news/how-are-ai-agents-spending-your-tokens/ research_reportauto-blessed — "1000x more tokens than code reasoning and code chat" ↩
Microsoft "From Wisconsin to Atlanta: Microsoft connects datacenters to build its first AI superfactory", 2025-11-12 · https://news.microsoft.com/source/features/ai/from-wisconsin-to-atlanta-microsoft-connects-datacenters-to-build-its-first-ai-superfactory/ Vendorauto-blessed — "millions of CPU cores for operational compute tasks" ↩
The Motley Fool "AMD (AMD) Q4 2025 Earnings Call Transcript", 2026-02-03 · https://www.fool.com/earnings/call-transcripts/2026/02/03/amd-amd-q4-2025-earnings-call-transcript/ earnings_transcriptauto-blessed — "actually going to a lot of traditional CPU tasks" ↩
The Motley Fool "Intel (INTC) Q4 2025 Earnings Call Transcript", 2026-01-23 · https://www.fool.com/earnings/call-transcripts/2026/01/23/intel-intc-q4-2025-earnings-call-transcript/ earnings_transcriptauto-blessed — "persistent and recursive commands driven by computer-to-computer interactions" ↩
Meta "Meta Partners with AWS on Graviton Chips to Power Agentic AI", 2026-04-01 · https://about.fb.com/news/2026/04/meta-partners-with-aws-on-graviton-chips-to-power-agentic-ai/ Vendorauto-blessed — "autonomous systems that reason, plan, and execute complex tasks" ↩
GitHub "About Copilot coding agent" · https://docs.github.com/en/copilot/concepts/about-copilot-coding-agent Vendorauto-blessed — "its own ephemeral development environment, powered by GitHub Actions" ↩
arXiv "How Do AI Agents Spend Your Money?", 2026-04-24 · https://arxiv.org/abs/2604.22750 Peer-reviewedauto-blessed — "consuming 1000x more tokens than code reasoning and code chat" ↩
Cursor "Continually improving our agent harness", 2026-04-30 · https://cursor.com/blog/continually-improving-agent-harness Vendorauto-blessed — "we drove all tool calls to at least 2 or often 3 9s of reliability" ↩
Cursor "Securely indexing large codebases" · https://cursor.com/blog/secure-codebase-indexing Vendorauto-blessed — "92% similarity" ↩
Twilio "ConversationRelay documentation" · https://www.twilio.com/docs/voice/conversationrelay Vendorauto-blessed — "ElevenLabs" ↩
Deepgram "Measuring streaming latency" · https://developers.deepgram.com/docs/measuring-streaming-latency Vendorauto-blessed — "300 milliseconds or less" ↩
Deepgram "Deepgram Voice Agent API is now generally available" · https://deepgram.com/learn/voice-agent-api-generally-available Vendorauto-blessed — "sub-200ms time-to-first-byte" ↩
Intercom "Fin AI Agent automation rate" · https://www.intercom.com/help/en/articles/13533623-fin-ai-agent-automation-rate Vendorauto-blessed — "automation rate is 30%" ↩
Fin.ai "AI Agent KPIs: enterprise performance metrics framework" · https://fin.ai/learn/ai-agent-kpis-enterprise-performance-metrics-framework Vendorauto-blessed — "67% resolution rate across its customer base, with top-performing customers reaching 80-84%" ↩
Klarna "Klarna AI assistant handles two-thirds of customer service chats in its first month", 2024-02-27 · https://www.klarna.com/international/press/klarna-ai-assistant-handles-two-thirds-of-customer-service-chats-in-its-first-month/ Vendorauto-blessed — "700 full-time agents" ↩
Customer Experience Dive "Klarna changes its AI tune and again recruits humans for customer service", 2025-05-01 · https://www.customerexperiencedive.com/news/klarna-reinvests-human-talent-customer-service-AI-chatbot/747586/ trade_pressauto-blessed — "turning back to people" ↩
Arista Networks "AI Networking" · https://www.arista.com/en/solutions/ai-networking Vendorauto-blessed — "Modern AI applications need high-bandwidth, lossless, low-latency, scalable, multi-tenant networks that interconnect hundreds or thousands of accelerators at high speed from 100Gbps to 400Gbps, evolving to 800Gbps and beyond." ↩
Google Cloud "The evolution of Google's Jupiter data center network" · https://cloud.google.com/blog/topics/systems/the-evolution-of-googles-jupiter-data-center-network Vendorauto-blessed — "Jupiter supports more than 6Pb/sec of datacenter bandwidth" ↩
NVIDIA "NVIDIA Introduces Spectrum-XGS Ethernet to Connect Distributed Data Centers Into Giga-Scale AI Super-Factories", 2025-08-22 · https://nvidianews.nvidia.com/news/nvidia-introduces-spectrum-xgs-ethernet-to-connect-distributed-data-centers-into-giga-scale-ai-super-factories Vendorauto-blessed — "interconnect multiple, distributed data centers to form massive AI super-factories capable of giga-scale intelligence" ↩
Weaviate "Vector index — Weaviate documentation" · https://docs.weaviate.io/weaviate/concepts/vector-index Vendorauto-blessed — "logarithmic time complexity" ↩
Anthropic "Memory", 2025-09-11 · https://claude.com/blog/memory Vendorauto-blessed — "Expanding to Pro and Max plans (Oct 23, 2025)" ↩
OWASP "LLM01:2025 Prompt Injection" · https://genai.owasp.org/llmrisk/llm01-prompt-injection/ standards_bodyauto-blessed — "A Prompt Injection Vulnerability occurs when user prompts alter the LLM’s behavior or output in unintended ways" ↩
OWASP "LLM06:2025 Excessive Agency" · https://genai.owasp.org/llmrisk/llm062025-excessive-agency/ standards_bodyauto-blessed — "call functions or interface with other systems via extensions" ↩
Anthropic "Agentic Misalignment: How LLMs could be insider threats", 2025-06-20 · https://www.anthropic.com/research/agentic-misalignment Vendorauto-blessed — "models from all developers resorted to malicious insider behaviors" ↩
Investing.com "Bank of America raises AI data center market forecast to $1.7 trillion by 2030", 2026-05-13 · https://www.investing.com/news/stock-market-news/bank-of-america-raises-ai-data-center-market-forecast-to-17-trillion-by-2030-93CH-4683475 trade_pressauto-blessed — "diversification of compute and memory components will be additive to the overall market" ↩
The Motley Fool "Microsoft (MSFT) Q3 2026 Earnings Call Transcript", 2026-04-29 · https://www.fool.com/earnings/call-transcripts/2026/04/29/microsoft-msft-q3-2026-earnings-transcript/ earnings_transcriptauto-blessed — "agents—working on behalf of users or with users—have created value" ↩