M12E1: The Memory Wall, GPU Architecture, and Compute Geopolitics
Module 12, Episode 1: The Memory Wall, GPU Architecture, and Compute Geopolitics
Bandwidth Beats FLOP Count
Every serious conversation about AI infrastructure eventually runs into a number that seems wrong at first. The NVIDIA H100 SXM is rated at roughly 989 teraFLOPS of FP16 compute — that is, floating-point operations per second, the standard measure of raw arithmetic throughput. Nearly one quadrillion floating-point operations per second in a single card. And yet, when you run LLM inference on an H100, the relevant spec is the 3.35 terabytes per second of HBM3 — high-bandwidth memory — bandwidth. That number, smaller, less dramatic, buried in the third row of most spec sheets, determines how many tokens per second your model produces.
Understanding why requires the concept of arithmetic intensity: the ratio of floating-point operations performed to bytes of data moved from memory. Every workload sits somewhere on this ratio. High arithmetic intensity means the GPU's compute units stay busy relative to the rate at which data is being fetched — the chip is compute-bound. Low arithmetic intensity means the compute units are starved, waiting for data to arrive from memory — the chip is memory-bound. The location of a workload on this spectrum determines which hardware characteristic matters.
For LLM training, arithmetic intensity is high. During a forward pass over a batch of sequences, the fundamental operation is a matrix-matrix multiplication: the weight matrix multiplies against a batch of activations. Because the batch dimension can be large, the same weight parameters are reused across many different input vectors in a single kernel call. That reuse drives up arithmetic intensity — the bytes loaded to fetch the weight matrix are amortized across many multiply-accumulate operations. Training a 70B model on a batch of 512 sequences is a compute-efficient activity.
Inference at deployment time is structurally different. During autoregressive decoding — the process by which the model generates one token at a time, each depending on the last — the operation at each step is a matrix-vector multiplication: the weight matrix multiplies against a single vector representing the current token state. The batch dimension collapses to one, or in practice to whatever batch size the serving system can assemble before latency requirements force an answer. Every loaded BF16 — brain float 16, a compact numerical format — element of the cached key matrix performs one multiply-accumulate with the single-token query element already in registers, yielding a 1:1 FLOP-to-byte ratio. This arithmetic intensity is far below the dense BF16 roofline of an H100 SXM GPU. The H100's 989 teraFLOPS of theoretical FP16 compute translate to a compute-to-bandwidth ratio of roughly 295 FLOPs per byte. Single-token inference delivers one FLOP per byte. The tensor cores — the specialized matrix arithmetic units on the chip — are idle 99.7% of the time, waiting for memory to deliver the next weight.
This is a consequence of the autoregressive decoding algorithm itself, not a fixable inefficiency. The model cannot produce token N+1 until it knows token N, so parallelism across the output sequence is unavailable. Increasing the batch size helps — serving many concurrent users simultaneously raises arithmetic intensity because the same weights are now multiplied against multiple query vectors before being evicted from cache — but latency requirements cap how large batches can grow before users wait unacceptably long.
HBM3 memory pushes bandwidth to 3.35 TB/s, removing bottlenecks in memory-bound inference. Tokens per second per GPU, at typical serving batch sizes, tracks almost linearly with memory bandwidth. Doubling the FLOP count on an otherwise identical chip changes inference throughput by almost nothing. Doubling the memory bandwidth doubles the token rate. The memory wall is a physical constraint that restructures which hardware investment makes economic sense.
Flash Attention and the KV Cache: Algorithmic Responses to a Physical Limit
When the FlashAttention paper appeared in 2022 from Tri Dao, Dan Fu, and collaborators at Stanford, its core contribution was to reorder the attention computation to minimize round trips between HBM and the GPU's on-chip SRAM — static random-access memory, the fast but small scratchpad that sits directly on the chip. Standard attention materializes the full N×N attention matrix in HBM — reading keys and queries, writing the attention score matrix, reading it back to apply softmax, then multiplying by values. For long sequences, this pattern of repeated HBM reads and writes becomes the dominant cost, not the floating-point operations themselves. FlashAttention tiles the computation into blocks that fit in SRAM, avoids materializing the full attention matrix in HBM at all, and produces exact results with no approximation. The original paper showed 15% efficiency improvement in wall-clock speed with no deterioration in output quality.
FlashAttention and KV caching — where KV stands for key-value, the intermediate computations the model stores to avoid redundant work — solve orthogonal bottlenecks. KV caching skips redundant work; FlashAttention makes the remaining work faster and more memory-efficient. The KV cache avoids recomputing key and value projections for tokens already seen. FlashAttention ensures that attention computation over those cached keys and values wastes as few HBM bandwidth cycles as possible. Modern inference stacks require both.
The KV cache creates its own memory accounting problem that becomes severe at long contexts. Its size scales as a product of context length, number of attention layers, number of KV heads, and head dimension. For a model like LLaMA-3 70B — Meta's open-weight large language model — running in BF16 with grouped query attention, the KV cache for a single sequence at 128K context length occupies tens of gigabytes. As both context lengths and model sizes scale up, the KV cache footprint expands substantially, leading to reduced throughput and elevated latencies. At 200K token contexts — now routine for models like Gemini 2.5 and Claude 4 — the KV cache for a single request can exhaust a substantial fraction of an H100's 80GB of HBM before model weights are accounted for.
This creates a direct tension: the same memory that holds the KV cache determines how large a model you can fit and how many concurrent users you can serve. The attention mechanism contributes up to 80% of overall latency at context lengths above 80K tokens. Long-context inference is a qualitatively different regime where memory capacity, not just memory bandwidth, becomes binding. Every token in a 200K context window must have its keys and values resident in HBM for attention to proceed.
In long contexts, the cost of inference is dominated by reading the KV cache from memory for every generated token. Because the KV cache adds little network communication cost, long-context inference performance is dominated almost entirely by memory bandwidth. An organization serving models at long contexts needs to optimize for memory capacity and bandwidth simultaneously. The H100's 80GB HBM3 at 3.35 TB/s is a specific point in that tradeoff space. The H200's 141GB HBM3e at 4.8 TB/s is a different point. The choice between them at long contexts is driven almost entirely by this accounting.
FlashAttention shows that the memory wall generates algorithmic pressure — the pressure to keep all necessary data closer to compute, to fuse operations that would otherwise require multiple HBM round trips, to redesign attention mechanisms around hardware arithmetic intensity rather than mathematical elegance. Techniques like grouped query attention, or GQA, which reduces the number of KV heads while preserving most model quality, are direct responses to KV cache pressure. The original multi-query attention paper and subsequent GQA work emerged from exactly this constraint. The hardware limit shapes the algorithm.
The H100 to GB200 Arc: Three Years of Hardware Trajectory
The H100 SXM, introduced in 2022 on TSMC's 4N process node, delivered 3.35 TB/s of HBM3 bandwidth across 80GB of memory — effectively a 2x increase over the A100's memory bandwidth. The architectural addition that most mattered for LLMs was the Transformer Engine: a hardware unit that dynamically switches between FP8 and FP16 precision, roughly doubling effective throughput for matrix operations in transformer blocks without requiring changes to model weights, delivering up to 4x faster GPT-3 training and 30x faster inference compared to the A100.
The H200 was an incremental step — same Hopper die, upgraded memory. The H200 carries 141GB HBM3e with approximately 4.8 TB/s memory bandwidth versus 3.35 TB/s on the H100 SXM. The FLOP count was identical. For workloads that fit comfortably in 80GB, the H100 and H200 perform nearly identically. For long-context inference or very large models, the H200's expanded capacity matters substantially.
The Blackwell generation represented a more fundamental redesign. The B200 GPU, built on TSMC's 4NP process, substantially increased FP4 compute throughput through second-generation Tensor Cores with microscaling formats — a technique that compresses numerical representations to squeeze more arithmetic operations per memory byte. For dense models like Llama 3.3 70B, Blackwell B200 delivers over 10,000 tokens per second per GPU at 50 tokens per second per user, which is 4x higher per-GPU throughput compared with the H200 GPU. That 4x improvement reflects both higher memory bandwidth in the B200 and aggressive software optimization of serving stacks — TensorRT-LLM, vLLM, and SGLang (all inference serving frameworks) saw major Blackwell-specific kernel improvements through 2025.
The most architecturally significant Blackwell product is not the individual B200. It is the GB200 NVL72. The GB200 NVL72 connects 36 Grace CPUs and 72 Blackwell GPUs in a rack-scale, liquid-cooled design, with a 72-GPU NVLink domain — NVLink being NVIDIA's proprietary high-speed chip interconnect — that acts as a single massive GPU, delivering 30x faster real-time trillion-parameter LLM inference and 10x greater performance for mixture-of-experts architectures, a model design where only a fraction of the network activates for any given input.
The architectural departure here is significant. Previous multi-GPU configurations communicated over InfiniBand or NVLink between separate servers, each maintaining its own CPU and PCIe interconnect. The NVL72 eliminates the intra-rack network topology problem by treating 72 GPUs as a single compute domain. NVLink Switch System provides 130 terabytes per second of low-latency GPU communications. At 130 TB/s of intra-rack bandwidth, the KV cache access patterns that bottleneck distributed inference across slower networks become dramatically more tractable.
What 13.4 TB of unified GPU memory enables is running a 671B-parameter model entirely within one rack. DeepSeek R1 — the Chinese lab's open-weight reasoning model released in early 2025 — at FP8 precision requires roughly 700-750GB for weights and runtime buffers. The NVL72 can hold a model of that scale within a single NVLink domain, eliminating the inter-node communication overhead that fragments inference performance when models are distributed across multiple InfiniBand-connected servers.
The economics are correspondingly aggressive. The rack-scale server alone costs approximately $3.1M for a typical hyperscaler, with all-in cost reaching about $3.9M per rack. Infrastructure requirements are severe: GB200 requires 480V power, direct liquid cooling, and $5-10M per megawatt retrofit cost; the NVL72 requires 2,000-plus cables per rack with exact routing specifications. The decision to build NVL72 clusters is a capital commitment that requires multi-year planning, purpose-built facilities, and dedicated operational teams.
The claimed 30x inference throughput improvement over H100 compares the NVL72 running 1.8 trillion parameter mixture-of-experts models against H100 clusters connected over InfiniBand, at specific latency targets. For smaller, well-tuned models on H100, the improvement is more modest. The gains are real but concentrated in exactly the regime — very large models, long contexts, high concurrency — where the NVL72's architectural advantages most directly address the bottlenecks described above.
Through mid-2025, no large-scale training runs had been completed on GB200 NVL72, as software continued to mature and reliability challenges were being worked through. NVIDIA's H100 and H200 as well as Google TPUs — tensor processing units, Google's custom AI accelerators — remained the only systems successfully used to complete frontier-scale training. By late 2025, CoreWeave and Azure had deployed NVL72 clusters at production scale for inference workloads, and the software situation had improved materially.
The Alternative Hardware Landscape: TPUs, Inferentia, and the CUDA Gravity Well
NVIDIA's architectural dominance is real but not uncontested. The alternative hardware ecosystem has matured substantially, particularly for inference workloads.
Google's TPU program represents the most credible long-running alternative. For TPU v4 and v5, compute throughput was much lower than the NVIDIA flagship of the time. TPU v6 Trillium came very close to the H100/H200 on FLOPs but arrived two years after the H100. Google designed TPU v5e explicitly for inference economics rather than peak training performance: it delivers up to 2x higher training performance per dollar and up to 2.5x inference performance per dollar for LLMs compared to Cloud TPU v4, at less than half the cost. For Google's internal workloads — Gemini inference serving billions of daily requests — the economics of purpose-built inference silicon are compelling in a way that general-purpose GPU pricing cannot match.
Google's TPU v7 Ironwood, announced in late 2025, represented a more aggressive inference-focused design. Each Ironwood chip boasts 4,614 teraFLOPS of compute, 192GB of HBM memory, and 7.2 TB/s of memory bandwidth. Scale that to a 9,216-chip pod and you reach 42.5 exaflops of peak compute. The 7.2 TB/s per chip is more than double the H100's 3.35 TB/s, aimed directly at the inference bottleneck. Google's TPU architecture addresses the inference memory wall by hosting massive KV caches entirely on-silicon, using expanded on-chip SRAM, combined with a SparseCore engine to offload communication tasks and reduce core idle time.
The constraint on TPU adoption is software ecosystem depth, not hardware capability. NVIDIA's CUDA — Compute Unified Device Architecture — platform has two decades of libraries, tooling, and community knowledge. PyTorch, which runs natively on CUDA, is the default framework for the vast majority of AI development. A GPU can run virtually any framework — CUDA, PyTorch, TensorFlow, OpenCL — giving developers considerable flexibility. A TPU demands TensorFlow, JAX, or XLA, trading breadth for depth and optimization. Most model development happens on GPU clusters, and porting to TPU requires meaningful engineering investment. Google has been steadily reducing this friction — adding vLLM and SGLang support for TPU v5p and v6 — but the gap remains.
AWS Inferentia2 takes a different approach: purpose-built inference silicon available through standard EC2 instances, optimized for high-throughput, low-latency serving. The Inferentia2 generation delivers 4x higher throughput and 10x lower latency than Inferentia1, with up to 70% cost reduction per inference compared to GPU-based alternatives. For organizations already embedded in AWS, Inferentia offers a path to inference cost reduction without the architectural replumbing that TPU adoption requires.
The strategic picture emerging from this landscape is a bifurcation between training and inference hardware. Google TPUs and H100/H200 remain the primary vehicles for frontier-scale training. Inference is diversifying: H100 clusters at general-purpose cloud providers, TPUs for Google's own models and selected external workloads, Inferentia for AWS-native serving, and NVL72 racks for organizations serving the very largest models at scale. The market is not winner-take-all at the inference layer, which creates genuine procurement optionality that did not exist in 2022.
Export Controls, the Controlled Supply Chain, and China's Hardware Response
In 2022, the U.S. Department of Commerce's Bureau of Industry and Security announced sweeping new export rules explicitly targeting AI chips, banning the export to China of top-tier GPUs such as the NVIDIA A100 and H100, as well as similar chips from AMD. The initial 2022 framework prohibited the export of any GPUs with an aggregate interconnect bandwidth exceeding 600 GB/s.
NVIDIA's response showed the precise nature of the controls: rather than lose the Chinese market entirely, the company engineered compliant variants. The A800 was an A100 with its NVLink interconnect bandwidth reduced to 400 GB/s from 600 GB/s. When the H100 was banned, the China-specific H800 was introduced with approximately 300 GB/s of interconnect bandwidth compared to 900 GB/s on the original H100. These modified chips also had slightly lower throughput and memory speeds through firmware and clock adjustments.
The October 2023 update closed these gaps. The Bureau of Industry and Security added controls on performance density as well as total processing performance, and added Chinese entities involved in advanced computing and AI to the Entity List — the U.S. government's roster of organizations subject to strict export licensing requirements. Those rules were broadened to cover any processors built on the same architectures, forcing U.S. vendors to either apply for special licenses or deliberately throttle performance before shipping. The H800 was blocked immediately — on October 23, 2023, the U.S. government notified NVIDIA to halt H800 exports.
What emerged from the October 2023 controls was the H20: a chip designed from the outset to comply with the new thresholds, with substantially reduced compute capability but full HBM3 memory bandwidth. The H20 has very high memory and network bandwidth relative to its arithmetic power, because these variables were not subject to export limitations and were largely carried over from the H200. This produced a paradoxical result: the H20, while dramatically weaker than the H100 for training, was faster than the H100 for inference at long contexts — because long-context inference is bandwidth-bound, and the H20's bandwidth was not throttled.
The controls have not been static. The second Trump administration, beginning in early 2025, added 42 Chinese entities to the Entity List in March 2025 and another 23 in September 2025, and required NVIDIA to apply for a license to sell its H20 GPU in China. After a period of uncertainty, NVIDIA's H20 GPU resumed sales to China in mid-July 2025 as part of broader trade negotiations — showing the volatility of the regulatory environment for any organization building a China-facing AI supply chain.
China's domestic response has been led by Huawei's Ascend series. The Ascend 910B and 910C are the most capable domestically produced AI accelerators, but their gap from NVIDIA hardware is significant and structural. Although the Ascend 910C approaches the specifications of the H100 on paper, the H100 outperforms the Ascend 910C by 60 percent in real-world performance, and China cannot produce enough 910Cs to meet domestic demand. That performance gap is driven by process node constraints: SMIC — China's Semiconductor Manufacturing International Corporation — cannot currently produce chips at nodes more advanced than 7nm, given U.S. and allied export controls on production tools. The last NVIDIA chip made at 7nm was the A100, released in 2020.
The production capacity problem compounds the performance gap. SemiAnalysis — an independent semiconductor research firm — estimated that Huawei could produce as many as 1.5 million AI chip dies in 2025, but would complete only 200,000-300,000 chips due to a shortage of high-bandwidth memory, which the United States export-controlled in December 2024. HBM is a chokepoint: produced primarily by SK Hynix, Samsung, and Micron, all subject to U.S. jurisdiction. Without access to cutting-edge HBM, Huawei's chips cannot reach the memory bandwidth levels that modern LLM inference demands. The December 2024 controls adopted, for the first time, country-wide restrictions on the export of advanced HBM to China.
Huawei has attempted to compensate through interconnect scale rather than individual chip performance. Its CloudMatrix 384 interconnects 384 units of Ascend 910C, delivering aggregate compute that competes with NVIDIA's rack-scale systems. Whether aggregate cluster performance can compensate for per-chip bandwidth deficits in LLM inference — which is inherently memory-bandwidth-bound at the per-chip level — remains an open empirical question; published benchmarks on Huawei hardware at LLM workloads are not yet comprehensive.
The performance gap between the best U.S. and Chinese AI chips is significant and widening. The best U.S. AI chips are currently about five times more powerful than the best Chinese AI chips, measured by total processing performance. By the second half of 2027, NVIDIA's best AI chips will be seventeen times more powerful than Huawei's best. That widening trajectory reflects the compounding effect of access to advanced process nodes, HBM supply, and CUDA software optimization — none of which China can close rapidly.
Two Capability Trajectories: The Bifurcated Market
The practical consequence of the export control regime is that the global AI hardware market has split into two distinct capability trajectories, and organizations on one trajectory are not competing on equal terms with organizations on the other.
Organizations with access to H100, H200, and Blackwell-generation hardware — located in Tier 1 jurisdictions under the January 2025 AI Diffusion Rule framework, or operating through major cloud providers in those jurisdictions — are running on a capability arc that by 2026 includes GB200 NVL72 rack-scale inference, Blackwell Ultra training, and the full depth of the CUDA software ecosystem. Roughly 75 percent of the chips powering AI model training in Chinese data centers still run on NVIDIA's CUDA platform, with the company having shipped more than a million export-compliant H20 chips to China since late 2024. The H20 provides a floor of capability — adequate for inference serving and for distillation-based approaches that avoid frontier-model training compute requirements — but it is not a path to training new frontier-scale models.
The DeepSeek R1 release in early 2025 demonstrated what is possible when sophisticated algorithmic work is applied to constrained hardware. Unable to access NVIDIA's H100-class GPUs, DeepSeek redesigned its model architecture and optimized training efficiency to achieve near-frontier performance with millions of less-capable H800 chips — well below the U.S. export-control threshold. The GRPO reinforcement learning technique that powered R1's reasoning capabilities, the multi-head latent attention mechanism that reduced KV cache pressure, and the aggressive use of mixture-of-experts architectures that allowed a large parameter count at low inference cost — these were not concessions to constrained hardware. They were innovations that emerged from it. Those techniques are now being adopted by labs with unrestricted hardware access.
As analyst Dawani observed: "Chinese researchers are learning to get more out of less. Once those techniques mature, they travel quickly across the ecosystem and reduce the strategic value of raw compute supremacy." The underlying asymmetry remains. Algorithmic efficiency gains travel across borders. Hardware capability differences do not. A lab with GB200 NVL72 access can run DeepSeek's algorithmic innovations on dramatically superior hardware, compounding rather than negating the hardware advantage.
The CUDA ecosystem remains unmatched in breadth and maturity. Huawei's CANN architecture — Compute Architecture for Neural Networks — while competitive in specific applications, lacks CUDA's breadth. Even as Huawei gains in hardware, NVIDIA retains a critical edge in software infrastructure. CUDA is not just a programming interface — it is thirty years of numerical computing libraries, inference optimization frameworks, profiling tools, and community knowledge. An organization moving to Huawei hardware faces re-engineering costs that dwarf the hardware procurement decision itself.
Across the broader hardware landscape, the market for inference compute is now served by multiple distinct ecosystems: NVIDIA Hopper/Blackwell for general-purpose AI, Google TPU for workloads within the Google Cloud orbit, AWS Inferentia/Trainium for AWS-native serving at scale, and a Chinese domestic stack anchored by Huawei Ascend for organizations in or dependent on Chinese supply chains. Each represents not just different hardware but different software environments, different optimization toolchains, and different limits on what models can be deployed.
The bifurcation is more granular than "H100 access" versus "no H100 access." Organizations in Tier 2 jurisdictions — most of Southeast Asia, the Middle East, Eastern Europe, Latin America — face country-level caps on GPU procurement that slow infrastructure scale-up even when individual chips are not prohibited. The October 2023 framework established total processing performance thresholds that effectively blocked A100, H100, MI300X, and even some gaming GPUs like the RTX 4090 from restricted markets. A sovereign wealth fund or national research lab in a Tier 2 country building out AI infrastructure is not operating in the same market as a hyperscaler in the United States or Germany, and no amount of procurement sophistication fully bridges that gap under current controls.
What a CAIO Actually Needs to Know
The memory wall, the hardware trajectory, and the export control regime are connected causally, not coincidentally. The memory wall drives hardware investment decisions — why HBM capacity and bandwidth command premium pricing, why GB200 NVL72 racks cost $3.9M, why inference-optimized silicon is a growing market distinct from training silicon. The hardware trajectory determines what capability is available to your organization over the next two to three years, and whether the hardware you are procuring now will still be the relevant baseline when your deployed models reach production scale. The export controls determine which hardware trajectory your organization is on — a fact that is geopolitical before it is economic.
For a Chief AI Officer — a CAIO — in a global enterprise, the procurement question is not simply "how many H100s should we reserve?" The real questions run deeper: which regulatory tier are our operating entities in, how stable are those tier assignments given the observed volatility of export control policy, what happens to our inference infrastructure if a key component of our supply chain crosses a control threshold, and do we have the software engineering capacity to operate on alternative hardware if the control environment forces a migration? The regulations determine which countries can access H100, H200, and Blackwell GPUs — and by extension, which organizations can build competitive AI infrastructure. For enterprises operating globally, understanding export controls has become as essential as understanding the chips themselves.
The volatility point deserves direct attention. Between October 2022 and mid-2025, the export control framework was revised substantively at least three times — each revision closing loopholes that the previous version had left open, and each revision creating new uncertainty about which products and which jurisdictions were affected. The H800 went from being an approved China-market product to being blocked in the span of thirteen months. The H20 went from approved to requiring a special license and back to approved within roughly a year. Organizations that built China-facing inference infrastructure around the H20 in early 2025 faced a period in which that infrastructure's legal status was genuinely unclear. That is not a procurement risk that can be managed with a standard vendor contract — it requires ongoing legal monitoring, contingency architecture planning, and the organizational capacity to migrate workloads across hardware platforms on compressed timelines.
The hardware substitution question is harder than it looks. The CUDA gravity well is real. An inference stack built on H100s and optimized with TensorRT-LLM, vLLM, or SGLang carries thousands of engineering hours of CUDA-specific optimization — kernel tuning, memory layout decisions, batching logic calibrated to specific HBM bandwidth profiles. Migrating that stack to Huawei Ascend or even to Google TPU is not a matter of recompiling. It requires re-profiling every hot kernel, rebuilding the serving framework from scratch in a different programming model, and revalidating output quality across the full distribution of production inputs. A realistic migration timeline for a mid-sized inference deployment is six to eighteen months of dedicated engineering effort. Hardware decisions made today carry software lock-in consequences that extend well past the hardware's own depreciation cycle.
Against this background, the emergence of inference-agnostic abstraction layers looks less like a convenience feature and more like a strategic hedge. Frameworks like vLLM that support multiple hardware backends, or cloud-provider managed inference services that abstract the underlying silicon, give an organization real optionality in a volatile control environment. An organization whose inference stack runs on a managed service that can redirect traffic from H100 instances to TPU v5e instances without application-layer changes has choices. An organization that has co-designed its inference stack tightly with NVIDIA-specific kernel libraries does not.
The workload classification question runs parallel to the hardware selection question. A document processing pipeline that runs batch inference overnight is not the same problem as a real-time customer-facing chatbot. The batch pipeline can tolerate higher latency, can amortize large batch sizes across the run, and can be scheduled on spot or preemptible instances that cut cost by 60-70% versus on-demand. The chatbot must respond within two seconds, must serve unpredictable traffic spikes, and needs reserved or dedicated capacity that can scale quickly. Running both workloads on the same hardware configuration is almost never optimal. A CAIO who has not mapped the organization's AI workloads onto this latency-throughput spectrum before making hardware commitments is paying for the wrong thing at significant scale.
The long-context trend sharpens this. As models with 200K and 1M token context windows move from research artifacts to production deployments — for document analysis, code review across large repositories, long-session agents — the memory capacity constraint becomes the primary cost driver. At 200K context, the KV cache for a single active session can occupy 40-60GB of HBM on a large model. An H100 with 80GB can serve very few concurrent long-context sessions before memory is exhausted. The H200's 141GB provides more headroom. The NVL72's 13.4TB of unified memory, pooled across 72 GPUs in a single NVLink domain, provides a qualitatively different operating point: the ability to serve thousands of concurrent long-context sessions from a single rack without the inter-node communication overhead that fragments performance when sessions must be distributed across InfiniBand-connected servers. For an organization whose AI use cases are trending toward agents and document-scale reasoning rather than short-form generation, that architectural advantage is directly relevant — not as a benchmark achievement, but as a capacity architecture.
The upcoming Vera Rubin generation sharpens the planning horizon further. NVIDIA has confirmed that the Vera Rubin NVL144 will carry HBM4 at 13 TB/s bandwidth — nearly four times the H100's bandwidth — in a 144-GPU rack configuration, delivering 3.3x the FP8 training performance of the B300 NVL72. Organizations making compute infrastructure commitments today are implicitly betting on a specific position in this trajectory. A three-year reserved instance commitment on H100 hardware made in mid-2025 runs through mid-2028, by which point Vera Rubin clusters will be in production at the major hyperscalers and the H100's position on the capability curve will have moved from current-generation to two generations behind. The economics of reserved pricing versus on-demand can justify locking in capacity even on hardware that will be superseded. The argument is for understanding exactly what you are committing to and why, rather than making capacity decisions based on what is currently visible on benchmark sheets.
The Google Ironwood TPU situation shows a strategic consideration that applies beyond TPU specifically. When a major cloud provider builds purpose-designed inference silicon — optimized for the memory bandwidth profiles of large transformer models, with 7.2 TB/s per chip and massively scaled pod configurations — and deploys it primarily for their own frontier model inference, the cost structure of serving those models changes in ways that affect competitive dynamics across the industry. Google serving Gemini 2.5 inference on Ironwood at internal transfer pricing has a fundamentally different cost basis than a competitor serving an equivalent model on commercially priced H100 or H200 instances. That asymmetry does not determine competitive outcomes — model quality, product integration, and developer ecosystem matter enormously — but it is a structural advantage that compounds over time as inference volume scales. For a CAIO evaluating whether to build AI capabilities on third-party foundation models or invest in custom model development and dedicated inference infrastructure, the cost structure asymmetry is a material input to that decision.
The semiconductor supply chain itself carries risks that extend beyond export controls. HBM production is concentrated at SK Hynix, Samsung, and Micron. Advanced packaging — the CoWoS process that integrates HBM with GPU dies, where CoWoS stands for Chip on Wafer on Substrate — is almost entirely sourced through TSMC. A sustained disruption to TSMC's advanced packaging capacity, whether from geopolitical events, natural disaster, or technical yield problems at new process nodes, would affect the entire H100/H200/Blackwell supply chain simultaneously. CoWoS capacity was already a constraint on H100 supply in 2023, limiting NVIDIA's ability to ship against demand. Organizations building critical AI infrastructure on hardware whose supply chain runs through a small number of facilities in Taiwan are carrying concentration risk that sits outside any individual vendor relationship.
None of this complexity is navigable by treating hardware as a commodity procurement decision. The memory wall is physics. The hardware trajectory is engineering's response to that physics, generation by generation. The export control regime is policy's attempt to use hardware as a lever on capability development — a lever that has proven blunter than its architects intended, volatile in its application, and increasingly difficult to calibrate as algorithmic innovation partially decouples capability from raw compute access. The bifurcated market that has emerged is not a temporary condition pending policy normalization. It is the operating environment for AI infrastructure planning through at least the end of the decade.
A CAIO who can trace a number on a spec sheet — 3.35 TB/s, 130 TB/s, 13 TB/s — back to the physical constraint it addresses and forward to the operational consequence it implies is not dependent on vendor narratives to make infrastructure decisions. In a market where vendor narratives are optimized for procurement velocity rather than strategic clarity, that independence is worth considerably more than any single hardware choice.
The compute question is, at its base, a physics question. Everything downstream — the architecture of inference stacks, the economics of serving costs, the strategic significance of export controls, the planning horizon for capacity commitments — follows from understanding what the physics permits and what it does not.