Opening#
When an aircraft breaks the sound barrier, it creates a shockwave – it moves faster than the surrounding air can propagate the disturbance. The medium cannot accommodate the perturbation, creating a discontinuity. A sonic boom.
Like ripples on the surface of a pond, the shockwave propagates outward, its effects dissipating with distance. Demand for AI services is propagating faster than the technology supply chain can absorb. First it was GPUs to train and run large language models. Now it is energy to power them, according to Microsoft’s CEO. These ripples reverberate in our everyday lives; electricity prices rise as datacenter demand strains grid capacity, and consumer electronics grow more expensive as they compete for the same constrained semiconductor supply.
Physical supply chains do not respond at the speed of software. Semiconductor fabs and power plants require billions of dollars in capital investment and years of construction, along with the human capital required to build and operate them. Even when demand is clear and capital is available, capacity cannot be expanded overnight. The medium resists sudden acceleration.
In a world where computation is constrained by physical and economic reality, software architecture increasingly becomes an expression of corporate strategy. It mediates customer demand against scarce, expensive resources, and in doing so determines what scales, what stalls, and what becomes uneconomical.
Software built on abundance#
For decades, software has been shaped in a world of consistent hardware performance gains. Thanks to Moore’s Law, transistor count doubled roughly every two years, giving us faster processors, faster networks, and more memory. Dennard’s Scaling allowed us to sustain this progress by reining in power density through increased transistor density and reduced drive voltages. This trend slowed as manufacturers pushed into the multi-GHz range and core voltages approached the transistor threshold voltage. Combined with inefficiencies amplified at small sizes, power density approached unsustainable thermal profiles.
Modern high-end datacenter CPUs such as Intel’s Xeon 6 and AMD’s EPYC 9005 series operate in the same clock-frequency range as their predecessors, with most performance gains coming from increased core count as opposed to increased single-core performance – often approaching two hundred cores per socket. However, just as adding more cooks in the kitchen doesn’t necessarily help prepare dinner faster, adding more cores does not automatically make our computers faster. Multi-core CPUs require sophisticated coordination mechanisms, and instead push these new challenges of coordination up the stack into software, giving rise to locks, semaphores, and entire areas of research devoted to parallel computing. These concepts impose a significant cognitive load on software teams that must think not just about how to deliver features to customers, but also about how to maximize utilization of the underlying hardware to deliver a performant user experience.
To grapple with such complexities exposed by the hardware, we create abstractions to hide the details. A mobile app developer should only have to think about APIs and not be required to understand cellular technologies such as 3GPP and LTE that support them. Somebody building a website should be able to focus on their rich animations and graphics, and not have to worry about the graphics rendering pipelines that power them. John Ousterhout captures this principle in A Philosophy of Software Design – building “deep” modules with simple interfaces that encapsulate complexity, thereby reducing cognitive load and enabling organizations to scale.
The advent of the public cloud pushed this abstraction further. Physical computing resources moved behind APIs; no longer do we need to wait on server procurement or data center commissioning. A new computing cluster was just a click away. Even the capital expenditure required to source all of the hardware became abstracted as a metered operating expense. By leveraging economies of scale, cloud providers offered an apparent abundance of compute, giving us the option to scale vertically and horizontally to thousands of servers with little friction. We now had the option to solve certain software performance challenges with financing, by simply throwing more compute and memory at the problem. However, the sudden change in the demands of modern workloads is beginning to exhaust some of these workarounds.
The memory crunch#
Abstractions can help decompose problems and make them tractable for humans, but they do not solve the underlying performance challenges. As parallelized workloads have increased, most visibly in GPUs and other accelerators, the challenge has shifted from performing parallelized computation to feeding it efficiently with enough data.
Popularized by Nvidia in the late 1990s, GPUs were originally developed to accelerate the highly parallelizable and repetitive graphics processing pipelines. These specialized technologies have pushed the limits of parallelism. In fact, modern datacenter GPUs contain tens of thousands of cores that can operate simultaneously, performing quadrillions of calculations per second – too many zeros to keep track of. Similarly, in a kitchen with tens of thousands of cooks, massive quantities of raw ingredients must flow continuously to keep them all busy. This ability to push ingredients through the compute cores, often referred to as memory bandwidth, can quickly become the bottleneck.
Memory technology has improved over the years, but just as single-core CPU performance gains have slowed significantly due to transistor and power density challenges, memory density improvements have also slowed. Higher bandwidth can be achieved by placing more memory chips in parallel, but physical space on the PCB is finite, and density and bandwidth are linked constraints. Highly parallelized workloads require correspondingly large amounts of memory, along with the bandwidth to shuttle the bits back and forth. Most computers use a memory technology known as DRAM which has gone through several revisions since its inception half a century ago. The most recent revision, DDR5, achieves upwards of 50 GB/s per channel – roughly double the speed of DDR4. These are significant improvements, but as our GPUs have become increasingly powerful, the large language model workloads now demand much more from memory. This phenomenon, where compute capability outgrows the available memory technology, is often called the “memory wall”.
One of the technologies helping us break through this memory wall is HBM (high bandwidth memory). First introduced in the mid-2010s, HBM stacks memory chips vertically to achieve higher bit density and bandwidth. Imagine a high-rise residence, with dedicated elevators to every floor, enabling far more people to move in and out of the building simultaneously. Owing to the complexity involved in stacking silicon and weaving these “elevators” through, HBM has yield and efficiency challenges during the manufacturing process when compared to standard DRAM. This, combined with the demand from hyperscalers, contributes to the memory crunch we see today. The shockwave that began in the GPU supply chain has reached memory.
The CPU stops being central#
Long before today’s memory constraints became visible, hardware and software architectures were already reorganizing to accommodate rising throughput and increasing specialization. Over the last decade, while single-core CPU performance has mostly stagnated, network bandwidth has increased by orders of magnitude. Modern office networks are generally wired for Gigabit Ethernet, while most datacenters now run well north of 100GbE. Just two years ago in 2024, we saw the standardization of 800GbE, which is enough bandwidth to transfer a full-length 4K film in under a second – far faster than a single CPU core can process.
The demand for inter-GPU communication has catalyzed this trend towards higher bandwidth. Today’s large language models are so large that they cannot be run on a single GPU. They require an entire cluster of them. As data volumes have grown beyond what a single CPU can efficiently handle, modern systems have adapted by moving the CPU off the data path. Rather than routing high-speed traffic through the CPU, dedicated silicon handles the data movement directly, while the CPU takes on an orchestration role, setting up operations, enforcing policies, and handling exceptions. Specialized interlinks such as NVLink exemplify this shift, directly connecting GPUs with each other without going through the CPU, allowing for low-jitter, low-latency, and high-bandwidth.
This pattern is not unique to GPUs. Ethernet NICs (network interface cards) have evolved into highly advanced computing devices, allowing the CPU to offload everything from modern encapsulation protocols such as VXLAN, to technologies such as RDMA (remote direct memory access). Traditionally, NICs would pass along network packets to the operating system’s network stack, which would extract the payload and forward it to a userspace process for action. Today, network traffic is too fast for efficient CPU processing. Even at 100 Gbps, considered relatively pedestrian by modern datacenter standards, it is almost impossible to saturate the network with a single CPU core. The network is so fast that the CPU cannot push out enough bytes quickly enough to utilize the available bandwidth. To avoid this CPU bottleneck, the NIC will read data directly from RAM, process it, and push it out over the network, all without CPU involvement.
Similarly, in the storage world, NVMe SSDs leverage DMA (direct memory access) over the PCIe bus to transfer data to and from system memory, instead of handling the data via the CPU. This is equivalent to creating a dedicated separate pathway from a restaurant’s truck bay to the pantry, avoiding having the cooks in the kitchen be involved in moving ingredients. These advances in both storage and networking now allow us to access remote NVMe drives over a network at almost locally attached speeds, utilizing technologies such as NVMe-oF (NVMe-over-Fabrics) that leverage advanced modern NICs to offload high bandwidth data shuttling off of the CPU.
These technological shifts due to changing workload patterns are not isolated to hardware alone. Large demand shocks reshape the medium in which systems operate, and software architectures respond alongside the hardware they run on.
Where physics meets software#
Databases provide a uniquely clear view into these dynamics because they sit at the intersection of compute, memory, storage, and networking. They operate at the technological limits of performance, and are forced to make tradeoffs – latency vs throughput, consistency vs availability, to name a few.
Shared nothing architectures, employed by well-known databases such as Cassandra, emphasize complete separation of resources, with each node containing its own CPU, memory, and storage. There are no dependencies on other nodes, but it also means that hardware resources may idle if the workloads are not constant. The ratio between CPU, memory, and storage cannot change dynamically, requiring relatively stable workload patterns to maximize hardware utilization. A database used for event logging purposes will require large storage capacity, but may have very few read queries, resulting in low computational requirements. A product database used in an e-commerce store likely has many read queries from shoppers, but a relatively modest amount of storage capacity that only scales with the number of products being sold. In other words, in shared nothing architectures, a mismatch in resource ratios can translate to paying for idle silicon or bottlenecking performance.
In contrast, a shared-storage architecture separates the data storage layer from computing resources such as CPU and memory, a concept with roots in Google’s early work on BigQuery. This allows CPU and memory resources to scale independently of storage requirements, enabling higher resource utilization, particularly in cloud environments. Neon, a database company acquired by Databricks in 2025, re-engineered large parts of the Postgres storage layer to enable this separation of compute and storage. This allows them to scale their resources up and down with load, independently from the amount of data they house, creating a kind of “serverless” database that does not require constant dedicated resources.
Systems such as Elasticsearch and ClickHouse have also adopted a similar shared-storage architecture, often enabled by the availability of cheap and durable object storage from services such as Amazon S3. These storage services help solve the hard problem of providing distributed and consistent storage capable of handling concurrent reads and writes, by providing primitives such as read-after-write consistency and conditional writes. Open lakehouse formats such as Apache Iceberg and Delta Lake are manifestations of this shift: higher-level data semantics layered on top of now-commoditized shared storage.
The sudden increase in demand for AI-related services utilizing vector embeddings has led to a surge in new database developments around vector search, tipping the scales in favor of this separation. Current state-of-the-art vector search algorithms include variations on HNSW (hierarchical navigable small world), chosen by many databases for its balance of dynamic insertions and recall performance without complex tuning. However, these graph-based algorithms are fairly memory-intensive due to the need to traverse data structures in-memory, amplifying the computation costs in these databases. By employing an architecture that can dynamically scale computational resources up and down independently from the storage volumes, databases can achieve higher utilization of the scarce hardware resources.
As workload patterns shift and hardware evolves, database architectures are forced to surface the tradeoffs they embody. In today’s world, decisions about how tightly to couple the four pillars of compute, memory, storage, and networking determine not only performance characteristics, but also hardware utilization, cost, and scalability. Databases sit close to the physical limits of the system; the underlying dynamic, however, is not unique to them. Whenever software operates under strong binding constraints, architectural choices begin to shape not just technical outcomes, but economic viability. There is no clean answer here. Every architecture embeds a bet about which constraints will bind hardest. Get it wrong, and you've optimized for a world that no longer exists.
Abstractions under pressure#
Abstractions are not free. They are priced according to the assumptions of the environment in which they were created. For decades, rising hardware abundance made that price easy to ignore. As physical binding constraints strengthen, the hidden costs encoded in our abstractions are becoming visible again.
Some abstractions are effectively zero-cost. Take, for example, smart-pointers and generics. These mechanisms reduce developer cognitive load without runtime performance tradeoffs. Others do have an associated cost. Garbage collection frees up developers from keeping track of variable lifetimes at the cost of runtime pauses. Virtual functions allow for dynamic dispatch at runtime in exchange for a performance penalty. In embedded systems and high-performance computing, we see a prevalence of languages such as C, C++, and Rust that trade development complexity and cognitive load in exchange for control of runtime behavior – a rational choice when every byte and every cycle carries a visible cost.
The Linux kernel, one of the most important pieces of software in the modern era, has allowed userspace programs to not worry about the specifics of hardware. Its syscalls abstract away details about hard drives and NICs, and expose stable interfaces to work with primitives such as memory, files, and networks. The kernel directly manages and talks to the hardware through drivers, and eventually surfaces them as abstracted and generic resources to userspace. These provide not just an abstraction, but also a separation of concerns; the kernel ensures that the hardware resources are properly managed and secured, and userspace focuses more on use cases.
When operating at the limits of hardware, the performance overhead of these abstractions can begin to outweigh the benefits. As networking and storage technology have advanced significantly relative to CPU performance, software’s ability to keep up with rising demand becomes critical. Every memory copy and every context switch adds up when repeated millions of times a second. The current demand for LLM-based services has greatly accelerated these pressures. The shockwave now reverberates through abstractions that were already strained by uneven advancements in hardware, exposing costs that can no longer be ignored.
Take for example, an application that sends tens of gigabytes of real-time high-resolution video over the network to a remote storage device with an SSD. The data gets copied from the video sender application’s userspace memory to the kernel, into the NIC, over the fiber optic cable to the receiving NIC, to the receiving kernel, to userspace, then to the kernel again, before writing to the SSD. At each step the data is shuffled around, sometimes encapsulated in a virtual digital envelope, and then taken out of it at the receiving end. The overhead of this virtual paper shuffling becomes apparent as we demand more performance out of our computers.
This shuffling is a consequence of the kernel-userspace abstraction layer: data enters and exits through the kernel. In the late 2010s, data center networking exceeded 100 Gbps and there was much desire to find a “shortcut” to bypass this abstraction. DPDK (Data Plane Development Kit) is an industry standard framework that completely bypasses the kernel’s abstractions, and allows userspace applications to talk directly to the NIC without the kernel playing telephone in between. It transfers the responsibility for the network stack to the application, and sacrifices some of the security guarantees and multiplexing capabilities of the kernel, in exchange for raw speed. By reducing memory copies and context switches, DPDK allows high bandwidth, performance-sensitive applications to achieve 2-3x efficiency when compared to going through the kernel. DPDK has its tradeoffs and is not a universal solution, but it is one of the many ways the industry has been moving to rework these abstraction boundaries, as the opportunity-cost equation has shifted.
As the environment changes, technology advances, and the workloads change, abstractions must be renegotiated. Assumptions about cost, performance, and economics that were implicit become explicit when put under pressure. Failing to notice these macro movements, or refusing to adapt, accumulates a form of hidden debt – your systems and products appear sound, but break when pushed.
Hardware awareness#
Datacenter-grade GPUs released in 2025 and later have dedicated modules in their silicon that are specifically designed for operating on FP4 and FP8 – 4-bit and 8-bit floating-point representations that did not exist just a few years ago. These ultra-low resolution numerical representations were invented specifically to increase the memory and compute efficiency of LLMs, and the fact that they are now built into transistor-level hardware reflects a tight coupling between emerging workloads and hardware design. Assumptions about the software requirements are being baked into the silicon on compressed timescales, as the hardware adjusts to rapidly changing workload patterns.
Similarly, FlashAttention-3, a specialized library in the CUDA ecosystem, accelerates transformer-based training and inference by exploiting LLM-optimized hardware primitives introduced in newer Nvidia architectures. It is a piece of software written with deep knowledge of the hardware it runs on, parallelizing memory operations alongside computational work. The pace of co-design between GPU manufacturers and the software ecosystem illustrates how tightly integrated hardware and software development has become in this space. Feedback loops between software needs and dedicated silicon that once spanned decades are now being compressed into years.
In the presence of strong physical constraints, hardware awareness shifts from an optimization lever to a strategic one. But when performance increasingly depends on exploiting hardware-specific features, then whoever controls those features holds leverage. Co-design creates value, but it also creates dependency. When the path to performance runs through vendor-specific silicon, the question of who controls the abstraction layer becomes a question of market power.
Silicon vendors have strong incentives to lock customers into their hardware. Every architectural choice about whether to use a vendor-specific feature is a bet – paying for performance today with flexibility tomorrow, or preserving optionality at the cost of leaving performance on the table. Abstractions face the lowest common denominator effect, where the only capabilities common across all hardware platforms are those that have been commoditized. Squeezing performance out of scarce hardware resources requires escape hatches to utilize the latest and greatest hardware features – features that live outside the portable abstraction, behind vendor-specific doors.
Within machine learning, XLA, TVM, and Triton exemplify this tension. These compiler technologies translate high-level models into hardware-specific executable code, allowing frameworks such as PyTorch to stay relatively hardware agnostic, preserving optionality in a world of diverging hardware. Google has invested heavily in XLA to run LLMs on their in-house chips called TPUs (tensor processing units), introduced years before LLMs became mainstream. They built both the silicon and the compiler. The abstraction layer itself became a strategic asset that leveraged proprietary hardware for performance while insulating their software stack from dependence on external chip suppliers.
Yet, despite the importance of hardware awareness to maximize performance under constraints, its impact is not uniform. Co-design carries significant costs in engineering effort, human capital, as well as long-term maintenance across multiple technical disciplines. Amortizing this cost is only possible when these architectural choices can be sustainably supported. As a result, the ability to fully capitalize on hardware awareness tends to concentrate where scale, capital, and long planning horizons already exist. And the entities best positioned to play this game are the public cloud providers.
Cloud after abundance#
Cloud computing emerged in an era where operating at scale required ownership of physical infrastructure. Getting started cost millions of dollars in servers, networking, and power contracts. Capacity planning was capital intensive, and companies had to take on the risks of resource underutilization.
One of cloud computing’s major value propositions was to make computational resources appear abundant and fungible. Elasticity became a default assumption for cloud customers, so much so that capacity planning could be deferred. Developers no longer needed to provision for peak demand; they could draw from the cloud provider’s seemingly infinite pool of servers, paying only for what they used.
This model rested on the implicit assumption that infrastructure would remain sufficiently available when needed, that prices would be predictable, and providers who operated at economies of scale would absorb the utilization risk of idle capacity. Although unregulated, abundance from decades of continuous hardware improvements coupled with competition between the hyperscale cloud providers have largely upheld these assumptions.
In today’s world where GPU, memory, and power are physically constrained, that equilibrium is under strain. Elasticity is no longer unconditional. Supply cannot respond quickly, and demand shows little sign of abating. When both supply and demand are inelastic, availability and pricing become sources of uncertainty rather than assumed ambient conditions.
While cloud providers have largely avoided repricing their products, in reality they are no longer pricing usage. They are charging for allocation. Access to capacity increasingly depends on upfront commitments, minimum-spend agreements, and long-term reservations with cloud providers to keep your spot in the proverbial line. Elasticity is now conditional on availability, and on commercial agreements.
Under physical binding constraints, cloud providers resemble capital allocators more than infrastructure service providers. Their scale gives them privileged access and pricing to scarce resources, along with the option to develop them in-house. At a macro-level, they centralize resource acquisition, and redistribute access to downstream cloud customers, while amortizing risk across a large user base.
Cloud computing did not remove capital expenditure from the system. It concentrated it. During an era of abundance, that concentration was largely hidden, but under scarcity, it becomes visible. The cloud remains, but its role has changed: from a virtual buffer against physical constraint to a mechanism for allocating within such constraints.
Rigor of the real#
With the cloud no longer acting as a buffer against physical limits, those limits reappear inside software systems themselves. Just as we moved complexity from hardware up into software as Dennard’s Scaling slowed down, we are now pushing the resource scarcity problem from cloud infrastructure into higher-level software. Decisions about what to abstract, what to build, and what to optimize now directly shape cost structures, margins, and competitive advantage. These decisions can no longer be deferred.
This is not theoretical. The pressure is already present across power, memory, and compute, where demand increasingly outpaces supply. Resource contention is no longer a transient anomaly; engineers increasingly encounter quotas and rate limits in day-to-day development. These are signals of an imbalance requiring tradeoffs that demand deliberate architectural decisions.
Not every product and business operates at these limits. Many operate in regimes where compute is cheap relative to revenue, performance is tolerable, and scale is determined more by people and processes than by physics. In such conditions, leaning into these abstractions and focusing on the higher-level bottlenecks remains the rational decision. But these conditions are not static. As workloads evolve, assumptions that once seemed obvious can fall apart. Architectural decisions serve as commitments – bets placed on future customer demand, technological advancements, and where structural advantage is expected to come from.
Architectural decisions are the interface between physical limits and organizational intent. Thinking from first principles allows us to reason about constraints and design systems that acknowledge what is real, rather than what we wish were true. Organizations that treat architecture as a technical decision delegated to engineering are implicitly making a bet that the environment which shaped their current abstractions will persist. Every system embeds assumptions about the world it was built for. The discipline is in examining those assumptions before the world moves on.