memory no longer wants to live in each machine

For many of us, memory shortage It may first sound like a problem close to domestic consumption: RAM modules, components and devices conditioned by an increasingly stressed demand. But the phenomenon that The Next Platform describes also points to the other end of the chain. It reaches the large technology companies that train, deploy and offer artificial intelligence models in data centers. The cloud is not an abstraction, and its appetite for memory is forcing us to think about something that until recently seemed unintuitive: perhaps each machine should not depend only on the RAM it has inside.

Memory changes places. The underlying idea is to transfer to memory a logic that is already familiar to us with storage. Today, data can live on the computer itself, on another machine on the network, or on a shared system accessed by several servers. The next generation of servers could treat RAM in a similar way: keep a portion local to each machine, but bring a much larger portion to large external systems capable of distributing capacity according to the need of the moment. From there comes what some call “memory godbox”: a large box or cluster of memory that is no longer tied to a single machine.

The CXL moment. For years, Compute Express Link has advanced slowly, almost as a promise for more flexible architectures. The technology was introduced several years ago, but current memory pressures are giving it a much more favorable context. CXL provides a coherent interface to communicate processors, memory, accelerators and other peripherals, relying on PCIe. The final idea is simple to tell, although complex to execute: separating resources without breaking the feeling that they work together.

CXL didn’t arrive all at once. It was first used to expand the memory of a server using modules connected to compatible PCIe slots. Then, with CXL 2.0, pooling appearedthat is, the possibility of pooling memory in a common pool and assigning it to different machines as needed. The limit was that that memory could be reallocated, but not truly shared between two systems working on the same data. CXL 3.0 It is the point at which that frontier begins to move, because it introduces broader topologies and shared memory between machines, although with certain technical limitations.

The underlying problem. According to The Next Platform, AI does not fall short only because of a lack of calculation, but also because of a lack of memory. The HBM that accompanies the GPUs is very fast and is designed to power these chips at high speed, but its capacity is limited and its cost is high. In training, the big challenge is usually processing enormous amounts of data to build the model. In inference, however, we talk about something else: using that already trained model to respond to a request.

The memory of the conversation. Each response from a language model is built little by little, token by token. In order not to recalculate everything above at each step, the systems save a type of working memory called KV cache. The Next Platform explains that previous attention vectors are preserved there, which help the model to continue taking into account the context while generating the response. The problem is that in services with many users, this cache can grow to occupy enormous amounts of memory, even more than the model itself.

It’s not just theory anymore. This idea no longer lives only in technical documents or architectural promises. The Register mentions Panmnesia, Liqid and UnifabriX as companies working on systems to take memory off the server and make it available to multiple machines. Some do it with CXL switches, others with large reserves of DDR5 that can be distributed among different hosts. The Next Platform adds the case of Enfabrica and its Emfasys system, designed for inference and capable, according to the media, of reaching 18 TB of DDR5 per memory server and 144 TB in a full rack. The conclusion is simple: the industry is not only looking for more memory, it is looking to place it in another way so that AI can take better advantage of it.

Images | Xataka with Nano Banana

In Xataka | The ‘Chinese Netflix’ has designed a plan for AI to generate the majority of its content within five years. It sounds risky

Leave your vote

Leave a Comment

GIPHY App Key not set. Please check settings

Log In

Forgot password?

Forgot password?

Enter your account data and we will send you a link to reset your password.

Your password reset link appears to be invalid or expired.

Log in

Privacy Policy

Add to Collection

No Collections

Here you'll find all collections you've created before.