Language Models Need Sleep

I can't pretend to understand how LLMs work, but I can be sure that anthropomorphizing their functions is not helpful to an objective debate over their abilities.

Does a motor vehicle get "sleep" when it is serviced? When I reboot a computer, is that equivalent to a nap?

They provide an explanation for using the term "sleep":

> In animals, the transfer from short-term memory to long-term memory is thought to be supported by hippocampal replay [33], especially during sleep [41]; in this phase, short-term hippocampal memories are reactivated and consolidated into cortical synaptic weights. Sleep makes animals unable to respond to external stimuli, suggesting that it must provide enough cognitive benefit to justify this cost [41]. Inspired by these biological processes, we propose a method for transferring context-window memory into persistent weights. When the model’s context window becomes full during inference, the model enters a “sleep” in which it performs multiple forward passes over the accumulated context and recursively updates its fast weights via a learned local rule. As in animal sleep, the model receives no external input tokens during this phase. After consolidation, the context window is cleared, and the model resumes operation with updated fast weights. During training, the model is optimized end-to-end by backpropagating through the entire process to maximize task performance after sleep.

This is why I object to sleep() from unistd.h. What an anthropomorphizing notion. Didn't early unix programmers understand that a computer isn't a living creature and therefore isn't capable to sleep? They must be really stupid!

Anthropomorphization is not inherently wrong, and in some instances, it actually lets you reason better about about complex behavior than whatever convoluted (and often wrong, especially in the case of giant neural networks) mechanistic description one might conjure.

Here the analogy isn't without reason.

I assume compacting is the sleep here; so, yes

If it works, it's called bionics, not anthropomorphization ;)

This is the struggle of naming papers. You could stretch definitions and make your own sexy headline or you could be precise and fewer people will read it.

>we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache

There is a strong, non-trivial connection here between what your brain does in sleep and what they are studying.

You wouldn't object to referring to robot eyes or robot legs.

The analogy is helpful, but yes we should be able to “intelligently design” something better than sleep analogues since we’re not constrained by evolution like in humans.

We are however constrained by the complexity of any purported solution. That's the bitter lesson, in a nutshell.

At the very least, we know that sleep and dreaming do exist in biological brains. (Doesn't mean any of it is applicable to artificial neural nets, doesn't mean it'll work for our specific architectures etc. etc., but at least the idea requires fewer assumptions than a completely untested novel theory.)

The idea of periodically stopping to write blocks of recent context into a fast-weight state is interesting, but I think it liked it better when E2E-TTT[1] did it. It's a more flexible and elegant continuous learning approach.

[1]https://arxiv.org/abs/2512.23675

related preprint from the letta team https://arxiv.org/abs/2504.13171

Scaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time compute, which allows models to "think" offline about contexts before queries are presented: by anticipating what queries users might ask and pre-computing useful quantities, we can significantly reduce the compute requirements at test-time. To demonstrate the efficacy of our method, we create modified versions of two reasoning tasks - Stateful GSM-Symbolic and Stateful AIME. We find that sleep-time compute can reduce the amount of test-time compute needed to achieve the same accuracy by ~ 5x on Stateful GSM-Symbolic and Stateful AIME and that by scaling sleep-time compute we can further increase accuracy by up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME. Furthermore, we introduce Multi-Query GSM-Symbolic, which extends GSM-Symbolic by including multiple related queries per context. By amortizing sleep-time compute across related queries about the same context using Multi-Query GSM-Symbolic, we can decrease the average cost per query by 2.5x. We then conduct additional analysis to understand when sleep-time compute is most effective, finding the predictability of the user query to be well correlated with the efficacy of sleep-time compute. Finally, we conduct a case-study of applying sleep-time compute to a realistic agentic SWE task.

Isn't this simply context pruning/optimization?

From the abstract, it looks like it's actually doing something deeper, updating weights in part of the model?

No, they're actually training weights based on context before compaction. Context is context, this is splitting the model into persistent weights and malleable ones which are periodically updated.

Wouldn’t that be extremely computationaly expensive considering how resource incentive training is?

[dead]