
This repository demonstrates the "Sleep-time Compute" technique described in the paper "Sleep-time Compute: Beyond Inference Scaling at Test-time" using open-source LLMs.
Google Colab: https://colab.research.google.com/drive/12Itg_XOCP9sRezztBIIRY97QHli0Lpg8?usp=sharing
Explanation Article: https://medium.com/@ronantech/demo-how-to-run-sleep-time-compute-to-reduce-llm-latency-84c5626d0770
Sleep-time Compute is a technique that improves the efficiency and accuracy of language models by splitting computation into two phases:
- Sleep-time Phase: Pre-compute useful inferences about a context when the model would otherwise be idle
- Test-time Phase: Use these pre-computed inferences to answer queries more efficiently
This approach offers several benefits:
- Reduced latency during query time
- Improved accuracy through deeper context understanding
- Cost efficiency through amortization across multiple queries
Based on the research findings, Sleep-time Compute is most effective in the following scenarios:
-
Stateful Applications: Systems where context persists across multiple interactions, such as:
- Document question-answering
- Coding assistants operating on shared repositories
- Conversational agents maintaining dialogue history
-
Predictable Queries: Contexts where potential questions follow predictable patterns
- Research shows the performance gap widens with more predictable queries
- Less effective when queries are difficult to predict or unrelated to the context
-
Multiple Related Queries: When users ask several questions about the same context
- Cost efficiency improves as the number of queries increases
- Research demonstrates a 2.5× decrease in average cost per query with 10 queries per context
-
High-Latency Constraints: Applications where reducing test-time compute is critical
- Particularly valuable when test-time tokens are significantly more expensive
- Can reduce test-time compute needed for the same accuracy by ~5×
Sleep-time Compute may not be beneficial in these scenarios:
-
Unpredictable Queries: When questions are difficult to anticipate from the context
- The research shows diminishing returns for less predictable queries
- Standard test-time compute may be more effective in these cases
-
Single Query Scenarios: With only one question per context
- The overhead of sleep-time compute isn't amortized
- Cost efficiency significantly drops without multiple related queries
-
High Test-Time Budget Settings: In applications where extensive test-time compute is already allocated
- Research shows standard test-time compute can sometimes outperform sleep-time compute when sufficient test-time resources are available
-
Non-Stateful Applications: Systems where context doesn't persist between interactions
- Without a persistent context to analyze during idle time, the core benefit is lost
-
Rapidly Changing Contexts: Environments where the context is frequently updated
- Pre-computed inferences may quickly become outdated
This demo implements Sleep-time Compute using:
- Mistral-7B-Instruct-v0.1: A powerful open-source language model
- Hugging Face Transformers: For model loading and inference
- Custom prompting strategies: Specially designed for the Mistral instruction format
The code demonstrates:
- Setting up the model with appropriate configurations
- Implementing the sleep-time and test-time phases
- Visualizing the benefits through token usage and accuracy metrics
- Multi-query amortization to show efficiency gains
- Two-Phase Approach: Clear separation between sleep-time and test-time computation
- Variable Verbosity: Control the level of detail in responses
- Performance Comparison: Analysis of regular vs. sleep-time compute approaches
- Visualization: Graphs showing efficiency gains and amortization benefits
The implementation demonstrates:
- Test-time Efficiency: Significant reduction in tokens needed at query time
- Accuracy Improvements: More reliable answers through pre-computed inferences
- Cost Amortization: Greater efficiency as the number of queries increases

