A Demo of Sleep-Time Compute to Reduce LLM Latency

This repository demonstrates the "Sleep-time Compute" technique described in the paper "Sleep-time Compute: Beyond Inference Scaling at Test-time" using open-source LLMs.

Google Colab: https://colab.research.google.com/drive/12Itg_XOCP9sRezztBIIRY97QHli0Lpg8?usp=sharing

Explanation Article: https://medium.com/@ronantech/demo-how-to-run-sleep-time-compute-to-reduce-llm-latency-84c5626d0770

Overview

Sleep-time Compute is a technique that improves the efficiency and accuracy of language models by splitting computation into two phases:

Sleep-time Phase: Pre-compute useful inferences about a context when the model would otherwise be idle
Test-time Phase: Use these pre-computed inferences to answer queries more efficiently

This approach offers several benefits:

Reduced latency during query time
Improved accuracy through deeper context understanding
Cost efficiency through amortization across multiple queries

When to Use Sleep-Time Compute

Based on the research findings, Sleep-time Compute is most effective in the following scenarios:

Stateful Applications: Systems where context persists across multiple interactions, such as:
- Document question-answering
- Coding assistants operating on shared repositories
- Conversational agents maintaining dialogue history
Predictable Queries: Contexts where potential questions follow predictable patterns
- Research shows the performance gap widens with more predictable queries
- Less effective when queries are difficult to predict or unrelated to the context
Multiple Related Queries: When users ask several questions about the same context
- Cost efficiency improves as the number of queries increases
- Research demonstrates a 2.5× decrease in average cost per query with 10 queries per context
High-Latency Constraints: Applications where reducing test-time compute is critical
- Particularly valuable when test-time tokens are significantly more expensive
- Can reduce test-time compute needed for the same accuracy by ~5×

When Not to Use Sleep-Time Compute

Sleep-time Compute may not be beneficial in these scenarios:

Unpredictable Queries: When questions are difficult to anticipate from the context
- The research shows diminishing returns for less predictable queries
- Standard test-time compute may be more effective in these cases
Single Query Scenarios: With only one question per context
- The overhead of sleep-time compute isn't amortized
- Cost efficiency significantly drops without multiple related queries
High Test-Time Budget Settings: In applications where extensive test-time compute is already allocated
- Research shows standard test-time compute can sometimes outperform sleep-time compute when sufficient test-time resources are available
Non-Stateful Applications: Systems where context doesn't persist between interactions
- Without a persistent context to analyze during idle time, the core benefit is lost
Rapidly Changing Contexts: Environments where the context is frequently updated
- Pre-computed inferences may quickly become outdated

Implementation Details

This demo implements Sleep-time Compute using:

Mistral-7B-Instruct-v0.1: A powerful open-source language model
Hugging Face Transformers: For model loading and inference
Custom prompting strategies: Specially designed for the Mistral instruction format

The code demonstrates:

Setting up the model with appropriate configurations
Implementing the sleep-time and test-time phases
Visualizing the benefits through token usage and accuracy metrics
Multi-query amortization to show efficiency gains

Key Features

Two-Phase Approach: Clear separation between sleep-time and test-time computation
Variable Verbosity: Control the level of detail in responses
Performance Comparison: Analysis of regular vs. sleep-time compute approaches
Visualization: Graphs showing efficiency gains and amortization benefits

Results

The implementation demonstrates:

Test-time Efficiency: Significant reduction in tokens needed at query time
Accuracy Improvements: More reliable answers through pre-computed inferences
Cost Amortization: Greater efficiency as the number of queries increases

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
sleeptime.ipynb		sleeptime.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A Demo of Sleep-Time Compute to Reduce LLM Latency

Overview

When to Use Sleep-Time Compute

When Not to Use Sleep-Time Compute

Implementation Details

Key Features

Results

About

Uh oh!

Releases

Packages

Languages

ronantakizawa/sleeptimecompute

Folders and files

Latest commit

History

Repository files navigation

A Demo of Sleep-Time Compute to Reduce LLM Latency

Overview

When to Use Sleep-Time Compute

When Not to Use Sleep-Time Compute

Implementation Details

Key Features

Results

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages