GitHub - KPouianou/tiny-sift

A lightweight stream summarization library that enables processing and analysis of massive data streams with minimal memory requirements. TinySift implements cutting-edge probabilistic data structures and algorithms designed for scenarios where conventional approaches would be prohibitively expensive or impossible due to memory constraints.

Why TinySift?

Modern data applications often need to analyze streams that are too large to store in memory or process exhaustively. Whether you're tracking unique visitors on a high-traffic website, monitoring network traffic patterns, or analyzing sensor data from IoT devices, processing large datasets typically requires complex infrastructure setup or expensive cloud services.

TinySift solves this problem by providing implementations of advanced algorithms that:

Process unlimited data in a single pass
Maintain constant memory usage regardless of stream size
Offer configurable tradeoffs between accuracy and memory requirements
Provide probabilistic guarantees on result quality

Features

Requires only Python with no external dependencies
Works with datasets much larger than available RAM
Processes data in a single pass - ideal for streams or large files
Provides configurable accuracy vs. memory usage tradeoffs
Includes serialization for saving and loading processing state
Supports combining results from separately processed data chunks

When to Use TinySift

TinySift is ideal for applications where:

The data stream is too large to store completely
You need approximate answers to specific questions about the data
Memory efficiency is a critical constraint
You can tolerate small, bounded errors in exchange for massive reductions in resource usage
You need to process data in real-time as it arrives

Installation

pip install tinysift

Quick Start

from tiny_sift.algorithms.reservoir import ReservoirSampling

# Create a reservoir with size 100
reservoir = ReservoirSampling(size=100)

# Process a stream of data
for item in data_stream:
    reservoir.update(item)

# Get the current sample
sample = reservoir.get_sample()

Algorithm Selection Guide

This guide helps you select the right algorithm based on your specific use case, memory constraints, and accuracy requirements.

Algorithm Comparison

Algorithm	Primary Use Case	Memory Usage	Accuracy	Limitations
Reservoir Sampling	Maintain a random sample of fixed size	O(k) where k is sample size	Exact sampling probability	Cannot change sample size dynamically
HyperLogLog	Count unique elements	O(2^p) where p is precision parameter	Error ~1.04/sqrt(2^p)	Returns approximate counts only
Count-Min Sketch	Estimate item frequencies	O(width × depth)	Overestimates by at most N/width with probability 1-2^(-depth)	Never underestimates but may overestimate
Bloom Filter	Set membership testing	O(m) bits for m-bit filter	False positive rate controllable, no false negatives	Cannot remove items, cannot count occurrences
Counting Bloom Filter	Set membership with deletion support	O(m×c) where c is counter bits	Same as Bloom Filter	Higher memory usage than standard Bloom Filter
Exponential Histogram	Sliding window statistics	O(k log(N)) where k is error parameter	Error bound of ε	Approximates statistics over sliding window
Space-Saving	Find top-k frequent items	O(k) where k is number of counters	Error at most N/k for any item	Only tracks a subset of items
T-Digest	Quantile estimation	O(k) where k is compression factor	Better accuracy for extreme quantiles	Variable precision across distribution

Decision Tree for Algorithm Selection

When memory is extremely limited:

Need to count unique items? → HyperLogLog
Need to test set membership? → Bloom Filter
Need frequency estimates? → Count-Min Sketch

When accuracy is critical:

Need exact random sampling? → Reservoir Sampling
Need sliding window accuracy? → Exponential Histogram with smaller error bound
Need accurate percentiles? → T-Digest with higher compression factor

When working with specific use cases:

Cache implementation with eviction? → Counting Bloom Filter
Finding trending/popular items? → Space-Saving
Monitoring metrics over time? → Exponential Histogram

Practical Usage Examples

Web Analytics

Unique visitors: HyperLogLog
Popular pages: Space-Saving
Session tracking: Bloom Filter (for fast membership testing)
Response time percentiles: T-Digest

Network Monitoring

Unique IP addresses: HyperLogLog
Heavy hitters (IPs with most traffic): Space-Saving
Packet size distribution: T-Digest
Protocol statistics over time windows: Exponential Histogram

Database Systems

Query result caching: Counting Bloom Filter
Approximate counts for query planning: Count-Min Sketch
Sampling for approximate queries: Reservoir Sampling
Cardinality estimation for joins: HyperLogLog

IoT & Sensor Networks

Periodic sampling of sensor data: Reservoir Sampling
Anomaly detection on sliding windows: Exponential Histogram
Device tracking with minimal memory: Bloom Filter
Estimation of value distribution: T-Digest

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
examples		examples
tests		tests
tiny_sift		tiny_sift
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
py.typed		py.typed
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Why TinySift?

Features

When to Use TinySift

Installation

Quick Start

Algorithm Selection Guide

Algorithm Comparison

Decision Tree for Algorithm Selection

When memory is extremely limited:

When accuracy is critical:

When working with specific use cases:

Practical Usage Examples

Web Analytics

Network Monitoring

Database Systems

IoT & Sensor Networks

License

About

Uh oh!

Releases

Packages

Languages

License

KPouianou/tiny-sift

Folders and files

Latest commit

History

Repository files navigation

Why TinySift?

Features

When to Use TinySift

Installation

Quick Start

Algorithm Selection Guide

Algorithm Comparison

Decision Tree for Algorithm Selection

When memory is extremely limited:

When accuracy is critical:

When working with specific use cases:

Practical Usage Examples

Web Analytics

Network Monitoring

Database Systems

IoT & Sensor Networks

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages