Skip to content

style: ruff/black autofix + misc updates #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 21, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
390 changes: 390 additions & 0 deletions docs/dedup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,390 @@
# Hybrid Deduplication System

QuarryCore implements a production-grade, two-layer deduplication system designed for high-throughput web content processing with sub-2ms median latency.

## Architecture Overview

The hybrid deduplication system combines two complementary approaches:

1. **Exact Layer**: Canonical HTML → SHA-256 → SQLite WAL storage
2. **Near Layer**: MinHashLSH (datasketch) → Redis backend

```mermaid
graph TD
A[Document Input] --> B[HTML Canonicalization]
B --> C[SHA-256 Hash]
C --> D{Exact Match?}
D -->|Yes| E[Mark as Duplicate]
D -->|No| F[Store Hash in SQLite]
F --> G[7-Shingle Generation]
G --> H[MinHash Signature]
H --> I{Near Match in Redis?}
I -->|Yes| E
I -->|No| J[Store in Redis LSH]
J --> K[Mark as Unique]
```

## Key Features

- **Fast Exact Detection**: Immediate return on exact duplicates (< 1ms)
- **Near-Duplicate Detection**: 7-character shingles with 85% similarity threshold
- **Resilient Operation**: Graceful degradation when Redis unavailable
- **Prometheus Metrics**: Real-time monitoring and alerting
- **WAL SQLite**: High-concurrency exact hash storage
- **Canonical HTML**: Consistent hashing across equivalent content

## Configuration

### Environment Variables

The deduplication system supports environment variable overrides with the `QUARRY_DEDUP_*` prefix:

```bash
# SQLite exact hash layer
QUARRY_DEDUP_HYBRID_SQLITE_PATH="data/dedup.db"

# Redis MinHashLSH layer
QUARRY_DEDUP_HYBRID_REDIS_URL="redis://localhost:6379/0"

# MinHash configuration
QUARRY_DEDUP_HYBRID_MINHASH_SHINGLE_SIZE=7
QUARRY_DEDUP_HYBRID_MINHASH_NUM_PERM=128
QUARRY_DEDUP_HYBRID_MINHASH_THRESHOLD=0.85
QUARRY_DEDUP_HYBRID_MINHASH_ENABLED=true
```

### YAML Configuration

```yaml
dedup:
hybrid:
sqlite_path: "data/dedup.db"
redis_url: "redis://localhost:6379/0"
minhash:
shingle_size: 7
num_perm: 128
threshold: 0.85
enabled: true
```

### Configuration Reference

| Parameter | Default | Description |
|-----------|---------|-------------|
| `sqlite_path` | `data/dedup.db` | SQLite database path for exact hash storage |
| `redis_url` | `redis://localhost:6379/0` | Redis URL for MinHashLSH backend |
| `minhash_shingle_size` | `7` | Character shingle size for MinHash |
| `minhash_num_perm` | `128` | Number of permutations for MinHash |
| `minhash_threshold` | `0.85` | Jaccard similarity threshold (0.0-1.0) |
| `minhash_enabled` | `true` | Enable/disable near-duplicate detection |

## Database Schema

### SQLite Exact Hash Table

```sql
CREATE TABLE hash_dedup (
hash TEXT PRIMARY KEY,
url TEXT NOT NULL,
timestamp REAL NOT NULL,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_hash_dedup_timestamp ON hash_dedup(timestamp);
CREATE INDEX idx_hash_dedup_url ON hash_dedup(url);
```

### Redis Keys

The MinHashLSH system uses Redis keys with the `lsh:*` prefix:

- `lsh:b:{band_id}:{band_hash}` - LSH band buckets
- `lsh:sig:{doc_id}` - MinHash signatures
- `lsh:meta` - Index metadata

## API Reference

### Simple Boolean API

```python
from quarrycore.dedup.hybrid_dedup import HybridDeduplicator, DedupConfig, ExtractResult

# Initialize deduplicator
config = DedupConfig()
deduplicator = HybridDeduplicator(config)

# Check for duplicates
doc = ExtractResult(
text="Document content",
html="<html>...</html>",
url="https://example.com"
)

is_duplicate = await deduplicator.is_duplicate(doc)
print(f"Document is duplicate: {is_duplicate}")
```

### Legacy Compatibility

For existing codebases using the rich DuplicationResult API:

```python
from quarrycore.dedup.hybrid_dedup import check_duplicates_legacy

# Use with existing ExtractedContent and ContentMetadata
result = await check_duplicates_legacy(deduplicator, content, metadata)
print(f"Is duplicate: {result.is_duplicate}")
print(f"Processing time: {result.processing_time_ms}ms")
```

## HTML Canonicalization

The exact layer uses deterministic HTML canonicalization to ensure consistent hashing:

### Transformations Applied

1. **Script/Style Removal**: Complete removal of `<script>` and `<style>` tags and content
2. **Whitespace Normalization**: Collapse multiple spaces/newlines to single space
3. **Entity Decoding**: Convert common HTML entities (`&amp;`, `&lt;`, etc.)
4. **Tag Stripping**: Remove all HTML tags, keeping only text content
5. **Trim**: Remove leading and trailing whitespace

### Example

```html
<!-- Input -->
<html>
<head>
<title>Test &amp; Example</title>
<script>alert('remove me');</script>
<style>body { color: red; }</style>
</head>
<body>
<h1>Hello World</h1>
<p>Content here</p>
</body>
</html>

<!-- Canonical Output -->
Test & Example Hello World Content here
```

## Performance Characteristics

### Latency Targets

- **Exact Duplicates**: < 1ms (immediate SQLite lookup)
- **Near Duplicates**: < 2ms (MinHash + Redis LSH)
- **Overall Median**: ≤ 2ms (99th percentile target)

### Throughput

- **Exact Layer**: 10,000+ documents/second (SQLite WAL)
- **Near Layer**: 5,000+ documents/second (Redis-limited)
- **Combined**: 5,000+ documents/second

### Memory Usage

- **Canonical Processor**: ~10MB baseline
- **SQLite**: ~40MB cache (configurable)
- **Redis**: Depends on document volume (~1KB per document)

## Monitoring and Metrics

### Prometheus Metrics

```prometheus
# Exact duplicate hits
dedup_exact_hits_total

# Near duplicate hits
dedup_near_hits_total

# Processing latency by layer
dedup_latency_seconds{layer="exact|near|total"}
```

### Example Queries

```promql
# Duplicate detection rate
rate(dedup_exact_hits_total[5m]) + rate(dedup_near_hits_total[5m])

# 95th percentile latency
histogram_quantile(0.95, rate(dedup_latency_seconds_bucket[5m]))

# Error rate
rate(dedup_errors_total[5m])
```

### Logs

The system uses structured logging with JSON output:

```json
{
"timestamp": "2024-01-15T10:30:00Z",
"level": "info",
"event": "exact_duplicate_detected",
"url": "https://example.com/page",
"duration_ms": 0.8,
"hash": "a1b2c3d4..."
}
```

## Deployment

### Redis Requirements

- **Version**: Redis 5.0+ (supports streams)
- **Memory**: Plan for ~1KB per unique document
- **Persistence**: RDB snapshots recommended for LSH recovery
- **Networking**: Low latency connection to application

### SQLite Configuration

```sql
-- Production SQLite pragmas (applied automatically)
PRAGMA journal_mode=WAL; -- Write-Ahead Logging
PRAGMA synchronous=NORMAL; -- Safe with WAL
PRAGMA cache_size=10000; -- 40MB cache
PRAGMA temp_store=MEMORY; -- In-memory temp tables
```

### High Availability

1. **Redis Fallback**: Uses `fakeredis` when Redis unavailable
2. **SQLite WAL**: Concurrent readers during writes
3. **Circuit Breakers**: Automatic failure detection and recovery
4. **Graceful Degradation**: Pipeline continues with reduced functionality

## Troubleshooting

### Common Issues

#### Redis Connection Failures

```bash
# Check Redis connectivity
redis-cli -u redis://localhost:6379/0 ping

# Monitor Redis logs
tail -f /var/log/redis/redis-server.log
```

**Solution**: The system automatically falls back to `fakeredis` and logs warnings once per minute.

#### SQLite Lock Errors

```
sqlite3.OperationalError: database is locked
```

**Solution**: Ensure WAL mode is enabled and no long-running transactions block writers.

#### High Memory Usage

Monitor Redis memory with:

```bash
redis-cli info memory
```

**Solution**: Implement TTL on LSH keys or periodic cleanup of old entries.

### Performance Tuning

#### SQLite Optimization

```python
# Adjust cache size for available memory
config = DedupConfig(sqlite_path="data/dedup.db")
# Cache size set automatically based on available memory
```

#### Redis Optimization

```bash
# Redis configuration
maxmemory 2gb
maxmemory-policy allkeys-lru
```

#### MinHash Tuning

```yaml
dedup:
hybrid:
minhash:
# Higher threshold = fewer false positives, more duplicates missed
threshold: 0.90 # More strict

# More permutations = better accuracy, slower processing
num_perm: 256 # More accurate
```

## Migration Guide

### From Legacy Deduplicator

The hybrid system is backward compatible with existing APIs:

```python
# Old code (still works)
from quarrycore.deduplicator import MultiLevelDeduplicator
deduplicator = MultiLevelDeduplicator(config)
result = await deduplicator.check_duplicates(content, metadata)

# New code (recommended)
from quarrycore.dedup.hybrid_dedup import HybridDeduplicator
deduplicator = HybridDeduplicator(config)
is_duplicate = await deduplicator.is_duplicate(doc)
```

### Database Migration

If migrating from an existing system:

1. **Backup existing data**
2. **Run migration script** (optional): `scripts/migrate_dedup_v2.py`
3. **Update configuration** to use hybrid system
4. **Monitor metrics** during transition

## Testing

### Unit Tests

```bash
# Run deduplication tests
pytest tests/unit/test_canonical_html.py -v
pytest tests/unit/test_hash_db.py -v
pytest tests/unit/test_minhash_redis.py -v
```

### Integration Tests

```bash
# End-to-end deduplication flow
pytest tests/functional/test_dedup_flow.py -v
```

### Performance Tests

```bash
# Benchmark latency requirements
pytest tests/benchmark/test_dedup_perf.py -v
```

## Security Considerations

- **Input Validation**: All HTML content is sanitized during canonicalization
- **Resource Limits**: Redis and SQLite have configurable memory/disk limits
- **Access Control**: Redis should not be exposed to public networks
- **Data Privacy**: Hashes are one-way, original content not stored in dedup layer

## Future Enhancements

- **Semantic Deduplication**: Optional GPU-accelerated embedding similarity
- **Distributed Redis**: Redis Cluster support for horizontal scaling
- **Content-Type Specific**: Different thresholds per content type
- **Adaptive Thresholds**: Machine learning-based similarity tuning
Loading
Loading