Skip to content

Commit 42ede23

Browse files
authored
Merge pull request #4 from shua-ie/feat/quality-assessment-gpu
Feat/quality assessment gpu
2 parents 4bf1caf + 80181e3 commit 42ede23

28 files changed

+1683
-108
lines changed

.github/workflows/ci.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ jobs:
1111
strategy:
1212
matrix:
1313
os: [ubuntu-latest, windows-latest]
14-
python-version: ["3.11", "3.12"]
14+
python-version: ["3.11"]
1515

1616
runs-on: ${{ matrix.os }}
1717

docs/gpu_quality_assessment.md

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# GPU-Accelerated Quality Assessment
2+
3+
QuarryCore now supports GPU acceleration for quality assessment, providing significant performance improvements when CUDA-capable hardware is available.
4+
5+
## Features
6+
7+
- **Automatic GPU Detection**: Automatically detects and uses CUDA when available
8+
- **Transparent Fallback**: Seamlessly falls back to CPU when CUDA is not available
9+
- **Configurable Device Selection**: Choose between `auto`, `cpu`, or `cuda` modes
10+
- **Performance Monitoring**: Built-in metrics for tracking scorer latency and errors
11+
12+
## Configuration
13+
14+
Configure the device in your `config.yaml`:
15+
16+
```yaml
17+
quality:
18+
device: auto # Options: "auto", "cpu", "cuda"
19+
min_content_length: 50
20+
max_content_length: 50000
21+
```
22+
23+
Or use environment variables:
24+
25+
```bash
26+
export QUARRY_QUALITY_DEVICE=cuda
27+
```
28+
29+
## Device Options
30+
31+
- **`auto`** (default): Automatically selects CUDA if available, otherwise CPU
32+
- **`cpu`**: Forces CPU execution even if CUDA is available
33+
- **`cuda`**: Uses CUDA if available, falls back to CPU with a warning
34+
35+
## Performance Metrics
36+
37+
The GPU-accelerated scorer tracks performance metrics:
38+
39+
- `quarrycore_quality_scorer_latency_seconds`: Latency histogram by scorer type
40+
- `quarrycore_quality_scorer_errors_total`: Error counter by scorer type
41+
- `quarrycore_quality_reject_total`: Documents rejected due to low quality
42+
43+
## Requirements
44+
45+
### CPU Mode
46+
- No additional requirements
47+
- Works on all systems
48+
49+
### GPU Mode
50+
- NVIDIA GPU with CUDA support
51+
- CUDA toolkit installed
52+
- PyTorch with CUDA support
53+
- Sufficient GPU memory (typically < 1GB for small models)
54+
55+
## Performance Expectations
56+
57+
When CUDA is available:
58+
- **Median latency**: ≤ 25ms for 1KB English text
59+
- **Memory usage**: < 300MB RAM delta after model initialization
60+
- **GPU memory**: < 1GB for sentence-transformers/all-MiniLM-L6-v2
61+
62+
## Testing
63+
64+
### Unit Tests (No GPU Required)
65+
```bash
66+
export QUARRY_QUALITY_DEVICE=cpu
67+
pytest tests/unit/quality/test_transformer_gpu.py -v
68+
```
69+
70+
### Performance Tests (GPU Required)
71+
```bash
72+
export QUARRY_QUALITY_DEVICE=cuda
73+
pytest tests/performance/test_quality_gpu_perf.py -v -m requires_cuda
74+
```
75+
76+
## Troubleshooting
77+
78+
### CUDA Not Detected
79+
- Verify CUDA installation: `nvidia-smi`
80+
- Check PyTorch CUDA support: `python -c "import torch; print(torch.cuda.is_available())"`
81+
82+
### Out of Memory Errors
83+
- Reduce batch size in configuration
84+
- Use a smaller model
85+
- Clear GPU cache periodically
86+
87+
### Performance Issues
88+
- Ensure GPU is not being used by other processes
89+
- Check GPU utilization with `nvidia-smi`
90+
- Consider using CPU mode for small workloads where GPU overhead isn't justified

pyproject.toml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
[build-system]
2-
requires = ["setuptools>=65.0", "wheel"]
3-
build-backend = "setuptools.build_meta"
2+
requires = ["hatchling"]
3+
build-backend = "hatchling.build"
44

55
[project]
66
name = "quarrycore"
7-
version = "0.1.0"
8-
description = "Production-grade AI training data miner with adaptive hardware optimization"
7+
dynamic = ["version"]
8+
description = "A production-grade asynchronous web content extraction and processing framework"
99
readme = "README.md"
10-
license = {text = "MIT"}
10+
license = "MIT"
1111
authors = [
1212
{name = "Your Name", email = "josh.mcd31@gmail.com"}
1313
]
@@ -32,7 +32,7 @@ classifiers = [
3232
"Topic :: Internet :: WWW/HTTP :: Indexing/Search",
3333
"Topic :: Software Development :: Libraries :: Python Modules",
3434
]
35-
requires-python = ">=3.11,<3.13"
35+
requires-python = ">=3.11,<3.12"
3636
dependencies = [
3737
# Core async and HTTP (following workflow specs)
3838
"httpx[http2]>=0.25.0",

requirements.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -97,9 +97,9 @@ multiprocess==0.70.16
9797
murmurhash==1.0.13
9898
mypy==1.16.1
9999
mypy_extensions==1.1.0
100-
networkx==3.5
100+
networkx==3.3
101101
nodeenv==1.9.1
102-
numpy==2.3.1
102+
numpy==1.24.4
103103
nvidia-cublas-cu12==12.6.4.1
104104
nvidia-cuda-cupti-cu12==12.6.80
105105
nvidia-cuda-nvrtc-cu12==12.6.77

requirements_py310.txt

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# Core dependencies compatible with Python 3.10
2+
aiofiles==24.1.0
3+
aiosqlite==0.20.0
4+
httpx==0.29.0
5+
structlog==24.5.0
6+
pydantic==2.10.7
7+
pydantic-settings==2.7.1
8+
prometheus-client==0.22.0
9+
PyYAML==6.0.2
10+
lxml==5.3.0
11+
selectolax==0.3.30
12+
trafilatura==2.0.0
13+
beautifulsoup4==4.12.3
14+
ftfy==6.4.0
15+
fasttext-wheel==0.9.2
16+
sentence-transformers==3.0.0
17+
psutil==6.1.1
18+
tenacity==9.0.0
19+
uvloop==0.21.0
20+
async-lru==2.0.4
21+
markdownify==0.14.1
22+
humanize==4.11.0
23+
polars==1.20.0
24+
rich==14.0.0
25+
python-dotenv==1.1.1
26+
pytest==8.3.6
27+
pytest-asyncio==0.24.0
28+
pytest-cov==6.0.0
29+
pytest-mock==3.14.0
30+
pytest-benchmark==6.0.0
31+
hypothesis==6.130.2
32+
black==24.10.0
33+
ruff==0.12.0
34+
mypy==1.16.1
35+
isort==5.13.2
36+
flake8==7.1.1
37+
pre-commit==4.0.1
38+
# ML dependencies
39+
torch==2.2.2
40+
numpy==1.24.4
41+
scipy==1.10.1
42+
scikit-learn==1.3.2
43+
networkx==3.1

src/quarrycore/cli.py

Lines changed: 22 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -184,7 +184,7 @@ async def run_pipeline() -> None:
184184
)
185185
)
186186
else:
187-
print(json.dumps(health, indent=2))
187+
logger.info("Health check", health=health)
188188
return
189189

190190
if console and Panel:
@@ -216,14 +216,16 @@ async def run_standard_pipeline(
216216
"""Run pipeline with progress bars."""
217217
if not Progress or not console:
218218
# Fallback to simple text output
219-
print(f"Processing {len(url_list)} URLs...")
219+
logger.info("Processing URLs", count=len(url_list))
220220
result = await pipeline.run(
221221
urls=url_list,
222222
batch_size=batch_size,
223223
checkpoint_interval=checkpoint_interval,
224224
resume_from=resume_from,
225225
)
226-
print(f"Completed: {result.get('processed_count', 0)} processed, {result.get('failed_count', 0)} failed")
226+
logger.info(
227+
"Pipeline completed", processed=result.get("processed_count", 0), failed=result.get("failed_count", 0)
228+
)
227229
return
228230

229231
progress = Progress(
@@ -311,11 +313,13 @@ def create_status_table() -> Any:
311313

312314
while not pipeline_task.done():
313315
if pipeline.state:
314-
print(f"Status: {pipeline.state.stage.value}, Processed: {pipeline.state.processed_count}")
316+
logger.info(
317+
"Pipeline status", stage=pipeline.state.stage.value, processed=pipeline.state.processed_count
318+
)
315319
await asyncio.sleep(1)
316320

317321
await pipeline_task
318-
print("Pipeline completed successfully!")
322+
logger.info("Pipeline completed successfully")
319323
return
320324

321325
with Live(create_status_table(), refresh_per_second=2, console=console) as live:
@@ -404,9 +408,9 @@ async def run_analysis() -> None:
404408
else:
405409
formatted_result = json.dumps(analysis_result, indent=2)
406410
if console:
407-
console.print(formatted_result)
411+
console.logger.info("Analysis result", result=formatted_result)
408412
else:
409-
print(formatted_result)
413+
logger.info("Analysis result", result=formatted_result)
410414

411415
if output:
412416
Path(output).write_text(formatted_result)
@@ -462,28 +466,28 @@ async def validate() -> None:
462466

463467
console.print(table)
464468
else:
465-
print("Configuration Status:")
469+
logger.info("Configuration Status")
466470
for key, value in health.items():
467471
status = "✅ OK" if value else "❌ Error"
468-
print(f"{key}: {status}")
472+
logger.info("Config item", key=key, status=status)
469473

470474
if all(health.values()):
471475
if console:
472476
console.print("[green]✅ Configuration is valid![/green]")
473477
else:
474-
print("✅ Configuration is valid!")
478+
logger.info("Configuration is valid")
475479
else:
476480
if console:
477481
console.print("[red]❌ Configuration has issues![/red]")
478482
else:
479-
print("❌ Configuration has issues!")
483+
logger.error("Configuration has issues")
480484
sys.exit(1)
481485

482486
except Exception as e:
483487
if console:
484488
console.print(f"[red]❌ Configuration validation failed: {e}[/red]")
485489
else:
486-
print(f"❌ Configuration validation failed: {e}")
490+
logger.error("Configuration validation failed", error=str(e))
487491
sys.exit(1)
488492

489493
asyncio.run(validate())
@@ -510,15 +514,15 @@ async def check_health() -> None:
510514
for key, value in health_status.items():
511515
metric_value = 1 if value else 0
512516
if console:
513-
console.print(f"quarrycore_health_{key} {metric_value}")
517+
console.logger.info("Health metric", metric=f"quarrycore_health_{key}", value=metric_value)
514518
else:
515-
print(f"quarrycore_health_{key} {metric_value}")
519+
logger.info("Health metric", metric=f"quarrycore_health_{key}", value=metric_value)
516520
else:
517521
output = json.dumps(health_status, indent=2)
518522
if console:
519-
console.print(output)
523+
console.logger.info("Health output", output=output)
520524
else:
521-
print(output)
525+
logger.info("Health output", output=output)
522526

523527
# Exit with non-zero if unhealthy
524528
if not all(health_status.values()):
@@ -527,9 +531,9 @@ async def check_health() -> None:
527531
except Exception as e:
528532
error_output = json.dumps({"error": str(e)})
529533
if console:
530-
console.print(error_output)
534+
console.logger.error("Health check error", output=error_output)
531535
else:
532-
print(error_output)
536+
logger.error("Health check error", output=error_output)
533537
sys.exit(1)
534538

535539
asyncio.run(check_health())

src/quarrycore/config/config.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -231,6 +231,10 @@ class QualityConfig(BaseModel):
231231

232232
min_content_length: int = Field(default=50, description="Minimum word count to perform quality assessment.")
233233
max_content_length: int = Field(default=50000, description="Maximum word count to perform quality assessment.")
234+
min_score: float = Field(default=0.6, ge=0, le=1, description="Minimum quality score to accept content.")
235+
device: Literal["auto", "cpu", "cuda"] = Field(
236+
default="auto", description="Device to use for neural scoring (auto, cpu, or cuda)"
237+
)
234238
default: DomainQualityConfig = Field(default_factory=DomainQualityConfig)
235239
domains: dict[DomainType, DomainQualityConfig] = Field(
236240
default_factory=dict, description="Domain-specific quality thresholds."

src/quarrycore/dataset/analytics.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,14 @@
88
from typing import Any, Dict, List, Tuple
99

1010
import numpy as np
11+
import structlog
1112
from transformers import AutoTokenizer
1213

1314
from quarrycore.config.config import DatasetConfig
1415
from quarrycore.protocols import ContentMetadata, QualityScore
1516

17+
logger = structlog.get_logger(__name__)
18+
1619

1720
class Analytics:
1821
"""Computes and presents analytics on a final dataset."""
@@ -101,6 +104,6 @@ def pretty_print_report(self, report: Dict[str, Any]) -> None:
101104
"""Prints the analytics report in a readable format."""
102105
import json
103106

104-
print("\n--- Dataset Analytics Report ---")
105-
print(json.dumps(report, indent=2))
106-
print("------------------------------\n")
107+
logger.info("\n--- Dataset Analytics Report ---")
108+
logger.info(json.dumps(report, indent=2))
109+
logger.info("------------------------------\n")

0 commit comments

Comments
 (0)