v0.0.41
New Pre-Processing Module: Squirrel 🚀
Version 0.0.41 introduces a new pre-processing module called Squirrel, designed for automated document pruning. Squirrel streamlines the process of accepting or rejecting documents by applying predefined rules and thresholds, eliminating the need for manual review. Squirrel supports the use of multiple pruning strategies. In this release, we include both embedding-based pruning and LLM-based pruning:
- Embedding-Based Pruning: This method filters documents based on their distance from a reference centroid in embedding space. Only documents within a specified threshold are retained, ensuring higher data quality.
- LLM-Based Pruning: Squirrel leverages large language models to further refine the pruning process. It conducts multiple voting trials using LLM evaluations to determine whether a document should be accepted or rejected.