v0.0.40
🔧 Vulture Enhancements
- Expanded Standard Cleaning Functions:
Added the following to Vulture’s standard text cleaning pipeline:remove_numbers
: Removes stand-alone numbers.remove_alphanumeric
: Removes mixed alphanumeric terms (e.g.,abc123
).remove_roman_numerals
: Removes Roman numeral listings.
🐆 Cheetah Additions
-
term_generator
:- Extracts top keywords from cleaned text using TF-IDF.
- Pairs keywords with nearby support terms based on a co-occurrence matrix.
- Saves output as a structured markdown file of search terms.
-
CheetahTermFormatter
:- Parses markdown search term files into structured blocks with optional filters (e.g.,
positives
,negatives
). - Supports plain string output or category-based filtering.
- Can generate substitution maps to convert multi-word phrases into underscored versions and back.
- Parses markdown search term files into structured blocks with optional filters (e.g.,
-
convert_txt_to_cheetah_markdown
:- Converts plain
.txt
files or structured term dictionaries into Cheetah-compatible markdown format. - Facilitates easier programmatic creation and editing of search term files.
- Converts plain
🧹 Refactoring and Fixes
-
Code Refactoring:
- Consolidated several duplicated functions across modules into shared helper utilities at a higher level.
-
Bug Fixes:
- Vulture:
- Fixed path-saving logic in operator pipelines.
- Fixed bugs in the NER and Vocabulary Consolidator operators.
- Beaver:
- Resolved a file-saving issue that also affected Wolf’s visualization routines.
- Fixes README under examples to have the correct module links.
- Vulture:
-
.gitignore
Updates:- Added more output files and example notebook directories to
.gitignore
.
- Added more output files and example notebook directories to
📁 New Example: NM Law Data Pipeline
Added the NM Law Data/
folder, containing the data processing pipeline used in the paper:
“Legal Document Analysis with HNMFk” (arXiv:2502.20364)
-
00_data_collection/
:
Scrapes and formats legal documents (statutes, constitution, court cases) from Justia. -
01_hnmfk_operation/
:
Constructs document-word matrices and runs Hierarchical Nonnegative Matrix Factorization (HNMFk). -
02_benchmarking/
:
Evaluates LLM-generated content using factual accuracy, entailment, and summarization metrics. -
03_visualizations/
:
Visualizes legal trends, knowledge graphs, and model evaluation results.