Release v0.0.40 · lanl/T-ELF

🔧 Vulture Enhancements

Expanded Standard Cleaning Functions:
Added the following to Vulture’s standard text cleaning pipeline:
- remove_numbers: Removes stand-alone numbers.
- remove_alphanumeric: Removes mixed alphanumeric terms (e.g., abc123).
- remove_roman_numerals: Removes Roman numeral listings.

term_generator:
- Extracts top keywords from cleaned text using TF-IDF.
- Pairs keywords with nearby support terms based on a co-occurrence matrix.
- Saves output as a structured markdown file of search terms.
CheetahTermFormatter:
- Parses markdown search term files into structured blocks with optional filters (e.g., positives, negatives).
- Supports plain string output or category-based filtering.
- Can generate substitution maps to convert multi-word phrases into underscored versions and back.
convert_txt_to_cheetah_markdown:
- Converts plain .txt files or structured term dictionaries into Cheetah-compatible markdown format.
- Facilitates easier programmatic creation and editing of search term files.

Code Refactoring:
- Consolidated several duplicated functions across modules into shared helper utilities at a higher level.
Bug Fixes:
- Vulture:
  - Fixed path-saving logic in operator pipelines.
  - Fixed bugs in the NER and Vocabulary Consolidator operators.
- Beaver:
  - Resolved a file-saving issue that also affected Wolf’s visualization routines.
- Fixes README under examples to have the correct module links.
.gitignore Updates:
- Added more output files and example notebook directories to .gitignore.

Added the NM Law Data/ folder, containing the data processing pipeline used in the paper:
“Legal Document Analysis with HNMFk” (arXiv:2502.20364)

00_data_collection/:
Scrapes and formats legal documents (statutes, constitution, court cases) from Justia.
01_hnmfk_operation/:
Constructs document-word matrices and runs Hierarchical Nonnegative Matrix Factorization (HNMFk).
02_benchmarking/:
Evaluates LLM-generated content using factual accuracy, entailment, and summarization metrics.
03_visualizations/:
Visualizes legal trends, knowledge graphs, and model evaluation results.