Skip to content

v0.0.40

Compare
Choose a tag to compare
@MaksimEkin MaksimEkin released this 30 Apr 19:34
· 10 commits to main since this release
ce17cfa

🔧 Vulture Enhancements

  • Expanded Standard Cleaning Functions:
    Added the following to Vulture’s standard text cleaning pipeline:
    • remove_numbers: Removes stand-alone numbers.
    • remove_alphanumeric: Removes mixed alphanumeric terms (e.g., abc123).
    • remove_roman_numerals: Removes Roman numeral listings.

🐆 Cheetah Additions

  • term_generator:

    • Extracts top keywords from cleaned text using TF-IDF.
    • Pairs keywords with nearby support terms based on a co-occurrence matrix.
    • Saves output as a structured markdown file of search terms.
  • CheetahTermFormatter:

    • Parses markdown search term files into structured blocks with optional filters (e.g., positives, negatives).
    • Supports plain string output or category-based filtering.
    • Can generate substitution maps to convert multi-word phrases into underscored versions and back.
  • convert_txt_to_cheetah_markdown:

    • Converts plain .txt files or structured term dictionaries into Cheetah-compatible markdown format.
    • Facilitates easier programmatic creation and editing of search term files.

🧹 Refactoring and Fixes

  • Code Refactoring:

    • Consolidated several duplicated functions across modules into shared helper utilities at a higher level.
  • Bug Fixes:

    • Vulture:
      • Fixed path-saving logic in operator pipelines.
      • Fixed bugs in the NER and Vocabulary Consolidator operators.
    • Beaver:
      • Resolved a file-saving issue that also affected Wolf’s visualization routines.
    • Fixes README under examples to have the correct module links.
  • .gitignore Updates:

    • Added more output files and example notebook directories to .gitignore.

📁 New Example: NM Law Data Pipeline

Added the NM Law Data/ folder, containing the data processing pipeline used in the paper:
“Legal Document Analysis with HNMFk” (arXiv:2502.20364)

  • 00_data_collection/:
    Scrapes and formats legal documents (statutes, constitution, court cases) from Justia.

  • 01_hnmfk_operation/:
    Constructs document-word matrices and runs Hierarchical Nonnegative Matrix Factorization (HNMFk).

  • 02_benchmarking/:
    Evaluates LLM-generated content using factual accuracy, entailment, and summarization metrics.

  • 03_visualizations/:
    Visualizes legal trends, knowledge graphs, and model evaluation results.