Skip to content

PredPatt Integration and Python 3.12+ Modernization #31

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Jul 31, 2025

Conversation

aaronstevenwhite
Copy link
Contributor

@aaronstevenwhite aaronstevenwhite commented Jul 30, 2025

This PR represents a significant modernization of the Decomp toolkit with full integration of PredPatt predicate-argument structure extraction functionality and comprehensive Python 3.12+ compatibility updates.

Summary

This PR integrates the standalone PredPatt library directly into decomp as decomp.semantics.predpatt, modernizes the codebase for Python 3.12+, and adds comprehensive CI/CD infrastructure. The integration maintains complete compatibility with the original PredPatt implementation while providing seamless interoperability with the UDS framework.

Key Changes

1. PredPatt Integration (~7,000 lines)

  • Full module structure at decomp.semantics.predpatt:
    • core/: Core data structures (Token, Predicate, Argument, PredPattOpts)
    • extraction/: Main extraction engine with linguistic rule application
    • parsing/: Universal Dependencies parsing utilities
    • rules/: Modular linguistic rules for predicate/argument identification
    • filters/: Configurable filtering system
    • utils/: Visualization and debugging utilities
  • Algorithm fidelity: Byte-for-byte identical output with standalone PredPatt v1.0.1
  • Comprehensive testing: 100+ differential tests ensure compatibility

2. Python 3.12+ Modernization

  • Type system updates:
    • Modern union syntax (X | Y instead of Union[X, Y])
    • Built-in generics (list[str] instead of List[str])
    • type aliases instead of TypeAlias
    • Protocol-based typing for better type safety
  • Packaging modernization:
    • Migration from setup.py to pyproject.toml
    • Structured dependency management with optional [dev] extras
    • Modern build system configuration

3. CI/CD Infrastructure

  • GitHub Actions workflow (.github/workflows/ci.yml):
    • Automated testing with pytest (including slow tests)
    • Linting with ruff (E/F errors only, style warnings suppressed)
    • Type checking with mypy
    • Documentation building with warnings as errors
  • Development tooling:
    • ruff.toml: Linting and formatting configuration
    • mypy.ini: Type checking configuration
    • Comprehensive test suite with 400+ tests

4. Documentation Enhancements

  • Comprehensive API documentation:
    • Detailed module and class docstrings throughout
    • Structured documentation for PredPatt components
    • Type hints for all public APIs
  • Updated tutorials:
    • Installation instructions for modern packaging
    • Quick-start guide corrections
    • Docker setup with jupyter/datascience-notebook
  • New content:
    • CHANGELOG.md with complete release history
    • releases.rst documentation page
    • CI/CD documentation

5. Bug Fixes

  • Fixed JSON serialization issue in UDS metadata (sets → lists)
  • Corrected document_idsdocumentids in documentation
  • Fixed trailing whitespace in linearization docstrings
  • Resolved F841 unused variable warnings

Testing

All tests pass including the comprehensive PredPatt differential test suite:

pytest --runslow  # 417 tests pass
mypy decomp      # No type errors
ruff check . --select E,F  # No critical errors

Breaking Changes

None for existing users. The integration is additive:

  • Original decomp API remains unchanged
  • PredPatt functionality is now available at decomp.semantics.predpatt
  • Python 3.12+ is now required (was 3.6+)

Migration Guide

For users of standalone PredPatt:

# before (standalone predpatt)
from predpatt import PredPatt, load_conllu, PredPattOpts

# after (integrated in decomp)
from decomp.semantics.predpatt import PredPatt, load_conllu, PredPattOpts

Example Usage

from decomp.semantics.predpatt import PredPatt, load_conllu

# load dependency parse
sentences = load_conllu('example.conllu')

# extract predicates and arguments
pp = PredPatt(sentences[0])

# access structures
for predicate in pp.predicates:
    print(f"Predicate: {predicate}")
    for arg in predicate.arguments:
        print(f"  Argument: {arg}")

Future Work

  • Standalone PredPatt contains a syntactic parser. This parser is very old and implemented in java. A newer parser will be integrated in the future under a parsing feature. Currently this feature has concrete as a dependency, but this dependency is likely to be removed, since it is unlikely to be necessary and is requires many old dependencies.
  • The style of the code is relatively homogeneous but could be improved.
  • Some unpythonic naming conventions from the standalone PredPatt remain for the sake of backward compatibility and should be removed.
  • The tests need substantial reorganization, style edits, and deduplication, especially the differential tests comparing standalone PredPatt to the integrated implementation.

Checklist

  • Tests pass locally and in CI
  • Documentation builds without warnings
  • Type checking passes
  • Linting passes (critical errors only)
  • Changelog updated
  • README updated with new badges and instructions

…`mypy.ini` for improved readability and added tests for argument filtering, predicate filtering, and integrated filtering to ensure consistent behavior with the original PredPatt implementation.
…ing configuration settings. Introduces new test files for differential testing of argument and predicate classes, ensuring compatibility with the original PredPatt implementation. Updates `pyproject.toml` for linting configurations and removes deprecated dependencies from `requirements.txt`.
…nization and consistency. Updates argument and predicate filtering functions to follow naming conventions. Enhances test files by ensuring compatibility with the original PredPatt implementation and improving readability. Additionally, minor formatting adjustments and code cleanups are applied throughout the codebase.
…prove code clarity and robustness. Updates the `pyproject.toml` to include new dependencies and removes deprecated ones. Enhances test coverage for argument and predicate classes, ensuring proper handling of edge cases and improving overall test reliability.
…s across various classes. Introduces a new typing module for shared type definitions, improves docstrings for clarity, and refines method signatures to ensure type safety. Updates the UDS corpus and document classes to better manage sentence and document-level graphs, including improved metadata handling and annotation methods. Additionally, refactors existing code for consistency and readability.
…and semantics modules to enhance type safety and clarity. Updates the UDS annotation system with new type definitions for better consistency. Improves error handling in various methods and enhances test coverage for the corpus and graph converters, ensuring robust functionality and compatibility with existing implementations.
…aset loading process. Enhances the `__init__.py` file with detailed module description and usage examples, improving overall code organization and readability.
…on for graph corpus management. Introduces detailed class and type alias documentation, improving clarity and usability for developers implementing corpus readers in the decomp framework.
…tions for the UDS corpus, annotation, and metadata classes. Refines type hints for improved clarity and consistency, ensuring better type safety throughout the UDS annotation system. Updates method signatures and docstrings to reflect changes, enhancing usability for developers working with UDS datasets.
… parameter to support both PredPattCorpus and a dictionary of UDSSentenceGraph. Updates the _validate_arguments method to reflect this change. Additionally, improves the get_ontologies function to prioritize loading metadata from annotation files, with fallback to the UDS corpus, enhancing the ontology collection process.
…hances class descriptions and method signatures for clarity, ensuring better type safety and usability. Updates type aliases to use `type` instead of `TypeAlias` for consistency, and improves error messages for better debugging. Additionally, restructures nested dictionary types for improved readability.
… graph modules. Updates type aliases to use `type` instead of `TypeAlias` for consistency, enhances method signatures for clarity, and improves error messages. Additionally, restructures docstrings for better readability and usability, ensuring a more robust and user-friendly API for developers working with UDS datasets.
…on files, enhancing type checking flexibility. Removes outdated test file for differential imports, streamlining the codebase. Updates type casting in PredPattCorpus for improved type safety and clarity, ensuring consistent handling of corpus data.
…ngs for classes and functions, improving clarity and usability. Refines type hints for better type safety and consistency, and restructures method signatures for improved readability. Updates the `get_ontologies` function to enhance metadata loading from annotation files, ensuring a more robust ontology collection process.
…rpus.py`, and `graph.py` files. Enhances documentation with detailed descriptions of classes and methods, improving clarity and usability. Introduces the `PredPattCorpus` and `PredPattGraphBuilder` classes for better management of semantic extractions and graph construction. Updates type hints for improved type safety and consistency across the module.
…dule. Updates the module docstring to provide clearer descriptions of key components, including the `HasPosition` protocol and `UDSchema` type alias. Refines type alias declaration for `UDSchema` to improve consistency and clarity across the PredPatt framework.
…ates the module and class docstrings to enhance clarity and detail regarding token representation and its attributes. Improves comments for better readability and understanding of the code structure.
…ew PredicateType enumeration for better type safety and clarity. Updates the documentation to reflect changes in predicate type handling, enhancing usability and consistency across the module. Modifies various components to utilize the new enumeration, ensuring a more robust implementation of predicate types.
… and usability. Updates class and function docstrings in various files, including `__init__.py`, `corpus.py`, `graph.py`, and `typing.py`, to provide detailed descriptions of components and their functionalities. Introduces structured sections for classes, functions, and constants, improving the overall organization of the documentation. Additionally, refines type hints and comments for better readability and consistency throughout the module.
…rules modules for improved clarity and consistency. Updates comment styles to lowercase and enhances readability in various files, including `__init__.py`, `argument_filters.py`, `predicate_filters.py`, and `base.py`. This change aims to standardize documentation practices and improve the overall usability of the codebase.
… for clarity and consistency. Updates comments in `corpus.py`, `__init__.py`, `nx.py`, and `graph.py` to standardize formatting and improve readability. This change aims to provide clearer descriptions of methods and properties, enhancing the overall usability of the codebase.
…ty and consistency. Updates `from_conll_and_annotations`, `from_json`, `add_annotation`, and various other methods in `corpus.py`, `document.py`, and `graph.py` to use a more structured format. This change enhances code clarity and maintains uniformity in method definitions throughout the codebase.
…ngine.py`, and `linearization.py`, by refining docstrings for clarity and consistency. Updates comments to standardize formatting and improve readability. This change aims to provide clearer descriptions of classes, methods, and their functionalities, enhancing the overall usability of the codebase.
- Introduced a new CHANGELOG.md to document notable changes and version history for the Decomp project.
- Added a CI workflow in .github/workflows/ci.yml for automated testing, linting, and type checking using Python 3.12.
- Updated README.md with badges for CI status, GitHub link, and license information.
- Enhanced documentation across various modules, including installation instructions, release notes, and detailed API references for the new PredPatt integration and Python 3.12+ compatibility.
…tegration

- Changed the base image in Dockerfile to jupyter/datascience-notebook with Python 3.12.
- Updated working directory and copy commands in Dockerfile for better ownership management.
- Modified installation commands to use editable mode and pre-build the UDS corpus.
- Enhanced README.md and install.rst with updated instructions for building and running the Docker image, including starting a Jupyter Lab server.
- Updated requirements.txt to reflect new package versions and added development dependencies for testing.
- Updated README.md and install.rst to clarify installation methods, including direct installation from GitHub and from source.
- Added requirements for Python 3.12 or higher and detailed steps for development installation with dependencies.
- Improved documentation structure and content in various files, including sentence-graphs.rst and predpatt.rst, for better clarity and usability.
- Refined comments and docstrings across multiple modules to enhance readability and consistency.
@aaronstevenwhite aaronstevenwhite force-pushed the predpatt-integration branch 3 times, most recently from bd88ca2 to c024247 Compare July 31, 2025 01:44
@aaronstevenwhite aaronstevenwhite force-pushed the predpatt-integration branch 3 times, most recently from 07c3fc9 to 1c0e24d Compare July 31, 2025 02:14
- Modifies the Dockerfile to install the toolkit in editable mode with visualization dependencies, removing the requirements.txt file.
- Updates the tests/README.md to clarify installation steps for running tests, emphasizing the use of editable mode for development dependencies.
- Removes tests/requirements.txt as its contents are now integrated into the main installation process.
@aaronstevenwhite aaronstevenwhite merged commit 5c10590 into master Jul 31, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant