Skip to content

AMZ-CodeFusion v.1

Latest
Compare
Choose a tag to compare
@adeism adeism released this 30 Jan 06:37
· 1 commit to main since this release
4b29adf

✨ Key Features for Code Documentation, Datasets & RAG

  • Human-in-the-Loop Documentation Focus: Prepare your codebase for enhanced documentation efforts by consolidating code into manageable, structured outputs, ready for human review and annotation.
  • Source Code Dataset Generation: Create clean, combined source code datasets perfect for training and evaluating RAG models designed for code understanding and generation.
  • Source Code Archiving: Efficiently archive entire projects or specific code sections into single files for better organization, searchability, and long-term storage.
  • RAG-Optimized Output: Generate output files specifically structured for optimal performance with Retrieval-Augmented Generation systems, enhancing code retrieval and context.
  • Intuitive GUI: User-friendly graphical interface powered by tkinter for effortless configuration and operation.
  • Flexible Input: Select any source directory to process your code files.
  • Customizable Output: Choose the output file name and location for your code dataset or archive.
  • File Extension Filtering: Include only specific code file types (e.g., .py, .java, .js, .c, .cpp, .html, .css).
  • Folder Exclusion: Exclude development-related folders (like .git, node_modules, venv) to focus on source code.
  • Regex Pattern Exclusion: Define regular expression patterns to exclude specific files or paths within your codebase.
  • File Size Limit: Manage dataset size by setting a maximum file size to skip processing very large code files.
  • Content Enhancements (Optional):
    • Include line numbers for referencing specific lines of code in documentation.
    • Add timestamps for tracking code versions or archival dates.
    • Display file sizes for dataset analysis.
    • Opt-in for syntax highlighting in the output for improved readability in documentation and datasets.
  • Code-Focused Exclusion Options:
    • Exclude images and non-code assets.
    • Exclude executable files and build artifacts.
    • Exclude temporary and backup files commonly found in development environments.
    • Exclude hidden files and folders.
    • NEW! Exclude comments (/* ... */) to create cleaner code datasets, focusing on the core logic.
  • Detailed Logging: Comprehensive logging to track the code dataset generation and archiving process, including skipped files and folders.
  • Summary Reports: Includes a summary header and a combination summary in the output file, detailing code files processed, dataset size, and skipped items.
  • Zip Archive Creation: Optionally create a .zip archive of the output code dataset or archive for easy sharing and distribution.
  • Multi-threaded Processing: Leverages multi-threading to accelerate the processing of large codebases.
  • Open Output File: Automatically opens the generated code dataset or archive file after processing.
  • Skipped Items Detail: Option to include detailed lists of skipped folders and files in the output summary for complete transparency in dataset creation.