You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on github.com and signed with GitHub’s verified signature.
✨ Key Features for Code Documentation, Datasets & RAG
Human-in-the-Loop Documentation Focus: Prepare your codebase for enhanced documentation efforts by consolidating code into manageable, structured outputs, ready for human review and annotation.
Source Code Dataset Generation: Create clean, combined source code datasets perfect for training and evaluating RAG models designed for code understanding and generation.
Source Code Archiving: Efficiently archive entire projects or specific code sections into single files for better organization, searchability, and long-term storage.
RAG-Optimized Output: Generate output files specifically structured for optimal performance with Retrieval-Augmented Generation systems, enhancing code retrieval and context.
Intuitive GUI: User-friendly graphical interface powered by tkinter for effortless configuration and operation.
Flexible Input: Select any source directory to process your code files.
Customizable Output: Choose the output file name and location for your code dataset or archive.
File Extension Filtering: Include only specific code file types (e.g., .py, .java, .js, .c, .cpp, .html, .css).
Folder Exclusion: Exclude development-related folders (like .git, node_modules, venv) to focus on source code.
Regex Pattern Exclusion: Define regular expression patterns to exclude specific files or paths within your codebase.
File Size Limit: Manage dataset size by setting a maximum file size to skip processing very large code files.
Content Enhancements (Optional):
Include line numbers for referencing specific lines of code in documentation.
Add timestamps for tracking code versions or archival dates.
Display file sizes for dataset analysis.
Opt-in for syntax highlighting in the output for improved readability in documentation and datasets.
Code-Focused Exclusion Options:
Exclude images and non-code assets.
Exclude executable files and build artifacts.
Exclude temporary and backup files commonly found in development environments.
Exclude hidden files and folders.
NEW! Exclude comments (/* ... */) to create cleaner code datasets, focusing on the core logic.
Detailed Logging: Comprehensive logging to track the code dataset generation and archiving process, including skipped files and folders.
Summary Reports: Includes a summary header and a combination summary in the output file, detailing code files processed, dataset size, and skipped items.
Zip Archive Creation: Optionally create a .zip archive of the output code dataset or archive for easy sharing and distribution.
Multi-threaded Processing: Leverages multi-threading to accelerate the processing of large codebases.
Open Output File: Automatically opens the generated code dataset or archive file after processing.
Skipped Items Detail: Option to include detailed lists of skipped folders and files in the output summary for complete transparency in dataset creation.