NOTE that requirements.txt is missing for every module. Currently written in Python using ChatGPT.
The main directory folder contains the GUI interface for the Libgen Librarian scraper.
The tgram_scraper connects to HegelBot on the ALS T-gram channel who leaves a trophy reaction once the image, text message, or audio file is finished scraping.
- The channel IDs are hardcoded in. You'll see its just a list we can continue expanding.
The cover_classifier sorts covers and text-heavy images out of the scrape.
What remains to be done:
- extracting text using DataLab Marker OCR from images and PDFs
- full translation of PDFs by processing the OCR output through subscription AI
- partially formatting the translation along with images best we can before sending off to Fiverr for manual workup
- extraction of bibliographies and citations using subscription AI
- connecting output back into GUI interface for manual prioritization sorting
5a. Separate tabs for separate workflows
5b. Connection to remote server; use a web interface instead of a Tkinter interface