Automatically extract outlines from software-generated PDF documents based on layout and text styles
Work in progress to improve the algorithm and add more test examples
# Requires Node 20+
npm install -g pdfoutliner
pdfoutliner -h
# Install globally and see options
# Alternatively run `npx pdfoutliner -h` without installation
pdfoutliner example.pdf
# outline will be added to new file example_outlined.pdf
pdfoutliner example.pdf -o txt
pdfoutliner example.pdf --fromtxt
# first save outline to example_outline.txt for manual edit
# then add outline from txt file to pdf
Some scientific papers (particularly preprints) don't include outline in the PDF, making it inconvenient to jump between sections. This tool analyzes the layout of the document and extracts certain text as outline based on some heuristics. The result may not be perfect, but can still be useful.
It only works on software-generated PDF and does not support scanned PDF. It is primarily tested on papers (see example
folder for some open access ones), but may also work on longer documents such as books.
A Zotero plugin was originally planned, but a similar feature has been built into Zotero.
- Google Scholar PDF Reader (not written to file)
- Zotero 7 (not written to file)
- github.com/hueyy/pdf_scout (an inspiration)
- github.com/cdevereaux/automatic_pdf_outline (semi-automatic)
- Some PDF suites maybe