|
1 | 1 | # PDFScraper
|
2 | 2 | CLI program for searching text and tables inside of PDF documents and displaying results in HTML. It combines [Pdfminer.six](https://github.com/pdfminer/pdfminer.six), [Camelot](https://github.com/camelot-dev/camelot) and [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) in a single program, which is simple to use.
|
3 | 3 |
|
4 |
| -# How to install |
5 |
| -### Using pip |
| 4 | +# How to use |
| 5 | +### Install using pip |
6 | 6 |
|
7 |
| -After installing the dependencies you can simply use pip to install PDFScraper: |
| 7 | +Use pip to install PDFScraper: |
8 | 8 |
|
9 | 9 | <pre>
|
10 | 10 | $ pip install PDFScraper
|
11 | 11 | </pre>
|
| 12 | + |
| 13 | +### Arguments |
| 14 | +<pre> |
| 15 | +optional arguments: |
| 16 | + -h, --help show this help message and exit |
| 17 | + --path PATH path to pdf folder or file |
| 18 | + --out OUT path to output file location |
| 19 | + --log_level {critical,error,warning,info,debug} |
| 20 | + logger level to use (default: info) |
| 21 | + --search SEARCH word to search for |
| 22 | + --tessdata TESSDATA location of tesseract data files |
| 23 | + --tables TABLES should tables be extracted and searched |
| 24 | +</pre> |
| 25 | + |
| 26 | + |
| 27 | + |
| 28 | +`path`, by default ".", specifies the location of the PDF folder or directory. |
| 29 | + |
| 30 | +`out`, by default ".", specifies output directory in which `summary.html` file is created. |
| 31 | + |
| 32 | +`search` argument is used for specifying the word or sentence that will be searched for in the PDF documents. |
| 33 | + |
| 34 | +`tessdata` argument can be used to specify custom tessdata location for OCR analysis. |
| 35 | + |
| 36 | +`tables`, by default True, specifies whether to search for search word in tables. Disabling tables search improves speed significantly. |
| 37 | + |
| 38 | +### OCR |
| 39 | + |
| 40 | +**tessdata pretrained language [files](https://github.com/tesseract-ocr/tessdata_best) need to be manually added to the tessdata directory.** |
| 41 | + |
| 42 | + |
| 43 | +OCR analysis of PDF documents currently supports English and Slovenian language. |
| 44 | +Language of the document is automatically detected using [langdetect library](https://github.com/Mimino666/langdetect). |
| 45 | + |
0 commit comments