Skip to content

Commit 2466110

Browse files
committed
README improvements
1 parent 3bddfcb commit 2466110

File tree

3 files changed

+39
-5
lines changed

3 files changed

+39
-5
lines changed

PDFScraper/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
__version__ = "1.0.4"
1+
__version__ = "1.0.5"
22

33

44
def version():

README.md

Lines changed: 37 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,45 @@
11
# PDFScraper
22
CLI program for searching text and tables inside of PDF documents and displaying results in HTML. It combines [Pdfminer.six](https://github.com/pdfminer/pdfminer.six), [Camelot](https://github.com/camelot-dev/camelot) and [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) in a single program, which is simple to use.
33

4-
# How to install
5-
### Using pip
4+
# How to use
5+
### Install using pip
66

7-
After installing the dependencies you can simply use pip to install PDFScraper:
7+
Use pip to install PDFScraper:
88

99
<pre>
1010
$ pip install PDFScraper
1111
</pre>
12+
13+
### Arguments
14+
<pre>
15+
optional arguments:
16+
-h, --help show this help message and exit
17+
--path PATH path to pdf folder or file
18+
--out OUT path to output file location
19+
--log_level {critical,error,warning,info,debug}
20+
logger level to use (default: info)
21+
--search SEARCH word to search for
22+
--tessdata TESSDATA location of tesseract data files
23+
--tables TABLES should tables be extracted and searched
24+
</pre>
25+
26+
27+
28+
`path`, by default ".", specifies the location of the PDF folder or directory.
29+
30+
`out`, by default ".", specifies output directory in which `summary.html` file is created.
31+
32+
`search` argument is used for specifying the word or sentence that will be searched for in the PDF documents.
33+
34+
`tessdata` argument can be used to specify custom tessdata location for OCR analysis.
35+
36+
`tables`, by default True, specifies whether to search for search word in tables. Disabling tables search improves speed significantly.
37+
38+
### OCR
39+
40+
**tessdata pretrained language [files](https://github.com/tesseract-ocr/tessdata_best) need to be manually added to the tessdata directory.**
41+
42+
43+
OCR analysis of PDF documents currently supports English and Slovenian language.
44+
Language of the document is automatically detected using [langdetect library](https://github.com/Mimino666/langdetect).
45+

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@
3737
"yattag==1.14.0",
3838
],
3939
name="PDFScraper",
40-
version="1.0.4",
40+
version="1.0.5",
4141
author="Erik Kastelec",
4242
author_email="erikkastelec@gmail.com",
4343
description="PDF text and table search",

0 commit comments

Comments
 (0)