README improvements

erikkastelec · erikkastelec · commit 2466110462c1 · 2020-08-09T20:01:49.000+02:00
diff --git a/PDFScraper/__init__.py b/PDFScraper/__init__.py
@@ -1,4 +1,4 @@
-__version__ = "1.0.4"
+__version__ = "1.0.5"
 
 
 def version():
diff --git a/README.md b/README.md
@@ -1,11 +1,45 @@
 # PDFScraper
 CLI program for searching text and tables inside of PDF documents and displaying results in HTML. It combines [Pdfminer.six](https://github.com/pdfminer/pdfminer.six), [Camelot](https://github.com/camelot-dev/camelot) and [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) in a single program, which is simple to use.
 
-# How to install
-### Using pip
+# How to use
+### Install using pip
 
-After installing the dependencies you can simply use pip to install PDFScraper:
+Use pip to install PDFScraper:
 
 <pre>
 $ pip install PDFScraper
 </pre>
+
+### Arguments
+<pre>
+optional arguments:
+  -h, --help            show this help message and exit
+  --path PATH           path to pdf folder or file
+  --out OUT             path to output file location
+  --log_level {critical,error,warning,info,debug}
+                        logger level to use (default: info)
+  --search SEARCH       word to search for
+  --tessdata TESSDATA   location of tesseract data files
+  --tables TABLES       should tables be extracted and searched
+</pre>
+
+
+
+`path`, by default ".", specifies the location of the PDF folder or directory.
+
+`out`, by default ".", specifies output directory in which `summary.html` file is created.
+
+`search` argument is used for specifying the word or sentence that will be searched for in the PDF documents.
+
+`tessdata` argument can be used to specify custom tessdata location for OCR analysis.
+
+`tables`, by default True, specifies whether to search for search word in tables. Disabling tables search improves speed significantly.
+
+### OCR
+
+**tessdata pretrained language [files](https://github.com/tesseract-ocr/tessdata_best) need to be manually added to the tessdata directory.**
+
+
+OCR analysis of PDF documents currently supports English and Slovenian language. 
+Language of the document is automatically detected using [langdetect library](https://github.com/Mimino666/langdetect).
+
diff --git a/setup.py b/setup.py
@@ -37,7 +37,7 @@
         "yattag==1.14.0",
     ],
     name="PDFScraper",
-    version="1.0.4",
+    version="1.0.5",
     author="Erik Kastelec",
     author_email="erikkastelec@gmail.com",
     description="PDF text and table search",

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-__version__ = "1.0.4"`
	`1`	`+__version__ = "1.0.5"`
`2`	`2`
`3`	`3`
`4`	`4`	`def version():`