@@ -5,7 +5,11 @@ workflows.
5
5
6
6
[ Tesseract] ( https://github.com/tesseract-ocr/tesseract?tab=readme-ov-file#about ) is an open source OCR (Optical
7
7
Character Recognition) engine that can recognize text (machine/typed/printed text, not handwritten) in images (e.g. PNG
8
- or JPEG).
8
+ or JPEG) or images embedded in PDF files.
9
+
10
+ To read text from PDFs directly (not from images), you may want
11
+ [ the ** Extract from PDF** operation of the built-in ** Extract from file** node
12
+ ] ( https://docs.n8n.io/integrations/builtin/core-nodes/n8n-nodes-base.extractfromfile/#operations )
9
13
10
14
[ n8n] ( https://n8n.io/ ) is a [ fair-code licensed] ( https://docs.n8n.io/reference/license/ ) workflow automation platform.
11
15
@@ -25,6 +29,9 @@ You can quickly get started by importing [the sample playbook](./sample_workflow
25
29
26
30
## Operations
27
31
32
+ Note: both operations will output a new binary field, called ` ocr ` , which contains the image that was actually OCR'd.
33
+ For input images, this will be the same image. For input PDFs, this will be each image in the PDF.
34
+
28
35
### Extract text
29
36
30
37
This operation reads the text of the entire image. It outputs a JSON item containing the entire recognized text, and a "
@@ -37,6 +44,9 @@ confidence value" indicating how likely the generated text is to match the sourc
37
44
}
38
45
```
39
46
47
+ If passed a PDF instead of an image, the node may output several items, one for each image in the PDF. Each item will
48
+ have the format described above.
49
+
40
50
### Extract boxes
41
51
42
52
This operation also reads text, but returns more information about the bounding box of each detected block, and the
@@ -86,6 +96,9 @@ Per-line statistics:
86
96
87
97
![ an image of the same text with Tesseract per-line detections overlaid as one red box covering each line] ( imgs/lines.png )
88
98
99
+ If passed a PDF instead of an image, the node may output several items, one for each image in the PDF. Each item will
100
+ have the format described above.
101
+
89
102
## Compatibility
90
103
91
104
This node has been tested on n8n v1.68.0, but should also work on older versions. If you encounter an issue with an
@@ -100,7 +113,8 @@ provided:
100
113
101
114
![ a screenshot of the node UI showing an input item with Binary data] ( imgs/iifn.png )
102
115
103
- The Binary file with that name will be read and processed.
116
+ The Binary file with that name will be read and processed. It should be an image or a PDF document. If a PDF, all images
117
+ inside the PDF will be extracted and processed separately.
104
118
105
119
### Detect on Entire Image?
106
120
@@ -178,8 +192,13 @@ Initial version, contains the **Extract text** and **Extract boxes** operations.
178
192
179
193
### v1.3.0
180
194
181
- * Add a Timeout option to control the max processing time (
182
- closes [#3](https://github.com/jreyesr/n8n-nodes-tesseractjs/issues/3))
195
+ * Add a Timeout option to control the max processing time
196
+ (closes [#3](https://github.com/jreyesr/n8n-nodes-tesseractjs/issues/3))
197
+
198
+ ### v1.4.0
199
+
200
+ * Add the ability to extract all images from a PDF and process them, in addition to single images
201
+ (closes [#4](https://github.com/jreyesr/n8n-nodes-tesseractjs/issues/4))
183
202
184
203
## Developer info
185
204
0 commit comments