Skip to content

Commit 721908b

Browse files
committed
Prep for release v1.4.0
1 parent c999b56 commit 721908b

File tree

2 files changed

+24
-5
lines changed

2 files changed

+24
-5
lines changed

README.md

Lines changed: 23 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,11 @@ workflows.
55

66
[Tesseract](https://github.com/tesseract-ocr/tesseract?tab=readme-ov-file#about) is an open source OCR (Optical
77
Character Recognition) engine that can recognize text (machine/typed/printed text, not handwritten) in images (e.g. PNG
8-
or JPEG).
8+
or JPEG) or images embedded in PDF files.
9+
10+
To read text from PDFs directly (not from images), you may want
11+
[the **Extract from PDF** operation of the built-in **Extract from file** node
12+
](https://docs.n8n.io/integrations/builtin/core-nodes/n8n-nodes-base.extractfromfile/#operations)
913

1014
[n8n](https://n8n.io/) is a [fair-code licensed](https://docs.n8n.io/reference/license/) workflow automation platform.
1115

@@ -25,6 +29,9 @@ You can quickly get started by importing [the sample playbook](./sample_workflow
2529

2630
## Operations
2731

32+
Note: both operations will output a new binary field, called `ocr`, which contains the image that was actually OCR'd.
33+
For input images, this will be the same image. For input PDFs, this will be each image in the PDF.
34+
2835
### Extract text
2936

3037
This operation reads the text of the entire image. It outputs a JSON item containing the entire recognized text, and a "
@@ -37,6 +44,9 @@ confidence value" indicating how likely the generated text is to match the sourc
3744
}
3845
```
3946

47+
If passed a PDF instead of an image, the node may output several items, one for each image in the PDF. Each item will
48+
have the format described above.
49+
4050
### Extract boxes
4151

4252
This operation also reads text, but returns more information about the bounding box of each detected block, and the
@@ -86,6 +96,9 @@ Per-line statistics:
8696

8797
![an image of the same text with Tesseract per-line detections overlaid as one red box covering each line](imgs/lines.png)
8898

99+
If passed a PDF instead of an image, the node may output several items, one for each image in the PDF. Each item will
100+
have the format described above.
101+
89102
## Compatibility
90103

91104
This node has been tested on n8n v1.68.0, but should also work on older versions. If you encounter an issue with an
@@ -100,7 +113,8 @@ provided:
100113

101114
![a screenshot of the node UI showing an input item with Binary data](imgs/iifn.png)
102115

103-
The Binary file with that name will be read and processed.
116+
The Binary file with that name will be read and processed. It should be an image or a PDF document. If a PDF, all images
117+
inside the PDF will be extracted and processed separately.
104118

105119
### Detect on Entire Image?
106120

@@ -178,8 +192,13 @@ Initial version, contains the **Extract text** and **Extract boxes** operations.
178192

179193
### v1.3.0
180194

181-
* Add a Timeout option to control the max processing time (
182-
closes [#3](https://github.com/jreyesr/n8n-nodes-tesseractjs/issues/3))
195+
* Add a Timeout option to control the max processing time
196+
(closes [#3](https://github.com/jreyesr/n8n-nodes-tesseractjs/issues/3))
197+
198+
### v1.4.0
199+
200+
* Add the ability to extract all images from a PDF and process them, in addition to single images
201+
(closes [#4](https://github.com/jreyesr/n8n-nodes-tesseractjs/issues/4))
183202

184203
## Developer info
185204

package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "n8n-nodes-tesseractjs",
3-
"version": "1.3.0",
3+
"version": "1.4.0",
44
"description": "A n8n module that exposes Tesseract.js, an OCR library that can detect text on images",
55
"keywords": [
66
"n8n-community-node-package"

0 commit comments

Comments
 (0)