[Feature Request] Getting EasyOCR to work with paperless-ngx #6056
Replies: 5 comments 3 replies
-
So I am working on a (stupid) attempt to check whether EasyOCR can be used in paperless-ngx. So the idea now is to get rid of the multiprocessing module in the plugin. The only drawback is that there will be no parallelism, but since paperless-ngx already handles this, there are chances that the ocrmypdf-easyocr plugin will get to work. |
Beta Was this translation helpful? Give feedback.
-
So interestingly, I had done some hackety hack 'I have no idea what i am doing.gif' type stuff to get my paperless-ngx working with easyOCR back in January. It was slow and borked itself a few times (and the model downloading was painful), but I got there in the end. However I hadn't touched my paperless instance since January. I had rebooted the server a few times since then though, so docker-compose likely did some rebuilding? Today, I get the "AssertionError: daemonic processes are not allowed to have children" message when trying to OCR a new document, which I did not encounter in the 200 previous files I had OCR'd back in Jan. So I googled the error, and found this thread. :) I'll attach my gubbins in the hope it proves useful to someone. /zdata/containers/ is a ZFS storage pool on a NAS, but mounted locally to my server with NFS. OS: Ubuntu 22.04.4 LTS x86_64 $ nvidia-smi
/zdata/containers/paperless-ngx/ $ ls /zdata/containers/paperless-ngx/docker-compose.yml
/zdata/containers/paperless-ngx/custom-cont-init.d/easyocrwithgpubackend.sh
my docker-compose env. I wasn't 100% clear on the effect of threads vs workers re: system RAM and GPU RAM, so...I iterated. Comments ended up in my file below.
let me know if you need anything else. This was several late nights of stuffing around, I'm not actually any kind of actual dev. |
Beta Was this translation helpful? Give feedback.
-
I would love to see the possibility to use different ocr engines in paperless-ngx. Would it be possible to do a module-like approach? So we could simply add different/better models in the feature? |
Beta Was this translation helpful? Give feedback.
-
This request is another proof that all built-in features should be replaced by workflows that permit replacement of components. Not all ingested documents will really need OCR. If OCR is made optional the DMS becomes an asset management system. If specific documents can be "believed" without passing them on to OCR (because they contain a known good text layer) that should be possible, too. Or running them on several different OCR engines and compare results with each other (and dictionaries), keeping the version that has the least words that can't be found in the dictionaries. |
Beta Was this translation helpful? Give feedback.
-
So there's no much to do. I more or less share @noseshimself view about the OCR being part of a workflow. |
Beta Was this translation helpful? Give feedback.
-
Description
So, tesseract has some pretty terrible results compared to EasyOCR.
Would love to see EasyOCR engine supported.
There's already a more generic discussion about alternative OCR plugins going on at #5128, but this one is EasyOCR specific.
Since EasyOCR is already supported by a OCRMyPDF plugin, I begun diving into the code in hope to get the plugin to work.
Here's how far I came:
As long as the ocrmypdf-easyocr plugin is installed, it will be used by ocrmypdf
There are a couple of issues with using the ocrmypdf-easyocr plugin in paperless, of which a couple can be resolved:
There's one big problem that must be solved between paperless-ngx and OCRMyPDF developpers I think...
Perhaps more on the OCRMyPDF side since tesseract integration already works.
When using paperless-ngx with EasyOCR plugin, OCR fails with
AssertionError: daemonic processes are not allowed to have children
error message, since both products use multiprocessing pools (and threading).I've opened an issue at ocrmypdf/OCRmyPDF-EasyOCR#7 in hope for the OCRMyPDF developper to make the plugin compatible with paperless-ngx
What's next ?
Hopefully the project developpers can comment on this issue.
Other
No response
Beta Was this translation helpful? Give feedback.
All reactions