Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Import Status: Running file import" stuck. #282

Open
ZeroCool940711 opened this issue Apr 20, 2020 · 19 comments
Open

"Import Status: Running file import" stuck. #282

ZeroCool940711 opened this issue Apr 20, 2020 · 19 comments
Assignees

Comments

@ZeroCool940711
Copy link

Seems like OpenSemanticSearch is stuck extracting and analyzing some files, it's been more than a few days and its still showing the same message when searching, even after rebooting it still stuck on the same files. It doesn't seem to be indexing anything new as the total document count still the same as it was before and there doesn't seem to be anything else OSS is doing.

image

@wAikAp
Copy link

wAikAp commented Apr 27, 2020

Same, wait for a long time but seems not working.. and flower have no active session

@ZeroCool940711
Copy link
Author

I think after some time it just stops working, in my case after 75 billion documents it doesnt index anything or process anything even though the CPU and RAM is not been used at all in my server, seems like there is some internal limit or something is broken, nothing is logged so its hard to tell what's going on.

@wAikAp
Copy link

wAikAp commented Apr 28, 2020

But I just indexing 6 files, seems 1 .ppt file can't do the OCR task, and I wait for 2 days, the import status still " Running file import (still 1 documents to process) "

@srich
Copy link

srich commented May 1, 2020

I am also experiencing this issue testing out open-semantic-search 20.02.08. Is there a service that needs to be restarted, or how is this issue resolved?

Adding start/stop instructions for services in addition to "solr" will be helpful...as well as the order of operations. https://www.opensemanticsearch.org/doc/admin/cmd

@DetlevCM
Copy link

Is it slow or is it stuck? - I set up a new instance on a laptop (with really too little RAM, so there will be a lot of swapping), and it seemed stuck on 3 files. After maybe 2 days it was down to 2 files. So not stuck, but slow due to swapping...

Though I will also say that the User Interface is not ideal as it would be nice to know which files are missing...

@ZeroCool940711
Copy link
Author

In my case its completely stuck, its not a RAM problem as it has a lot of RAM on the server im running it, I think it might have something to do with images been deleted before it can process them, if im right images are not downloaded to the server but instead they are used directly from the website where they were indexed, so, could be that an image was deleted or moved before it could be processed, also could be that it doesn't have access to the image or something, it could be trying the same files over and over and because they are not accessible the process can not be completed.

@olli0815
Copy link

olli0815 commented Jun 14, 2020

I do have the same problem: Indexed a small folder via "opensemanticsearch-index-dir" leads to message "Running file import (still 77 documents to process)".
CLI shows Indexing new file: ....but index creation seems to stuck.
The folder does only contain simple textfiles w/o any images.

Any hints to get the root cause? logs?

Edit#
indexing a single file with opensemanticsearch-index-file within the same folder is running fine.

@mbanks850
Copy link

mbanks850 commented Jul 2, 2020

Mine is similar, it has looked this way since February, there have been a bunch a reboots and crashes. I am running the 20.01.17 release. I was thinking of downloading 20.04.17 to see if it made any difference.

It would be nice if there was a timeout, have it skip the current document, and move on. Let it come back to it on the next pass

Import status: Running file import (still 5071601 documents to process)

Because of yet running and open tasks like text extraction and analysis maybe not all results were found yet, since at the moment of this search 5071601 file(s) could be only searched, overviewed and filtered by their file names only, not yet by their content and/or content based facets/filters!

 Previous Newest 10 of 5339085 documents 

@DetlevCM
Copy link

DetlevCM commented Aug 9, 2020

If anybody wants to do some testing, I wonder if the problem does not stem from an interaction of components (it might be too early to tell just now on my end):

I decided to "clean up" and start with a new freshly configured instance of OpenSemantic Search. (Side note: after updating Debian, I immediately had some corrupted files in /var/lib/dpkg/info ... - I wonder why and how.)

In order to reduce the computational cost and also because I am not sure it adds value in my specific use case, I disabled both the Named Entity Recognition (Spacy) and the Graph DB (neo4j).
So far it seems that the import is running fast without any problems. At present it is OCRing the documents. Add to that, significantly fewer files are written to /tmp (I had something daft like 200.000 files or so before...)
So far I see about 500 - the pages from the document.

I guess I will see in "a while" (whenever...) if this helps.
Incidentally, my previous installation of OpenSemantic Search never calmed down and seemed to continue working indefinitely...
(I am using it as a local search engine for my document library. I don't need more than a search engine, so all the machine learning and the semantics support are not important to me.)

@mbanks850
Copy link

What steps did you use to disabled Named Entity Recognition and Graph DB? Do you know what we would loose by disabling those features?

@DetlevCM
Copy link

@mbanks850 I only use the interface that is exposed to the user. -> Open Semantic Search interface -> config in the top menu -> "Named Entity Recognition" and "Graph DB (Neo4j)" options.

Both entries have some descriptions. The graph database deals with relationships between documents and the named entity recognition tries to understand the document based on machine learning principles.

Given that the project (based on the description) was developed to deal with data dumps for journalists, such tools may be very, very useful. Using it as a local document search engine, the relations (graph database) become less interesting. The Named Entity Recognition could be useful, but is possibly not well tuned to for example technical documents. It may also be that the Named Entity Recognition deals with the semantics aspets of the search - thus turning it off may make OpenSemantic Search "dumber".
Given that I want to search a database of papers and technical documents that I create, this seems fair enough for/to me.

Now for some reason, this has lead to Open Semantic Search not showing me how many files it yet wants to OCR... - But tesserract ocr is the only process hogging the CPU. (I don't think SOLR is particularly heavy for the straightforward searching. It is the part that tries to be clever which is CPU-intensive.)

@mbanks850
Copy link

Thank you, looking at the descriptions, Graph DB is not something I will need. Named Entity maybe, but we are also just using it as a search in technical documents.

@RiteshSingh
Copy link

Same issue in OSS 20.04.17 and 20.01.17

@rusty9283
Copy link

rusty9283 commented Jan 13, 2021

Same issue here.

Ubuntu 20.04, OSS 20.11.01 and 21.01.03

Indexing via opensemanticsearch-index-dir -> ~210.000 files.
After about 16 hours 2 documents are extracted but CPU is on 100% with 8 tasks from "etl_tasks".

"NER" and "Neo4j" are disabled.

I tried to reset filemonitoring and deleted index but CPU is always on 100% with "etl_tasks" without indexing?

Only if i stop the service "opensemanticetl" the cpu is in normal use.

Has someone news about this?

@rusty9283
Copy link

After some testing I think my problem is maybe another: #341

@movanet
Copy link

movanet commented Mar 11, 2021

Same issue here. It's been a few months since this post. Did you encountered any other file import issue after this?
Also, would it help if we turn it off after the fact (after it got stucked) or do we need to clean start and do another indexing?

@mbanks850 I only use the interface that is exposed to the user. -> Open Semantic Search interface -> config in the top menu -> "Named Entity Recognition" and "Graph DB (Neo4j)" options.

@nikhilbhalwankar
Copy link

nikhilbhalwankar commented Jul 3, 2021

I had the same problem today. Around 4,80,000 documents got indexed but they were stuck at file import. I waited for around 6 to 7 hours but still it looked to be stuck. I restarted the server and the import process started automatically. I am not sure but it looks like issue is something related to flower server worker. I am using virtual machine appliance (21.01.17).

@HenryJones23
Copy link

This issue is still unresolved. I have probably encountered the same problem (Open Semantic Search installation package from 22.10.08). It regularly hangs during the extraction of files (see issue #461 for details). Did you guys ever find any solution to this?

@Pooja1905
Copy link

I am facing similar issue. Can anyone guide me on this? I have checked error logs of solr, syslogs etc and there doesn't seem to be any errors as such. The CPU utilisation of my EC2 instance seems to be quite busy and not idle. I have delted the indexes/indices and recreated them a couple of times, but there is not change in the total number " Running file imports ..." stats. I have 95-100 gb of data (mixed media - pdfs, images, videos, audios, pngs, csv etc)

I Have left it alone for 2 days now and it hasn't made a dent in the numbers, however the cpu utilization is 80-95 % on average.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests