-
-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Import Status: Running file import" stuck. #282
Comments
Same, wait for a long time but seems not working.. and flower have no active session |
I think after some time it just stops working, in my case after 75 billion documents it doesnt index anything or process anything even though the CPU and RAM is not been used at all in my server, seems like there is some internal limit or something is broken, nothing is logged so its hard to tell what's going on. |
But I just indexing 6 files, seems 1 .ppt file can't do the OCR task, and I wait for 2 days, the import status still " Running file import (still 1 documents to process) " |
I am also experiencing this issue testing out open-semantic-search 20.02.08. Is there a service that needs to be restarted, or how is this issue resolved? Adding start/stop instructions for services in addition to "solr" will be helpful...as well as the order of operations. https://www.opensemanticsearch.org/doc/admin/cmd |
Is it slow or is it stuck? - I set up a new instance on a laptop (with really too little RAM, so there will be a lot of swapping), and it seemed stuck on 3 files. After maybe 2 days it was down to 2 files. So not stuck, but slow due to swapping... Though I will also say that the User Interface is not ideal as it would be nice to know which files are missing... |
In my case its completely stuck, its not a RAM problem as it has a lot of RAM on the server im running it, I think it might have something to do with images been deleted before it can process them, if im right images are not downloaded to the server but instead they are used directly from the website where they were indexed, so, could be that an image was deleted or moved before it could be processed, also could be that it doesn't have access to the image or something, it could be trying the same files over and over and because they are not accessible the process can not be completed. |
I do have the same problem: Indexed a small folder via "opensemanticsearch-index-dir" leads to message "Running file import (still 77 documents to process)". Any hints to get the root cause? logs? Edit# |
Mine is similar, it has looked this way since February, there have been a bunch a reboots and crashes. I am running the 20.01.17 release. I was thinking of downloading 20.04.17 to see if it made any difference. It would be nice if there was a timeout, have it skip the current document, and move on. Let it come back to it on the next pass
|
If anybody wants to do some testing, I wonder if the problem does not stem from an interaction of components (it might be too early to tell just now on my end): I decided to "clean up" and start with a new freshly configured instance of OpenSemantic Search. (Side note: after updating Debian, I immediately had some corrupted files in /var/lib/dpkg/info ... - I wonder why and how.) In order to reduce the computational cost and also because I am not sure it adds value in my specific use case, I disabled both the Named Entity Recognition (Spacy) and the Graph DB (neo4j). I guess I will see in "a while" (whenever...) if this helps. |
What steps did you use to disabled Named Entity Recognition and Graph DB? Do you know what we would loose by disabling those features? |
@mbanks850 I only use the interface that is exposed to the user. -> Open Semantic Search interface -> config in the top menu -> "Named Entity Recognition" and "Graph DB (Neo4j)" options. Both entries have some descriptions. The graph database deals with relationships between documents and the named entity recognition tries to understand the document based on machine learning principles. Given that the project (based on the description) was developed to deal with data dumps for journalists, such tools may be very, very useful. Using it as a local document search engine, the relations (graph database) become less interesting. The Named Entity Recognition could be useful, but is possibly not well tuned to for example technical documents. It may also be that the Named Entity Recognition deals with the semantics aspets of the search - thus turning it off may make OpenSemantic Search "dumber". Now for some reason, this has lead to Open Semantic Search not showing me how many files it yet wants to OCR... - But tesserract ocr is the only process hogging the CPU. (I don't think SOLR is particularly heavy for the straightforward searching. It is the part that tries to be clever which is CPU-intensive.) |
Thank you, looking at the descriptions, Graph DB is not something I will need. Named Entity maybe, but we are also just using it as a search in technical documents. |
Same issue in OSS 20.04.17 and 20.01.17 |
Same issue here. Ubuntu 20.04, OSS 20.11.01 and 21.01.03 Indexing via opensemanticsearch-index-dir -> ~210.000 files. "NER" and "Neo4j" are disabled. I tried to reset filemonitoring and deleted index but CPU is always on 100% with "etl_tasks" without indexing? Only if i stop the service "opensemanticetl" the cpu is in normal use. Has someone news about this? |
After some testing I think my problem is maybe another: #341 |
Same issue here. It's been a few months since this post. Did you encountered any other file import issue after this?
|
I had the same problem today. Around 4,80,000 documents got indexed but they were stuck at file import. I waited for around 6 to 7 hours but still it looked to be stuck. I restarted the server and the import process started automatically. I am not sure but it looks like issue is something related to flower server worker. I am using virtual machine appliance (21.01.17). |
This issue is still unresolved. I have probably encountered the same problem (Open Semantic Search installation package from 22.10.08). It regularly hangs during the extraction of files (see issue #461 for details). Did you guys ever find any solution to this? |
I am facing similar issue. Can anyone guide me on this? I have checked error logs of solr, syslogs etc and there doesn't seem to be any errors as such. The CPU utilisation of my EC2 instance seems to be quite busy and not idle. I have delted the indexes/indices and recreated them a couple of times, but there is not change in the total number " Running file imports ..." stats. I have 95-100 gb of data (mixed media - pdfs, images, videos, audios, pngs, csv etc) I Have left it alone for 2 days now and it hasn't made a dent in the numbers, however the cpu utilization is 80-95 % on average. |
Seems like OpenSemanticSearch is stuck extracting and analyzing some files, it's been more than a few days and its still showing the same message when searching, even after rebooting it still stuck on the same files. It doesn't seem to be indexing anything new as the total document count still the same as it was before and there doesn't seem to be anything else OSS is doing.
The text was updated successfully, but these errors were encountered: