Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing UniProt data as quads succeeds, but fails to index. #1640

Open
JervenBolleman opened this issue Nov 25, 2024 · 3 comments
Open

Parsing UniProt data as quads succeeds, but fails to index. #1640

JervenBolleman opened this issue Nov 25, 2024 · 3 comments

Comments

@JervenBolleman
Copy link

Sorry for the limited information here.

2024-11-23 08:07:49.783 - INFO: Parsing triples from single input stream fifo.nq (parallel = true) ...
2024-11-23 08:07:49.784 - INFO: Parsing input triples and creating partial vocabularies, one per batch ...
2024-11-25 18:29:06.552 - INFO: Triples parsed: 241,428,823,831 [average speed 1.1 M/s, last batch 1.2 M/s, fastest 1.4 M/s, slowest 0.9 M/s] 
2024-11-25 18:29:07.074 - INFO: Number of triples created (including QLever-internal ones): 241,428,823,831 [may contain duplicates]
2024-11-25 18:29:07.075 - INFO: Merging partial vocabularies ...
2024-11-25 18:29:19.244 - INFO: Finished writing compressed internal vocabulary, size = 0 B [uncompressed = 0 B, ratio = 100%]
2024-11-25 18:29:20.198 - ERROR: Resource temporarily unavailable

and docker container is gone :(
the image was running

cd /index;IndexBuilderMain -m 96GB -s uniprot-settings.json -F nq -f fifo.nq -i uniprot_2024_06 | tee uniprot_2024_06.index-log"

uniprot-settings.json

{ "languages-internal": [], 
"prefixes-external": [""], 
"locale": { "language": "en", "country": "US", "ignore-punctuation": true }, 
"ascii-prefixes-only": true, 
"num-triples-per-batch": 5000000 }

Anything I can look at?

@hannahbast
Copy link
Member

@JervenBolleman If you look at https://github.com/ad-freiburg/qlever-control/blob/main/src/qlever/Qleverfiles/Qleverfile.uniprot, you will see a STXXL_MEMORY = 60G. The corresponding option for IndexBuilderMain is --stxxl-memory 60G.

The crash happened here: https://github.com/ad-freiburg/qlever/blob/master/src/index/IndexImpl.cpp#L563-L565, where memoryLimitIndexBuilding() is exactly the value of the --stxxl-memory option.

I have hit this several times myself, and it's always frustrating because one has waited so long for the triples to parse and then it crashes for this trivial reason :-( It's on our TODO list and until this is done, we should at least have a proper error message when this happens.

@hannahbast
Copy link
Member

@JervenBolleman I see now that you have already set -m 96 GB. But checking my uniprot.settings.json from when I last built UniProt, I see five times your batch size:

{ "languages-internal": [], "prefixes-external": [""], "locale": { "language": "en", "country": "US", "ignore-punctuation": true }, "ascii-prefixes-only": true, "num-triples-per-batch": 25000000 }

Some more background info on this: the merging is done using a recursive implementation of parallel merge sort, where each call gets the --stxxl-memory value divided by the number of batches: https://github.com/ad-freiburg/qlever/blob/master/src/util/ParallelMultiwayMerge.h#L236 . With around 250 B input triples and a batch size of 5 M, you get around 50 K batches, which then have to be merged. With your setting, each batch gets only around 2 MB, which is not enough. Not sure exactly what the bottleneck there is, we really have to look into this, it is so frustrating when this happens.

@hannahbast
Copy link
Member

hannahbast commented Nov 25, 2024

PS: When I last built UniProt, the last line of the log regarding the parsing looked like this:

Triples parsed: 188,975,691,139 [average speed 2.9 M/s, last batch 4.8 M/s, fastest 9.0 M/s, slowest 0.0 M/s]

compared to your

Triples parsed: 241,428,823,831 [average speed 1.1 M/s, last batch 1.2 M/s, fastest 1.4 M/s, slowest 0.9 M/s]

Using MULTI_INPUT_JSON or the equivalent for IndexBuilderMain should help. Docker incurs a significant overhead, too (but more like 20-30%, not a factor of three).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants