-
Notifications
You must be signed in to change notification settings - Fork 623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor multi-threaded indexing performance #935
Comments
I would say this is very much not expected behavior. Having your benchmark code would be very helpful! Thanks for the report. |
this sounds too much difference between java / c#. a test program would be great to analyze this. |
Concurrent document writing support is currently broken due to a bug mentioned in #886 (comment). No objections if someone wants to contribute a solution to that issue, but this isn't considered a blocker for the Lucene.NET 4.8.0 release. |
@NightOwl888 I remember this was an old bug relevant only to older dotnet core. |
@NightOwl888 If you apply the changes to DocumentsWriterFlushControl from that branch on top of the latest can you still reproduce it? I have no luck on 5.0, 6.0 or 7.0 - but asking because maybe I am doing something wrong >.< (I have not run the entire test suite yet, there is allot of test and I have problems compiling the IKVM stuff) @superkelvint Sharing the source code would certainly be relevant, Parallel.ForEach would not be my first choice here. |
Yes, I just got approval to do so, and will be providing that shortly. |
Here is the MultithreadedIndexer.cs file. I'm not a C# programmer so perhaps I am doing something silly that's slowing things down. |
Nope. This happens on all operating systems and .NET versions. I got it to fail on .NET 7 after that post. We cannot rule out that the problem lies in the test or the test framework, but it almost certainly is in our codebase. We have a configuration that never existed in Lucene. We are somewhere between 4.8.0 and 4.8.1 on the documents writer stuff. We should go through the JIRA issues in the Changes.txt to determine if we have all of the changes that are relevant to the documents writer. That may or may not be contributing to this.
IKVM can be tricky to get to compile the first time because it doesn't properly clean the build. I suggest closing the IDE and then doing the following:
If upon reopening you still have issues compiling, report the issue to ikvm-maven. The team is pretty responsive. On Azure DevOps, I was able to reproduce the test failure, but this time it took 60 runs and it only happened once: https://dev.azure.com/lucene-net-temp4/Main/_build/results?buildId=266&view=ms.vss-test-web.build-test-results-tab&runId=191370&resultId=100202&paneView=debug I was also able to get it to fail locally with the seed from #886. Using the I just put the attribute in the It also failed on the first run with the seed from the above Azure DevOps failure on [assembly: Lucene.Net.Util.RandomSeed("0x9f28125320b5e6c3")]
[assembly: NUnit.Framework.SetCulture("lrc-IR")] I also got it to fail on |
@jeme thoughts on the included code within MultithreadedIndexer.cs file? |
@pc-LMS I have actually not looked further into it as I have used the time I have for Lucene (here and there) to attempt to reproduce the problem linked by @NightOwl888 locally (as that would ease things hugely for correcting it) or just trying to unravel the code in an attempt to get a theoretical idea of what might be wrong, so far no luck on reproducing it though. |
@jeme - I am a bit perplexed why you cannot reproduce this. While it is not trivial to do, it also is not that hard and I have reproduced it on multiple machines. It is possible that if the problem is a timing issue, perhaps the particular machine that you are experimenting with is somehow perfectly synchronized to miss the error, too fast/slow, or the timing accuracy of the underlying OS is sufficient to where it doesn't happen. Have you attempted to reproduce on more than one machine and/or OS? |
@NightOwl888 No, I have not tried other machines or OS'es for now, I have tried running the tests under various .NET versions with the seeds provided, I would have to install VM's for trying other OS'es for now (or dust off my old 2012 Mac book PRO and boot that into OSX, again it would take some time). But I could perhaps find a few other machines to run it on, my machine is a bit dated (i9-7900X, 128GB Ram and M.2 SSD's). If you can reproduce it locally, what are you running on? (OS etc). Also, are you running all the test suites, just that project containing that test, just that suite (file) or can you reproduce it with just running that one test alone? Because if this is timing and mine just runs slower pr. thread (10 cores means lower clock speed after all + the IPC has increased a bit since then I guess). |
I am on Windows 10 x64 running with an I7-8850H, 2.60Ghz, 6 Cores, 12 Logical Processors, 32GB RAM, and M.2 SSD's. I am unable to reproduce the failure locally without the FYI - I did a new manual revert that is up to date with master here: https://github.com/NightOwl888/lucenenet/tree/fix/GH-935-documentswriter-concurrency |
Ok, I was actually in the thinking that this was because my CPU was a slow old one, but maybe it's the other way around. It passes just fine, takes a while with repeat:1000 obviously, but since it at max starts 6 threads I tried upping that seeing as it would never occupy all Cores(10)/Logical(20) on my system. Thinking that it could also be if the system was put under stress, even with a random between 32 to 64 threads, my system is just not stressed at all, but I still wonder if stressing things could make it easier to reproduce, so my plan is to do 2 things after work today. A) Move the code to my old MacBook (with windows) and run the tests there, that would run on a Core i7 i7 3820QM. Thanks for the updated source/branch. So status is that the test passes for me with:
(what you can reproduce it with) And it also passes with in addition to the above:
|
Interesting. I tried with I think I worked out what is happening, though. Issue 1The error indicates that Digging into the We cannot pass this line unless Therefore, to fail in this way, I checked and there are 4 calls to So, moving the Issue 2After that fix, we still got some failures. But it was also clear from the stack trace what was happening. If The documentation for ReentrantLock.tryLock() states:
So, it can return false, but I don't think it ever does in practice since a .NET has no such behavior, but I don't think it really matters which order the locks are taken in as long as This probably won't perform as well because the method is supposed to return immediately and may end up deadlocking, but so far it seems to perform better and never deadlock. I am running the tests now to see if the above changes work on all platforms, but locally this seems promising. |
…): The DocumentsWriterThreadPool.Reset() method must be called within the context of a lock because there is a possible race condition with other callers of DocumentsWriterThreadPool.Reset(). See apache#935.
…ntLock.tryLock() method barges to the front of the queue instead of returning false like Monitor.TryEnter(). Use Monitor.Enter(object, ref bool) instead, which always returns true. We get locks in a different order, but I am not sure whether that matters. Fixes apache#935. Closes apache#886.
I have just submitted #940 with a patch. Well, there were 3 different attempts. While identifying the problem wasn't too difficult, finding a solution for Issue 2 turned out to be quite challenging due to how performance sensitive of an area this is. If interested parties could run some benchmarks and/or review the PR to provide some feedback on how to improve this approach, suggestions would be welcomed. I have already attempted I also gave |
Not really as I actually tested on the various .net versions, other things came up as we had a break down in one of our systems due to a windows update, so I have sadly been busy elsewhere. But I am actually having another timing related bug in another codebase which I can easily reproduce at work with a single run of a test (always fails after ~max 2min, a run is 10min), but it's far harder to reproduce the same bug at home on the same PC as I am having trouble reproducing this issue (takes 2-3 runs). So I don't know. Running the Lucene tests at work has proven to be a challenge though, I don't know if it's because I use ReSharper that the Lucene solution seems to be very slugish for me to work with, or it's something else. But I always have challenges getting started working on anything. But great if you found a fix. |
Is there an existing issue for this?
Describe the bug
I wrote a multithreaded indexer in Java for Lucene 4.8 and one for Lucene.net 4.8. I tried to do pretty much a line-for-line port. However, there is a substantial difference in the performance characteristics of the two. In Java, I used a ThreadPoolExecutor with CallerRunsPolicy, and in C# I used Parallel.ForEach.
I can provide both Java and .NET files, but before doing so, I wanted to understand if this is expected behavior and therefore not a bug.
I am observing identical behavior in .NET 8 as well as .NET 6, so I don't think its related to #933
Expected Behavior
Multi-threaded indexing performance scales with the number of threads used.
Steps To Reproduce
No response
Exceptions (if any)
No response
Lucene.NET Version
4.8
.NET Version
8.0.204 and 6.0.29
Operating System
Windows 10
Anything else?
No response
The text was updated successfully, but these errors were encountered: