Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NPE at BytesToNameCanonicalizer #145

Closed
kimchy opened this issue Jul 8, 2014 · 24 comments
Closed

NPE at BytesToNameCanonicalizer #145

kimchy opened this issue Jul 8, 2014 · 24 comments

Comments

@kimchy
Copy link

kimchy commented Jul 8, 2014

Heya, updated to 2.4.1, and started to get the following exception:

java.lang.NullPointerException
    at com.fasterxml.jackson.core.sym.BytesToNameCanonicalizer$Bucket.access$000(BytesToNameCanonicalizer.java:1187)
    at com.fasterxml.jackson.core.sym.BytesToNameCanonicalizer.findBestBucket(BytesToNameCanonicalizer.java:1049)
    at com.fasterxml.jackson.core.sym.BytesToNameCanonicalizer._addSymbol(BytesToNameCanonicalizer.java:858)
    at com.fasterxml.jackson.core.sym.BytesToNameCanonicalizer.addName(BytesToNameCanonicalizer.java:667)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.addName(UTF8StreamJsonParser.java:2127)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.findName(UTF8StreamJsonParser.java:1998)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.parseMediumName(UTF8StreamJsonParser.java:1580)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._parseName(UTF8StreamJsonParser.java:1527)
    at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:693)
    at org.elasticsearch.common.xcontent.json.JsonXContentParser.nextToken(JsonXContentParser.java:50)
    at org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:270)
    at org.elasticsearch.common.xcontent.support.AbstractXContentParser.readValue(AbstractXContentParser.java:308)
    at org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:275)
    at org.elasticsearch.common.xcontent.support.AbstractXContentParser.readValue(AbstractXContentParser.java:308)
    at org.elasticsearch.common.xcontent.support.AbstractXContentParser.readMap(AbstractXContentParser.java:275)
    at org.elasticsearch.common.xcontent.support.AbstractXContentParser.readOrderedMap(AbstractXContentParser.java:258)
    at org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrdered(AbstractXContentParser.java:213)
    at org.elasticsearch.common.xcontent.support.AbstractXContentParser.mapOrderedAndClose(AbstractXContentParser.java:228)
    at org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:119)
    at org.elasticsearch.common.xcontent.XContentHelper.convertToMap(XContentHelper.java:101)
    at org.elasticsearch.index.mapper.DocumentMapperParser.parseCompressed(DocumentMapperParser.java:177)
    at org.elasticsearch.index.mapper.MapperService.parse(MapperService.java:400)
    at org.elasticsearch.index.mapper.MapperService.merge(MapperService.java:266)
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.processMapping(IndicesClusterStateService.java:419)
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyMappings(IndicesClusterStateService.java:360)
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:179)
    at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:425)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:153)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

This seems to happen with very large documents, with many different field names. I tried loading the document and recreating the failure, but it doesn't seem to fail... . I suspect maybe this is caused by many different threads parsing at the same time? I do not know.

For reference, here is a sample document that it fails on: https://dl.dropboxusercontent.com/u/2136051/jackson_npe.txt.

@cowtowncoder
Copy link
Member

Can you easily see if this was already in 2.4.0? I would guess it'd be regression from 2.3.3 to 2.4.0 (since there is fair bit of under-the-hood refactoring there); 2.4.0 -> 2.4.1 should have fewer diffs.

@kimchy
Copy link
Author

kimchy commented Jul 8, 2014

@cowtowncoder yea, it happens also in 2.4.0 ....

@cowtowncoder
Copy link
Member

@kimchy thanks. Unfortunate as it is, at least it makes sense. Code itself does suggest a race condition of some kind... I hope it is not, since single-threaded probs much easier to unit test.

@kimchy
Copy link
Author

kimchy commented Jul 8, 2014

@cowtowncoder aye, smells like a race condition..., can't seem to repro it on my end still....

@cowtowncoder
Copy link
Member

Hmmh. Actually... code looks bit funky. It'd seem like null check should be needed there, and how 2.3 had it.

@cowtowncoder
Copy link
Member

Nope. Code in that particular spot not changed.

@cowtowncoder
Copy link
Member

Looking at diffs, I think the problem is probably in new handleSpillOverflow which gets triggered if collision list grows beyond certain value. Possibly in combination with change to allow reuse even in case where total size may shrink (due to dropped spill lists).

If so, this could still be reproducible with single-thread, but require heavy production of distinct names to trigger the problem.
Alternatively it could require threaded-access problems. But either way, of all changes this is the only thing that looks like it could be the culprit.

@kimchy
Copy link
Author

kimchy commented Jul 8, 2014

@cowtowncoder I see..., interesting..., in any case, if you think of a fix without being to easily repro it in a test, I can easily recreate this in a benchmark type code we have consistently, so I can try it out.

@cowtowncoder
Copy link
Member

@kimchy Ok. Ideally it'd be reproducible on demand, but since I will try not to touch this code in foreseeable future (once fixed), I am ok with "manual" verification here. :)

@cowtowncoder
Copy link
Member

Ok, I think I may know what is going on here:

  • _collList may contain nulls, but before 2.4, only at or after _collEnd
  • with 2.4.0, it is now possible that one of earlier entries gets reset as null, triggering the problem.

and if so, fix is to add null check in code. I'll try that later on tonight (need to be going now, but will get back to it).

@cowtowncoder
Copy link
Member

@kimchy I was able to reproduce it locally, with bit of tweaking, although wasn't able to make automated test. However fix worked as expected, and problem does not seem to affect CharsToNameCanonicalizer. Please let me know if you still see this issue.

@kimchy
Copy link
Author

kimchy commented Jul 9, 2014

@cowtowncoder this is perfect, works, thank you!. Apologies in advance for the question, but do you have a rough estimate for a 2.4.2 release?

@cowtowncoder
Copy link
Member

@kimchy This is actually the first fix since 2.4.1, but given its critical nature it should get out quickly.

One short-term possibility would be to do a micro-release 2.4.1.1, until full 2.4.2 push. This to both give bit more time to address other issues found from 2.4.1, and to plug the immediate problem. It'd require additional entry for jackson-core (other components just using 2.4.1), but could be the best option.

Would this make sense? If so, I can push 2.4.1.1 right away.

@kimchy
Copy link
Author

kimchy commented Jul 9, 2014

@cowtowncoder I think 2.4.1.1 release will confuse users, it would be the first release that has another .v number..., so my vote is just to wait for 2.4.2, and if a few more weeks are needed, then its worth it compared to the confusing that will happen... .

@cowtowncoder
Copy link
Member

Users need not use 2.4.1.1, and in fact it'd be only for experienced users, or those with specific problems.

But I won't bother with 2.4.1.1 unless someone wants it; I think 2.4.2 will need to be released before end of July anyway. And the option of micro-patch is always available for critical patches, just wanted to mention that in case it made sense for particular issue.

@kimchy
Copy link
Author

kimchy commented Jul 9, 2014

@cowtowncoder make sense, we want to release 1.3 ES in a couple of weeks, if we can get 2.4.2 into this release, it would be awesome.

@kimchy
Copy link
Author

kimchy commented Jul 9, 2014

@cowtowncoder thinking about it, if you are up for 2.4.1.1, it would be great, as it will give jackson more time to bake in our (randomized) testing infra, and potentially flush other things

@cowtowncoder
Copy link
Member

@kimchy Ok. I'll do 2.4.1.1, more for testing / fallback-option purposes, since doing that is trivially easy.
And then we'll sync 2.4.2 so ES official release can use official patch with the fix.

@cowtowncoder
Copy link
Member

2.4.1.1 now pushed, should trickle to Maven central within an hour or two.

I also updated 2.4.2 release page:

https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.4.2

and noticed that there was actually already one earlier critical fix; for which there is jackson-databind 2.4.1.1. But it's probably not necessary for ES as it is releated to Object Id handling in databinding.

@kimchy
Copy link
Author

kimchy commented Jul 9, 2014

@cowtowncoder thank you!

@cowtowncoder
Copy link
Member

@kimchy thank you for reporting this. Also: once issue is dealt with, I am curious if performance difference between 2.3 and 2.4 is measurable. There are some minor improvements that show improvements in some of the tests, but it is hard to know for sure.
One exception is jackson-dataformat-cbor module, which should have significant improvements for reading of small/medium-sized documents. Otherwise improvements are more incremental.

@kimchy
Copy link
Author

kimchy commented Jul 9, 2014

FYI, I know this is a crazy use case, but I had to disable JsonFactory.Feature.FAIL_ON_SYMBOL_HASH_OVERFLOW since it was tripping. 2.4.1.1 fixes the problem, thanks!

@kimchy
Copy link
Author

kimchy commented Jul 9, 2014

@cowtowncoder missed your perf question, it does seem that for large documents, there is performance improvements, though nothing scientific yet, will properly test

@cowtowncoder
Copy link
Member

@kimchy I'd also be interested in overflow case -- I was bit hesitant wrt this; I know there are exploits out there so there should be some limit (plus performance will go down the drain way before). But at the same time it is theoretically possible to reach limits... and sounds like in practice as well.

So it is also possible that default max overflow limit should be increased, if modest increase would help solve use cases. If not, then feature works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants