Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transport-test deadlocks intermittently #617

Open
AkihiroSuda opened this issue Oct 20, 2015 · 13 comments
Open

transport-test deadlocks intermittently #617

AkihiroSuda opened this issue Oct 20, 2015 · 13 comments
Assignees
Labels

Comments

@AkihiroSuda
Copy link

The bug has been observed on Travis #990, #979, and more.

jstack indicates that java.lang.Runtime.loadLibrary0 is in BLOCKED state,
but I'm still not sure why it's in such a weird state.

Note that no packet can be observed when we hit the bug.
So the problem seems not related to network IO.

Reproduction procedure

On Terminal 1:

$ COUNT=50
$ for f in $(seq 1 $COUNT); do
     echo "Experiment #${f} started at $(date)"
     lein test :only riemann.transport-test/sse-subscribe-events-test
     echo "Experiment #${f} done at $(date)"
   done
..
Experiment #11 started at Tue Oct 20 16:29:07 JST 2015
lein test riemann.transport-test
Ran 1 tests containing 1 assertions.
0 failures, 0 errors.
Experiment #11 done at Tue Oct 20 16:29:18 JST 2015
Experiment #12 started at Tue Oct 20 16:29:18 JST 2015
lein test riemann.transport-test
(DEADLOCKED)

On Terminal 2:

$ jstack $PID
..
"clojure-agent-send-off-pool-3" #17 prio=5 os_prio=0 tid=0x00007f7a79f07000 nid=0x3138 waiting for monitor entry [0x00007f7a26ccb000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at java.lang.Runtime.loadLibrary0(Runtime.java:862)
        - waiting to lock <0x00000000d4f52260> (a java.lang.Runtime)
        at java.lang.System.loadLibrary(System.java:1122)
        at sun.nio.ch.IOUtil$1.run(IOUtil.java:359)
        at sun.nio.ch.IOUtil$1.run(IOUtil.java:357)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.nio.ch.IOUtil.<clinit>(IOUtil.java:356)
        at sun.nio.ch.Util.<clinit>(Util.java:48)
        at sun.nio.ch.SelectorImpl.<init>(SelectorImpl.java:57)
        at sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:64)
        at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:36)
        at java.nio.channels.Selector.open(Selector.java:227)
        at org.httpkit.server.HttpServer.<init>(HttpServer.java:59)
        at org.httpkit.server$run_server.invoke(server.clj:8)
        at riemann.transport.sse.SSEServer.start_BANG_(sse.clj:97)
        - locked <0x00000000cddc0340> (a riemann.transport.sse.SSEServer)
        at riemann.service$eval4126$fn__4127$G__4116__4129.invoke(service.clj:20)
        at riemann.service$eval4126$fn__4127$G__4115__4132.invoke(service.clj:20)
        at clojure.core$pmap$fn__6328$fn__6329.invoke(core.clj:6466)
        at clojure.core$binding_conveyor_fn$fn__4145.invoke(core.clj:1910)
        at clojure.lang.AFn.call(AFn.java:18)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
..

Environment

  • Riemann: 48e9f61 (Oct 16, 2016)
  • OS: Ubuntu 15.04 (Linux 3.19.0-30), AMD64
  • JDK: OpenJDK 8u45-b14-1
@AkihiroSuda
Copy link
Author

@AkihiroSuda
Copy link
Author

Finally I've got how the deadlock occurrs.

I consider this is a bug of JDK.
JDK-8022594 marks a similar issue as CLOSED, but actually they have not resolved the issue.

I'm considering reporting our issue to the JDK team unless you have a different opinion.

I'll also send a quick-fix PR, which replaces pmap to map in transition! so as to avoid concurrent call for tcp start! and sse start!.

@aphyr
Copy link
Collaborator

aphyr commented Oct 21, 2015

Wow AkihiroSuda, this is terrific detective work! Thank you! That'll save a whole ton of headache in the merge process.

@aphyr
Copy link
Collaborator

aphyr commented Oct 21, 2015

I want to try and preserve service startup parallelism if possible--what do you think about calling IOUtil/load to force the static initializer to run earlier?

aphyr added a commit that referenced this issue Oct 21, 2015
@aphyr
Copy link
Collaborator

aphyr commented Oct 21, 2015

So this seems to work on JDK8, but there's gotta be something different going on in JDK7 because it doesn't have IOUtil/load. I'm gonna try promoting the lock of Runtime to the server start functions.

@aphyr
Copy link
Collaborator

aphyr commented Oct 21, 2015

Ugh, so I introduced a lock mutexing sse-server and tcp-server's startup, and it still deadlocks in Travis. Maybe a different codepath this time?

@AkihiroSuda
Copy link
Author

Couldn't reproduce with my local OpenJDK 7u79. 😞
I will try to dump stack on the Travis with my new tool: https://github.com/AkihiroSuda/deadlockreporter

@aphyr
Copy link
Collaborator

aphyr commented Oct 22, 2015

I know, right? I tried too! Outta time to hack on this today but keep me posted, and thank you!
On Oct 21, 2015 11:18 PM, Akihiro Suda [email protected] wrote:Couldn't reproduce with my local OpenJDK 7u79.
I will try to dump stack on the Travis with my new tool: https://github.com/AkihiroSuda/deadlockreporter

—Reply to this email directly or view it on GitHub.

@AkihiroSuda
Copy link
Author

In JDK7, com.hypirion.io (called from leiningen rather than riemann itself) is deadlocking?
https://travis-ci.org/AkihiroSuda/riemann/jobs/86775336
https://travis-ci.org/AkihiroSuda/riemann/jobs/86775337

@AkihiroSuda
Copy link
Author

On second thought, riemann itself is deadlocking at transport_test.clj:76.
However, unfortunately jstack did not provide any useful information about clojure stacks.

@pyr pyr self-assigned this Feb 8, 2016
@pyr pyr added the bug label Feb 9, 2016
@AkihiroSuda
Copy link
Author

The bug seems still not resolved for JDK 8 (for the build run 14 days ago, at least).. https://travis-ci.org/riemann/riemann/jobs/107986718
Not sure whether the cause is identical to the case in JDK7 (#617 (comment))

AkihiroSuda added a commit to AkihiroSuda/riemann that referenced this issue Feb 23, 2016
AkihiroSuda added a commit to AkihiroSuda/riemann that referenced this issue Feb 23, 2016
@AkihiroSuda
Copy link
Author

The latest 0ed63b29 (Feb 14) still hanging with JDK8.
Stack trace: https://travis-ci.org/AkihiroSuda/riemann/jobs/111115077

@zolyfarkas
Copy link

Here is the JDK bug: https://bugs.openjdk.java.net/browse/JDK-8194653

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants