Restarting pipeline "stalled" indefinitely and will not shut down #119

mhunsber · 2018-07-20T18:32:18Z

the pipeline will not reload if the configuration changes.

Version: Logstash v6.3.0, logstash-input-tcp v5.0.9
Operating System: Windows Server 2012
Config File (if you have sensitive info, please remove it):

input{
    tcp{
        port => 50000
    }
}
filter {
}
output{
}

Sample Data: none required
Steps to Reproduce: start the pipeline with tcp input, then make a change to the configuration file so the pipeline reloads
logstash-plain.log

[2018-07-20T14:07:32,763][INFO ][logstash.pipeline        ] Starting pipeline {:pipeline_id=>"health-check", "pipeline.workers"=>1, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50}
[2018-07-20T14:07:34,857][INFO ][logstash.inputs.tcp      ] Starting tcp input listener {:address=>"0.0.0.0:50000", :ssl_enable=>"false"}
[2018-07-20T14:07:39,435][INFO ][logstash.pipeline        ] Pipeline started successfully {:pipeline_id=>"health-check", :thread=>"#<Thread:0x355e4b9f@C:/LS/logstash-6.3.0/logstash-core/lib/logstash/pipeline_action/create.rb:44 run>"}
[2018-07-20T14:08:13,573][INFO ][logstash.pipelineaction.reload] Reloading pipeline {"pipeline.id"=>:"health-check"}
[2018-07-20T14:08:18,698][WARN ][logstash.shutdownwatcher ] {"inflight_count"=>0, "stalling_thread_info"=>{"other"=>[{"thread_id"=>35, "name"=>"[health-check]<tcp", "current_call"=>"[...]/vendor/bundle/jruby/2.3.0/gems/logstash-input-tcp-5.0.9-java/lib/logstash/inputs/tcp.rb:180:in `close'"}, {"thread_id"=>31, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/pipeline.rb:418:in `read_batch'"}]}}
[2018-07-20T14:08:18,714][ERROR][logstash.shutdownwatcher ] The shutdown process appears to be stalled due to busy or blocked plugins. Check the logs for more information.
[2018-07-20T14:08:23,672][WARN ][logstash.shutdownwatcher ] {"inflight_count"=>0, "stalling_thread_info"=>{"other"=>[{"thread_id"=>35, "name"=>"[health-check]<tcp", "current_call"=>"[...]/vendor/bundle/jruby/2.3.0/gems/logstash-input-tcp-5.0.9-java/lib/logstash/inputs/tcp.rb:180:in `close'"}, {"thread_id"=>31, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/pipeline.rb:418:in `read_batch'"}]}}
[2018-07-20T14:08:28,662][WARN ][logstash.shutdownwatcher ] {"inflight_count"=>0, "stalling_thread_info"=>{"other"=>[{"thread_id"=>35, "name"=>"[health-check]<tcp", "current_call"=>"[...]/vendor/bundle/jruby/2.3.0/gems/logstash-input-tcp-5.0.9-java/lib/logstash/inputs/tcp.rb:180:in `close'"}, {"thread_id"=>31, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/pipeline.rb:418:in `read_batch'"}]}}

that [logstash.shutdownwatcher ] event will repeat indefinitely until logstash is terminated. I let it go for 4 minutes

[2018-07-20T14:11:53,672][WARN ][logstash.shutdownwatcher ] {"inflight_count"=>0, "stalling_thread_info"=>{"other"=>[{"thread_id"=>35, "name"=>"[health-check]<tcp", "current_call"=>"[...]/vendor/bundle/jruby/2.3.0/gems/logstash-input-tcp-5.0.9-java/lib/logstash/inputs/tcp.rb:180:in `close'"}, {"thread_id"=>31, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/pipeline.rb:418:in `read_batch'"}]}}
[2018-07-20T14:11:58,669][WARN ][logstash.shutdownwatcher ] {"inflight_count"=>0, "stalling_thread_info"=>{"other"=>[{"thread_id"=>35, "name"=>"[health-check]<tcp", "current_call"=>"[...]/vendor/bundle/jruby/2.3.0/gems/logstash-input-tcp-5.0.9-java/lib/logstash/inputs/tcp.rb:180:in `close'"}, {"thread_id"=>31, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/pipeline.rb:418:in `read_batch'"}]}}
[2018-07-20T14:12:03,666][WARN ][logstash.shutdownwatcher ] {"inflight_count"=>0, "stalling_thread_info"=>{"other"=>[{"thread_id"=>35, "name"=>"[health-check]<tcp", "current_call"=>"[...]/vendor/bundle/jruby/2.3.0/gems/logstash-input-tcp-5.0.9-java/lib/logstash/inputs/tcp.rb:180:in `close'"}, {"thread_id"=>31, "name"=>nil, "current_call"=>"[...]/logstash-core/lib/logstash/pipeline.rb:418:in `read_batch'"}]}}

Then when I shut down logstash, there is still a "java.exe" process that cannot be terminated through task manager. If I try to start logstash again, it will through an exception and not start:

[2018-07-20T14:30:58,510][INFO ][logstash.configmanagement.elasticsearchsource] Configuration Management License OK
[2018-07-20T14:30:58,635][FATAL][logstash.runner          ] Logstash could not be started because there is already another instance using the configured data directory.  If you wish to run multiple instances, you must change the "path.data" setting.
[2018-07-20T14:30:58,650][ERROR][org.logstash.Logstash    ] java.lang.IllegalStateException: Logstash stopped processing because of an error: (SystemExit) exit

The text was updated successfully, but these errors were encountered:

yaauie · 2018-07-20T20:45:00Z

Sample Data: none required

Do you mean that no data is required, or that it doesn't matter what data is given? The backtrace indicates that it is failing to shut down the event loop, which uses the default 15-second timeout for each of two phases (definitely less than your four minutes). To narrow down:

after the input starts, in what situations does it stall shutting down?
- exactly zero connections established to the port
- one or more connections established to the port and held open by client
- one or more connections established to the port and all closed normally by client

Is the port you've chosen, 50000 available, and is the user running the process privileged enough to listen on that port? There are a number of applications that are known to listen on 50000. Is there a chance that it is stalling before the server has had a chance to start, this could narrow down my search.

netstat -na | find "50000"

mhunsber · 2018-07-20T21:41:17Z

after the input starts, in what situations does it stall shutting down?

I have the tcp listener there for a firewall which does a tcp connection to the client to make health-check connections to for load balancing purposes. So the situation would be that one or more connections established and all closed.
To avoid the firewall opening connections, I started the tcp listener on port 50001 so that no connections would be established, and it was able to restart correctly, so it must just be when there have been connections opened to it.

Is there a chance that it is stalling before the server has had a chance to start?

No, running netstat -na | find "50000" before starting the logstash service shows that no other applications trying to listen on port 50000.

You say there is a 15-second timeout during the shut down loop? The firewall makes connections every 10 seconds is there a chance it could be interfering?

mhunsber · 2018-07-23T13:39:00Z

Do you mean that no data is required, or that it doesn't matter what data is given?

Sorry, that was unclear. I meant that no data is sent to the tcp listener. The firewall simply sends a TCP SYN request to the configured port, and waits for a TCP SYN/ACK in return to mark the application as either up or down.
That was all the documentation for the firewall's server load balancing would tell me. From Wireshark I can tell that after the firewall receives the TCP SYN/ACK it sends a TCP RST and so the three-way handshake is never completed, meaning no connection has been established.

yaauie · 2018-07-25T18:38:27Z

Ah. The firewall TCP RST to terminate a not-yet-established connection that it initiated isn't particularly well-behaved and fits into a lovely grey area of the spec. No doubt it does so in order to save on round trip chatter of finishing the handshake with TCP ACK/FIN, and waiting for aTCP ACK in response, but RFC739 instructs implementations not to abuse RST like this:

As a general rule, reset (RST) must be sent whenever a segment arrives
which apparently is not intended for the current connection. A reset
must not be sent if it is not clear that this is the case.

Whatever blockage is happening in this plugin here, appears to be occurring because Netty (the Nonblocking IO adapter being used by the plugin for the low-level TCP implementation) is... blocking. I'm glad to help chase it down, but would appreciate your help in refining the reproduction steps.

There are cases documented in RFC739 where receiving TCP RST on the connection should work to reset a connection in SYN-RECEIVED state, but the segment must first be deemed "acceptable":

   There are four cases for the acceptability test for an incoming
   segment:

   Segment Receive  Test
   Length  Window
   ------- -------  -------------------------------------------

      0       0     SEG.SEQ = RCV.NXT

      0      >0     RCV.NXT =< SEG.SEQ < RCV.NXT+RCV.WND

     >0       0     not acceptable

     >0      >0     RCV.NXT =< SEG.SEQ < RCV.NXT+RCV.WND
                 or RCV.NXT =< SEG.SEQ+SEG.LEN-1 < RCV.NXT+RCV.WND

   If the RCV.WND is zero, no segments will be acceptable, but
   special allowance should be made to accept valid ACKs, URGs and
   RSTs.

   If an incoming segment is not acceptable, an acknowledgment
   should be sent in reply (unless the RST bit is set, if so drop
   the segment and return):

     <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>

Can you provide a packet capture of the relevant bits from the connection, preferably filtered to only contain information about connections that exhibit the symptom? If it contains sensitive information and you're willing to trust me with it, feel free to encrypt with my GPG public key and provide your public key so I can decrypt.

mhunsber · 2018-07-26T13:24:13Z

Sure. I used Wireshark to capture and it's compressed with 7-zip. There isn't any sensitive information. Looks like the RST packet is sent with:

Segment Length: 0
Receive Window: 32768
Sequence: 1
Next Sequence number: 1

I also manually shutdown logstash during the packet capture so you can see what is happenining when Logstash does not respond.

Thank you.

logstash-input-tcp-issue119.zip

jsvd assigned yaauie Jul 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restarting pipeline "stalled" indefinitely and will not shut down #119

Restarting pipeline "stalled" indefinitely and will not shut down #119

mhunsber commented Jul 20, 2018 •

edited

Loading

yaauie commented Jul 20, 2018

mhunsber commented Jul 20, 2018 •

edited

Loading

mhunsber commented Jul 23, 2018

yaauie commented Jul 25, 2018 •

edited

Loading

mhunsber commented Jul 26, 2018

Restarting pipeline "stalled" indefinitely and will not shut down #119

Restarting pipeline "stalled" indefinitely and will not shut down #119

Comments

mhunsber commented Jul 20, 2018 • edited Loading

yaauie commented Jul 20, 2018

mhunsber commented Jul 20, 2018 • edited Loading

mhunsber commented Jul 23, 2018

yaauie commented Jul 25, 2018 • edited Loading

mhunsber commented Jul 26, 2018

mhunsber commented Jul 20, 2018 •

edited

Loading

mhunsber commented Jul 20, 2018 •

edited

Loading

yaauie commented Jul 25, 2018 •

edited

Loading