-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Close of already closed channel causes a panic #18853
Comments
Do you know how exactly the panicking was triggered? It would be great if you could create an e2e or integration test to reproduce this. It looks like the Line 390 in b125409
I agree using |
@ahrtr, thank you for responding quickly. I agree, it seems the only way this could happen is if the |
@ahrtr, I had a chat with my colleagues and they believe there is a chance that our internal service framework could have executed its own It is possible that this will happen again, but if the issue was indeed in our service code and it has been fixed, once that code is in production we may never see this again. And waiting for something that may never happen again is probably not a good strategy. I would still recommend moving the channel closure into the function invoked by |
Please feel free to deliver a PR and we will evaluate & review the change/impact. Thanks |
That sounds good to me, @ahrtr. Will do. |
Bug report criteria
What happened?
Organization I work for leverages etcd as a building block of a service, which complicates this matter a bit. There was a panic observed which from the looks of it does not appear to be caused by our service, rather it seems to be in etcd. Unfortunately, while I have no problem sharing the stack trace, the code is not publicly available. We do not modify any of the etcd code, thus I am confident that the problem is not with modified code, and from the looks of it appears to exist in this repository. I did not observe this directly, but my understanding is that this was observed when the service was asked to shutdown. This description seems to align well with the code.
We build an operating system on the illumos kernel, which is not well known. It is not Linux, and I understand that there should be no expectation of support. However, I think this issue can affect anyone, thus I am raising it.
From the looks of the stack the problem is really in how the channel closing is implemented. I am going to reference more recent version of the file, because this code remained the same for quite some time. We can see here:
etcd/server/embed/etcd.go
Line 458 in 995027f
nil
, in which case we issueclose(...)
. However, the channel is never set tonil
, which means it is closed, but theif e.errc != nil {...}
check will still evaluate totrue
and would then allow for theclose(...)
to happen again. It seems to me that minimallye.errc = nil
should be added inside of the block so that the check evaluated totrue
once only.Perhaps
e.errc
is not set tonil
intentionally. It could be that this is done as a defensive measure to make sure that some consumer does not end-up stuck on a now blocked channel. I can understand this, though debatablyselect
with a default case should be used. That being said, interestingly, a little up above there is this piece of code:etcd/server/embed/etcd.go
Line 409 in 995027f
sync.Once
variable. I am wondering if the function which is clearly meant to support cleanup and be executed once isn't a better place for this code. Since this is guaranteed to be called once only, there will be no need to reset the channel to anil
value.It feels to me like this is a good improvement opportunity that anyone, not just us, would benefit from. I am happy to put up a PR, but would like to first hear whether there is interest in making this change and if so, whether we should just set the channel to
nil
, or consider moving into the function at line 409.Here's the stack. My apologies as I cannot share anything to do with
go.racktop.io
.What did you expect to happen?
Did not expect to see this panic. Expected orderly shutdown.
How can we reproduce it (as minimally and precisely as possible)?
Unfortunately, this does not seem to be reliably reproducible. I could not find a way of triggering this in a reproducible way.
Anything else we need to know?
No response
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
paste your configuration here
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
No response
The text was updated successfully, but these errors were encountered: