-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Authentication fails periodically and restart fixes it #12974
Comments
Thanks for reporting this. It looks like the timeout happens when our HTTP client timeouts while doing the request. Can you send your configuration for the HTTP authentication? In particular it would be interesting to see what the the value is for the |
I have seen once such behaviour before: if the the http server silently drops a HTTP request (e.g. due to rate limit) without responding with an error code, and does not close the connection either, the HTTP client (at HTTP layer) will wait indefinitely for a HTTP response or socket close -- this is however just a guesswork, would be nice if @thehellmaker you can help to look on the server side (logs maybe) to verify my guess. Nonetheless, we plan to do something at application layer: reconnect if timeout happens. |
@kjellwinblad Here is the authn config on emqx.conf. request_timeout is default 15 i presume as I do not provide it. We have 2 different listeners, one is with the below authn and other with mTLS certificate verification. The username based http based listener shown below is what our mobile application connect to and mtls based listener is what our devices connect to. While the username pwd http based auth suffers from this issue the mtls based devices are absolutely fine and they are able to connect. I can also confirm both my application cluster and emqx are running on the same machine so network partitions/connectivity issues are not a possibility. I think what @zmstone might also be possible as this issue starts building up slowly where these count of the timeout occurrences build up slowly until it starts happening to all requests. So it seems like the connections in the pool start getting into inconsistent state slowly for some reason I dnt know yet. authentication = [
the default pool size is 8 so if more than 8 requests come at the same time it should get pipelined. However that can timeout the requests as well if some of these are starved regularly. Our mobile applications which are connecting to this listener have infinite retries on this failure so initially once in a while connection requests fail, after sometime the first 2 fail almost regularly and then it connects, and then it increased to 5 reconnects before it connects and finally all reconnects start failing. I have now changed the config to below which has increased pool_size parameter and stricter timeouts and trying. authentication = [
@zmstone Since emqx is only giving these logs and my entire application is running just fine with other devices connecting to mtls listener.
|
@zmstone it looks like your hunch was right. This is happening when the http server is unable to respond. Our deployments are not bluegreen right now. And the entire http service is unavailable during deployment during which time the http sever will be unable to respond. We have been able to verify that the more deployments we do this issue gets worse progressively. |
Thank you for the confirmation. |
I am not super sure but what I can confirm is that there are always mqtt connections and new requests coming consistently so there could be a possibillity that a request came right before the deployment started and the server was stopped after the handshake. Can we introduce client side timeout configurations for the http pool so that clients can configure accordingly and if they don't return a response then the connection is returned to the pool timing it out? |
yeah sure. I will work on a patch. Will be in 5.7.1 or 5.8.0 |
Thanks. We also found another issue which exactly coincides with your hypothesis. Our http api implementation has a bug that if any exception thrown our server does not return a response and client waits forever. We are fixing this issue |
This is maybe the only cause. Or should at least buy some time before we release the enhancement. |
What happened?
I have a listener configured where clients join with a new clientId everytime and it does a username password authentication. using an HTTP backend on emqx. The functionality works perfectly ok on a fresh start but every 5 - 10 days the authentication starts failing intermittently and it only gets worse until all requests starts failing. The http server itself is running very smooth and handling other requests consistently for a few years now.
To fix this we do a
sudo systemctl emqx restart
After emqx restart without http auth server restart it comes back to normal
What did you expect to happen?
Run stably without any issues for a long time
How can we reproduce it (as minimally and precisely as possible)?
No response
Anything else we need to know?
No response
EMQX version
OS version
Log files
The text was updated successfully, but these errors were encountered: