Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Authentication fails periodically and restart fixes it #12974

Open
thehellmaker opened this issue May 5, 2024 · 9 comments
Open

Authentication fails periodically and restart fixes it #12974

thehellmaker opened this issue May 5, 2024 · 9 comments
Assignees
Labels

Comments

@thehellmaker
Copy link

What happened?

I have a listener configured where clients join with a new clientId everytime and it does a username password authentication. using an HTTP backend on emqx. The functionality works perfectly ok on a fresh start but every 5 - 10 days the authentication starts failing intermittently and it only gets worse until all requests starts failing. The http server itself is running very smooth and handling other requests consistently for a few years now.

To fix this we do a
sudo systemctl emqx restart

After emqx restart without http auth server restart it comes back to normal

What did you expect to happen?

Run stably without any issues for a long time

How can we reproduce it (as minimally and precisely as possible)?

No response

Anything else we need to know?

No response

EMQX version

$ ./bin/emqx_ctl broker
sysdescr  : EMQX
version   : 5.5.0
datetime  : 2024-05-05T05:53:03.160776018+00:00
uptime    : 2 days, 19 hours, 4 minutes, 26 seconds

OS version

# On Linux:
$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
$ uname -a
Linux ip-172-31-29-225 5.15.0-1051-aws #56~20.04.1-Ubuntu SMP Tue Nov 28 15:43:06 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux

Log files

2024-04-20T03:34:54.488386+00:00 [error] msg: send_error, mfa: emqx_resource_buffer_worker:handle_query_result_pure/3(942), peername: 122.171.22.243:10156, clientid: app-qzTabuxVIxfPtesq8aiDfYLmKhS2-7805dfa0-8eac-437f-9ae4-f5c5a9d96d3d, id: <<"emqx_authn_http:9">>, reason: {recoverable_error,timeout} 2024-04-20T03:35:00.190401+00:00 [error] msg: send_error, mfa: emqx_resource_buffer_worker:handle_query_result_pure/3(942), peername: 122.171.22.243:31292, clientid: app-qzTabuxVIxfPtesq8aiDfYLmKhS2-7805dfa0-8eac-437f-9ae4-f5c5a9d96d3d, id: <<"emqx_authn_http:9">>, reason: {recoverable_error,timeout} 2024-04-20T03:40:11.464430+00:00 [error] msg: send_error, mfa: emqx_resource_buffer_worker:handle_query_result_pure/3(942), peername: 150.129.248.148:45924, clientid: app-vEGvEnILlfNn13I7GhqX3Lbctts2-b95c7984-b1ba-4caa-b1a1-de4d7efaa414, id: <<"emqx_authn_http:9">>, reason: {recoverable_error,timeout} 2024-04-20T03:45:01.138421+00:00 [error] msg: send_error, mfa: emqx_resource_buffer_worker:handle_query_result_pure/3(942), peername: 122.171.18.160:6965, clientid: app-t0vP9bLOKibPNS3dLM31xjA84qs1-ab37f436-8b88-48ac-a192-5e8b9d144469, id: <<"emqx_authn_http:9">>, reason: {recoverable_error,timeout} 2024-04-20T03:59:58.580418+00:00 [error] msg: send_error, mfa: emqx_resource_buffer_worker:handle_query_result_pure/3(942), peername: 49.205.145.253:51840, clientid: app-zLAIPHul2gPsOqlXOjvG2odMzB63-ec9284f2-d1bc-49a6-bf3b-025b32f6a163, id: <<"emqx_authn_http:9">>, reason: {recoverable_error,timeout} 2024-04-20T04:03:05.886405+00:00 [error] msg: send_error, mfa: emqx_resource_buffer_worker:handle_query_result_pure/3(942), peername: 49.205.145.253:52270, clientid: app-zLAIPHul2gPsOqlXOjvG2odMzB63-598584ba-47de-4ae5-8c79-6d3dba99bf01, id: <<"emqx_authn_http:9">>, reason: {recoverable_error,timeout} 2024-04-20T04:04:49.744418+00:00 [error] msg: send_error, mfa: emqx_resource_buffer_worker:handle_query_result_pure/3(942), peername: 122.162.146.184:2084, clientid: app-RxVTqo48gZYQxobFBq1w90ddErC2-c9cdafb5-6b51-4c66-92cb-473fc8795329, id: <<"emqx_authn_http:9">>, reason: {recoverable_error,timeout} 2024-04-20T04:07:04.994386+00:00 [error] msg: send_error, mfa: emqx_resource_buffer_worker:handle_query_result_pure/3(942), peername: 106.221.207.42:55080, clientid: app-ZOlqVvJXb3dCi2QJruKObag6E3B2-21f519c2-898e-4843-bb52-d9226db84653, id: <<"emqx_authn_http:9">>, reason: {recoverable_error,timeout} 2024-04-20T04:08:13.586428+00:00 [error] msg: send_error, mfa: emqx_resource_buffer_worker:handle_query_result_pure/3(942), peername: 106.221.200.42:55083, clientid: app-ZOlqVvJXb3dCi2QJruKObag6E3B2-9c8918a3-d22b-4d00-bd87-6e7f0b36cca8, id: <<"emqx_authn_http:9">>, reason: {recoverable_error,timeout} 2024-04-20T04:11:16.106396+00:00 [error] msg: send_error, mfa: emqx_resource_buffer_worker:handle_query_result_pure/3(942), peername: 106.221.207.42:55089, clientid: app-ZOlqVvJXb3dCi2QJruKObag6E3B2-de6f59c5-13d1-424a-b15b-ee32a26f7db0, id: <<"emqx_authn_http:9">>, reason: {recoverable_error,timeout} 2024-04-20T04:12:19.795414+00:00 [error] msg: send_error, mfa: emqx_resource_buffer_worker:handle_query_result_pure/3(942), peername: 106.221.199.42:55093, clientid: app-ZOlqVvJXb3dCi2QJruKObag6E3B2-7f60d226-19c5-4280-8eab-04ea24cd0c56, id: <<"emqx_authn_http:9">>, reason: {recoverable_error,timeout} 2024-04-20T04:14:25.077405+00:00 [error] msg: send_error, mfa: emqx_resource_buffer_worker:handle_query_result_pure/3(942), peername: 106.221.199.42:55098, clientid: app-ZOlqVvJXb3dCi2QJruKObag6E3B2-6b240019-dad7-4476-9bd4-32f3ba31bc5e, id: <<"emqx_authn_http:9">>, reason: {recoverable_error,timeout} 2024-04-20T04:24:14.720428+00:00 [error] msg: send_error, mfa: emqx_resource_buffer_worker:handle_query_result_pure/3(942), peername: 171.76.86.174:19846, clientid: app-ABLUKVxQq6Y9oV52vpXe5zB0mAz1-a9ba58a5-9a76-4d71-a90e-e81d1a4bab6a, id: <<"emqx_authn_http:9">>, reason: {recoverable_error,timeout} 2024-04-20T04:27:56.824421+00:00 [error] msg: send_error, mfa: emqx_resource_buffer_worker:handle_query_result_pure/3(942), peername: 49.15.187.19:35127, clientid: app-EQwau3L8KGVZob3oOiPCm37zoud2-4a17aa4f-a0f9-4b4e-b158-a41f13821c03, id: <<"emqx_authn_http:9">>, reason: {recoverable_error,timeout} 2024-04-20T04:28:03.388429+00:00 [error] msg: send_error, mfa: emqx_resource_buffer_worker:handle_query_result_pure/3(942), peername: 49.15.187.243:9315, clientid: app-EQwau3L8KGVZob3oOiPCm37zoud2-69a60aaf-8730-4c93-b5ce-e73f0ea400f4, id: <<"emqx_authn_http:9">>, reason: {recoverable_error,timeout}
@thehellmaker thehellmaker changed the title Authentication fails periodically Authentication fails periodically and restart fixes it May 5, 2024
@kjellwinblad
Copy link
Contributor

Thanks for reporting this. It looks like the timeout happens when our HTTP client timeouts while doing the request. Can you send your configuration for the HTTP authentication? In particular it would be interesting to see what the the value is for the request_timeout option. Is it possible that your HTTP server gets slower at certain time periods and that could cause the timeouts?

@zmstone
Copy link
Member

zmstone commented May 12, 2024

I have seen once such behaviour before: if the the http server silently drops a HTTP request (e.g. due to rate limit) without responding with an error code, and does not close the connection either, the HTTP client (at HTTP layer) will wait indefinitely for a HTTP response or socket close -- this is however just a guesswork, would be nice if @thehellmaker you can help to look on the server side (logs maybe) to verify my guess.

Nonetheless, we plan to do something at application layer: reconnect if timeout happens.
NOTE: The application layer did not do retries or reconnects because in most cases, it's a server overload, retry or reconnect is likely going to increase the load even more. so the coming fix will have to be very careful when picking the dfaults.

@zmstone zmstone self-assigned this May 12, 2024
@thehellmaker
Copy link
Author

thehellmaker commented May 13, 2024

@kjellwinblad Here is the authn config on emqx.conf. request_timeout is default 15 i presume as I do not provide it.
And not the load on the server is <20% CPU utilization @P99 so i do not think its server slowing down. However there are times when the server is unavailable on non blue green deployment at times.

We have 2 different listeners, one is with the below authn and other with mTLS certificate verification. The username based http based listener shown below is what our mobile application connect to and mtls based listener is what our devices connect to. While the username pwd http based auth suffers from this issue the mtls based devices are absolutely fine and they are able to connect. I can also confirm both my application cluster and emqx are running on the same machine so network partitions/connectivity issues are not a possibility. I think what @zmstone might also be possible as this issue starts building up slowly where these count of the timeout occurrences build up slowly until it starts happening to all requests. So it seems like the connections in the pool start getting into inconsistent state slowly for some reason I dnt know yet.

authentication = [
{
mechanism = password_based
backend = http
enable = true

        method = post
        url = "***",
        body {
            clientid = "${clientid}"
            username = "${username}"
            password = "${password}"
            peerhost = "${peerhost}"
            cert_subject = "${cert_subject}"
            cert_common_name = "${cert_common_name}"
        }
        headers {
            "Content-Type" = "application/json"
            "X-Request-Source" = "EMQX"
        }
    }
]

the default pool size is 8 so if more than 8 requests come at the same time it should get pipelined. However that can timeout the requests as well if some of these are starved regularly. Our mobile applications which are connecting to this listener have infinite retries on this failure so initially once in a while connection requests fail, after sometime the first 2 fail almost regularly and then it connects, and then it increased to 5 reconnects before it connects and finally all reconnects start failing. I have now changed the config to below which has increased pool_size parameter and stricter timeouts and trying.

authentication = [
{
mechanism = password_based
backend = http
enable = true

        method = post
        url = "***",
        pool_size=24
        enable_pipelining=100
        connect_timeout = 10
        request_timeout = 5
        body {
            clientid = "${clientid}"
            username = "${username}"
            password = "${password}"
            peerhost = "${peerhost}"
            cert_subject = "${cert_subject}"
            cert_common_name = "${cert_common_name}"
        }
        headers {
            "Content-Type" = "application/json"
            "X-Request-Source" = "EMQX"
        }
    }
]

@zmstone Since emqx is only giving these logs and my entire application is running just fine with other devices connecting to mtls listener.
I have 2 options to confirm your hypothesis

  1. Stress testing a lot of simultaneous connect requests so that the server is overloaded to simulate this scenario.
  2. Simulate by not responding to any authn requests by the http auth api

@thehellmaker
Copy link
Author

@zmstone it looks like your hunch was right. This is happening when the http server is unable to respond. Our deployments are not bluegreen right now. And the entire http service is unavailable during deployment during which time the http sever will be unable to respond. We have been able to verify that the more deployments we do this issue gets worse progressively.

@zmstone
Copy link
Member

zmstone commented May 18, 2024

@zmstone it looks like your hunch was right. This is happening when the http server is unable to respond. Our deployments are not bluegreen right now. And the entire http service is unavailable during deployment during which time the http sever will be unable to respond. We have been able to verify that the more deployments we do this issue gets worse progressively.

Thank you for the confirmation.
I wonder why the server doesn’t reply error codes such as 503, or disconnect.

@thehellmaker
Copy link
Author

I am not super sure but what I can confirm is that there are always mqtt connections and new requests coming consistently so there could be a possibillity that a request came right before the deployment started and the server was stopped after the handshake.

Can we introduce client side timeout configurations for the http pool so that clients can configure accordingly and if they don't return a response then the connection is returned to the pool timing it out?

@zmstone
Copy link
Member

zmstone commented May 18, 2024

yeah sure. I will work on a patch. Will be in 5.7.1 or 5.8.0

@thehellmaker
Copy link
Author

Thanks. We also found another issue which exactly coincides with your hypothesis. Our http api implementation has a bug that if any exception thrown our server does not return a response and client waits forever. We are fixing this issue

@zmstone
Copy link
Member

zmstone commented May 18, 2024

Thanks. We also found another issue which exactly coincides with your hypothesis. Our http api implementation has a bug that if any exception thrown our server does not return a response and client waits forever. We are fixing this issue

This is maybe the only cause. Or should at least buy some time before we release the enhancement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants