Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection reset by peer on connecting to CloudSQL Postgresql #1855

Closed
Kanav-7 opened this issue Jun 22, 2023 · 3 comments
Closed

Connection reset by peer on connecting to CloudSQL Postgresql #1855

Kanav-7 opened this issue Jun 22, 2023 · 3 comments
Assignees
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@Kanav-7
Copy link

Kanav-7 commented Jun 22, 2023

Bug Description

We are running Auth Proxy in a GKE cluster. We faced a issue where our application (which connects to postgresql instance) was dropping requests. On checking cloudsql proxy logs we found out multiple connection reset by peer errors.

Example code (or command)

No response

Stacktrace

2023/05/17 11:30:20 ephemeral certificate for instance <instance_string> will expire soon, refreshing now.
2023/05/17 11:30:20 refreshing ephemeral certificate for instance <instance_string>
2023/05/17 11:30:20 ephemeral certificate for instance <instance_string> will expire soon, refreshing now.
2023/05/17 11:30:20 refreshing ephemeral certificate for instance <instance_string>
2023/05/17 11:30:20 failed to refresh the ephemeral certificate for <instance_string>, returning previous cert instead: googleapi: Error 409: The instance or operation is not in an appropriate state to handle the request., invalidState
2023/05/17 11:30:20 failed to refresh the ephemeral certificate for <instance_string>, returning previous cert instead: googleapi: Error 409: The instance or operation is not in an appropriate state to handle the request., invalidState
2023/05/17 11:30:20 new ephemeral certificate expires sooner than expected (adjusting refresh time to compensate): current time: 2023-05-17 11:30:20.218349058 +0000 UTC m=+13211.321417063, certificate expires: 2023-05-17 11:35:20 +0000 UTC
2023/05/17 11:30:20 Scheduling refresh of ephemeral certificate in 4m54.781650942s
2023/05/17 11:30:20 new ephemeral certificate expires sooner than expected (adjusting refresh time to compensate): current time: 2023-05-17 11:30:20.218349058 +0000 UTC m=+13211.321417063, certificate expires: 2023-05-17 11:35:20 +0000 UTC
2023/05/17 11:30:20 Scheduling refresh of ephemeral certificate in 4m54.781650942s
2023/05/17 11:32:19 New connection for "<instance_string>"
2023/05/17 11:32:19 refreshing ephemeral certificate for instance <instance_string>
2023/05/17 11:32:19 New connection for "<instance_string>"
2023/05/17 11:32:19 refreshing ephemeral certificate for instance <instance_string>
2023/05/17 11:32:19 failed to refresh the ephemeral certificate for <instance_string>, returning previous cert instead: googleapi: Error 409: The instance or operation is not in an appropriate state to handle the request., invalidState
2023/05/17 11:32:19 new ephemeral certificate expires sooner than expected (adjusting refresh time to compensate): current time: 2023-05-17 11:32:19.389718108 +0000 UTC m=+13330.492786132, certificate expires: 2023-05-17 11:35:20 +0000 UTC
2023/05/17 11:32:19 failed to refresh the ephemeral certificate for <instance_string>, returning previous cert instead: googleapi: Error 409: The instance or operation is not in an appropriate state to handle the request., invalidState
2023/05/17 11:32:19 new ephemeral certificate expires sooner than expected (adjusting refresh time to compensate): current time: 2023-05-17 11:32:19.389718108 +0000 UTC m=+13330.492786132, certificate expires: 2023-05-17 11:35:20 +0000 UTC
2023/05/17 11:32:19 Scheduling refresh of ephemeral certificate in 2m55.610281892s
2023/05/17 11:32:19 Scheduling refresh of ephemeral certificate in 2m55.610281892s
2023/05/17 11:32:29 couldn't connect to "<instance_string>": read tcp 10.22.0.14:60914-><ip:3307>: read: connection reset by peer
2023/05/17 11:32:29 couldn't connect to "<instance_string>": read tcp 10.22.0.14:60914-><ip:3307>: read: connection reset by peer
2023/05/17 11:32:29 New connection for "<instance_string>"
2023/05/17 11:32:29 New connection for "<instance_string>"
2023/05/17 11:32:29 refresh operation throttled for <instance_string>: reusing config from last refresh (10.111788974s ago)
2023/05/17 11:32:29 couldn't connect to "<instance_string>": dial tcp: missing address
2023/05/17 11:32:29 refresh operation throttled for <instance_string>: reusing config from last refresh (10.111788974s ago)
2023/05/17 11:32:29 couldn't connect to "<instance_string>": dial tcp: missing address
2023/05/17 11:33:02 New connection for "<instance_string>"
2023/05/17 11:33:02 New connection for "<instance_string>"

Steps to reproduce?

We are not able to reproduce the issue after it's occurrence

Environment

We are running cloudsql proxy in GKE cluster. cloudsql proxy is run inside the application container itself (not as a sidecar)

  1. OS type and version: Container: gcr.io/distroless/java11-debian11:latest
  2. Cloud SQL Proxy version (./cloud-sql-proxy --version): gce-proxy:1.24.0
  3. Proxy invocation command (for example, ./cloud-sql-proxy --port 5432 INSTANCE_CONNECTION_NAME): ./cloud_sql_proxy -instances="${instance_name}"=tcp:3306

Additional Details

No response

@Kanav-7 Kanav-7 added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Jun 22, 2023
@jackwotherspoon jackwotherspoon added the priority: p2 Moderately-important priority. Fix may not be included in next release. label Jun 22, 2023
@jackwotherspoon
Copy link
Collaborator

Hi @Kanav-7 thanks for raising the issue on the Cloud SQL Proxy.

The first thing that stands out to me is the version of the Proxy being v1.24.0 which is an extremely old version of the Proxy (almost 2 years old). I would recommend at the very least updating to the latest version of the v1 Proxy, v1.38.0. Even more so you may want to think about migrating to a v2 (latest is v2.4.0) of the Proxy. A lot of enhancements and new features have been added to v2 that provide a more resilient build.

Similar issues have been raised in the past for older v1 versions #779, #343 but we have not seen the same issues arise in newer versions as of yet.

23/05/17 11:32:29 couldn't connect to "<instance_string>": read tcp 10.22.0.14:60914->ip:3307: read: connection reset by peer`

The nature of this error normally points at a networking/connectivity issue. The error messages lists an external address/port (e.i. 10.22.0.14:60914), which makes me believe the problem is with the connection between the Proxy and Cloud SQL instance, ip:3307 is the TCP standard port the Proxy connects to on the Cloud SQL instance. (see below diagram)

image

Sometimes the connection can fail if one side is under critical load, I would check if your Cloud SQL instance (or proxy container) had high CPU usage around the time of the error? The refresh operation throttled message might also point to too many active connection or the load on the server being too high.

How often is this error being seen? Is it very rare or is it being seen fairly frequently? If it is on the rare side or happening very infrequently than the cause could potentially be due to the Cloud SQL instance undergoing a maintenance window or an automatic upgrade at the time a refresh operation occurs. If these events overlap then the Proxy would be unable to reach the Cloud SQL instance as we don't currently support seamless cut overs to read-replicas or backups etc. (#1831) This would also make sense of the message:

2023/05/17 11:32:29 couldn't connect to "<instance_string>": dial tcp: missing address

The refresh operation would be unable to get the IP address of the instance if one of these events were occurring.

I would first check CPU usage at the time of the error and also try upgrading to the latest version of the proxy (v2 is recommended) Let us know if you see the issue arise on a recent version of the Proxy and we can investigate further 😄

@enocom
Copy link
Member

enocom commented Jun 29, 2023

+1 to upgrading to latest. V2 is best, but v1 is still supported.

Would you mind updating and reporting back if you still see this?

@enocom
Copy link
Member

enocom commented Jul 25, 2023

Closing this as stale. Feel free to re-open with more information.

@enocom enocom closed this as completed Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

3 participants