Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection test occasionally causes TLS handshake failed errors #2081

Closed
mvhatch opened this issue Jan 9, 2024 · 16 comments
Closed

Connection test occasionally causes TLS handshake failed errors #2081

mvhatch opened this issue Jan 9, 2024 · 16 comments
Assignees
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@mvhatch
Copy link

mvhatch commented Jan 9, 2024

Bug Description

gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.8.0 and gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.8.1 takes 30s to establish a connection when using --run-connection-test. This delay intermittently introduces proxy connection errors.

Stacktrace

Version v2.8.1

2024/01/08 23:07:43 Authorizing with OAuth2 token
2024/01/08 23:07:44 [<project>:<region>:<instance>] Listening on [::]:5432
2024/01/08 23:07:44 The proxy has started successfully and is ready for new connections!
2024/01/08 23:07:44 Connection test started
2024/01/08 23:08:14 Connection test passed
2024/01/08 23:08:14 [<project>:<region>:<instance>] accepted connection from 192.168.192.3:35240
2024/01/08 23:08:44 [<project>:<region>:<instance>] failed to connect to instance: Dial error: handshake failed (connection name = "<project>:<region>:<instance>"): context deadline exceeded
pg_dump: error: connection to database "<databse>" failed: connection to server at "pg_dump--proxy" (192.168.192.2), port 5432 failed: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

Version v2.7.2

2024/01/09 19:24:00 Authorizing with OAuth2 token
2024/01/09 19:24:00 [<project>:<region>:<instance>] Listening on [::]:5432
2024/01/09 19:24:00 The proxy has started successfully and is ready for new connections!
2024/01/09 19:24:00 Connection test started
2024/01/09 19:24:00 Connection test passed
2024/01/09 19:24:01 [<project>:<region:<instance>] accepted connection from 172.23.0.3:60642

Steps to reproduce?

Start a proxy container pointing to an instance, run the connection test as part of startup

  1. docker run --rm gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.8.0 <project>:<region>:<instance> --run-connection-test

Environment

  1. OS type and version:
uname -a
Linux 92671f21852e 6.4.16-linuxkit #1 SMP PREEMPT Wed Oct 25 16:32:24 UTC 2023 x86_64 GNU/Linux 
  1. Cloud SQL Proxy version (./cloud-sql-proxy --version): 2.8.1+container
  2. Proxy invocation command (for example, ./cloud-sql-proxy --port 5432 INSTANCE_CONNECTION_NAME):

Additional Details

No response

@mvhatch mvhatch added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Jan 9, 2024
@enocom enocom added the priority: p2 Moderately-important priority. Fix may not be included in next release. label Jan 9, 2024
@enocom
Copy link
Member

enocom commented Jan 9, 2024

How are you connecting to the Proxy in the container?

@enocom
Copy link
Member

enocom commented Jan 9, 2024

FWIW I do this:

docker run --rm -p 5432:5432 gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.8.1 \
  --address 0.0.0.0 <INSTANCE_CONNECTION_NAME> \
  --run-connection-test --token $(gcloud auth print-access-token)

And then connect with psql from my host machine. This works without issue.

@enocom enocom assigned enocom and unassigned jackwotherspoon Jan 9, 2024
@mvhatch
Copy link
Author

mvhatch commented Jan 9, 2024

@enocom - I have no issue connecting my host machine to the proxy; I have an intermittent issue w/ the proxy connecting to cloud sql. However, I have no issue with the proxy connecting to cloud sql when using v2.7.2. The only thing I noticed was a timeout error in v2.8.1.

@enocom
Copy link
Member

enocom commented Jan 9, 2024

How often do you see this timeout error? Also I assume you're connecting to public IP?

@mvhatch
Copy link
Author

mvhatch commented Jan 9, 2024

How often do you see this timeout error? Also I assume you're connecting to public IP?

The timeout error has been happening ~50% of the time when running a workload that uses v2.8.1.
It is connecting to the public IP.

@enocom
Copy link
Member

enocom commented Jan 9, 2024

Thanks, @mvhatch. In that case, let me hammer this some more to see if I can force out the same problem. Have you tried reproducing outside of docker?

@enocom enocom changed the title v2.8.0 and v2.8.1 --run-connection-test takes 30s Connection test occasionally causes TLS handshake failed errors Jan 10, 2024
@enocom
Copy link
Member

enocom commented Jan 10, 2024

I haven't reproduced this yet, but realize I've seen this same problem when people use the readiness probe.

The key error is:

failed to connect to instance: Dial error: 
  handshake failed (connection name = "<project>:<region>:<instance>"): 
  context deadline exceeded

@mvhatch
Copy link
Author

mvhatch commented Jan 10, 2024

Idk if this is helpful, but I've also seen a timeout error in addition to the context deadline exceeded error:

2024/01/08 22:37:09 Authorizing with OAuth2 token
2024/01/08 22:37:09 <project>:<region>:<instance> Listening on [::]:5432
2024/01/08 22:37:09 The proxy has started successfully and is ready for new connections!
2024/01/08 22:37:09 Connection test started
2024/01/08 22:37:39 Connection test passed
2024/01/08 22:37:39 <project>:<region>:<instance> accepted connection from 192.168.96.3:38646
2024/01/08 22:38:09 <project>:<region>:<instance> failed to connect to instance: Dial error: failed to dial (connection name = "<project>:<region>:<instance>"): dial tcp 35.238.253.130: i/o timeout
pg_dump: error: connection to database "<db>" failed: connection to server at "pg_dump--proxy" (192.168.96.2), port 5432 failed: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

@mvhatch
Copy link
Author

mvhatch commented Jan 10, 2024

perhaps it is a red herring, but do you know the reason the connection tests now take 30s in version v2.8.x?

@enocom
Copy link
Member

enocom commented Jan 10, 2024

Yes that is helpful. It implies the remote end of the socket isn't responding, whether for the initial connection or the TLS handshake.

@enocom
Copy link
Member

enocom commented Jan 10, 2024

perhaps it is a red herring, but do you know the reason the connection tests now take 30s in version v2.8.x?

That's still unclear and entirely unexpected.

@enocom
Copy link
Member

enocom commented Jan 10, 2024

FWIW I'd expect the connection test to be only used when you're uncertain if the network path is open. Otherwise, I wouldn't use it regularly. In the meantime, we'll sort out this bug.

@akshetpandey
Copy link

akshetpandey commented Apr 22, 2024

I have a similar issue when running in dataflow. I am connecting using private ip. It's not very often, something like ~5%, but it causes my dataflow job to fail after spending an hour hung up on this.

curl -o /bin/cloud_sql_proxy https://storage.googleapis.com/cloud-sql-connectors/cloud-sql-proxy/v2.9.0/cloud-sql-proxy.linux.amd64

/bin/cloud_sql_proxy --private-ip -u /cloudsql ${CLOUD_SQL_PROXY_INSTANCES} &

2024/04/21 01:34:43 [$DB-STRING] could not resolve instance version: failed to get instance: Refresh error: failed to get instance metadata (connection name = "$DB-STRING"): Get "https://sqladmin.googleapis.com/sql/v1beta4/projects/$PROJECT/instances/$DB/connectSettings?alt=json&prettyPrint=false": net/http: TLS handshake timeout

2024/04/21 01:34:43 The proxy has encountered a terminal error: unable to start: [$DB-STRING] Unable to mount socket: failed to get instance: Refresh error: failed to get instance metadata (connection name = "$DB-STRING"): Get "https://sqladmin.googleapis.com/sql/v1beta4/projects/$PROJECT/instances/$DB/connectSettings?alt=json&prettyPrint=false": net/http: TLS handshake timeout

@enocom
Copy link
Member

enocom commented Apr 25, 2024

@akshetpandey Are you using the --run-connection-test flag? This is for what seems to be a race condition when that flag is enabled.

If you're getting TLS errors otherwise, feel free to open a new issue. At a glance, I'd be curious to know what kind of CPU usage you see when that TLS handshake timeout occurs. Sometimes that can be a symptom of the Proxy being resource starved.

@akshetpandey
Copy link

akshetpandey commented May 1, 2024

It isn't related to that flag, although I should use it to work around the infinite loop I am facing because of this! Will create a new issue

@enocom
Copy link
Member

enocom commented May 24, 2024

I think this is a duplicate of #2224 and #2212, the first of the two being the more relevant description.

We're going to cut a release next week with a fix. If I'm mistaken and this is still happening after the release next week, feel free to re-open. You can try this now by building against the latest cloudsqlconn (https://github.com/GoogleCloudPlatform/cloud-sql-go-connector/releases/tag/v1.10.1).

@enocom enocom closed this as completed May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

4 participants