Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[client-v2][discussion] Load-balancing on the client side. #1870

Open
chernser opened this issue Oct 17, 2024 · 7 comments
Open

[client-v2][discussion] Load-balancing on the client side. #1870

chernser opened this issue Oct 17, 2024 · 7 comments

Comments

@chernser
Copy link
Contributor

Topic

Client-v2 implementation may connect to a single target host today. Client uses Apache HTTP client which has built-in connection pool. This set of properties is enough to handle many use-cases, because we assume that ClickHouse cluster is behind a load-balancer.
Client-v1 has load-balancing on the client side and can handle failover to a backup node. This mechanism is complex because should track many moving parts.

Handling load balancing on a client side has some challenges:

  • keep track of liveness of all nodes.
  • load distribution requires very fine connection control.
  • fair load distribution would need to know load of each pod.

External HTTP load balancer would work better:

  • all known load-balancers have liveness check - no need to do it on client.
  • proxy knows how many requests were sent to each node - so it is much easier to handle
  • having a centralized proxy helps to control traffic to a DB. For example, if some application start acting bad - it can be easily disconnected from cluster.

Proxy would become a single point of failure, but it is lightweight and easy to restart than swarm of pods.

This issue is for the discussion. Please share your thoughts about pros, cons for both approaches. Thanks!

@ashwinsri1
Copy link

Hey,
I had a few uncertainties regarding the load balancer(LB) solution:

  • If I provide the LB DNS as a connection host and create a connection pool with it, wouldn't it connect to only one of the nodes of the cluster at the time of creation of pool?
  • Since the client will be making the connection to the clickhouse cluster, some visibility on the connection pool for each of the nodes is better as I can see the issues/performances of each of the thread from the client logs instead of going to the LB dashboard/logs.

Keeping in mind your point of the client side challenges, I am not at all against using external LB but I need some certainty that when the client will be connecting to the CH cluster, all the resources will be used effectively.

@huddedar34
Copy link

Hi @chernser ,
We have a similar usecase where we have single shard and 3 replicas. Clickhouse is hosted in 3 separate VMs. We are okay to use external LB. But based on the docs I understand that when we use a DNS it will resolve to one of the nodes and create a connection pool with that node (Ref). We want to utilise all the replicas for processing. How can we achieve this using clickhouse java client v2? Do we have any plan to support this? Any short-term solution you can provide? Thanks in advance!

@chernser
Copy link
Contributor Author

Good day. @ashwinsri1 !
Thank you for the great question!

  • LB DNS is based on changing IP address for the know host. It usually takes longer for client to switch to the new route. This type of load-balancers, as I know, work not alone and their function is to route traffic between different regions, zones, data centers. For example, it will help when zone fails - when traffic cannot be proxied. Usually LB DNS is implemented by updating DNS server records, so it can be applied without client support. (I will check how we would improve time of switching between routes on client side )
  • I was suggesting to use reverse proxy load-balancer when requests are routed to cluster nodes by the proxy. In this load distribution is controlled by the proxy and can be changed on the fly. In case of HTTP it would be possible to send additional information for proxy in headers (for example, to indicate "heavy" request) so proxy can do special routing decision. There are many options to build such solution: nginx, istion, k8s network, haproxy, GCP Load Balancing
  • Adding more metrics into the client is on our plan. This is very good suggestion! However, proxy metrics should be present to because only proxy makes final decision and by using proxy we actually centralize traffic monitoring.

We would appreciate if you can share high-level data flow.

Thanks!

@chernser
Copy link
Contributor Author

Good day, @huddedar34 !
Thank you for the question.

  • in case we are using a proxy server receives client requests and resends them to some replica next will happen:

    • client will use proxy's hostname (ex. "db-proxy.local") and DB replicas will be on their hosts (ex. "db1.local", "db2.local")
    • all requests and connections will come to "db-proxy.local" in this case
    • proxy will decide where to re-send client request ("db1.local" or "db2.local")
    • if replica failed to respond in time, proxy may re-send same request to another replica without breaking client's connection)
    • if something indicates that one of the replicas is overloaded (ex. response time increased) then proxy can route most requests to less loaded proxy
  • in case there is a DNS load balancing then:

    • client may use any host name (not IP). Host name can be for proxy or for replica.
    • DNS load balancer will update DNS records when traffic should be routed to another shard.
    • DNS load balancer solves load balancing on higher level - for example, to distribute load between regions and to send client traffic to the closest endpoint.

Both types of proxies may work together.
There is minimal support on the client side. However client may be tuned to work for DNS load balancing.

Would you please share some details about your architecture? Is all replicas in the same network? Is there external clients directly connecting to the cluster?

Thanks!

@chernser
Copy link
Contributor Author

chernser commented Nov 4, 2024

Good @ashwinsri1 @huddedar34 !

Eventually we will implement it in the client-v2, I think. Currently we need to understand what is actually needed from these two features.

I have a question: where do you run client application? is it separate VMs or something like K8S?

Thanks!

@huddedar34
Copy link

huddedar34 commented Nov 7, 2024

Hi @chernser,

Thanks for the reply.

We run the client application on K8s. We will try to do a POC around the Proxy, DNS approach for clickhouse cluster once and get back to you if we have any questions.

Regarding the architecture, we have client application service running on K8s. We have a single shard, 3 replica clickhouse cluster to which application connects to. Clickhouse replicas are hosted on 3 separate VMs (1 on each). We want to utilise all the 3 replicas for our query processing.

Let us know if you want to understand any more details. Happy to help!

@ahmedriza
Copy link

ahmedriza commented Dec 25, 2024

Hi @chernser,

Also interested in the client side (V2) load balancing.

As you pointed out, a global load balancer can suffer from a single point of failure and possible overload. The MinIO solution to this is interesting. Can be found here: Introducing Sidekick - A High Performance Load Balancer . This avoids the issues of single point of failure and overloading of load balancers.

For our use case, we run a number of applications on k8s and they connect to a ClickHouse cluster - typically with one or two shards and multiple replicas.

How about a starting point where the client just does a round-robin between the configured end points?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants