Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

circuit: introduce half_open_resource_timeout #188

Merged
merged 1 commit into from
Jul 31, 2018

Conversation

sirupsen
Copy link
Contributor

@sirupsen sirupsen commented Jul 30, 2018

When a circuit is busy (for those at Shopify, see https://github.com/Shopify/shopify/pull/164168) you may reduce a substantial amount of capacity if you have large timeouts (see #145). One way to circumvent this is to have very large error_threshold timeouts, as a multiple of your timeout. E.g. if your resource timeout is 30s, then an error_timeout of 30s will make sure you have 50% of capacity available, and 60s will make sure you have 25% available (assuming that the circuits trigger equally over the timeout period, which is not an unreasonable assumption in production). However, this has obvious drawbacks, making recovering from incidents take longer for the affected resource.

Instead, we can reduce the resource timeout in the half_open? state to achieve something similar. Normally, timeouts are to allow outlier queries to go through—in the face of an incident, we can eat that (we'd want to kill those anyway) and reduce the timeout dramatically while half_open? on the circuit.

The ugly thing about this, is that it couples the circuit with the resource. However, I think it makes more sense for the circuit to know about the resource, than for the resource to know about the circuit—that's why I chose to implement it this way.

This will need to be implemented per adapter, by implementing with_resource_timeout.

cc @Shopify/servcomm

@sirupsen sirupsen force-pushed the half-open-resource-timeout branch from 28d7c3c to ce2808f Compare July 30, 2018 10:50
@@ -122,5 +123,19 @@ def log_state_transition(new_state)
def disabled?
ENV['SEMIAN_CIRCUIT_BREAKER_DISABLED'] || ENV['SEMIAN_DISABLED']
end

def maybe_with_half_open_resource_timeout(resource)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the coupling I think is ugly; passing in the resource and figuring out whether it quacks correctly here. When I do a 2nd iteration here with comments, I'll likely refactor to pass the raw resource into the initializer of the bulkhead and circuit—which seems a bit more clean.

@sirupsen sirupsen force-pushed the half-open-resource-timeout branch from ce2808f to 500c47a Compare July 30, 2018 11:00
@sirupsen
Copy link
Contributor Author

sirupsen commented Jul 30, 2018

Ugh... for some reason this didn't surface in local tests, and only fails in one place on CI, but sometimes it won't allow reconfiguring the timeouts for an established client 😖

Found this... brianmario/mysql2#317 does not look promising... Kir worked on this in brianmario/mysql2#955, I'll talk to him and see if we can get this in.

Copy link
Contributor

@casperisfine casperisfine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea, and I think it makes sense to only give a small timeout when going into half-open to check if the resource is properly recovered.

However I'm afraid that queries that can't possibly be executed with this reduced timeout might cause the CB to stay open longer than it should.

@@ -83,6 +83,23 @@ def query(*args)
end
end

def with_resource_timeout(temp_timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we support different timeouts for read / write, or is it assumed this timeout is simply for the recovery phase so there is no need for it to be fine grained?

@@ -27,7 +28,7 @@ def acquire

result = nil
begin
result = yield
result = maybe_with_half_open_resource_timeout(resource) { yield }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/nitpick, pass &block to safe one stack frame and it's nicer.

@sirupsen
Copy link
Contributor Author

sirupsen commented Jul 30, 2018

I read the MySQL2 source, and we can get the read_timeout without https://github.com/brianmario/mysql2/pull/955—so let's do that for now.

This should be good for final review.

@sirupsen sirupsen force-pushed the half-open-resource-timeout branch 2 times, most recently from 8d7afb7 to 4834280 Compare July 30, 2018 15:42
@@ -34,7 +34,7 @@ def test_semian_can_be_disabled
end

def test_connection_errors_open_the_circuit
@proxy.downstream(:latency, latency: 1200).apply do
@proxy.downstream(:latency, latency: 2200).apply do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why you had to change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I upped the default timeout to 2s, to show that it changes to 1s during the half open state.

Copy link
Contributor

@casperisfine casperisfine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation is 👍 to me.

Still unsure about the actual effect in prod, but I'm sure you'll look at those closely so 👍 as well.

yield
ensure
@read_timeout = prev_read_timeout
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this pattern makes me a bit uncomfortable when expressed like this. If someone were to add a line of code to the top of this method that might raise, the ensure could set a nil value. An alternative that more clearly maps to the intention is:

def
  prev = current
  begin
    current = temp
    # things
  ensure
    current = prev
  end
end

@sirupsen sirupsen force-pushed the half-open-resource-timeout branch from 4834280 to 0da0d7b Compare July 31, 2018 08:29
@sirupsen sirupsen merged commit 47bee87 into master Jul 31, 2018
@sirupsen sirupsen deleted the half-open-resource-timeout branch July 31, 2018 08:30
@kirs
Copy link
Contributor

kirs commented Jul 31, 2018

@sirupsen have you looked into how hard it would be to make other adapters (e.g. Redis) to support the same with_resource_timeout trick?

@sirupsen
Copy link
Contributor Author

sirupsen commented Jul 31, 2018

@kirs pretty easy, I just decided to prioritize MySQL here because those resources don't generally pose larger infra concerns in terms of occupying capacity because of their lower timeouts. For MySQL, it wouldn't be crazy to have this 25-50x as small. For something like Redis, you want to never really allow any query to take more than ~250ms. You're not going to get an order of magnitude there, but it might be worth it eventually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants