Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distribute cache concurrency issue #83

Closed
Caskia opened this issue Jun 20, 2019 · 34 comments · Fixed by #180
Closed

Distribute cache concurrency issue #83

Caskia opened this issue Jun 20, 2019 · 34 comments · Fixed by #180

Comments

@Caskia
Copy link

Caskia commented Jun 20, 2019

Hi @stefanprodan ,
When deploying a lots of api services(use distributed cache to store counter) behind load balance, every instance need to get conuter and use process' lock to increse count, as the code below

 using (await AsyncLock.WriterLockAsync(counterId).ConfigureAwait(false))
            {
                var entry = await _counterStore.GetAsync(counterId, cancellationToken);

                if (entry.HasValue)
                {
                    // entry has not expired
                    if (entry.Value.Timestamp + rule.PeriodTimespan.Value >= DateTime.UtcNow)
                    {
                        // increment request count
                        var totalCount = entry.Value.Count + _config.RateIncrementer?.Invoke() ?? 1;

                        // deep copy
                        counter = new RateLimitCounter
                        {
                            Timestamp = entry.Value.Timestamp,
                            Count = totalCount
                        };
                    }
                }

                // stores: id (string) - timestamp (datetime) - total_requests (long)
                await _counterStore.SetAsync(counterId, counter, rule.PeriodTimespan.Value, cancellationToken);
            }

I think it has concurrency issues, counter will not work correctly when a lots of requests come in. Every instance's counter count by themself then save to the cache store, the last one will cover the previous one.

Why not use Token Bucket algorithm to achieve this feature.

@cristipufu
Copy link
Collaborator

Hi @Caskia,

Your observation is correct. How would you implement the Token Bucket algorithm in this particular case?

Please feel free to contribute with details, pseudo-code or open up a new pull request.

Thanks

@liguobao
Copy link

liguobao commented Sep 5, 2019

the error :

[2019-09-05T12:08:12.6390870+00:00  Microsoft.AspNetCore.Server.Kestrel ERR] Connection id "0HLPFG8HQJTUG", Request id "0HLPFG8HQJTUG:00000003": An unhandled exception was thrown by the application.
System.OperationCanceledException: The operation was canceled.
   at System.Threading.CancellationToken.ThrowOperationCanceledException()
   at Microsoft.Extensions.Caching.Redis.RedisCache.RefreshAsync(String key, Nullable`1 absExpr, Nullable`1 sldExpr, CancellationToken token)
   at Microsoft.Extensions.Caching.Redis.RedisCache.GetAndRefreshAsync(String key, Boolean getData, CancellationToken token)
   at Microsoft.Extensions.Caching.Redis.RedisCache.GetAsync(String key, CancellationToken token)
   at Microsoft.Extensions.Caching.Distributed.DistributedCacheExtensions.GetStringAsync(IDistributedCache cache, String key, CancellationToken token)
   at AspNetCoreRateLimit.DistributedCacheRateLimitStore`1.GetAsync(String id, CancellationToken cancellationToken)
   at AspNetCoreRateLimit.IpRateLimitProcessor.GetMatchingRulesAsync(ClientRequestIdentity identity, CancellationToken cancellationToken)
   at AspNetCoreRateLimit.RateLimitMiddleware`1.Invoke(HttpContext context)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ProcessRequests[TContext](IHttpApplication`1 application)

from it ?

Thanks for report.

@liguobao
Copy link

liguobao commented Sep 8, 2019

@cristipufu hi, check the error message please ?

@cristipufu
Copy link
Collaborator

There's a workaround for the distributed cache issue, but requires a little bit of coding (be aware that this is only a concept and it's not tested):

Overwrite the rate limiting middleware part:

DistributedClientRateLimitMiddleware .cs
public static IApplicationBuilder UseDistributedClientRateLimiting(this IApplicationBuilder builder)
{
  return builder.UseMiddleware<DistributedClientRateLimitMiddleware>();
}

public class DistributedClientRateLimitMiddleware : RateLimitMiddleware<DistributedClientRateLimitProcessor>
{
  private readonly ILogger<ClientRateLimitMiddleware> _logger;

  public DistributedClientRateLimitMiddleware(RequestDelegate next,
          IOptions<ClientRateLimitOptions> options,
          IRateLimitCounterStore counterStore,
          IClientPolicyStore policyStore,
          IRateLimitConfiguration config,
          ILogger<ClientRateLimitMiddleware> logger)
      : base(next, options?.Value, new DistributedClientRateLimitProcessor(options?.Value, counterStore, policyStore, config), config)
  {
      _logger = logger;
  }

  protected override void LogBlockedRequest(HttpContext httpContext, ClientRequestIdentity identity, RateLimitCounter counter, RateLimitRule rule)
  {
      _logger.LogInformation($"Request {identity.HttpVerb}:{identity.Path} from ClientId {identity.ClientId} has been blocked, quota {rule.Limit}/{rule.Period} exceeded by {counter.Count}. Blocked by rule {rule.Endpoint}, TraceIdentifier {httpContext.TraceIdentifier}.");
  }
}

Now, we just have to overwrite the ClientRateLimitProcessor.ProcessRequestAsync() method, replacing the AsyncLock in-process lock with a Redis distributed lock:

DistributedClientRateLimitProcessor.cs
public class DistributedClientRateLimitProcessor : ClientRateLimitProcessor
    {
        private readonly IRateLimitCounterStore _counterStore;
        private readonly IRateLimitConfiguration _config;
        private readonly RedLockFactory _redLockFactory;

        public DistributedClientRateLimitProcessor(ClientRateLimitOptions options,
            IRateLimitCounterStore counterStore,
            IClientPolicyStore policyStore,
            IRateLimitConfiguration config,
            IRedLockFactory redLockFactory)
            : base(options, counterStore, policyStore, config)
        {
            _counterStore = counterStore;
            _config = config;
            _redLockFactory = redLockFactory; // this should be somehow singleton injected
        }

        public override async Task<RateLimitCounter> ProcessRequestAsync(ClientRequestIdentity requestIdentity, RateLimitRule rule, CancellationToken cancellationToken = default)
        {
            var counter = new RateLimitCounter
            {
                Timestamp = DateTime.UtcNow,
                Count = 1
            };

            var counterId = BuildCounterKey(requestIdentity, rule);
            var expiry = TimeSpan.FromSeconds(2);

            // serial reads and writes on same key
            using (var redLock = await _redLockFactory.CreateLockAsync(counterId, expiry)) // there are also non async Create() methods
            {
                // make sure we got the lock
                if (redLock.IsAcquired)
                {
                    var entry = await _counterStore.GetAsync(counterId, cancellationToken);

                    if (entry.HasValue)
                    {
                        // entry has not expired
                        if (entry.Value.Timestamp + rule.PeriodTimespan.Value >= DateTime.UtcNow)
                        {
                            // increment request count
                            var totalCount = entry.Value.Count + _config.RateIncrementer?.Invoke() ?? 1;

                            // deep copy
                            counter = new RateLimitCounter
                            {
                                Timestamp = entry.Value.Timestamp,
                                Count = totalCount
                            };
                        }
                    }

                    // stores: id (string) - timestamp (datetime) - total_requests (long)
                    await _counterStore.SetAsync(counterId, counter, rule.PeriodTimespan.Value, cancellationToken);
                }
            }
            // the lock is automatically released at the end of the using block

            return counter;
        }
    }

One can implement a distributed lock mechanism in Redis, and there already are some implementations out there that we could use (or at least inspire from):

Then we just have to replace the IRateLimitCounterStore implementation with the distributed one (that's available via the AspNetCoreRateLimiter package) and also specify which distributed cache implementation we want to use:

services.AddSingleton<IRateLimitCounterStore, DistributedCacheRateLimitCounterStore>();
services.AddStackExchangeRedisCache(options =>
{
    options.Configuration = "localhost";
    options.InstanceName = "SampleInstance";
});

@liguobao
Copy link

 services.AddDistributedRedisCache(options =>
                {
                    // e.g. contoso5.redis.cache.windows.net,abortConnect=false,ssl=true,password=...
                    options.Configuration = Configuration["IpRateLimitingRedisString"];
                    options.InstanceName = "ip_rate_limit_";
                });

                // needed to store rate limit counters and ip rules
                services.AddMemoryCache();

                //load general configuration from appsettings.json
                services.Configure<IpRateLimitOptions>(Configuration.GetSection("IpRateLimiting"));

                //load ip rules from appsettings.json
                services.Configure<IpRateLimitPolicies>(Configuration.GetSection("IpRateLimitPolicies"));

                // inject counter and rules stores
                services.AddSingleton<IIpPolicyStore, DistributedCacheIpPolicyStore>();
                services.AddSingleton<IRateLimitCounterStore, DistributedCacheRateLimitCounterStore>();

@cristipufu like it?

But it not work, distribute cache concurrent problem also.

@cristipufu
Copy link
Collaborator

There's also another possible solution (also overriding the ClientRateLimitProcessor.ProcessRequestAsync() method), one could implement the pattern described here: https://redis.io/commands/INCR#pattern-rate-limiter

@STRATZ-Ken
Copy link

Just wanted to throw in some more information, I run this plugin on an API that was getting about 1 million requests per 4 hours. After getting complaints during high peak times about sluggishness, I removed this limiter and it went down from about 0.6sec per request to 0.15sec per request. I used a Redis backend for Cache as well. So there is defiantly something that needs to be done to help performance under high loads. Otherwise, for small projects this plugin totally rocks.

@cristipufu
Copy link
Collaborator

cristipufu commented Sep 13, 2019

@STRATZ-Ken were you using the 3+ version? (the previous versions had a simple lock which means that each request would have throttled all the others)

The issue probably is this block of code:

using (await AsyncLock.WriterLockAsync(counterId).ConfigureAwait(false))
{
    var entry = await _counterStore.GetAsync(counterId, cancellationToken);

    if (entry.HasValue)
    {
        // entry has not expired
        if (entry.Value.Timestamp + rule.PeriodTimespan.Value >= DateTime.UtcNow)
        {
            // increment request count
            var totalCount = entry.Value.Count + _config.RateIncrementer?.Invoke() ?? 1;

            // deep copy
            counter = new RateLimitCounter
            {
                Timestamp = entry.Value.Timestamp,
                Count = totalCount
            };
        }
    }

    // stores: id (string) - timestamp (datetime) - total_requests (long)
    await _counterStore.SetAsync(counterId, counter, rule.PeriodTimespan.Value, cancellationToken);
}

As you can see this will hold a lock per ip/client/etc (depending on the plugin's configuration) that makes 2 Redis calls inside it. So if one would implement this atomic operation directly in Redis, replacing the lock + the 2 calls with a single Redis call, I think that the performance would improve a lot.

@liguobao
Copy link

@stefanprodan

@cristipufu cristipufu pinned this issue Sep 15, 2019
@cristipufu cristipufu changed the title Distribute cache concurrent problem and why not use Token Bucket algorithm? Distribute cache concurrency issue Sep 15, 2019
@STRATZ-Ken
Copy link

Any update on this?

@remoba
Copy link

remoba commented Jan 30, 2020

Any plans to fix this?

1 similar comment
@Daniel-Glucksman
Copy link

Any plans to fix this?

@simonhaines
Copy link

The IDistributedCache service used to implement the distributed rate-limiting counter does not provide enough concurrency guarantees to resolve this race condition, and it likely never will. An atomic increment operation is needed, such as Redis' INCR command.

We resolved this issue by refactoring the IRateLimitCounterStore and backing it with a Redis cache, see the repo here. This also reduces per-request latency by eliminating the read/update/write operations that are the core of this issue (see here).

For each rate limit rule, time is divided into intervals that are the length of the rule's period. Requests are resolved to an interval, and the interval's counter is incremented. This is a slight change in semantics to the original implementation, but works in our use-case.

This approach requires a dependency on StackExchange.Redis to access the INCR command and key expiry, and the IConnectionMultiplexer service needs to be injected at startup.

@olegkv
Copy link

olegkv commented Mar 9, 2020

We resolved this issue by refactoring the IRateLimitCounterStore and backing it with a Redis cache
This approach requires a dependency on StackExchange.Redis to access the INCR command and key expiry, and the IConnectionMultiplexer service needs to be injected at startup.

As far as I see, there is still race condition there, because call to Redis INCR command and a call to set expiration are different calls and counter may change in between. Looks like the only way to make it work correctly is to make it atomic using Lua script, exactly as explained in Redis documentation for INCR command.

@simonhaines
Copy link

As far as I see, there is still race condition there, because call to Redis INCR command and a call to set expiration are different calls and counter may change in between.

No race condition, it does not matter what the counter value is when the key expiry is set. The expiry only needs to be set once (but can be set any number of times), and the check for a counter value of 1 ensures this.

This is why the algorithm divides time into intervals of the rule's period. The expiry time will be the same for each interval, and any client may set it, as long as it is set at least once.

@olegkv
Copy link

olegkv commented Mar 9, 2020

No race condition

Whatever is non-atomic (and your code is non-atomic) will produce race conditions.
All these cases can happen:
-your code gets frozen by garbage collector after creating counter and before setting expiration, so counter expires too late
-your code can crash after creating counter and before setting expiration because of OS shutdown or some other reason, so counter never expires
-network issue can prevent setting expiration after creating counter so counter never expires or expires too late

..and so on.

That's why Redis documentation says it needs to be done by Lua script which is atomic, Redis guys wrote it for a reason.

@simonhaines
Copy link

@olegkv the right place to discuss the fork is over at its repo, so I've created an issue there for this.

@nick-cromwell
Copy link
Contributor

I’ve only recently found this repo and was wondering if there are known instances of using this in a production environment and how successful it has been.

Presumably the side effect of this issue is an incorrect request count. I think we can get by with that within a certain range.

Are there any other consequences of this that we should know?

@simonhaines
Copy link

@nick-cromwell I maintain a fork whose master branch is used in production. It has a known issue in a distributed deployment where one node may set a Redis key and then leave the network (e.g. crash) before it sets the key to expire. In this case, access to the rate-limited resource is eventually blocked. This is an extremely rare case, but is serious enough that it has been addressed in the atomic-lua branch, which is being tested and has not yet been used in production.

@nick-cromwell
Copy link
Contributor

Thanks for that, @simonhaines. I'll keep a look out for a pull request from the atomic-lua branch. I'm not familiar with Lua script, but based on your discussion here with @olegkv it seems that will resolve this issue once and for all.

@nick-cromwell
Copy link
Contributor

@simonhaines Are you good to make a PR from your atomic-lua branch? I think that would close out this issue, correct?

@simonhaines
Copy link

@nick-cromwell I don't think so. The fork changes the IRateLimitCounterStore interface in an incompatible way. Merging would require a major version number increase. It also makes the StackExchange Redis client a hard dependency. I'm not sure which way @stefanprodan wants to take this library, but these are big changes.

@tomaustin700
Copy link

Hey @stefanprodan, what do you think to the changes proposed by @simonhaines? I'm wanting to use this package but holding off until the concurrency issues are sorted.

@validide
Copy link

validide commented Apr 7, 2021

I was reading an recent article about how GitHub solved their rate limiting issues link with the actual Lua script at the end.

Does anyone consider this as a valid approach that would be worth looking into an possibly creating a pull request?

@simonhaines
Copy link

@validide there is a pull request #180 that incorporates distributed rate-limiting and it looks like a good way forward. However the proposed changes require an increase to the major version, and it looks like the project structure is going to be refactored with the distributed rate limited split out to a separate project.

@ByronAP
Copy link

ByronAP commented Apr 20, 2021

Wish it would get merged and v bumped already though... it's been sitting a while

@ozonni
Copy link

ozonni commented May 21, 2021

this should have high priority since for high load projects this issue is a blocker...

@validide
Copy link

I realize this is an open-source project and everyone is contributing on their own time so is there any way I could help?

@nick-cromwell
Copy link
Contributor

Thanks @validide. We're just waiting on #180 to be merged and the major version to be increased. Once #180 is merged I can open a PR for the StackExchangeRedisProcessingStrategy which resolves this.

@cristipufu
Copy link
Collaborator

We've pushed a new version 4.0.0 https://www.nuget.org/packages/AspNetCoreRateLimit/4.0.0 compatible with
the Redis extension which is available on NuGet here https://www.nuget.org/packages/AspNetCoreRateLimit.Redis/

@bkromhout
Copy link

@cristipufu, sorry to revive this, but wanted to ask something. I've read through all of this thread, and the related PRs, and the general documentation, but something is still unclear to me: is the concurrency issue identified originally here still a problem if you're using SQL Server for the distributed store since it still uses the AsyncKeyLockProcessingStrategy? Or was this a problem unique to Redis?

If it is still an issue for SQL Server cases, is the expectation here basically that as you scale up, you eventually need to switch over from SQL Server to Redis (and the new implementation that uses the direct Redis calls)?

Apologies again for reviving this just to ask, but I'm trying to understand the risk factor here of using SQL Server so that I can make a recommendation to my team.

@cristipufu
Copy link
Collaborator

@bkromhout

is the concurrency issue identified originally here still a problem if you're using SQL Server for the distributed store since it still uses the AsyncKeyLockProcessingStrategy

Yes

If it is still an issue for SQL Server cases, is the expectation here basically that as you scale up, you eventually need to switch over from SQL Server to Redis (and the new implementation that uses the direct Redis calls)?

I think that you can follow the Redis processing strategy example and implement something similar in SQL

@cristipufu
Copy link
Collaborator

cristipufu commented Oct 31, 2022

@jamesfoster
Copy link

https://github.com/stefanprodan/AspNetCoreRateLimit/wiki/Using-Redis-as-a-distributed-counter-store#solution-2---in-memory-cache-fallback

Can the docs be updated to provide an example of the in-memory fallback for the "New Version". Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.