Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.Net7 Minimal API running under provider.al2 runtime timeout on quick fire successive requests #1543

Open
Simonl9l opened this issue Jul 7, 2023 · 16 comments
Labels
annotations bug This issue is a bug. needs-reproduction This issue needs reproduction. p2 This is a standard priority issue

Comments

@Simonl9l
Copy link

Simonl9l commented Jul 7, 2023

Describe the bug

Related to discussion #1424, but figure an Issue might garner more attention, since prior response there were stale and now have a reproducible sample.

In summary we have a .Net 7 Minimal API setup (serverless.AspNetCoreMinimalAPI) switched to .Net7 and using the amazon.lambda.tools to build the package using the docker image process via the -ucfb True option.

Our sense is that in .Net7 the Host/Middleware, or something related to its interaction with the provider.al2 runtime that may have changed with the newer .Net version is now possibly incompatible the .AddAWSLambdaHosting middleware.

Expected Behavior

Successive and continuous request should complete successfully

Current Behavior

Once deployed we find:

  • that the default route / can be continuously quick-fired without error.
  • that other requests like /calculator/add/1/2 initially succeeds but then fail after ~3-5 rapid requests with a timeout, as set by the lambda deployment (30 secs in our case).
  • If the invocations are spaced out ( at ~30 second intervals) the successive requests continue to work successfully.

Reproduction Steps

We have a GitHub workflow enabled clone-able/deployable repo here with instructions here

We have used the default template samples so that we can eliminate if any of our code is introducing a side effect, and whist we had added some of that code it's commented out.

Possible Solution

We need to migrate to .Net7 such that we can use latest C# features.

Additional Information/Context

We note comments in the RuntimeSupport.Client.InvocationResponse there is special handling of the OutputStream for successive invocations, but are unsure of the implication in this scenario, and assume this in an optimization.

Of note:

  • The / (default) route request in the Map does not leverage any default middleware where as the other routes to map the request on the controller do.
  • Whist Net7.0 is not explicitly supported for minimal API, given the lambda.NativeAOT template, we assume basic compatibility with the provider.al2 custom runtime
  • We have set <PublishAot>false</PublishAot> in the csproj, to eliminate any concern that this my be an issue.

AWS .NET SDK and/or Package version used

Amazon.Lambda.AspNetCoreServer.Hosting - 1.6.0
AWSSDK.Extensions.NETCore.Setup - 3.7.7
Amazon.Lambda.RuntimeSupport - 1.8.8
Amazon.Lambda.Core" Version - 2.1.0
Amazon.Lambda.Serialization.SystemTextJson - 2.3.1

Targeted .NET Platform

.Net 7

Operating System and version

AmazonLinux

@Simonl9l Simonl9l added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jul 7, 2023
@ashishdhingra
Copy link
Contributor

Needs review with @normj, related discussion #1424.

@Simonl9l
Copy link
Author

Simonl9l commented Jul 10, 2023

@ashishdhingra & @normj - Thanks - to confirm you were able to deploy and reproduce?

Not sure this is related to annotations per label?

@normj
Copy link
Member

normj commented Jul 11, 2023

@Simonl9l I cloned your repo and although I didn't run it through the GitHub actions I ran same dotnet lambda package and dotnet lambda deployserverless commands and had no issues. Can you add the log output you are seeing those commands in case I'm missing something?

Is there anything in the CloudWatch logs when you get the timeout?

@Simonl9l
Copy link
Author

Simonl9l commented Jul 11, 2023

@normj thanks of taking a look!
image

You'll note that the actual Minimal API response succeeded per the top line of the log; but you will see the timeout as highlighted.

Our assumption is that the runtime is pulling the rug out for some reason, on the basis of a timeout, but this is somehow erroneous, as the full elapsed time from the first call is only ~5 seconds. logging shows:

logging shows a 30 sec gap here...but there is no reason for the code to stall, as it's exactly the same that ran successfully the two prior iterations.
image

Here is the full CloudWatch log of the three successive executions to where the third fails.
log-events-viewer-result-3.csv, line 370 of the file lines up with the screen shot.

Also to confirm that you are using the /calculator/add/1/2 or similar route, not the / route?

@ashishdhingra ashishdhingra added needs-reproduction This issue needs reproduction. and removed needs-triage This issue or PR still needs to be triaged. labels Jul 11, 2023
@Simonl9l
Copy link
Author

Simonl9l commented Jul 14, 2023

@normj & @ashishdhingra - sorry to bug, is there anything we can do to assist the reproduction effort?

I believe I sent Norm a direct email offline if we want to connect/screen share etc.

@Simonl9l
Copy link
Author

@normj & @ashishdhingra

It's unfortunate that you and team have not be able to recreate this issue, that we are aware. We do though have more information.

Our hypothesis is that there is something wrong with the raw AL2 image, that only manifest when using the underpinning of the AddAWSLambdaHosting that regular raw or Annotations based Lambdas do not.

We get to this because we can build a self contained executable in both .Net 7 (and .Net 8 preview 6 - just to see) and deploy them to a .Net 6 runtime lambda, and they both work as expected, with no timeouts - on basis that the executable is called bootstrap, and more importantly the basis that the self-contained this runtime is overriding the runtime in the Dotnet 6 image.

This means we have a way forwards, and can avoid the AL2 runtime entirely. It had not occurred to us that one could do this before.

However if we switch the Lambda runtime to the AL2 runtime, using the same packages in turn it runs as expected but we get exactly the same failure observations as we did before.

Of note, if we build a self-contained .Net 6 package as above and deploy that to AL2 runtime, it also fails as observed.

Our assumptions is that when .Net 8 goes LTS, the existing .Net lambda runtime will be updated to support it, and what we have identified will likely impact few, if anyone.

However as MSFT support AOT builds of the .Net runtime, that say underpins Minimal API, this may surface for those trying to run that in the current AL2 image. Perhaps well event see Amazon Linux 2023 but then - right?

Well also probably be fine in the future with running Minimal API AOT images (when supported) in the current dotnet runtime too.

Is there anything y'all know about that we may be missing in running any self-contained or AOT'd packages in the dotnet runtime that might present a future problem here?

@ashishdhingra ashishdhingra added the p2 This is a standard priority issue label Oct 16, 2023
@dguisinger
Copy link

dguisinger commented Feb 7, 2024

@normj & @ashishdhingra

I appear to be having this issue as well. I have been spinning why wheels for 3 weeks on multiple .NET projects, no matter what I do my .NET 8 lambda functions timeout (30 seconds at API gateway, 60 seconds at the end of "execution") for the most simplest of things.

I am not using AoT.
I am using AddAWSLambdaHosting(LambdaEventSource.RestApi) with API Controllers, not minimal API.
We have swagger running in our projects, the swagger UI responds fine most of the time, but we have a 0% success rate on WebAPI calls.

This is making Lambda a completely unusable platform for us. I had mentioned it in the Tracking .NET 8 support issue nearly a month ago.

Screenshot 2024-02-07 at 4 11 58 PM

@normj
Copy link
Member

normj commented Feb 7, 2024

@dguisinger Is there a sample project you can share that demonstrates this behavior?

@dguisinger
Copy link

@dguisinger Is there a sample project you can share that demonstrates this behavior?

I'll throw together a sample project tonight

@dguisinger
Copy link

@normj I am at a complete loss. I created a new project, has identical Program.cs code, uses MediatR for CQRS and contains DynamoDB transaction logic in a separate assembly.... and it works just as it would be expected to.

I don't understand any differences between this project, and the two others I have which I have spent hundreds of hours trying to figure out why every single call timeouts when ran in Lambda (but work fine on localhost)

The biggest differences are in CICD pipeline...my two other projects use CodeBuild/CodePipeline in a separate DevOps account to build and deploy the code, but uses identical CDK code for building the lambda image and deploying it. The only difference (which should not make a difference, in theory) is this project I'm running the CDK deploy and docker build on a Windows machine, whereas with CodeBuild it would be on linux.

As mentioned in my previous posts, the more logging I added, the more random the position where updates in the log stopped.... it even improved greatly when I removed all logging from the app, but didn't solve the problem.

I have absolutely no idea how to even tackle this at this point, I can't keep throwing hundreds of hours into code that performs perfectly under Kestrel and then doesn't execute in Lambda... and there appear to be no ways to get the Lambda runtime to give useful debugging information.

@Simonl9l
Copy link
Author

Simonl9l commented Feb 9, 2024

@dguisinger we found that building a single executable (either AoT or bundled runtime) seemed to solve what ever we were doing to cause the timeout symptoms.

It's unclear if your symptoms are the same as what you were seeing.

Given our example, the AWS team were unable to recreate it, even though with our sample we could recreate it readily, that was also a default template lambda not using other assemblies as you indicate.

We'd surmised that something was not quite 100% good in relation to the lambda runtime and dotnet runtime - but we can't evidence it.

We've not had the wherewithal to experiment further, or really get into any preview net 8.0 or AL2023 support testing.

Our hope was that this would be solved with the AL2032 runtime, but fear given your experiences this has somehow carried over.

It's also interesting that you are using CDK. But it's unclear how that might be a factor.

@dguisinger
Copy link

@Simonl9l i had started down that path of the AoT/bundled runtime and never got it working, but I didn't spend too much time on it (its so painful to tweak one line at a time and push it back into the CICD pipeline and wait for it to do its thing)

At the moment, Swagger pages respond 100% of the time..... and occasionally I can get the WebAPI to respond (instantly), and when it does I can pound the API relentlessly and it stays working for hundreds of calls... until it idles for a few minutes, and then its back to timeouts. When I check the CloudWatch logs, its the exact same stream for the instance that was timing out initially that suddenly starts responding, so its not spinning up a new instance to handle the request. This leads me to guess its a problem in the AspNet hosting layer which last time I checked is still targeting 6.0.

As I had mentioned in the .NET 8 tracking issue, when I had changed the logging level on .NET to a higher level, even Swagger stopped responding. And I had also found that adding more inline logging to find how far it was making it into my code would actually sometimes change the point at where my code appeared to stop running.

@dguisinger
Copy link

@Simonl9l @normj Interestingly, I tried publish ... --self-contained today on my company's APIs that are having this issue, it changed the behavior to make it work more often, but it still has the timeouts.

However it allowed me to explore the pattern in the logs.
Based on the CloudWatch streams, Lambda spun up two instances. One timed out repeatedly, the other timed out the first time and then succeeded everytime after. What got my attention is it appears based on the logs that even though its in the same log stream, it appears to be cold-starting over and over.... Each failed attempt had a bunch of logging put out by AspNet's data protection and AWS SDK credential searches, whereas on the log stream of the instance that was working, it only showed up on the initial hit (and timeout) but didn't log any other time.

This would seem to indicate the container keeps restarting the code on every request if its been timing out. Could the runtime be terminating on a failure, causing execution to stop, with no response back to Lambda itself telling it execution failed - resulting in it believing it timed out?

Screenshot 2024-02-09 at 2 04 18 PM
Screenshot 2024-02-09 at 2 04 44 PM

@dguisinger
Copy link

dguisinger commented Feb 9, 2024

Something else I've found... a single line item in the log from the AWS SDK seems to indicate whether or not the request is going to work.
Screenshot 2024-02-09 at 2 52 32 PM
Screenshot 2024-02-09 at 2 56 25 PM

Everytime it says "[Information] AWSSDK: Found credentials using the AWS SDK's default credential search", all requests going forward work... if that line doesn't appear, responses time out.

I see the line two lines below always says it found credentials... so this could be what I've previously reported, where when things aren't working logging is random... so it could be the runtime is crashing before the log writes out to the stream.

I've tried wrapping my entire Program.cs in a try/catch and printing the exception to console, but there appears to be no exception within the application.

Is there a way to get the .NET 8 lambda runtime to output verbose details upon a crash?

@normj
Copy link
Member

normj commented Feb 12, 2024

@dguisinger I'm assuming this is an ASP .NET Core based Lambda function. Would be possible to share, at least a redacted version, of the startup code for the function? Like the Program.cs file?

@dguisinger
Copy link

@normj sure, here you go

I did try putting my sample project into my CICD pipeline that the other projects go through just in case building with CodeBuild was causing the issue, but the sample still worked. I also commented out all of the controllers/actions in one of my broken projects to get down to just 1 controller/action and it was still broken, so the sample being simplified in that regard wasn't an issue.

The only difference I can think of at this point is the broken projects have several more assemblies (for the DTOs, database repository/entities, etc) and in-house nuget packages for cross-API shared code that the sample project didn't.

var builder = WebApplication.CreateBuilder(args);
builder.Logging.ClearProviders();
builder.Logging.AddConsole();
builder.Logging.AddLambdaLogger();


// Add services to the container.
builder.Services.AddMediatR(cfg => cfg.RegisterServicesFromAssemblies(Assembly.GetExecutingAssembly()));

builder.Services.AddControllers();
builder.Services.AddAWSLambdaHosting(LambdaEventSource.RestApi);

builder.Services.AddEndpointsApiExplorer();
builder.Services.AddSwaggerGen(cfg =>
{
    cfg.EnableAnnotations();
    cfg.AddSecurityDefinition("Bearer", new OpenApiSecurityScheme
    {
        In = ParameterLocation.Header,
        Description = "Please enter token",
        Name = "Authorization",
        Type = SecuritySchemeType.Http,
        BearerFormat = "JWT",
        Scheme = "bearer"
    });

    cfg.AddSecurityRequirement(new OpenApiSecurityRequirement
    {
    {
        new OpenApiSecurityScheme
        {
            Reference = new OpenApiReference
            {
                Type=ReferenceType.SecurityScheme,
                Id="Bearer"
            }
        },
        new string[]{}
    }
    });
});

builder.Services.AddAWSService<IAmazonDynamoDB>();

var mapperConfig = new MapperConfiguration(mc =>
{
    mc.AddProfile(new Database.AutoMapper.Profile());
});
var mapper = mapperConfig.CreateMapper();
builder.Services.AddSingleton<IMapper>(mapper);
builder.Services.AddTransient<DatabaseRepository>();
builder.Services.AddSingleton<DatabaseConfiguration>(new DatabaseConfiguration(
    Environment.GetEnvironmentVariable("idempotency_table_name"),
    Environment.GetEnvironmentVariable("table_name")
));
builder.Services.AddCognitoIdentity();
builder.Services.AddAuthentication(options =>
{
    options.DefaultAuthenticateScheme = JwtBearerDefaults.AuthenticationScheme;
    options.DefaultChallengeScheme = JwtBearerDefaults.AuthenticationScheme;
})
.AddJwtBearer(options =>
{
    options.Authority = builder.Configuration["Cognito:Authority"];
    options.TokenValidationParameters = new TokenValidationParameters
    {
        ValidateIssuerSigningKey = true,
        ValidateAudience = false
    };
});

var app = builder.Build();

// Configure the HTTP request pipeline.
app.UseSwagger();
app.UseSwaggerUI();

app.UseXRay("ServiceName");
AWSSDKHandler.RegisterXRayForAllServices();

app.UseHttpsRedirection();

app.UseAuthorization();

app.MapControllers();

app.Run();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
annotations bug This issue is a bug. needs-reproduction This issue needs reproduction. p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests

4 participants