Support SDK metrics for go v2 AWS SDK #1744

DanielBauman88 · 2022-06-29T01:15:10Z

Describe the feature

The java feature is documented here.
The functionality is described in this section.

The request is to support the same for the go sdk so that it is trivial to get metrics for latencies/errors/retries to aws dependencies made in a customer application.

Use Case

I want to have operational metrics for latency,error,num-calls for all my dependencies so that I can monitor the performance of my service and dig into problems and investigate the impact of outages.

Proposed Solution

To implement this functionality with a simple option on SDK creation in the go sdk v2.

Other Information

No response

Acknowledgements

I may be able to implement this feature request
This feature might incur a breaking change

AWS Go SDK V2 Module Versions Used

This is applicable to all SDKs

Go version used

This should be applicable to all go versions

jeichenhofer · 2022-10-26T21:46:07Z

I'm also looking to integrate some metrics with the aws-sdk-go-v2 libraries, but I don't want to re-invent the wheel. Hopefully this will be an officially supported feature, but I also need a solution in the meantime. Specifically, I want to record a tuple of service name, operation name, aws region, latency, retry count, and response code on every request sent to AWS. I can envision doing this with the "middleware" API, but these are the only docs I can find, and they don't do a great job explaining what information about the request is available (https://aws.github.io/aws-sdk-go-v2/docs/middleware/) (e.g., would we need to record the "sent time" in Initialize step, then check it in the deserialize step, or is latency already a populated metadata value).

While we wait for a response from the development team about incorporating this as an SDK feature, is there any guidance on implementing something ourselves?

jeichenhofer · 2022-10-27T02:58:13Z

Here's what I could come up with by stepping through the middleware stack code. It seems to work as intended, but I'd be curious to hear from people more familiar with the API.

Of course, this would need to be incorporated with some existing metrics system, replacing the ReportMetrics function with something that feeds into monitoring systems or log files. If there's a chance that the function might return an error, then I'd have to think a bit more about how to handle that.

Also, because this is placed "after" all of the other deserializers, it will be executed per retry. That's why I left the "retry on access denied" code in there, to test out what happens when a retried operation fails. The output measures the latency of each individual retry request (by default that's three requests total). I thought replacing the smithymiddleware.After with smithymiddleware.Before would measure latency of the combined three round-trips, but that was not the case. Since I want the behavior to be per-retry, I didn't investigate further.

Here is the working code to test this out. Just replace the AKID and SKEY constants with IAM User credentials with no access, and you'll see the metrics spit out from the three requests with a 403 response code.

package main

import (
	"context"
	"fmt"
	"github.com/aws/aws-sdk-go-v2/aws"
	sdkmiddleware "github.com/aws/aws-sdk-go-v2/aws/middleware"
	"github.com/aws/aws-sdk-go-v2/aws/retry"
	"github.com/aws/aws-sdk-go-v2/config"
	"github.com/aws/aws-sdk-go-v2/credentials"
	"github.com/aws/aws-sdk-go-v2/service/s3"
	smithymiddleware "github.com/aws/smithy-go/middleware"
	"github.com/aws/smithy-go/transport/http"
	"time"
)

const (
	AKID = "akid_here"
	SKEY = "secret_access_key_here"
	SESH = ""
)

type RequestMetricTuple struct {
	ServiceName   string
	OperationName string
	Region        string
	LatencyMS     int64
	ResponseCode  int
}

func ReportMetrics(metrics *RequestMetricTuple) {
	fmt.Printf("metrics: %+v\n", metrics)
}

func reportMetricsMiddleware() smithymiddleware.DeserializeMiddleware {
	reportRequestMetrics := smithymiddleware.DeserializeMiddlewareFunc("ReportRequestMetrics", func(
		ctx context.Context, in smithymiddleware.DeserializeInput, next smithymiddleware.DeserializeHandler,
	) (
		out smithymiddleware.DeserializeOutput, metadata smithymiddleware.Metadata, err error,
	) {
		requestMadeTime := time.Now()
		out, metadata, err = next.HandleDeserialize(ctx, in)
		if err != nil {
			return out, metadata, err
		}

		responseStatusCode := -1
		switch resp := out.RawResponse.(type) {
		case *http.Response:
			responseStatusCode = resp.StatusCode
		}

		latency := time.Now().Sub(requestMadeTime)
		metrics := RequestMetricTuple{
			ServiceName:   sdkmiddleware.GetServiceID(ctx),
			OperationName: sdkmiddleware.GetOperationName(ctx),
			Region:        sdkmiddleware.GetRegion(ctx),
			LatencyMS:     latency.Milliseconds(),
			ResponseCode:  responseStatusCode,
		}
		ReportMetrics(&metrics)

		return out, metadata, nil
	})

	return reportRequestMetrics
}

func getDefaultConfig(ctx context.Context) (*aws.Config, error) {
	cfg, err := config.LoadDefaultConfig(
		ctx,
		config.WithCredentialsProvider(credentials.NewStaticCredentialsProvider(AKID, SKEY, SESH)),
		config.WithRetryer(
			func() aws.Retryer {
				return retry.AddWithErrorCodes(retry.NewStandard(), "AccessDenied")
			},
		),
	)
	if err != nil {
		return nil, err
	}

	cfg.APIOptions = append(cfg.APIOptions, func(stack *smithymiddleware.Stack) error {
		return stack.Deserialize.Add(reportMetricsMiddleware(), smithymiddleware.After)
	})

	return &cfg, nil
}

func doStuff(ctx context.Context, client *s3.Client) {
	listBucketResults, err := client.ListBuckets(ctx, &s3.ListBucketsInput{})
	if err != nil {
		panic(err)
	}
	fmt.Printf("num_buckets: %d\n", len(listBucketResults.Buckets))
}

func main() {
	ctx := context.Background()
	cfg, err := getDefaultConfig(ctx)
	if err != nil {
		panic(err)
	}

	client := s3.NewFromConfig(*cfg)

	for true {
		doStuff(ctx, client)
		time.Sleep(time.Second * 2)
	}
}

lucix-aws · 2023-11-28T16:59:59Z

related: #1142

We intend to implement this in terms of aws/smithy-go#470, the internal spec for this component of the smithy client reference architecture is being finalized.

Please upvote this issue if this functionality is important to you as an SDK user.

DanielBauman88 added feature-request A feature should be added or improved. needs-triage This issue or PR still needs to be triaged. labels Jun 29, 2022

vudh1 removed the needs-triage This issue or PR still needs to be triaged. label Jul 19, 2022

RanVaknin added p2 This is a standard priority issue l Effort estimation: large labels Nov 14, 2022

lucix-aws mentioned this issue Nov 28, 2023

Add CSM Support #1142

Closed

RanVaknin added the queued This issues is on the AWS team's backlog label Feb 15, 2024

mschfh mentioned this issue May 4, 2024

Migrate to AWS Go SDK v2 brave/go-sync#181

Open

lucix-aws removed the l Effort estimation: large label May 24, 2024

lucix-aws self-assigned this Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support SDK metrics for go v2 AWS SDK #1744

Support SDK metrics for go v2 AWS SDK #1744

DanielBauman88 commented Jun 29, 2022

jeichenhofer commented Oct 26, 2022 •

edited

jeichenhofer commented Oct 27, 2022

lucix-aws commented Nov 28, 2023 •

edited

Support SDK metrics for go v2 AWS SDK #1744

Support SDK metrics for go v2 AWS SDK #1744

Comments

DanielBauman88 commented Jun 29, 2022

Describe the feature

Use Case

Proposed Solution

Other Information

Acknowledgements

AWS Go SDK V2 Module Versions Used

Go version used

jeichenhofer commented Oct 26, 2022 • edited

jeichenhofer commented Oct 27, 2022

lucix-aws commented Nov 28, 2023 • edited

jeichenhofer commented Oct 26, 2022 •

edited

lucix-aws commented Nov 28, 2023 •

edited