Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

draft: measure e2e latency for amazon q requests #5566

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jpinkney-aws
Copy link
Contributor

@jpinkney-aws jpinkney-aws commented Sep 10, 2024

Problem

This is just a draft of what we could potentially do to measure e2e latency in amazon q chat message requests. Feel free to leave comments on the PR. This isn't trying to directly get to a solution, its trying to figure what works and what doesn't.

Solution

Since amazon q chat is fire and forget, this solution manually instruments all the checkpoints that happen in the code through a trace id. This trace id lets us know which message we are currently handling throughout the execution flow.

In the future I'd like to get to a point where we don't need to do as much manual work. I still think we can automate listening for incoming events (like the dropdown details show) but it needs a bit more thought

Previous PR description

Problem

This is just a draft of what we could potentially do to measure e2e latency in amazon q chat message requests. Feel free to leave comments on the PR

Main code level problems:

  • It's hard to connect a tab id to a conversation id. Tab ID's exist pretty much to send/receive messages but they don't really live along side conversation id's. Conversation id's are exclusively related to the session. The only place they meetup is in their chat session storages
  • Do we want to track individual events like webviewToDispatcher as metrics?
    • Though there wouldn't really be a way to "listen" on these events except for if we manually allow-listed metrics since theres no reason to attach a tabID to that metric
  • How do we listen to events emitted from telemetry so that we can find them later?
    • One solution is proposed in this PR where we listen to events being emitted and then check if we want to track them
  • Could we connect these with a parent/child mechanism like we want to introduce eventually?
    • We kind of do this because the "listener" event is the parent and the things it finds are the children. The problem is we can really have only one "listener" for these Q events, since we don't know what tabs are related to what conversation ID's until much later (after the first request). This is why we build up the map of conversation metrics and tab metrics and join them together before we emit

Solution

This PR does a couple things:

  • Moves up the chat session storage for codewhisperer chat. This allows us to eventually connect conversation ids to tabs
  • Introduces a chat storage facades that allows us to call getConversationId on any tab type. This is only implemented for cwc right now
  • Introduces a listener pattern that allows telemetry events to "listen" to metrics that are being emitted
  • emits a sample events that can you can trace from "pressing enter" to "receives message in ui"

Sample event when sending a request in amazon q chat:

2024-09-11 10:11:05.712 [debug] telemetry: chat_roundTrip {
  Metadata: {
    duration: '14585',
    name: 'onChatPrompt',
    child_metrics: 'initialRequestTime,webviewToDispatcher,dispatcherToFeature,amazonq_enterFocusConversation,amazonq_startConversation,amazonq_addMessage,finalRequestTime',
    child_metric_durations: '{"webviewToDispatcher":1,"dispatcherToFeature":3,"amazonq_enterFocusConversation":1542,"amazonq_startConversation":2,"amazonq_addMessage":13003,"finalRequestTime":34}',
    result: 'Succeeded',
    awsAccount: 'not-set',
    awsRegion: 'us-east-1'
  },
  Value: 1,
  Unit: 'None',
  Passive: false
}

You can also get events from the right click context menu items as well (explain, refactor, etc)

2024-09-11 10:11:34.384 [debug] telemetry: chat_roundTrip {
  Metadata: {
    duration: '12885',
    name: 'aws.amazonq.explainCode',
    child_metrics: 'initialRequestTime,webviewToDispatcher,amazonq_enterFocusConversation,amazonq_addMessage,finalRequestTime',
    child_metric_durations: '{"webviewToDispatcher":17,"amazonq_enterFocusConversation":1263,"amazonq_addMessage":11572,"finalRequestTime":33}',
    result: 'Succeeded',
    awsAccount: 'not-set',
    awsRegion: 'us-east-1'
  },
  Value: 1,
  Unit: 'None',
  Passive: false
}

License: I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link

This pull request modifies files in src/ but no tests were added/updated. Confirm whether tests should be added or ensure the PR description explains why tests are not required.


// A common interface over existing chat session storages
// TODO align all the chatsessionstorages so we don't have to do this :/
export class ChatSessionStorageFacade {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

packages/core/src/amazonq/util/storageFacade.ts Outdated Show resolved Hide resolved
packages/core/src/amazonq/util/telemetryHelper.ts Outdated Show resolved Hide resolved
packages/core/src/amazonq/util/telemetryHelper.ts Outdated Show resolved Hide resolved
packages/core/src/amazonq/util/telemetryHelper.ts Outdated Show resolved Hide resolved
packages/core/src/amazonq/util/telemetryHelper.ts Outdated Show resolved Hide resolved
const allMetrics = [...tabUIMetrics, ...conversationMetrics].sort((a, b) => a[1] - b[1])

// Get the total duration by subtracting the last (finalRequestTime) and the first duration (initialRequestTime)
const totalDuration = allMetrics[allMetrics.length - 1][1] - allMetrics[0][1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way we can get a named type to get the time instead of [1]?

packages/core/src/shared/telemetry/spans.ts Outdated Show resolved Hide resolved
@jpinkney-aws
Copy link
Contributor Author

I think another way we could possible go is to just instrument everything via a traceId which is more traditional for distributed tracing (rather than connecting tabIds to conversation Ids at the end of a flow like we do here). I think this might be more scalable for codewhisperer inline as well.

on chat prompt we could create a trace id, plumb that through the entire system and store various different timings in the RecordMap.

This would remove:

  • listening on telemetry events
  • remove the chatStorageFacade

we would keep:

  • record map
  • start listening/stop listening

end result:

  • we still get the chat message duration (but we don't necessarily correlate telemetry events with chat_roundTrip (though we totally can))
  • less complicated to understand/the flow is easily observable e2e
  • predictable metrics appearing in the end result

Problem:
- We can't measure e2e how long a chat request takes

Solution:
- Take an initial approach, see what works and what doesn't

Original approach:
- The original approach was to "listen" to all events coming out of telemetry. I think this is something we probably want to do eventually because it means we won't need to manually instrument flows ourselves but I think the integration needed to be more refined

Current approach:
- Plumb the trace id through to the webview -> vscode -> vscode. This allows us to know exactly what is happening to a message and how long everything took to get there
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants