-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add opt-in metrics to CloudWatch agent #988
base: main
Are you sure you want to change the base?
Conversation
"ImageId": "${aws:ImageId}", | ||
"InstanceId": "${aws:InstanceId}", | ||
"InstanceType": "${aws:InstanceType}", | ||
"AutoScalingGroupName": "${aws:AutoScalingGroupName}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for purposes of cloudwatch metric billing, each unique dimension combination is considered 1 metric, so including granular dimensions like InstanceId
or even InstanceType
could really blow up costs for those of us with a large number of ec2 instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. How about having just the AutoScalingGroupName
dimension then?
Or would there be a way to have the queue as a dimension?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i think that makes sense. not sure if it's possible to add the queue, but i think that'd be nice as well. It should be available as a tag on the instance, BuildkiteQueue
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into this a bit, and one way to add BuildkiteQueue
as a dimension would be to modify the cloudwatch config file on boot using jq (similar to docker), as part of the bk-install-elastic-stack.sh
script. You should have access to BUILDKITE_QUEUE from there and then just enable and start the service on boot instead of at AMI creation time.
19278b8
to
5756f14
Compare
@freewil Ok updated the PR after much testing (lost a bit of time with the issue mentioned in #995) I've done the following changes:
As per the documentation:
After this change I get the following metrics in CloudWatch (ignore the pollution due to various tests): I reckon we don't need both the ASG name and the Queue, so we could drop one of them. Let me know what you think. Maybe we can change the |
cw_config="/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json" | ||
cat <<<"$(jq \ | ||
--arg queue "$BUILDKITE_QUEUE" \ | ||
'. + { | ||
metrics: { | ||
metrics_collected: { | ||
mem: {measurement: ["mem_used_percent"], append_dimensions: {BuildkiteQueue: $queue}}, | ||
disk: {measurement: ["used_percent"], resources: ["*"], append_dimensions: {BuildkiteQueue: $queue}} | ||
}, | ||
append_dimensions: { | ||
AutoScalingGroupName: "${aws:AutoScalingGroupName}" | ||
} | ||
} | ||
}' $cw_config)" >$cw_config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so looks like this will append to https://github.com/buildkite/elastic-ci-stack-for-aws/blob/509802db39718c8e0623535804237fed35f6f516/packer/linux/conf/cloudwatch-agent/config.json so that the existing file doesn't get clobbered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it actually rewrites it. It's the same pattern used to configure docker, albeit more complicated.
- The
jq
command will read the$cw_config
file, append the metrics configuration block and output it (jq [options] <jq filter> [file...]
). Where the jq filter is:.
identity filter{}
: object construction+
: addition operator between two objects. So takes the original configuration and merge the new object created:Objects are added by merging, that is, inserting all the key-value pairs from both objects into a single combined object.
- This output is passed to the standard input cat with a
here-string
(<<<
) using command substitution ($(..)
). So the jq command runs in a subshell. cat
is then overwriting the original config file
It's equivalent to:
new_config=$(jq <filter> $cw_config)
echo $new_config | cat > $cw_config
From what I understand, the combination of here-string and command substitution will use a temp file or memory to avoid any race condition while modifying the file in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool - LGTM. I don't work for buildkite btw, but would love to see this get merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is awesome @ouranos! thanks for submitting.
Maybe we can change the namespace too? Or make it more customisable?
i think adding a namespace is probably a good idea, it'll make filtering in the console significantly less of a pain. maybe something like BuildkiteElasticStack/Agent
or something similar? very open to other ideas though, and agree that making it customizable (with a sensible default) is a good idea.
mem: {measurement: ["mem_used_percent"], append_dimensions: {BuildkiteQueue: $queue}}, | ||
disk: {measurement: ["used_percent"], resources: ["*"], append_dimensions: {BuildkiteQueue: $queue}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we're adding the queue as dimension for memory and disk, should we do it for CPU too?
Report on memory and disk usage
5756f14
to
a1e83f6
Compare
Following #811, we've removed our custom patch to install the CloudWatch agent.
The default config (downloaded with
amazon-cloudwatch-agent-ctl -m ec2 -a fetch-config -c default -s
) had an extra config block to reports on memory and disk usage.This is a nice feature to monitor buildkite instances utilisation (and assist with dimensioning).
For example, see the bottom right graph:
We could potentially add other dimensions to make it easier to differentiate by queue name. I'm currently using the ASG name which might change when updating the stack.