-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Adjust the alicloud metrics exporter and add RDS performance metrics #15563
base: master
Are you sure you want to change the base?
feat: Adjust the alicloud metrics exporter and add RDS performance metrics #15563
Conversation
9f75e88
to
d4bed9f
Compare
@ZPascal I see this is a draft, are you actually ready for a review once the lint issue is resolved? |
e2b3c16
to
d632ccc
Compare
@powersj I think so. I still have some rework to do, e.g. new tests or documentation, but I think we can do it in parallel. In general, the feature offers the possibility to manage performance metrics. These metrics are not part of the CMS. I'm currently thinking about renaming the plugin. What do you think of this idea? |
c0b4842
to
a25731c
Compare
Check my understanding: CMS is the Cloud Monitor service, while the RDS is the relational database service? A user might not want one or the other. Renaming it in the Telegraf config would be a breaking change and we do not want to do that. What I would suggest doing is to add a config option that sets what metrics are captured, something like: ## Aliyun Metrics
## Specified which metrics to capture from Aliyun, choose from:
## * cms - Cloud Monitor service
## * rds - Relational Database service
# metrics = ["cms"] A user can then add in In terms of docs, we would want to clarify at the top that this collects more than just CMS. |
Hi @powersj, thank you for the answer.
The default values are collected by CMS and as an optional feature, you can enable the advanced monitoring metrics like AWS to get more details like IOPS usage.
I fully agree with that.
I would like to build up the whole topic in a generic way and also offer the possibility to consume further extended metrics of the services in the future. For this reason, I have introduced a generic parameter that provides the basic function for the plugin and in the second step activates the function for the RDS service. I can only report on our current case, and we need both. The CMS metrics and the performance metrics. If someone doesn't need the CMS metrics, they don't need to forward or extract them.
I think we can handle it, in one generic solution to extract metrics from the Alicloud API. |
We discourage boolean parameters. As a plugin gets more and more features it means the user has more and more toggles that they need to switch on or off and complicates the configuration greatly. We prefer a single config option that is an array that specifies what features to toggle on or off. |
@powersj Thank you for the update. I'll adapt the code base. |
Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Forums or provide additional details in this issue and reqeust that it be re-opened. Thank you! |
I'm going to re-open as @ZPascal I think you've been going through and updating the PRs and we were nearly done with this one. |
f7f9386
to
bc31acc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update, some initial comments
## Specified which metric service to capture from Aliyun | ||
## * cms - Cloud Monitor service (default settings) | ||
## * rds - Relational Database service | ||
# service = 'rds' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there a service
+ metric_services
? Shouldn't this be removed in favor of the metric_services
above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metric_service
is the general trigger point to initialize the RDS client on a technical level and to activate the function in general. The service
parameter defines the corresponding extended
metrics and marks them. These metrics are not available at the standard level (CMS). This means at the API level that you will not get a result for the endpoint if you call it with the wrong metrics.
It is not marked as an error, but we log it as debug information. If this behavior is ok, I can also remove the service
parameter and we intentionally call the API endpoint with the wrong information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the scenario at play with these settings then.
Why would I enable "cms" metric_services and not "cms" service? (This naming is a headache) :) Likewise, why would I enable "rds" metric_services and not "rds" service?
You mentioned extended metrics. We are not collecting these today for cms right?
Should this be a follow-on PR and this PR instead be focused only on adding the new RDS service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @powersj,
I'm not sure I understand the scenario at play with these settings then.
Why would I enable "cms" metric_services and not "cms" service? (This naming is a headache) :) Likewise, why would I enable "rds" metric_services and not "rds" service?
The corresponding parameters are linked to each other.
The metric_service
parameter is the generic option and generally activates the option. It initiates the separate client that processes the request to the dedicated API endpoint.
The service
parameter at the metric level is the trigger point for calling the API. In the current implementation, it is the RDS service of Alicloud and extracts the separately specified values.
If we don't have a switch for the metric level, we are intentionally calling the RDS API in the wrong context to get metrics e.g. we expect CMS default metrics, thus burdening the Alicloud rate limit.
You mentioned extended metrics. We are not collecting these today for cms right?
Should this be a follow-on PR and this PR instead be focused only on adding the new RDS service?
The CMS service and the dedicated APIs of the services themselves are not compatible and are separate endpoints. When I talk about extended metrics, I'm talking about the dedicated RDS API endpoint that this PR is already using.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @powersj, is there any ambiguity regarding the functionality or can I proceed to write new tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZPascal please do not expose technical details to users. As you said, the two parameters are linked and as such a nightmare to maintain and even communicate to the user. Please let the user define the metric and internally map to which client to initialize and use!
resp := new(rds.DescribeDBInstancePerformanceResponse) | ||
|
||
switch request.Key { | ||
//TODO Adapt the Tests and the Mock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's left here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is my quality aspect to write new unit tests for the delivered new function. I am currently working on the mocks and will further customize the tests today.
5bd146b
to
6f80f8e
Compare
Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Forums or provide additional details in this issue and reqeust that it be re-opened. Thank you! |
Sorry @ZPascal. I'll reopen and try to read into the discussion tomorrow! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @ZPascal for taking so long to look at your great contribution! I do have some comments in the code, with the most important being to remove the redundant metric_service
/service
settings. Just use one (e.g. metric
) setting and internally map to which service you have to get this from.
Regarding renaming the plugin: Please don't do this in the present PR but split this out to a separate one. What you can do is to rename the plugin directory and struct to "aliyun" and then register the plugin twice, once with the new and once with the old name like here...
## Specified which metric service to capture from Aliyun | ||
## * cms - Cloud Monitor service (default settings) | ||
## * rds - Relational Database service | ||
# service = 'rds' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ZPascal please do not expose technical details to users. As you said, the two parameters are linked and as such a nightmare to maintain and even communicate to the user. Please let the user define the metric and internally map to which client to initialize and use!
"github.com/jmespath/go-jmespath" | ||
"reflect" | ||
"slices" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep the grouping into built-in, 3rd party and telegraf internal imports which the groups being separated by an empty line!
"github.com/jmespath/go-jmespath" | ||
|
||
"github.com/aliyun/alibaba-cloud-sdk-go/services/rds" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, the SDK is not an internal import so please move it to the group above.
@@ -27,8 +29,8 @@ import ( | |||
var sampleConfig string | |||
|
|||
type ( | |||
// AliyunCMS is aliyun cms config info. | |||
AliyunCMS struct { | |||
// AliyunMetrics is aliyun cms config info. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment adds no info at all. Please remove instead of making it worse. ;-)
func (*AliyunCMS) SampleConfig() string { | ||
func (*AliyunMetrics) SampleConfig() string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please keep the struct name as changing it might make things more confusing...
//Check metric services | ||
if len(s.MetricServices) == 0 { | ||
s.MetricServices = []string{"cms"} | ||
s.Log.Info("'metric_services' is not set. Metrics will be queried from the cms service") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not log anything for filling in default values. This is confusion and will make users explicitly set the values in their configs to silence this with no benefit.
for _, instanceID := range metric.requestDimensions { | ||
req := rds.CreateDescribeDBInstancePerformanceRequest() | ||
req.DBInstanceId = instanceID["instanceId"] | ||
req.Key = metricName | ||
startTime := s.windowStart.UTC() | ||
req.StartTime = fmt.Sprintf("%d-%02d-%02dT%02d:%02dZ", startTime.Year(), startTime.Month(), | ||
startTime.Day(), startTime.Hour(), startTime.Minute()) | ||
endTime := s.windowEnd.UTC() | ||
req.EndTime = fmt.Sprintf("%d-%02d-%02dT%02d:%02dZ", endTime.Year(), endTime.Month(), | ||
endTime.Day(), endTime.Hour(), endTime.Minute()) | ||
req.RegionId = region | ||
|
||
resp, err := s.rdsClient.DescribeDBInstancePerformance(req) | ||
|
||
if err != nil { | ||
return fmt.Errorf("failed to get the database instance performance metrics: %w", err) | ||
} | ||
if resp.GetHttpStatus() != 200 { | ||
s.Log.Errorf("failed to get the database instance performance metrics: %v", resp.BaseResponse.GetHttpContentString()) | ||
break | ||
} | ||
|
||
for _, performanceKey := range resp.PerformanceKeys.PerformanceKey { | ||
for _, performanceValue := range performanceKey.Values.PerformanceValue { | ||
parsedTime, err := time.Parse(time.RFC3339, performanceValue.Date) | ||
if err != nil { | ||
return fmt.Errorf("failed to parse response performance time datapoints: %w", err) | ||
} | ||
|
||
if strings.Contains(performanceValue.Value, "&") { | ||
performanceKeys := strings.Split(performanceKey.ValueFormat, "&") | ||
performanceValues := strings.Split(performanceValue.Value, "&") | ||
|
||
for i, value := range performanceValues { | ||
valueAsFloat, err := strconv.ParseFloat(value, 32) | ||
if err != nil { | ||
return fmt.Errorf("failed to convert the performance value string to an float: %w", err) | ||
} | ||
datapoints = append(datapoints, | ||
map[string]interface{}{ | ||
"instanceId": instanceID["instanceId"], | ||
performanceKeys[i]: valueAsFloat, | ||
"timestamp": parsedTime.Unix(), | ||
}) | ||
} | ||
} else { | ||
valueAsFloat, err := strconv.ParseFloat(performanceValue.Value, 32) | ||
if err != nil { | ||
return fmt.Errorf("failed to convert the performance value string to an float: %w", err) | ||
} | ||
datapoints = append(datapoints, | ||
map[string]interface{}{ | ||
"instanceId": instanceID["instanceId"], | ||
performanceKey.ValueFormat: valueAsFloat, | ||
"timestamp": parsedTime.Unix(), | ||
}) | ||
} | ||
} | ||
} | ||
|
||
if len(datapoints) == 0 { | ||
s.Log.Debugf("No rds performance metrics returned from RDS, response msg: %s", resp.GetHttpContentString()) | ||
break | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OMG! Please move this to an own function and try to reduce nesting!
if reflect.TypeOf(value).String() == "int64" { | ||
datapointTime = value.(int64) | ||
} else { | ||
datapointTime = int64(value.(float64)) / 1000 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use type assertions here instead of reflection!
@@ -39,7 +39,7 @@ type discoveryTool struct { | |||
|
|||
respRootKey string //Root key in JSON response where to look for discovery data | |||
respObjectIDKey string //Key in element of array under root key, that stores object ID | |||
//for ,majority of cases it would be InstanceId, for OSS it is BucketName. This key is also used in dimension filtering// ) | |||
//for ,the majority of cases it would be InstanceId, for OSS it is BucketName. This key is also used in dimension filtering// ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get what this comment wants to tell me and where this relates to...
@@ -63,15 +63,13 @@ func getRPCReqFromDiscoveryRequest(req discoveryRequest) (*requests.RpcRequest, | |||
} | |||
|
|||
ptrV := reflect.Indirect(reflect.ValueOf(req)) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please move formatting cleanup to an own PR to ease reviews! It's hard if functional changes and pure style changes are mixed and some files are only style changes...
ea43c13
to
44b89fc
Compare
Download PR build artifacts for linux_amd64.tar.gz, darwin_arm64.tar.gz, and windows_amd64.zip. 📦 Click here to get additional PR build artifactsArtifact URLs |
Hello! I am closing this issue due to inactivity. I hope you were able to resolve your problem, if not please try posting this question in our Community Slack or Community Forums or provide additional details in this issue and reqeust that it be re-opened. Thank you! |
Unfortunately, there was no progress in the PR for some time. Please reopen the PR if you want to continue working on the matter. |
Hi @srebhan, I'm still working on the corresponding feature in general, but I need a bit more time to handle the corresponding case and rewrite the functionality in general. I'll reopen a new PR if the feature is ready. |
I reopened the PR to be sure we keep track... |
30ab988
to
ba29dd4
Compare
@ZPascal would you be ok to close this PR and reopen once your code is in a reviewable state? I'm asking because having too many open PRs makes keeping an overview harder. Don't get me wrong, this is not a judgement of your code and I'm really looking forward to reviewing this feature, it's more an administrative issue for us... |
Summary
The PR adds generic advanced and RDS performance metrics support
Checklist
Output
Related issues
superseeds #15238