Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix missing GPM metrics #122

Merged
merged 2 commits into from
May 24, 2024
Merged

Fix missing GPM metrics #122

merged 2 commits into from
May 24, 2024

Conversation

elezar
Copy link
Member

@elezar elezar commented May 16, 2024

These changes ensure that the collected GPM metrics are transferred to the original data structure passed to GpmMetricsGet.

Closes #121

@elezar elezar requested a review from klueska May 16, 2024 13:56
@elezar elezar self-assigned this May 16, 2024
pkg/nvml/gpm.go Outdated
Comment on lines 72 to 74
defer func() {
metricsGet.Metrics = nvmlMetricsGet.Metrics
}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid doing a defer like this and do something more explicit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can store the incoming ret.

I was just wondering whether we should copy the metrics that are returned regardless of the ret value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to remember why we pass a pointer to the GpmMetricsGetType in here (and fill it out) rather than just returning (GpmMetricsGetType, Return). I have a vague memory of needing to do this but I don't know why.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember now -- because the API requires some fields to be set as INPUTs and other fields to be set as OUTPUTs in the struct: https://github.com/NVIDIA/go-nvml/blob/main/gen/nvml/nvml.h#L9401-L9409

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a matter of interest, if some fields are used as inputs, using the GpmMetricsGetV type seems quite akward. I suppose this is generated due to the versioned union though.

pkg/nvml/gpm.go Outdated
metricsGet.Metrics = nvmlMetricsGet.Metrics
}()

return nvmlGpmMetricsGetStub(nvmlMetricsGet)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the idea of adding a one-off stub like this. If we are going to do this, we should do it for everything.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I added one like this for the topology functions that I also added tests for. I didn't see the value in adding all of them if we weren't going to use them immediately.

Would you rather I remove the tests and the stubs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's fine then. I think we can add a level of indirection pretty easily though. We just instruct nvidia.yml to rewrite all generated functions as nvml*Internal and then do something similar to how I generate the various interfaces to generate variables for all nvml* = nvml*Internal by default. Even better would be to make them all function pointers hanging off of the library type (so we don't have to do the weird update/reset in the tests), but that might be more trouble than it's worth.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Let's have a look at this at some point in the (near) future.

pkg/nvml/gpm.go Outdated
@@ -57,14 +57,23 @@ type GpmMetricsGetVType struct {
func (l *library) GpmMetricsGetV(metricsGet *GpmMetricsGetType) GpmMetricsGetVType {
return GpmMetricsGetVType{metricsGet.convert()}
Copy link
Contributor

@klueska klueska May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be updated somehow too? Otherwise, how do the assigned metrics make their way out of this call?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .convert() call does also copy the metrics meaning that any metrics in metricsGet will be in the result of convert().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should change this whole thing:

// nvml.GpmMetricsGet()
type GpmMetricsGetVType struct {
	metricsGet *nvmlGpmMetricsGetType
}

func (l *library) GpmMetricsGetV(metricsGet *GpmMetricsGetType) GpmMetricsGetVType {
	return GpmMetricsGetVType{metricsGet.convert()}
}
func (metricsGetV GpmMetricsGetVType) V1() Return {
	metricsGetV.metricsGet.Version = 1
	return nvmlGpmMetricsGet(metricsGetV.metricsGet)
}

func (l *library) GpmMetricsGet(metricsGet *GpmMetricsGetType) Return {
	metricsGet.Version = GPM_METRICS_GET_VERSION
	return nvmlGpmMetricsGet(metricsGet.convert())
}

to this:

// nvml.GpmMetricsGet()
type GpmMetricsGetVType struct {
	metricsGet *GpmMetricsGetType
}

func (l *library) GpmMetricsGetV(metricsGet *GpmMetricsGetType) GpmMetricsGetVType {
	return GpmMetricsGetVType{metricsGet}
}
func (metricsGetV GpmMetricsGetVType) V1() Return {
	nvmlMetricsGet := metricsGetV.metricsGet.convert()
	nvmlMetricsGet.Version = 1
	ret := nvmlGpmMetricsGet(&nvmlMetricsGet)
	*metricsGetV.metricsGet = nvmlMetricsGet.convert()
	return ret
}

func (l *library) GpmMetricsGet(metricsGet *GpmMetricsGetType) Return {
	nvmlMetricsGet := metricsGet.convert()
	nvmlMetricsGet.Version = GPM_METRICS_GET_VERSION
	ret := nvmlGpmMetricsGet(&nvmlMetricsGet)
	*metricsGetV.metricsGet = nvmlMetricsGet.convert()
	return ret
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see how your most recent changes will work for GpmMetricsGet() (even though I still think we should use an explicit convert) -- but I still don't understand how you believe they work for GpmMetricsGetV(&metricsGet).V1() without something like what I have above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right about V1(). I've updated the revision to add a test for that too and also update the implementation.

@klueska
Copy link
Contributor

klueska commented May 21, 2024

The .convert() call does also copy the metrics meaning that any metrics in metricsGet will be in the result of convert().

Yes, but we need the values set back in the original variable passed in.

@elezar elezar marked this pull request as ready for review May 21, 2024 18:46
pkg/dl/dl.go Outdated
@@ -65,6 +65,7 @@ func dlError() error {
}

func (dl *DynamicLibrary) Open() error {
fmt.Printf("%+v\n", dl)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Sorry, was editing in the wrong window here.

pkg/nvml/gpm.go Outdated
return nvmlGpmMetricsGet(metricsGet.convert())
nvmlMetricsGet := metricsGet.convert()
ret := nvmlGpmMetricsGetStub(nvmlMetricsGet)
metricsGet.Metrics = nvmlMetricsGet.Metrics
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klueska we create a local instance of an nvml type before the function call and we set the return value explicitly here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, something changed since I made my previous comment. I was still seeing:

func (l *library) GpmMetricsGetV(metricsGet *GpmMetricsGetType) GpmMetricsGetVType {
	return GpmMetricsGetVType{metricsGet.convert()}
}

whereby we lose our reference to the original metricsGet variable as soon as we do metricsGet.convert(). The current code seems to basically do what I was suggesting now.

The one difference being that you do:

metricsGet.Metrics = nvmlMetricsGet.Metrics

instead of:

*metricsGet = nvmlMetricsGet.convert()

This is equivalent for now -- but your version would need to be updated if any other fields become OUT fields in the future, wheras using convert() avoids this.

pkg/nvml/gpm.go Outdated
Comment on lines 74 to 78
func gpmMetricsGet(metricsGet *GpmMetricsGetType) Return {
nvmlMetricsGet := metricsGet.convert()
ret := nvmlGpmMetricsGetStub(nvmlMetricsGet)
metricsGet.Metrics = nvmlMetricsGet.Metrics
return ret
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klueska following your comments, I could update this whole thing to:

Suggested change
func gpmMetricsGet(metricsGet *GpmMetricsGetType) Return {
nvmlMetricsGet := metricsGet.convert()
ret := nvmlGpmMetricsGetStub(nvmlMetricsGet)
metricsGet.Metrics = nvmlMetricsGet.Metrics
return ret
func gpmMetricsGet(metricsGet *GpmMetricsGetType) Return {
nvmlMetricsGet := metricsGet.convert()
ret := nvmlGpmMetricsGet(nvmlMetricsGet)
*metricsGet = *nvmlMetricsGet.convert()
return ret

do you have a preference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is equivalent for now -- but your version would need to be updated if any other fields become OUT fields in the future, wheras using convert() avoids this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Let's switch to the full assignmet to better handle future fields.

elezar added 2 commits May 23, 2024 12:33
@elezar elezar merged commit 3ff490b into NVIDIA:main May 24, 2024
4 checks passed
@elezar elezar deleted the gpm-metrics branch May 24, 2024 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Collected GPM metrics are not returned to caller
2 participants