-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bump NVML to 12.4 via 12.1, 12.2, and 12.3 #123
Conversation
3e215c6
to
e8f227c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may be missing some System*
functions for the 12.3
bump.
A general question is whether we want to separate this into separate PRs to better track the version updates? (We need not release each of these, but we could)
@@ -78,18 +78,6 @@ const ( | |||
VGPU_NAME_BUFFER_SIZE = 64 | |||
// GRID_LICENSE_FEATURE_MAX_COUNT as defined in nvml/nvml.h | |||
GRID_LICENSE_FEATURE_MAX_COUNT = 3 | |||
// VGPU_SCHEDULER_POLICY_UNKNOWN as defined in nvml/nvml.h | |||
VGPU_SCHEDULER_POLICY_UNKNOWN = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this is due to the nvml.h
change, this is breaking from the point of view of consumers -- especially if we consder that we may be loading and older libnvidia-ml.so
version that still has these defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVML only guarantees backwards compatibility within a major release. Are you saying we want to have a stronger guarantee?
pkg/nvml/device.go
Outdated
@@ -2654,48 +2654,6 @@ func (device nvmlDevice) GetVgpuCapabilities(capability DeviceVgpuCapability) (b | |||
return (capResult != 0), ret | |||
} | |||
|
|||
// nvml.DeviceGetVgpuSchedulerLog() | |||
func (l *library) DeviceGetVgpuSchedulerLog(device Device) (VgpuSchedulerLog, Return) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these changes mean that we should do each of the verison bumps as separate PRs? At least we then have the option to branch at the 12.1
update and apply fixes there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either that or just tag the appropriate commits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose they're still available even after a merge.
@@ -184,9 +184,6 @@ var ( | |||
DeviceGetVgpuCapabilities = libnvml.DeviceGetVgpuCapabilities | |||
DeviceGetVgpuMetadata = libnvml.DeviceGetVgpuMetadata | |||
DeviceGetVgpuProcessUtilization = libnvml.DeviceGetVgpuProcessUtilization | |||
DeviceGetVgpuSchedulerCapabilities = libnvml.DeviceGetVgpuSchedulerCapabilities |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're generating APIs now, could we call out the removal of these functions in a changelog / release notes? (more of a nice-to-have)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They were added back in. There were however, 2 legitimate removals:
- CcuGetStreamState() (int, Return)
- CcuSetStreamState(int) Return
Not sure if these were intentional...
@@ -78,6 +78,24 @@ const ( | |||
VGPU_NAME_BUFFER_SIZE = 64 | |||
// GRID_LICENSE_FEATURE_MAX_COUNT as defined in nvml/nvml.h | |||
GRID_LICENSE_FEATURE_MAX_COUNT = 3 | |||
// VGPU_SCHEDULER_POLICY_UNKNOWN as defined in nvml/nvml.h | |||
VGPU_SCHEDULER_POLICY_UNKNOWN = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I see, they re-added them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, they were removed in 12.1 for some reason and then readded in 12.2
@@ -275,6 +275,15 @@ func nvmlDeviceGetSerial(nvmlDevice nvmlDevice, Serial *byte, Length uint32) Ret | |||
return __v | |||
} | |||
|
|||
// nvmlDeviceGetModuleId function as declared in nvml/nvml.h | |||
func nvmlDeviceGetModuleId(nvmlDevice nvmlDevice, ModuleId *uint32) Return { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: Can we update c-for-go
to return moduleId
here instead of ModuleId
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. We could / should look into it.
@@ -102,6 +102,25 @@ func nvmlSystemGetProcessName(Pid uint32, Name *byte, Length uint32) Return { | |||
return __v | |||
} | |||
|
|||
// nvmlSystemGetHicVersion function as declared in nvml/nvml.h | |||
func nvmlSystemGetHicVersion(HwbcCount *uint32, HwbcEntries *HwbcEntry) Return { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a corresponding top-level or interface implementation for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its there already, actually:
https://github.com/NVIDIA/go-nvml/blob/main/pkg/nvml/system.go#L53
In each of these versions bumps a whole bunch of functions were moved around, so there are a bunch of +
s and -
s on functions that are really just moves within the file.
@@ -351,6 +351,15 @@ func nvmlDeviceClearCpuAffinity(nvmlDevice nvmlDevice) Return { | |||
return __v | |||
} | |||
|
|||
// nvmlDeviceGetNumaNodeId function as declared in nvml/nvml.h | |||
func nvmlDeviceGetNumaNodeId(nvmlDevice nvmlDevice, Node *uint32) Return { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Does this mean we don't need to use the heuristics that we currently use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say so. But we need to be aware that this is only available in later releases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we would have to support a fallback.
Do you have further comments here? Or more comments to add to my responses? |
No. Looks good. Thanks for the responses. |
Let's fix the GPM metrics before merging this so that the change appears in all of the versions 12.0-12.4 |
This is technically a breaking change, but I can't imagine there being anyone who creates a variable to one of these types. If they do, I imagine they use the `:=` syntax and not the explicit `var` syntax, so they won't be naming the tspe anyway. I'm OK breaking the 1 person who this might affect. Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
Signed-off-by: Kevin Klues <[email protected]>
e8f227c
to
f7ceebd
Compare
No description provided.