Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add internal method to get device handle #119

Merged
merged 2 commits into from
May 14, 2024

Conversation

elezar
Copy link
Member

@elezar elezar commented May 8, 2024

This was done in go-nvlib in NVIDIA/go-nvlib#17 but not included when we implemented the interfaces here.

Without the explict nvmlDevice conversion we see:

=== RUN   TestGetTopologyCommonAncestor
--- FAIL: TestGetTopologyCommonAncestor (0.00s)
panic: interface conversion: nvml.Device is nvml.wrappedDevice, not nvml.nvmlDevice [recovered]
        panic: interface conversion: nvml.Device is nvml.wrappedDevice, not nvml.nvmlDevice

goroutine 19 [running]:
testing.tRunner.func1.2({0x1030fc500, 0xc0000b10b0})
        /usr/local/Cellar/go/1.22.2/libexec/src/testing/testing.go:1631 +0x24a
testing.tRunner.func1()
        /usr/local/Cellar/go/1.22.2/libexec/src/testing/testing.go:1634 +0x377
panic({0x1030fc500?, 0xc0000b10b0?})
        /usr/local/Cellar/go/1.22.2/libexec/src/runtime/panic.go:770 +0x132
github.com/NVIDIA/go-nvml/pkg/nvml.nvmlDevice.GetTopologyCommonAncestor(...)
        /Users/elezar/src/go-nvml/pkg/nvml/device.go:223
github.com/NVIDIA/go-nvml/pkg/nvml.TestGetTopologyCommonAncestor(0xc0000c29c0)
        /Users/elezar/src/go-nvml/pkg/nvml/device_test.go:38 +0x85
testing.tRunner(0xc0000c29c0, 0x10313b4c8)
        /usr/local/Cellar/go/1.22.2/libexec/src/testing/testing.go:1689 +0xfb
created by testing.(*T).Run in goroutine 1
        /usr/local/Cellar/go/1.22.2/libexec/src/testing/testing.go:1742 +0x390
FAIL    github.com/NVIDIA/go-nvml/pkg/nvml      0.612s

@elezar elezar force-pushed the fix-topology-common-ancestor branch from c4f6455 to 90d365f Compare May 8, 2024 19:59
@klueska
Copy link
Contributor

klueska commented May 9, 2024

I don't think this approach will work in general. It works for new types that embed the Device type, but not for those that want to implement the Device interface from scratch (e.g. our mock implementation) because they have no way of providing an implementation for the unexported method.

Would an alternative be to implement the following internal function directly? I realize it relies on reflection, but maybe this is OK since it won't actually be called very often...

func nvmlDeviceHandle(d Device) nvmlDevice {
	if nvmlDevice, ok := d.(nvmlDevice); ok {
		return nvmlDevice
	}
	
	val := reflect.ValueOf(d)
	typ := val.Type()
	for i := 0; i < typ.NumField(); i++ {
		if !typ.Field(i).Anonymous {
			continue
		}
		if _, ok := val.Field(i).Interface().(Device); !ok {
			continue
		}
		return nvmlDeviceHandle(val.Field(i).Interface().(Device))
	}
	panic("Unable to convert Device to underlying nvmlDevice")
}

Here it is in the go playground:
https://go.dev/play/p/AnJXQpCrzGZ?v=gotip.go?download=true.go?download=true

@elezar elezar force-pushed the fix-topology-common-ancestor branch from 4ed5f91 to ddf4a1b Compare May 10, 2024 08:50
@elezar
Copy link
Member Author

elezar commented May 10, 2024

There is definitely a concern that this would not work in general. I also considered reflect, but didn't have time to properly flesh this out.

Note that the issue here is not the concrete type to which the function is attached, but the fact that we're accepting an interface type for one of the functions and expect this argument to be convertable to the base type because we need to call an internal function with this type. It is this call to the internal function that causes a panic when the input cannot be converted to the concrete type.

I have incorportated your nvmlDeviceHandle function in the latest revision and we will still panic in cases where we cannot perform this conversion, but at least cover the wrapped cases. For users implementing their own Device types from scratch, they will have to update the relevant functions accordingly.

Another option would be to add a function that provides the underlying reference as a public function. This may still not make sense in mocks, but would allow consumers to implement this as they require.

@elezar elezar self-assigned this May 10, 2024
@elezar elezar marked this pull request as ready for review May 10, 2024 09:04
@elezar elezar force-pushed the fix-topology-common-ancestor branch 2 times, most recently from d5cd719 to a74badc Compare May 10, 2024 10:27
pkg/nvml/device.go Outdated Show resolved Hide resolved
elezar added a commit to elezar/k8s-device-plugin that referenced this pull request May 10, 2024
Apply a local fix to a go-nvml bug.

See NVIDIA/go-nvml#119

Signed-off-by: Evan Lezar <[email protected]>
@elezar elezar force-pushed the fix-topology-common-ancestor branch 4 times, most recently from bed0186 to cdfcbbb Compare May 13, 2024 15:54
pkg/nvml/device.go Outdated Show resolved Hide resolved
@elezar elezar force-pushed the fix-topology-common-ancestor branch from cdfcbbb to 339dc1a Compare May 14, 2024 08:28
@klueska
Copy link
Contributor

klueska commented May 14, 2024

I'm still sad there isn't a good way to do this without reflection, but I think the current implementation is pretty easy to understand and covers all our bases.

@elezar elezar merged commit a94486b into NVIDIA:main May 14, 2024
4 checks passed
@elezar elezar deleted the fix-topology-common-ancestor branch May 14, 2024 11:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants