Description of problem: After apply new sriovnetworknodepolicy and worker reboot. it caused sriov-device-plugin cannot find the VF Version-Release number of selected component (if applicable): 4.9.0-202108191042 How reproducible: not always Steps to Reproduce: 1. Create the following policy # cat intel-netdevice.yaml apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: intel-netdevice namespace: openshift-sriov-network-operator spec: deviceType: netdevice nicSelector: pfNames: - ens1f0 rootDevices: - '0000:3b:00.0' vendor: '8086' nodeSelector: feature.node.kubernetes.io/sriov-capable: 'true' numVfs: 2 priority: 99 resourceName: intelnetdevice [root@dell-per740-36 rhcos]# cat mlx277-rdma apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: mlx277-dpdk namespace: openshift-sriov-network-operator spec: mtu: 1500 nicSelector: pfNames: - ens2f1 vendor: '15b3' deviceID: '1015' nodeSelector: feature.node.kubernetes.io/sriov-capable: 'true' numVfs: 2 isRdma: true resourceName: mlx277dpdk 2. after above policy created. all works well. VF cannot found # oc describe node dell-per740-14.rhts.eng.pek2.redhat.com | grep "openshift.io" node.openshift.io/os_id=rhcos machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-worker-eb1b1ade8c290e67fc77daa247a4ffbd machineconfiguration.openshift.io/desiredConfig: rendered-worker-eb1b1ade8c290e67fc77daa247a4ffbd machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/ssh: accessed machineconfiguration.openshift.io/state: Done sriovnetwork.openshift.io/state: Idle openshift.io/intelnetdevice: 2 openshift.io/mlx277dpdk: 2 openshift.io/intelnetdevice: 2 openshift.io/mlx277dpdk: 2 3. apply another policy in follow: # cat mlx278-rdma apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: mlx278-dpdk namespace: openshift-sriov-network-operator spec: mtu: 1550 nicSelector: pfNames: - ens3f1 rootDevices: - '0000:5e:00.1' vendor: '15b3' nodeSelector: feature.node.kubernetes.io/sriov-capable: 'true' numVfs: 2 isRdma: true resourceName: mlx278dpdk 4. after the worker reboot, found the "openshift.io/mlx277dpdk change to 1" # oc describe node dell-per740-14.rhts.eng.pek2.redhat.com | grep "openshift.io" node.openshift.io/os_id=rhcos machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-worker-eb1b1ade8c290e67fc77daa247a4ffbd machineconfiguration.openshift.io/desiredConfig: rendered-worker-eb1b1ade8c290e67fc77daa247a4ffbd machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/ssh: accessed machineconfiguration.openshift.io/state: Done sriovnetwork.openshift.io/state: Idle openshift.io/intelnetdevice: 2 openshift.io/mlx277dpdk: 1 openshift.io/mlx278dpdk: 2 openshift.io/intelnetdevice: 2 openshift.io/mlx277dpdk: 1 openshift.io/mlx278dpdk: 2 openshift.io/intelnetdevice 0 0 openshift.io/mlx277dpdk 0 0 openshift.io/mlx278dpdk 0 0 5. Actually the Vf is inited # ip link show ens2f1 11: ens2f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 link/ether 98:03:9b:48:dd:99 brd ff:ff:ff:ff:ff:ff vf 0 link/ether 32:be:05:49:2f:d3 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off vf 1 link/ether e6:80:d7:e7:b4:13 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off 6. Check the sriov-device-plugin pod logs Actual results: # oc logs sriov-device-plugin-9f62d I0820 09:12:36.334062 1 manager.go:52] Using Kubelet Plugin Registry Mode I0820 09:12:36.334298 1 main.go:44] resource manager reading configs I0820 09:12:36.334377 1 manager.go:82] raw ResourceList: {"resourceList":[{"resourceName":"intelnetdevice","selectors":{"vendors":["8086"],"pfNames":["ens1f0"],"rootDevices":["0000:3b:00.0"],"linkTypes":["ether"],"IsRdma":false,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"mlx277dpdk","selectors":{"vendors":["15b3"],"pfNames":["ens2f1"],"linkTypes":["ether"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"mlx278dpdk","selectors":{"vendors":["15b3"],"pfNames":["ens3f1"],"rootDevices":["0000:5e:00.1"],"linkTypes":["ether"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null}]} I0820 09:12:36.334705 1 factory.go:168] net device selector for resource intelnetdevice is &{DeviceSelectors:{Vendors:[8086] Devices:[] Drivers:[] PciAddresses:[]} PfNames:[ens1f0] RootDevices:[0000:3b:00.0] LinkTypes:[ether] DDPProfiles:[] IsRdma:false NeedVhostNet:false} I0820 09:12:36.334787 1 factory.go:168] net device selector for resource mlx277dpdk is &{DeviceSelectors:{Vendors:[15b3] Devices:[] Drivers:[] PciAddresses:[]} PfNames:[ens2f1] RootDevices:[] LinkTypes:[ether] DDPProfiles:[] IsRdma:true NeedVhostNet:false} I0820 09:12:36.334842 1 factory.go:168] net device selector for resource mlx278dpdk is &{DeviceSelectors:{Vendors:[15b3] Devices:[] Drivers:[] PciAddresses:[]} PfNames:[ens3f1] RootDevices:[0000:5e:00.1] LinkTypes:[ether] DDPProfiles:[] IsRdma:true NeedVhostNet:false} I0820 09:12:36.334870 1 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix: ResourceName:intelnetdevice DeviceType:netDevice Selectors:0xc000130d08 SelectorObj:0xc00025ca90} {ResourcePrefix: ResourceName:mlx277dpdk DeviceType:netDevice Selectors:0xc000130d20 SelectorObj:0xc00025cea0} {ResourcePrefix: ResourceName:mlx278dpdk DeviceType:netDevice Selectors:0xc000130d38 SelectorObj:0xc00025d110}] I0820 09:12:36.334980 1 manager.go:193] validating resource name "openshift.io/intelnetdevice" I0820 09:12:36.335032 1 manager.go:193] validating resource name "openshift.io/mlx277dpdk" I0820 09:12:36.335060 1 manager.go:193] validating resource name "openshift.io/mlx278dpdk" I0820 09:12:36.335073 1 main.go:60] Discovering host devices I0820 09:12:36.423991 1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:18:00.0 02 Intel Corporation I350 Gigabit Network Connection I0820 09:12:36.424369 1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:18:00.1 02 Intel Corporation I350 Gigabit Network Connection I0820 09:12:36.424611 1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:18:00.2 02 Intel Corporation I350 Gigabit Network Connection I0820 09:12:36.424764 1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:18:00.3 02 Intel Corporation I350 Gigabit Network Connection I0820 09:12:36.424927 1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:3b:00.0 02 Intel Corporation Ethernet Controller XXV710 for 25GbE ... I0820 09:12:36.425118 1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:3b:00.1 02 Intel Corporation Ethernet Controller XXV710 for 25GbE ... I0820 09:12:36.425268 1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:3b:02.0 02 Intel Corporation Ethernet Virtual Function 700 Series I0820 09:12:36.425402 1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:3b:02.1 02 Intel Corporation Ethernet Virtual Function 700 Series I0820 09:12:36.425541 1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:5e:00.0 02 Mellanox Technolo... MT27800 Family [ConnectX-5] I0820 09:12:36.425700 1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:5e:00.1 02 Mellanox Technolo... MT27800 Family [ConnectX-5] I0820 09:12:36.426753 1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:5e:00.4 02 Mellanox Technolo... MT27800 Family [ConnectX-5 Virtual Fu... I0820 09:12:36.427180 1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:5e:00.5 02 Mellanox Technolo... MT27800 Family [ConnectX-5 Virtual Fu... I0820 09:12:36.427588 1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:60:00.0 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx] I0820 09:12:36.427985 1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:60:00.1 02 Mellanox Technolo... MT27710 Family [ConnectX-4 Lx] I0820 09:12:36.428422 1 main.go:66] Initializing resource servers I0820 09:12:36.428456 1 manager.go:112] number of config: 3 I0820 09:12:36.428469 1 manager.go:115] I0820 09:12:36.428480 1 manager.go:116] Creating new ResourcePool: intelnetdevice I0820 09:12:36.428504 1 manager.go:117] DeviceType: netDevice I0820 09:12:36.439228 1 factory.go:108] device added: [pciAddr: 0000:3b:02.0, vendor: 8086, device: 154c, driver: iavf] I0820 09:12:36.439242 1 factory.go:108] device added: [pciAddr: 0000:3b:02.1, vendor: 8086, device: 154c, driver: iavf] I0820 09:12:36.439259 1 manager.go:145] New resource server is created for intelnetdevice ResourcePool I0820 09:12:36.439265 1 manager.go:115] I0820 09:12:36.439269 1 manager.go:116] Creating new ResourcePool: mlx277dpdk I0820 09:12:36.439273 1 manager.go:117] DeviceType: netDevice W0820 09:12:36.439303 1 pciNetDevice.go:55] RDMA resources for 0000:18:00.0 not found. Are RDMA modules loaded? W0820 09:12:36.439557 1 pciNetDevice.go:55] RDMA resources for 0000:18:00.1 not found. Are RDMA modules loaded? W0820 09:12:36.439800 1 pciNetDevice.go:55] RDMA resources for 0000:18:00.2 not found. Are RDMA modules loaded? W0820 09:12:36.439986 1 pciNetDevice.go:55] RDMA resources for 0000:18:00.3 not found. Are RDMA modules loaded? W0820 09:12:36.440158 1 pciNetDevice.go:55] RDMA resources for 0000:3b:00.1 not found. Are RDMA modules loaded? W0820 09:12:36.440356 1 pciNetDevice.go:55] RDMA resources for 0000:3b:02.0 not found. Are RDMA modules loaded? W0820 09:12:36.440683 1 pciNetDevice.go:55] RDMA resources for 0000:3b:02.1 not found. Are RDMA modules loaded? I0820 09:12:36.444276 1 factory.go:108] device added: [pciAddr: 0000:60:00.1, vendor: 15b3, device: 1015, driver: mlx5_core] I0820 09:12:36.444296 1 manager.go:145] New resource server is created for mlx277dpdk ResourcePool I0820 09:12:36.444301 1 manager.go:115] I0820 09:12:36.444305 1 manager.go:116] Creating new ResourcePool: mlx278dpdk I0820 09:12:36.444310 1 manager.go:117] DeviceType: netDevice W0820 09:12:36.444337 1 pciNetDevice.go:55] RDMA resources for 0000:18:00.0 not found. Are RDMA modules loaded? W0820 09:12:36.444580 1 pciNetDevice.go:55] RDMA resources for 0000:18:00.1 not found. Are RDMA modules loaded? W0820 09:12:36.444812 1 pciNetDevice.go:55] RDMA resources for 0000:18:00.2 not found. Are RDMA modules loaded? W0820 09:12:36.444981 1 pciNetDevice.go:55] RDMA resources for 0000:18:00.3 not found. Are RDMA modules loaded? W0820 09:12:36.445148 1 pciNetDevice.go:55] RDMA resources for 0000:3b:00.1 not found. Are RDMA modules loaded? W0820 09:12:36.445317 1 pciNetDevice.go:55] RDMA resources for 0000:3b:02.0 not found. Are RDMA modules loaded? W0820 09:12:36.445636 1 pciNetDevice.go:55] RDMA resources for 0000:3b:02.1 not found. Are RDMA modules loaded? I0820 09:12:36.449421 1 factory.go:108] device added: [pciAddr: 0000:5e:00.4, vendor: 15b3, device: 1018, driver: mlx5_core] I0820 09:12:36.449432 1 factory.go:108] device added: [pciAddr: 0000:5e:00.5, vendor: 15b3, device: 1018, driver: mlx5_core] I0820 09:12:36.449468 1 manager.go:145] New resource server is created for mlx278dpdk ResourcePool I0820 09:12:36.449473 1 main.go:72] Starting all servers... I0820 09:12:36.449685 1 server.go:196] starting intelnetdevice device plugin endpoint at: openshift.io_intelnetdevice.sock I0820 09:12:36.450379 1 server.go:222] intelnetdevice device plugin endpoint started serving I0820 09:12:36.450483 1 server.go:196] starting mlx277dpdk device plugin endpoint at: openshift.io_mlx277dpdk.sock I0820 09:12:36.451062 1 server.go:222] mlx277dpdk device plugin endpoint started serving I0820 09:12:36.451385 1 server.go:196] starting mlx278dpdk device plugin endpoint at: openshift.io_mlx278dpdk.sock I0820 09:12:36.451974 1 server.go:222] mlx278dpdk device plugin endpoint started serving I0820 09:12:36.452014 1 main.go:77] All servers started. I0820 09:12:36.452029 1 main.go:78] Listening for term signals I0820 09:12:37.206376 1 server.go:106] Plugin: openshift.io_mlx277dpdk.sock gets registered successfully at Kubelet I0820 09:12:37.206892 1 server.go:131] ListAndWatch(mlx277dpdk) invoked I0820 09:12:37.207005 1 server.go:106] Plugin: openshift.io_mlx278dpdk.sock gets registered successfully at Kubelet I0820 09:12:37.206931 1 server.go:139] ListAndWatch(mlx277dpdk): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:60:00.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},},} I0820 09:12:37.207226 1 server.go:131] ListAndWatch(intelnetdevice) invoked I0820 09:12:37.207284 1 server.go:131] ListAndWatch(mlx278dpdk) invoked I0820 09:12:37.207276 1 server.go:139] ListAndWatch(intelnetdevice): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:3b:02.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:3b:02.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},},} I0820 09:12:37.207394 1 server.go:106] Plugin: openshift.io_intelnetdevice.sock gets registered successfully at Kubelet I0820 09:12:37.207311 1 server.go:139] ListAndWatch(mlx278dpdk): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:5e:00.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:5e:00.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},},} Expected results: Additional info: restart sriov-device-plugin pod can fix this issue.
Balazs. if need pre-test this fixed. can you help provide one image for testing?
https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/309
After discussing with Sebastian, he assumes that the issue won't be reproducible anymore in 4.9 due to a patch that excludes PFs from pools. @zhaozhanqi Please take that into account when checking this.
(In reply to Balazs Nemeth from comment #6) > After discussing with Sebastian, he assumes that the issue won't be > reproducible anymore in 4.9 due to a patch that excludes PFs from pools. > @zhaozhanqi Please take that into account when checking this. Yes, seems this issue already be fixed and cannot be reproduced now Move this bug to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.9.48 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:6317
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days