Bug 1996013

Summary: sriov-device-plugin cannot found the VF
Product: OpenShift Container Platform Reporter: zhaozhanqi <zzhao>
Component: NetworkingAssignee: Balazs Nemeth <bnemeth>
Networking sub component: SR-IOV QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: bnemeth, dosmith, sscheink, vlaad
Version: 4.9   
Target Milestone: ---   
Target Release: 4.9.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-09-12 12:22:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description zhaozhanqi 2021-08-20 10:16:08 UTC
Description of problem:

After apply new sriovnetworknodepolicy and worker reboot. it caused sriov-device-plugin cannot find the VF

Version-Release number of selected component (if applicable):
4.9.0-202108191042

How reproducible:
not always

Steps to Reproduce:
1. Create the following policy
# cat intel-netdevice.yaml 
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: intel-netdevice
  namespace: openshift-sriov-network-operator
spec:
  deviceType: netdevice
  nicSelector:
    pfNames:
      - ens1f0
    rootDevices:
      - '0000:3b:00.0'
    vendor: '8086'
  nodeSelector:
    feature.node.kubernetes.io/sriov-capable: 'true'
  numVfs: 2
  priority: 99
  resourceName: intelnetdevice
[root@dell-per740-36 rhcos]# cat mlx277-rdma 
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: mlx277-dpdk
  namespace: openshift-sriov-network-operator
spec:
  mtu: 1500
  nicSelector:
    pfNames:
      - ens2f1
    vendor: '15b3'
    deviceID: '1015'
  nodeSelector:
    feature.node.kubernetes.io/sriov-capable: 'true'
  numVfs: 2
  isRdma: true
  resourceName: mlx277dpdk

2. after above policy created. all works well. VF cannot found

# oc describe node dell-per740-14.rhts.eng.pek2.redhat.com | grep "openshift.io"
                    node.openshift.io/os_id=rhcos
                    machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-eb1b1ade8c290e67fc77daa247a4ffbd
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-eb1b1ade8c290e67fc77daa247a4ffbd
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/ssh: accessed
                    machineconfiguration.openshift.io/state: Done
                    sriovnetwork.openshift.io/state: Idle
  openshift.io/intelnetdevice:  2
  openshift.io/mlx277dpdk:      2
  openshift.io/intelnetdevice:  2
  openshift.io/mlx277dpdk:      2
 
3. apply another policy in follow:

# cat mlx278-rdma 
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: mlx278-dpdk
  namespace: openshift-sriov-network-operator
spec:
  mtu: 1550
  nicSelector:
    pfNames:
      - ens3f1
    rootDevices:
      - '0000:5e:00.1'
    vendor: '15b3'
  nodeSelector:
    feature.node.kubernetes.io/sriov-capable: 'true'
  numVfs: 2
  isRdma: true
  resourceName: mlx278dpdk

4. after the worker reboot, found the "openshift.io/mlx277dpdk change to 1"

# oc describe node dell-per740-14.rhts.eng.pek2.redhat.com | grep "openshift.io"
                    node.openshift.io/os_id=rhcos
                    machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-eb1b1ade8c290e67fc77daa247a4ffbd
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-eb1b1ade8c290e67fc77daa247a4ffbd
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/ssh: accessed
                    machineconfiguration.openshift.io/state: Done
                    sriovnetwork.openshift.io/state: Idle
  openshift.io/intelnetdevice:  2
  openshift.io/mlx277dpdk:      1
  openshift.io/mlx278dpdk:      2
  openshift.io/intelnetdevice:  2
  openshift.io/mlx277dpdk:      1
  openshift.io/mlx278dpdk:      2
  openshift.io/intelnetdevice  0            0
  openshift.io/mlx277dpdk      0            0
  openshift.io/mlx278dpdk      0            0

5. Actually the Vf is inited

# ip link show ens2f1
11: ens2f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 98:03:9b:48:dd:99 brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether 32:be:05:49:2f:d3 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
    vf 1     link/ether e6:80:d7:e7:b4:13 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off

6. Check the sriov-device-plugin pod logs 

Actual results:


# oc logs sriov-device-plugin-9f62d
I0820 09:12:36.334062       1 manager.go:52] Using Kubelet Plugin Registry Mode
I0820 09:12:36.334298       1 main.go:44] resource manager reading configs
I0820 09:12:36.334377       1 manager.go:82] raw ResourceList: {"resourceList":[{"resourceName":"intelnetdevice","selectors":{"vendors":["8086"],"pfNames":["ens1f0"],"rootDevices":["0000:3b:00.0"],"linkTypes":["ether"],"IsRdma":false,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"mlx277dpdk","selectors":{"vendors":["15b3"],"pfNames":["ens2f1"],"linkTypes":["ether"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"mlx278dpdk","selectors":{"vendors":["15b3"],"pfNames":["ens3f1"],"rootDevices":["0000:5e:00.1"],"linkTypes":["ether"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null}]}
I0820 09:12:36.334705       1 factory.go:168] net device selector for resource intelnetdevice is &{DeviceSelectors:{Vendors:[8086] Devices:[] Drivers:[] PciAddresses:[]} PfNames:[ens1f0] RootDevices:[0000:3b:00.0] LinkTypes:[ether] DDPProfiles:[] IsRdma:false NeedVhostNet:false}
I0820 09:12:36.334787       1 factory.go:168] net device selector for resource mlx277dpdk is &{DeviceSelectors:{Vendors:[15b3] Devices:[] Drivers:[] PciAddresses:[]} PfNames:[ens2f1] RootDevices:[] LinkTypes:[ether] DDPProfiles:[] IsRdma:true NeedVhostNet:false}
I0820 09:12:36.334842       1 factory.go:168] net device selector for resource mlx278dpdk is &{DeviceSelectors:{Vendors:[15b3] Devices:[] Drivers:[] PciAddresses:[]} PfNames:[ens3f1] RootDevices:[0000:5e:00.1] LinkTypes:[ether] DDPProfiles:[] IsRdma:true NeedVhostNet:false}
I0820 09:12:36.334870       1 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix: ResourceName:intelnetdevice DeviceType:netDevice Selectors:0xc000130d08 SelectorObj:0xc00025ca90} {ResourcePrefix: ResourceName:mlx277dpdk DeviceType:netDevice Selectors:0xc000130d20 SelectorObj:0xc00025cea0} {ResourcePrefix: ResourceName:mlx278dpdk DeviceType:netDevice Selectors:0xc000130d38 SelectorObj:0xc00025d110}]
I0820 09:12:36.334980       1 manager.go:193] validating resource name "openshift.io/intelnetdevice"
I0820 09:12:36.335032       1 manager.go:193] validating resource name "openshift.io/mlx277dpdk"
I0820 09:12:36.335060       1 manager.go:193] validating resource name "openshift.io/mlx278dpdk"
I0820 09:12:36.335073       1 main.go:60] Discovering host devices
I0820 09:12:36.423991       1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:18:00.0	02          	Intel Corporation   	I350 Gigabit Network Connection         
I0820 09:12:36.424369       1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:18:00.1	02          	Intel Corporation   	I350 Gigabit Network Connection         
I0820 09:12:36.424611       1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:18:00.2	02          	Intel Corporation   	I350 Gigabit Network Connection         
I0820 09:12:36.424764       1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:18:00.3	02          	Intel Corporation   	I350 Gigabit Network Connection         
I0820 09:12:36.424927       1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:3b:00.0	02          	Intel Corporation   	Ethernet Controller XXV710 for 25GbE ...
I0820 09:12:36.425118       1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:3b:00.1	02          	Intel Corporation   	Ethernet Controller XXV710 for 25GbE ...
I0820 09:12:36.425268       1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:3b:02.0	02          	Intel Corporation   	Ethernet Virtual Function 700 Series    
I0820 09:12:36.425402       1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:3b:02.1	02          	Intel Corporation   	Ethernet Virtual Function 700 Series    
I0820 09:12:36.425541       1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:5e:00.0	02          	Mellanox Technolo...	MT27800 Family [ConnectX-5]             
I0820 09:12:36.425700       1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:5e:00.1	02          	Mellanox Technolo...	MT27800 Family [ConnectX-5]             
I0820 09:12:36.426753       1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:5e:00.4	02          	Mellanox Technolo...	MT27800 Family [ConnectX-5 Virtual Fu...
I0820 09:12:36.427180       1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:5e:00.5	02          	Mellanox Technolo...	MT27800 Family [ConnectX-5 Virtual Fu...
I0820 09:12:36.427588       1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:60:00.0	02          	Mellanox Technolo...	MT27710 Family [ConnectX-4 Lx]          
I0820 09:12:36.427985       1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:60:00.1	02          	Mellanox Technolo...	MT27710 Family [ConnectX-4 Lx]          
I0820 09:12:36.428422       1 main.go:66] Initializing resource servers
I0820 09:12:36.428456       1 manager.go:112] number of config: 3
I0820 09:12:36.428469       1 manager.go:115] 
I0820 09:12:36.428480       1 manager.go:116] Creating new ResourcePool: intelnetdevice
I0820 09:12:36.428504       1 manager.go:117] DeviceType: netDevice
I0820 09:12:36.439228       1 factory.go:108] device added: [pciAddr: 0000:3b:02.0, vendor: 8086, device: 154c, driver: iavf]
I0820 09:12:36.439242       1 factory.go:108] device added: [pciAddr: 0000:3b:02.1, vendor: 8086, device: 154c, driver: iavf]
I0820 09:12:36.439259       1 manager.go:145] New resource server is created for intelnetdevice ResourcePool
I0820 09:12:36.439265       1 manager.go:115] 
I0820 09:12:36.439269       1 manager.go:116] Creating new ResourcePool: mlx277dpdk
I0820 09:12:36.439273       1 manager.go:117] DeviceType: netDevice
W0820 09:12:36.439303       1 pciNetDevice.go:55] RDMA resources for 0000:18:00.0 not found. Are RDMA modules loaded?
W0820 09:12:36.439557       1 pciNetDevice.go:55] RDMA resources for 0000:18:00.1 not found. Are RDMA modules loaded?
W0820 09:12:36.439800       1 pciNetDevice.go:55] RDMA resources for 0000:18:00.2 not found. Are RDMA modules loaded?
W0820 09:12:36.439986       1 pciNetDevice.go:55] RDMA resources for 0000:18:00.3 not found. Are RDMA modules loaded?
W0820 09:12:36.440158       1 pciNetDevice.go:55] RDMA resources for 0000:3b:00.1 not found. Are RDMA modules loaded?
W0820 09:12:36.440356       1 pciNetDevice.go:55] RDMA resources for 0000:3b:02.0 not found. Are RDMA modules loaded?
W0820 09:12:36.440683       1 pciNetDevice.go:55] RDMA resources for 0000:3b:02.1 not found. Are RDMA modules loaded?
I0820 09:12:36.444276       1 factory.go:108] device added: [pciAddr: 0000:60:00.1, vendor: 15b3, device: 1015, driver: mlx5_core]
I0820 09:12:36.444296       1 manager.go:145] New resource server is created for mlx277dpdk ResourcePool
I0820 09:12:36.444301       1 manager.go:115] 
I0820 09:12:36.444305       1 manager.go:116] Creating new ResourcePool: mlx278dpdk
I0820 09:12:36.444310       1 manager.go:117] DeviceType: netDevice
W0820 09:12:36.444337       1 pciNetDevice.go:55] RDMA resources for 0000:18:00.0 not found. Are RDMA modules loaded?
W0820 09:12:36.444580       1 pciNetDevice.go:55] RDMA resources for 0000:18:00.1 not found. Are RDMA modules loaded?
W0820 09:12:36.444812       1 pciNetDevice.go:55] RDMA resources for 0000:18:00.2 not found. Are RDMA modules loaded?
W0820 09:12:36.444981       1 pciNetDevice.go:55] RDMA resources for 0000:18:00.3 not found. Are RDMA modules loaded?
W0820 09:12:36.445148       1 pciNetDevice.go:55] RDMA resources for 0000:3b:00.1 not found. Are RDMA modules loaded?
W0820 09:12:36.445317       1 pciNetDevice.go:55] RDMA resources for 0000:3b:02.0 not found. Are RDMA modules loaded?
W0820 09:12:36.445636       1 pciNetDevice.go:55] RDMA resources for 0000:3b:02.1 not found. Are RDMA modules loaded?
I0820 09:12:36.449421       1 factory.go:108] device added: [pciAddr: 0000:5e:00.4, vendor: 15b3, device: 1018, driver: mlx5_core]
I0820 09:12:36.449432       1 factory.go:108] device added: [pciAddr: 0000:5e:00.5, vendor: 15b3, device: 1018, driver: mlx5_core]
I0820 09:12:36.449468       1 manager.go:145] New resource server is created for mlx278dpdk ResourcePool
I0820 09:12:36.449473       1 main.go:72] Starting all servers...
I0820 09:12:36.449685       1 server.go:196] starting intelnetdevice device plugin endpoint at: openshift.io_intelnetdevice.sock
I0820 09:12:36.450379       1 server.go:222] intelnetdevice device plugin endpoint started serving
I0820 09:12:36.450483       1 server.go:196] starting mlx277dpdk device plugin endpoint at: openshift.io_mlx277dpdk.sock
I0820 09:12:36.451062       1 server.go:222] mlx277dpdk device plugin endpoint started serving
I0820 09:12:36.451385       1 server.go:196] starting mlx278dpdk device plugin endpoint at: openshift.io_mlx278dpdk.sock
I0820 09:12:36.451974       1 server.go:222] mlx278dpdk device plugin endpoint started serving
I0820 09:12:36.452014       1 main.go:77] All servers started.
I0820 09:12:36.452029       1 main.go:78] Listening for term signals
I0820 09:12:37.206376       1 server.go:106] Plugin: openshift.io_mlx277dpdk.sock gets registered successfully at Kubelet
I0820 09:12:37.206892       1 server.go:131] ListAndWatch(mlx277dpdk) invoked
I0820 09:12:37.207005       1 server.go:106] Plugin: openshift.io_mlx278dpdk.sock gets registered successfully at Kubelet
I0820 09:12:37.206931       1 server.go:139] ListAndWatch(mlx277dpdk): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:60:00.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},},}
I0820 09:12:37.207226       1 server.go:131] ListAndWatch(intelnetdevice) invoked
I0820 09:12:37.207284       1 server.go:131] ListAndWatch(mlx278dpdk) invoked
I0820 09:12:37.207276       1 server.go:139] ListAndWatch(intelnetdevice): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:3b:02.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:3b:02.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},},}
I0820 09:12:37.207394       1 server.go:106] Plugin: openshift.io_intelnetdevice.sock gets registered successfully at Kubelet
I0820 09:12:37.207311       1 server.go:139] ListAndWatch(mlx278dpdk): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:5e:00.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:5e:00.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},},}

Expected results:


Additional info:

restart sriov-device-plugin pod can fix this issue.

Comment 3 zhaozhanqi 2022-07-08 02:37:38 UTC
Balazs. if need pre-test this fixed.   can you help provide one image for testing?

Comment 6 Balazs Nemeth 2022-07-11 14:58:45 UTC
After discussing with Sebastian, he assumes that the issue won't be reproducible anymore in 4.9 due to a patch that excludes PFs from pools. @zhaozhanqi Please take that into account when checking this.

Comment 8 zhaozhanqi 2022-08-08 07:26:08 UTC
(In reply to Balazs Nemeth from comment #6)
> After discussing with Sebastian, he assumes that the issue won't be
> reproducible anymore in 4.9 due to a patch that excludes PFs from pools.
> @zhaozhanqi Please take that into account when checking this.

Yes, seems this issue already be fixed and cannot be reproduced now

Move this bug to verified.

Comment 12 errata-xmlrpc 2022-09-12 12:22:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.9.48 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6317

Comment 13 Red Hat Bugzilla 2023-09-15 01:35:38 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days