Bug 1996013 - sriov-device-plugin cannot found the VF
Summary: sriov-device-plugin cannot found the VF
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.9
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.9.z
Assignee: Balazs Nemeth
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-20 10:16 UTC by zhaozhanqi
Modified: 2023-09-15 01:35 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-09-12 12:22:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2022:6317 0 None None None 2022-09-12 12:22:35 UTC

Description zhaozhanqi 2021-08-20 10:16:08 UTC
Description of problem:

After apply new sriovnetworknodepolicy and worker reboot. it caused sriov-device-plugin cannot find the VF

Version-Release number of selected component (if applicable):
4.9.0-202108191042

How reproducible:
not always

Steps to Reproduce:
1. Create the following policy
# cat intel-netdevice.yaml 
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: intel-netdevice
  namespace: openshift-sriov-network-operator
spec:
  deviceType: netdevice
  nicSelector:
    pfNames:
      - ens1f0
    rootDevices:
      - '0000:3b:00.0'
    vendor: '8086'
  nodeSelector:
    feature.node.kubernetes.io/sriov-capable: 'true'
  numVfs: 2
  priority: 99
  resourceName: intelnetdevice
[root@dell-per740-36 rhcos]# cat mlx277-rdma 
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: mlx277-dpdk
  namespace: openshift-sriov-network-operator
spec:
  mtu: 1500
  nicSelector:
    pfNames:
      - ens2f1
    vendor: '15b3'
    deviceID: '1015'
  nodeSelector:
    feature.node.kubernetes.io/sriov-capable: 'true'
  numVfs: 2
  isRdma: true
  resourceName: mlx277dpdk

2. after above policy created. all works well. VF cannot found

# oc describe node dell-per740-14.rhts.eng.pek2.redhat.com | grep "openshift.io"
                    node.openshift.io/os_id=rhcos
                    machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-eb1b1ade8c290e67fc77daa247a4ffbd
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-eb1b1ade8c290e67fc77daa247a4ffbd
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/ssh: accessed
                    machineconfiguration.openshift.io/state: Done
                    sriovnetwork.openshift.io/state: Idle
  openshift.io/intelnetdevice:  2
  openshift.io/mlx277dpdk:      2
  openshift.io/intelnetdevice:  2
  openshift.io/mlx277dpdk:      2
 
3. apply another policy in follow:

# cat mlx278-rdma 
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: mlx278-dpdk
  namespace: openshift-sriov-network-operator
spec:
  mtu: 1550
  nicSelector:
    pfNames:
      - ens3f1
    rootDevices:
      - '0000:5e:00.1'
    vendor: '15b3'
  nodeSelector:
    feature.node.kubernetes.io/sriov-capable: 'true'
  numVfs: 2
  isRdma: true
  resourceName: mlx278dpdk

4. after the worker reboot, found the "openshift.io/mlx277dpdk change to 1"

# oc describe node dell-per740-14.rhts.eng.pek2.redhat.com | grep "openshift.io"
                    node.openshift.io/os_id=rhcos
                    machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-eb1b1ade8c290e67fc77daa247a4ffbd
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-eb1b1ade8c290e67fc77daa247a4ffbd
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/ssh: accessed
                    machineconfiguration.openshift.io/state: Done
                    sriovnetwork.openshift.io/state: Idle
  openshift.io/intelnetdevice:  2
  openshift.io/mlx277dpdk:      1
  openshift.io/mlx278dpdk:      2
  openshift.io/intelnetdevice:  2
  openshift.io/mlx277dpdk:      1
  openshift.io/mlx278dpdk:      2
  openshift.io/intelnetdevice  0            0
  openshift.io/mlx277dpdk      0            0
  openshift.io/mlx278dpdk      0            0

5. Actually the Vf is inited

# ip link show ens2f1
11: ens2f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 98:03:9b:48:dd:99 brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether 32:be:05:49:2f:d3 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
    vf 1     link/ether e6:80:d7:e7:b4:13 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off

6. Check the sriov-device-plugin pod logs 

Actual results:


# oc logs sriov-device-plugin-9f62d
I0820 09:12:36.334062       1 manager.go:52] Using Kubelet Plugin Registry Mode
I0820 09:12:36.334298       1 main.go:44] resource manager reading configs
I0820 09:12:36.334377       1 manager.go:82] raw ResourceList: {"resourceList":[{"resourceName":"intelnetdevice","selectors":{"vendors":["8086"],"pfNames":["ens1f0"],"rootDevices":["0000:3b:00.0"],"linkTypes":["ether"],"IsRdma":false,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"mlx277dpdk","selectors":{"vendors":["15b3"],"pfNames":["ens2f1"],"linkTypes":["ether"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null},{"resourceName":"mlx278dpdk","selectors":{"vendors":["15b3"],"pfNames":["ens3f1"],"rootDevices":["0000:5e:00.1"],"linkTypes":["ether"],"IsRdma":true,"NeedVhostNet":false},"SelectorObj":null}]}
I0820 09:12:36.334705       1 factory.go:168] net device selector for resource intelnetdevice is &{DeviceSelectors:{Vendors:[8086] Devices:[] Drivers:[] PciAddresses:[]} PfNames:[ens1f0] RootDevices:[0000:3b:00.0] LinkTypes:[ether] DDPProfiles:[] IsRdma:false NeedVhostNet:false}
I0820 09:12:36.334787       1 factory.go:168] net device selector for resource mlx277dpdk is &{DeviceSelectors:{Vendors:[15b3] Devices:[] Drivers:[] PciAddresses:[]} PfNames:[ens2f1] RootDevices:[] LinkTypes:[ether] DDPProfiles:[] IsRdma:true NeedVhostNet:false}
I0820 09:12:36.334842       1 factory.go:168] net device selector for resource mlx278dpdk is &{DeviceSelectors:{Vendors:[15b3] Devices:[] Drivers:[] PciAddresses:[]} PfNames:[ens3f1] RootDevices:[0000:5e:00.1] LinkTypes:[ether] DDPProfiles:[] IsRdma:true NeedVhostNet:false}
I0820 09:12:36.334870       1 manager.go:106] unmarshalled ResourceList: [{ResourcePrefix: ResourceName:intelnetdevice DeviceType:netDevice Selectors:0xc000130d08 SelectorObj:0xc00025ca90} {ResourcePrefix: ResourceName:mlx277dpdk DeviceType:netDevice Selectors:0xc000130d20 SelectorObj:0xc00025cea0} {ResourcePrefix: ResourceName:mlx278dpdk DeviceType:netDevice Selectors:0xc000130d38 SelectorObj:0xc00025d110}]
I0820 09:12:36.334980       1 manager.go:193] validating resource name "openshift.io/intelnetdevice"
I0820 09:12:36.335032       1 manager.go:193] validating resource name "openshift.io/mlx277dpdk"
I0820 09:12:36.335060       1 manager.go:193] validating resource name "openshift.io/mlx278dpdk"
I0820 09:12:36.335073       1 main.go:60] Discovering host devices
I0820 09:12:36.423991       1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:18:00.0	02          	Intel Corporation   	I350 Gigabit Network Connection         
I0820 09:12:36.424369       1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:18:00.1	02          	Intel Corporation   	I350 Gigabit Network Connection         
I0820 09:12:36.424611       1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:18:00.2	02          	Intel Corporation   	I350 Gigabit Network Connection         
I0820 09:12:36.424764       1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:18:00.3	02          	Intel Corporation   	I350 Gigabit Network Connection         
I0820 09:12:36.424927       1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:3b:00.0	02          	Intel Corporation   	Ethernet Controller XXV710 for 25GbE ...
I0820 09:12:36.425118       1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:3b:00.1	02          	Intel Corporation   	Ethernet Controller XXV710 for 25GbE ...
I0820 09:12:36.425268       1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:3b:02.0	02          	Intel Corporation   	Ethernet Virtual Function 700 Series    
I0820 09:12:36.425402       1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:3b:02.1	02          	Intel Corporation   	Ethernet Virtual Function 700 Series    
I0820 09:12:36.425541       1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:5e:00.0	02          	Mellanox Technolo...	MT27800 Family [ConnectX-5]             
I0820 09:12:36.425700       1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:5e:00.1	02          	Mellanox Technolo...	MT27800 Family [ConnectX-5]             
I0820 09:12:36.426753       1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:5e:00.4	02          	Mellanox Technolo...	MT27800 Family [ConnectX-5 Virtual Fu...
I0820 09:12:36.427180       1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:5e:00.5	02          	Mellanox Technolo...	MT27800 Family [ConnectX-5 Virtual Fu...
I0820 09:12:36.427588       1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:60:00.0	02          	Mellanox Technolo...	MT27710 Family [ConnectX-4 Lx]          
I0820 09:12:36.427985       1 netDeviceProvider.go:78] netdevice AddTargetDevices(): device found: 0000:60:00.1	02          	Mellanox Technolo...	MT27710 Family [ConnectX-4 Lx]          
I0820 09:12:36.428422       1 main.go:66] Initializing resource servers
I0820 09:12:36.428456       1 manager.go:112] number of config: 3
I0820 09:12:36.428469       1 manager.go:115] 
I0820 09:12:36.428480       1 manager.go:116] Creating new ResourcePool: intelnetdevice
I0820 09:12:36.428504       1 manager.go:117] DeviceType: netDevice
I0820 09:12:36.439228       1 factory.go:108] device added: [pciAddr: 0000:3b:02.0, vendor: 8086, device: 154c, driver: iavf]
I0820 09:12:36.439242       1 factory.go:108] device added: [pciAddr: 0000:3b:02.1, vendor: 8086, device: 154c, driver: iavf]
I0820 09:12:36.439259       1 manager.go:145] New resource server is created for intelnetdevice ResourcePool
I0820 09:12:36.439265       1 manager.go:115] 
I0820 09:12:36.439269       1 manager.go:116] Creating new ResourcePool: mlx277dpdk
I0820 09:12:36.439273       1 manager.go:117] DeviceType: netDevice
W0820 09:12:36.439303       1 pciNetDevice.go:55] RDMA resources for 0000:18:00.0 not found. Are RDMA modules loaded?
W0820 09:12:36.439557       1 pciNetDevice.go:55] RDMA resources for 0000:18:00.1 not found. Are RDMA modules loaded?
W0820 09:12:36.439800       1 pciNetDevice.go:55] RDMA resources for 0000:18:00.2 not found. Are RDMA modules loaded?
W0820 09:12:36.439986       1 pciNetDevice.go:55] RDMA resources for 0000:18:00.3 not found. Are RDMA modules loaded?
W0820 09:12:36.440158       1 pciNetDevice.go:55] RDMA resources for 0000:3b:00.1 not found. Are RDMA modules loaded?
W0820 09:12:36.440356       1 pciNetDevice.go:55] RDMA resources for 0000:3b:02.0 not found. Are RDMA modules loaded?
W0820 09:12:36.440683       1 pciNetDevice.go:55] RDMA resources for 0000:3b:02.1 not found. Are RDMA modules loaded?
I0820 09:12:36.444276       1 factory.go:108] device added: [pciAddr: 0000:60:00.1, vendor: 15b3, device: 1015, driver: mlx5_core]
I0820 09:12:36.444296       1 manager.go:145] New resource server is created for mlx277dpdk ResourcePool
I0820 09:12:36.444301       1 manager.go:115] 
I0820 09:12:36.444305       1 manager.go:116] Creating new ResourcePool: mlx278dpdk
I0820 09:12:36.444310       1 manager.go:117] DeviceType: netDevice
W0820 09:12:36.444337       1 pciNetDevice.go:55] RDMA resources for 0000:18:00.0 not found. Are RDMA modules loaded?
W0820 09:12:36.444580       1 pciNetDevice.go:55] RDMA resources for 0000:18:00.1 not found. Are RDMA modules loaded?
W0820 09:12:36.444812       1 pciNetDevice.go:55] RDMA resources for 0000:18:00.2 not found. Are RDMA modules loaded?
W0820 09:12:36.444981       1 pciNetDevice.go:55] RDMA resources for 0000:18:00.3 not found. Are RDMA modules loaded?
W0820 09:12:36.445148       1 pciNetDevice.go:55] RDMA resources for 0000:3b:00.1 not found. Are RDMA modules loaded?
W0820 09:12:36.445317       1 pciNetDevice.go:55] RDMA resources for 0000:3b:02.0 not found. Are RDMA modules loaded?
W0820 09:12:36.445636       1 pciNetDevice.go:55] RDMA resources for 0000:3b:02.1 not found. Are RDMA modules loaded?
I0820 09:12:36.449421       1 factory.go:108] device added: [pciAddr: 0000:5e:00.4, vendor: 15b3, device: 1018, driver: mlx5_core]
I0820 09:12:36.449432       1 factory.go:108] device added: [pciAddr: 0000:5e:00.5, vendor: 15b3, device: 1018, driver: mlx5_core]
I0820 09:12:36.449468       1 manager.go:145] New resource server is created for mlx278dpdk ResourcePool
I0820 09:12:36.449473       1 main.go:72] Starting all servers...
I0820 09:12:36.449685       1 server.go:196] starting intelnetdevice device plugin endpoint at: openshift.io_intelnetdevice.sock
I0820 09:12:36.450379       1 server.go:222] intelnetdevice device plugin endpoint started serving
I0820 09:12:36.450483       1 server.go:196] starting mlx277dpdk device plugin endpoint at: openshift.io_mlx277dpdk.sock
I0820 09:12:36.451062       1 server.go:222] mlx277dpdk device plugin endpoint started serving
I0820 09:12:36.451385       1 server.go:196] starting mlx278dpdk device plugin endpoint at: openshift.io_mlx278dpdk.sock
I0820 09:12:36.451974       1 server.go:222] mlx278dpdk device plugin endpoint started serving
I0820 09:12:36.452014       1 main.go:77] All servers started.
I0820 09:12:36.452029       1 main.go:78] Listening for term signals
I0820 09:12:37.206376       1 server.go:106] Plugin: openshift.io_mlx277dpdk.sock gets registered successfully at Kubelet
I0820 09:12:37.206892       1 server.go:131] ListAndWatch(mlx277dpdk) invoked
I0820 09:12:37.207005       1 server.go:106] Plugin: openshift.io_mlx278dpdk.sock gets registered successfully at Kubelet
I0820 09:12:37.206931       1 server.go:139] ListAndWatch(mlx277dpdk): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:60:00.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},},}
I0820 09:12:37.207226       1 server.go:131] ListAndWatch(intelnetdevice) invoked
I0820 09:12:37.207284       1 server.go:131] ListAndWatch(mlx278dpdk) invoked
I0820 09:12:37.207276       1 server.go:139] ListAndWatch(intelnetdevice): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:3b:02.0,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:3b:02.1,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},},}
I0820 09:12:37.207394       1 server.go:106] Plugin: openshift.io_intelnetdevice.sock gets registered successfully at Kubelet
I0820 09:12:37.207311       1 server.go:139] ListAndWatch(mlx278dpdk): send devices &ListAndWatchResponse{Devices:[]*Device{&Device{ID:0000:5e:00.4,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},&Device{ID:0000:5e:00.5,Health:Healthy,Topology:&TopologyInfo{Nodes:[]*NUMANode{&NUMANode{ID:0,},},},},},}

Expected results:


Additional info:

restart sriov-device-plugin pod can fix this issue.

Comment 3 zhaozhanqi 2022-07-08 02:37:38 UTC
Balazs. if need pre-test this fixed.   can you help provide one image for testing?

Comment 6 Balazs Nemeth 2022-07-11 14:58:45 UTC
After discussing with Sebastian, he assumes that the issue won't be reproducible anymore in 4.9 due to a patch that excludes PFs from pools. @zhaozhanqi Please take that into account when checking this.

Comment 8 zhaozhanqi 2022-08-08 07:26:08 UTC
(In reply to Balazs Nemeth from comment #6)
> After discussing with Sebastian, he assumes that the issue won't be
> reproducible anymore in 4.9 due to a patch that excludes PFs from pools.
> @zhaozhanqi Please take that into account when checking this.

Yes, seems this issue already be fixed and cannot be reproduced now

Move this bug to verified.

Comment 12 errata-xmlrpc 2022-09-12 12:22:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.9.48 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:6317

Comment 13 Red Hat Bugzilla 2023-09-15 01:35:38 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days


Note You need to log in before you can comment on or make changes to this bug.