Bug 1955874

Summary: Webscale: sriov vfs are not created and sriovnetworknodestate indicates sync succeeded - state is not correct
Product: OpenShift Container Platform Reporter: Nabeel Cocker <ncocker>
Component: NetworkingAssignee: zenghui.shi <zshi>
Networking sub component: SR-IOV QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: anbhat, bbennett, skanakal, zshi
Version: 4.8   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1958467 (view as bug list) Environment:
Last Closed: 2021-07-27 23:05:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1958467    
Attachments:
Description Flags
dmeseg, none

Description Nabeel Cocker 2021-05-01 05:46:52 UTC
Created attachment 1777999 [details]
dmeseg,

Description of problem:

Inaccurate state between the config daemon and sriovnetwork node state.  

We are seeing that the vfs are not getting created until the config-daemon pod is deleted and in some cases deleting the sriovnetworknodestate.

This is happening when the node is first enabled with nnp.



Version-Release number of selected component (if applicable):
OCP 4.6.17



How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
[corona@stablcurco ~]$ 
[corona@stablcurco ~]$ cat sriov-nnp.yaml 
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: ens3f0vf1115
  namespace: openshift-sriov-network-operator
spec:
  deviceType: netdevice
  mtu: 9100
  nicSelector:
    deviceID: "1017"
    pfNames:
    - ens3f0#11-14
    vendor: 15b3
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  numVfs: 16
  priority: 99
  resourceName: ens3f0vf1115
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: ens3f1vf1115
  namespace: openshift-sriov-network-operator
spec:
  deviceType: netdevice
  mtu: 9100
  nicSelector:
    deviceID: "1017"
    pfNames:
    - ens3f1#11-14
    vendor: 15b3
  nodeSelector:
    feature.node.kubernetes.io/network-sriov.capable: "true"
  numVfs: 16
  priority: 99
  resourceName: ens3f1vf1115





[core@worker-149 ~]$ cat /sys/bus/pci
pci/         pci_express/ 
[core@worker-149 ~]$ cat /sys/bus/pci/devices/0000\:d
0000:d7:00.0/ 0000:d7:0e.1/ 0000:d7:12.0/ 0000:d7:15.0/ 0000:d8:00.1/ 0000:d8:00.6/ 0000:d8:01.3/ 0000:d8:02.0/ 0000:d8:02.5/ 0000:d8:03.2/ 0000:d8:03.7/ 
0000:d7:05.0/ 0000:d7:0f.0/ 0000:d7:12.1/ 0000:d7:16.0/ 0000:d8:00.2/ 0000:d8:00.7/ 0000:d8:01.4/ 0000:d8:02.1/ 0000:d8:02.6/ 0000:d8:03.3/ 0000:d8:04.0/ 
0000:d7:05.2/ 0000:d7:0f.1/ 0000:d7:12.2/ 0000:d7:16.4/ 0000:d8:00.3/ 0000:d8:01.0/ 0000:d8:01.5/ 0000:d8:02.2/ 0000:d8:02.7/ 0000:d8:03.4/ 0000:d8:04.1/ 
0000:d7:05.4/ 0000:d7:10.0/ 0000:d7:12.4/ 0000:d7:17.0/ 0000:d8:00.4/ 0000:d8:01.1/ 0000:d8:01.6/ 0000:d8:02.3/ 0000:d8:03.0/ 0000:d8:03.5/ 
0000:d7:0e.0/ 0000:d7:10.1/ 0000:d7:12.5/ 0000:d8:00.0/ 0000:d8:00.5/ 0000:d8:01.2/ 0000:d8:01.7/ 0000:d8:02.4/ 0000:d8:03.1/ 0000:d8:03.6/ 
[core@worker-149 ~]$ cat /sys/bus/pci/devices/0000\:d8\:0
0000:d8:00.0/ 0000:d8:00.3/ 0000:d8:00.6/ 0000:d8:01.1/ 0000:d8:01.4/ 0000:d8:01.7/ 0000:d8:02.2/ 0000:d8:02.5/ 0000:d8:03.0/ 0000:d8:03.3/ 0000:d8:03.6/ 0000:d8:04.1/ 
0000:d8:00.1/ 0000:d8:00.4/ 0000:d8:00.7/ 0000:d8:01.2/ 0000:d8:01.5/ 0000:d8:02.0/ 0000:d8:02.3/ 0000:d8:02.6/ 0000:d8:03.1/ 0000:d8:03.4/ 0000:d8:03.7/ 
0000:d8:00.2/ 0000:d8:00.5/ 0000:d8:01.0/ 0000:d8:01.3/ 0000:d8:01.6/ 0000:d8:02.1/ 0000:d8:02.4/ 0000:d8:02.7/ 0000:d8:03.2/ 0000:d8:03.5/ 0000:d8:04.0/ 
[core@worker-149 ~]$ cat /sys/bus/pci/devices/0000\:d8\:0
0000:d8:00.0/ 0000:d8:00.3/ 0000:d8:00.6/ 0000:d8:01.1/ 0000:d8:01.4/ 0000:d8:01.7/ 0000:d8:02.2/ 0000:d8:02.5/ 0000:d8:03.0/ 0000:d8:03.3/ 0000:d8:03.6/ 0000:d8:04.1/ 
0000:d8:00.1/ 0000:d8:00.4/ 0000:d8:00.7/ 0000:d8:01.2/ 0000:d8:01.5/ 0000:d8:02.0/ 0000:d8:02.3/ 0000:d8:02.6/ 0000:d8:03.1/ 0000:d8:03.4/ 0000:d8:03.7/ 
0000:d8:00.2/ 0000:d8:00.5/ 0000:d8:01.0/ 0000:d8:01.3/ 0000:d8:01.6/ 0000:d8:02.1/ 0000:d8:02.4/ 0000:d8:02.7/ 0000:d8:03.2/ 0000:d8:03.5/ 0000:d8:04.0/ 
[core@worker-149 ~]$ cat /sys/bus/pci/devices/0000\:d8\:00.6/
ari_enabled               d3cold_allowed            infiniband_mad/           local_cpus                numa_node                 resource0                 vendor
broken_parity_status      device                    infiniband_srp/           max_link_speed            physfn/                   resource0_wc              
class                     dma_mask_bits             infiniband_verbs/         max_link_width            pools                     revision                  
config                    driver/                   iommu/                    modalias                  power/                    subsystem/                
consistent_dma_mask_bits  driver_override           iommu_group/              msi_bus                   ptp/                      subsystem_device          
current_link_speed        enable                    irq                       msi_irqs/                 reset                     subsystem_vendor          
current_link_width        infiniband/               local_cpulist             net/                      resource                  uevent                    
[core@worker-149 ~]$ cat /sys/bus/pci/devices/0000\:d8\:00.6/n
net/       numa_node  
[core@worker-149 ~]$ cat /sys/bus/pci/devices/0000\:d8\:00.6/net/
cat: '/sys/bus/pci/devices/0000:d8:00.6/net/': Is a directory
[core@worker-149 ~]$ cd /sys/bus/pci/devices/0000\:d8\:00.6/net/ 
[core@worker-149 net]$

Comment 2 Nabeel Cocker 2021-05-05 20:05:39 UTC
Hello,

Could you please let me know what info is needed?

thank you

Nabeel

Comment 3 zenghui.shi 2021-05-07 13:26:21 UTC
upstream fix for config daemon panic (index out of range when getting VF interface name): https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/127

Comment 5 zhaozhanqi 2021-05-10 08:04:59 UTC
@ncocker hi, I have a try with your above policy many times on old version. However I did not met this issue. Do you have steps to reproduce this issue which can help verify this bug?

Comment 6 zhaozhanqi 2021-05-11 07:52:42 UTC
Tried many times, this issue cannot be reproduced on 4.8.0-202105100942.p0

Move this to verified.

Comment 11 errata-xmlrpc 2021-07-27 23:05:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Comment 12 Red Hat Bugzilla 2023-09-15 01:06:00 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days