Bug 2043336 - Creating multi SriovNetworkNodePolicy cause the worker always be draining
Summary: Creating multi SriovNetworkNodePolicy cause the worker always be draining
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.10
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.11.0
Assignee: Peng Liu
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-21 03:48 UTC by zhaozhanqi
Modified: 2022-08-10 10:43 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 10:43:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:43:28 UTC

Description zhaozhanqi 2022-01-21 03:48:24 UTC
Description of problem:

One worker node marked as draining and VF initing will be blocked when creating multi SriovNetworkNodePolicy with following steps

create two yaml files with policies definition:
file1.yaml: it should contain policies for worker1 nic1 and worker2 nic1
file2.yaml: it should contain policies for worker1 nic2 and worker2 nic2
apply file1.yaml
wait until worker1 starts reboot
apply file2.yaml
wait until worker1 started

Version-Release number of selected component (if applicable):
4.10.0-fc.1
Red Hat Enterprise Linux CoreOS 410.84.202201122058-0
4.18.0-305.30.1.el8_4.x86_64
cri-o://1.23.0-100.rhaos4.10.git77d20b2.el8

sriov operator version: 4.10.0-202201181018

How reproducible:


Steps to Reproduce:
1. setup cluster and sriov operator is installed
2. make sure there are two workers have the supported sriov nics
3. Create the following file

# cat mlx277-rdma
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: mlx277-dpdk
  namespace: openshift-sriov-network-operator
spec:
  mtu: 1500
  nicSelector:
    pfNames:
      - ens2f1
    vendor: '15b3'
    deviceID: '1015'
  nodeSelector:
    feature.node.kubernetes.io/sriov-capable: 'true'
  numVfs: 3
  isRdma: true
  resourceName: mlx277dpdk



# cat mlx278-rdma 
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: mlx278-dpdk
  namespace: openshift-sriov-network-operator
spec:
  mtu: 1550
  nicSelector:
    pfNames:
      - ens3f1
    rootDevices:
      - '0000:5e:00.1'
    vendor: '15b3'
  nodeSelector:
    feature.node.kubernetes.io/sriov-capable: 'true'
  numVfs: 3
  isRdma: true
  resourceName: mlx278dpdk


4.  Create the first yaml file
oc create -f mlx277-rdma

5. watch the node status, when it gets to reboot and apply another yaml file

# oc get node
NAME                                      STATUS                        ROLES    AGE   VERSION
dell-per740-13.rhts.eng.pek2.redhat.com   Ready                         master   57d   v1.23.0+50f645e
dell-per740-14.rhts.eng.pek2.redhat.com   NotReady,SchedulingDisabled   worker   56d   v1.23.0+50f645e
dell-per740-31.rhts.eng.pek2.redhat.com   Ready                         master   57d   v1.23.0+50f645e
dell-per740-32.rhts.eng.pek2.redhat.com   Ready                         master   57d   v1.23.0+50f645e
dell-per740-35.rhts.eng.pek2.redhat.com   Ready                         worker   56d   v1.23.0+50f645e


####check above worker is in reboot

###then create another yaml file

#oc create -f mlx278-rdma


After that, Found one worker marked as `SchedulingDisabled`


# oc get node
NAME                                      STATUS                     ROLES    AGE   VERSION
dell-per740-13.rhts.eng.pek2.redhat.com   Ready,SchedulingDisabled   master   57d   v1.23.0+50f645e
dell-per740-14.rhts.eng.pek2.redhat.com   Ready                      worker   57d   v1.23.0+50f645e
dell-per740-31.rhts.eng.pek2.redhat.com   Ready                      master   57d   v1.23.0+50f645e
dell-per740-32.rhts.eng.pek2.redhat.com   Ready                      master   57d   v1.23.0+50f645e
dell-per740-35.rhts.eng.pek2.redhat.com   Ready                      worker   57d   v1.23.0+50f645e



Actual results:

sriov config daemon already in progress,  see details logs in Additional info:

# oc get sriovnetworknodestates.sriovnetwork.openshift.io  dell-per740-14.rhts.eng.pek2.redhat.com -o yaml

.
..

   mtu: 1500
    name: ens2f1
    numVfs: 2
    pciAddress: 0000:60:00.1
    totalvfs: 2
    vendor: 15b3
  syncStatus: InProgress


Expected results:



Additional info:

details sriov logs: 

http://file.apac.redhat.com/~zzhao/sriovlog.tar.gz

Comment 7 zhaozhanqi 2022-02-15 06:32:54 UTC
Move this to verified to make it can backport

Comment 9 errata-xmlrpc 2022-08-10 10:43:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.