2043336 – Creating multi SriovNetworkNodePolicy cause the worker always be draining

Bug 2043336 - Creating multi SriovNetworkNodePolicy cause the worker always be draining

Summary: Creating multi SriovNetworkNodePolicy cause the worker always be draining

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Peng Liu
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-21 03:48 UTC by zhaozhanqi
Modified:	2022-08-10 10:43 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 10:43:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 10:43:28 UTC

Description zhaozhanqi 2022-01-21 03:48:24 UTC

Description of problem:

One worker node marked as draining and VF initing will be blocked when creating multi SriovNetworkNodePolicy with following steps

create two yaml files with policies definition:
file1.yaml: it should contain policies for worker1 nic1 and worker2 nic1
file2.yaml: it should contain policies for worker1 nic2 and worker2 nic2
apply file1.yaml
wait until worker1 starts reboot
apply file2.yaml
wait until worker1 started

Version-Release number of selected component (if applicable):
4.10.0-fc.1
Red Hat Enterprise Linux CoreOS 410.84.202201122058-0
4.18.0-305.30.1.el8_4.x86_64
cri-o://1.23.0-100.rhaos4.10.git77d20b2.el8

sriov operator version: 4.10.0-202201181018

How reproducible:


Steps to Reproduce:
1. setup cluster and sriov operator is installed
2. make sure there are two workers have the supported sriov nics
3. Create the following file

# cat mlx277-rdma
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: mlx277-dpdk
  namespace: openshift-sriov-network-operator
spec:
  mtu: 1500
  nicSelector:
    pfNames:
      - ens2f1
    vendor: '15b3'
    deviceID: '1015'
  nodeSelector:
    feature.node.kubernetes.io/sriov-capable: 'true'
  numVfs: 3
  isRdma: true
  resourceName: mlx277dpdk



# cat mlx278-rdma 
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: mlx278-dpdk
  namespace: openshift-sriov-network-operator
spec:
  mtu: 1550
  nicSelector:
    pfNames:
      - ens3f1
    rootDevices:
      - '0000:5e:00.1'
    vendor: '15b3'
  nodeSelector:
    feature.node.kubernetes.io/sriov-capable: 'true'
  numVfs: 3
  isRdma: true
  resourceName: mlx278dpdk


4.  Create the first yaml file
oc create -f mlx277-rdma

5. watch the node status, when it gets to reboot and apply another yaml file

# oc get node
NAME                                      STATUS                        ROLES    AGE   VERSION
dell-per740-13.rhts.eng.pek2.redhat.com   Ready                         master   57d   v1.23.0+50f645e
dell-per740-14.rhts.eng.pek2.redhat.com   NotReady,SchedulingDisabled   worker   56d   v1.23.0+50f645e
dell-per740-31.rhts.eng.pek2.redhat.com   Ready                         master   57d   v1.23.0+50f645e
dell-per740-32.rhts.eng.pek2.redhat.com   Ready                         master   57d   v1.23.0+50f645e
dell-per740-35.rhts.eng.pek2.redhat.com   Ready                         worker   56d   v1.23.0+50f645e


####check above worker is in reboot

###then create another yaml file

#oc create -f mlx278-rdma


After that, Found one worker marked as `SchedulingDisabled`


# oc get node
NAME                                      STATUS                     ROLES    AGE   VERSION
dell-per740-13.rhts.eng.pek2.redhat.com   Ready,SchedulingDisabled   master   57d   v1.23.0+50f645e
dell-per740-14.rhts.eng.pek2.redhat.com   Ready                      worker   57d   v1.23.0+50f645e
dell-per740-31.rhts.eng.pek2.redhat.com   Ready                      master   57d   v1.23.0+50f645e
dell-per740-32.rhts.eng.pek2.redhat.com   Ready                      master   57d   v1.23.0+50f645e
dell-per740-35.rhts.eng.pek2.redhat.com   Ready                      worker   57d   v1.23.0+50f645e



Actual results:

sriov config daemon already in progress,  see details logs in Additional info:

# oc get sriovnetworknodestates.sriovnetwork.openshift.io  dell-per740-14.rhts.eng.pek2.redhat.com -o yaml

.
..

   mtu: 1500
    name: ens2f1
    numVfs: 2
    pciAddress: 0000:60:00.1
    totalvfs: 2
    vendor: 15b3
  syncStatus: InProgress


Expected results:



Additional info:

details sriov logs: 

http://file.apac.redhat.com/~zzhao/sriovlog.tar.gz

Comment 5 Peng Liu 2022-02-14 13:56:40 UTC

The fix https://github.com/openshift/sriov-network-operator/commit/d48291194e861bcbcb575b9884d6a0a7a615d461 has been merged with https://github.com/openshift/sriov-network-operator/pull/620

Comment 7 zhaozhanqi 2022-02-15 06:32:54 UTC

Move this to verified to make it can backport

Comment 9 errata-xmlrpc 2022-08-10 10:43:12 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.