Bug 1857581

Summary: [4.5][sriov] sriov-device-plugin pods not scheduled to node with taints
Product: OpenShift Container Platform Reporter: OpenShift BugZilla Robot <openshift-bugzilla-robot>
Component: NetworkingAssignee: zenghui.shi <zshi>
Networking sub component: SR-IOV QA Contact: zhaozhanqi <zzhao>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: zshi
Version: 4.5   
Target Milestone: ---   
Target Release: 4.5.z   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-22 12:21:26 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1857507    
Bug Blocks: 1857510, 1858668    

Description OpenShift BugZilla Robot 2020-07-16 08:03:42 UTC
+++ This bug was initially created as a clone of Bug #1857507 +++

Description of problem:

when a node enabled for SR-IOV is tainted, sriov-device-plugin pods are not able to get scheduled on to the node therefore preventing pods requiring SR-IOV from getting scheduled.



Version-Release number of selected component (if applicable):


How reproducible:
always

Steps to Reproduce:
1.taint node that has SR-IOV enabled e.g. oc adm taint node worker-21 worker=load-balancer:NoSchedule   
2.reboot the node
3.sriov-device-plugin does not get scheduled on to the node

Actual results:


Expected results:
sriov-device-plugin get scheduled onto the node

Additional info:

sriov-device-plugin tolerations:

 tolerations:
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/network-unavailable
    operator: Exists

These will not allow the pod to be scheduled.  The other sriov pods have an additional toleration:

tolerations:
  - operator: Exists

That allows them to get scheduled.

Comment 2 zenghui.shi 2020-07-17 02:31:18 UTC
*** Bug 1857509 has been marked as a duplicate of this bug. ***

Comment 5 zhaozhanqi 2020-07-20 01:39:46 UTC
Verified this bug on 4.5.0-202007172106.p0

 oc rsh sriov-network-operator-54df58fd7b-hdv4g
sh-4.2#cat bindata/manifests/plugins/sriov-device-plugin.yaml | grep toler -A 3
      tolerations:
      - operator: Exists
      serviceAccountName: sriov-device-plugin
      containers:

Comment 7 errata-xmlrpc 2020-07-22 12:21:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2956