Description of problem: Delete the ovs HW offload policy, sriov dp pod crashed. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. apply one ovs HW offload policy and then delete ovs HW offload policy 2. check the sriov dp crashed oc get pod NAME READY STATUS RESTARTS AGE network-resources-injector-9dm4x 1/1 Running 0 42h network-resources-injector-ffvrp 1/1 Running 0 42h network-resources-injector-qq5sk 1/1 Running 0 42h operator-webhook-7k4sq 1/1 Running 0 42h operator-webhook-kb72c 1/1 Running 0 42h operator-webhook-rl2hx 1/1 Running 0 42h sriov-cni-82tn2 2/2 Running 0 15h sriov-cni-rg8vn 2/2 Running 0 15h sriov-device-plugin-4hpwm 0/1 CrashLoopBackOff 7 15m sriov-device-plugin-4mqtt 1/1 Running 0 4m33s sriov-network-config-daemon-gx8fm 1/1 Running 0 15h sriov-network-config-daemon-vdrwd 1/1 Running 0 4m53s sriov-network-operator-5955546847-sh8st 1/1 Running 0 39h Actual results: #oc logs sriov-device-plugin-4hpwm I0108 04:05:10.690644 1 manager.go:52] Using Kubelet Plugin Registry Mode I0108 04:05:10.690724 1 main.go:44] resource manager reading configs I0108 04:05:10.690778 1 manager.go:86] raw ResourceList: {"resourceList":null} I0108 04:05:10.690783 1 manager.go:106] unmarshalled ResourceList: [] E0108 04:05:10.690789 1 main.go:51] no resource configuration; exiting # oc get cm device-plugin-config -o yaml apiVersion: v1 data: sriov-worker-0: '{"resourceList":[{"resourceName":"mlxnics0","selectors":{"vendors":["15b3"],"devices":["1018"],"pfNames":["ens801f1"],"rootDevices":["0000:b0:00.1"],"IsRdma":false,"NeedVhostNet":false},"SelectorObj":null}]}' sriov-worker-1: '{"resourceList":null}' kind: ConfigMap metadata: creationTimestamp: "2021-01-06T09:27:04Z" managedFields: - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:data: .: {} f:sriov-worker-0: {} f:sriov-worker-1: {} manager: sriov-network-operator operation: Update time: "2021-01-06T09:27:04Z" name: device-plugin-config namespace: openshift-sriov-network-operator resourceVersion: "898768" uid: 7c3622c3-0624-44d3-9f0b-7486b4f7d746 Expected results: Additional info:
The problem is that when deleting sriov policy, sriov-device-plugin daemonset nodeAffinity is not update accordingly which result in device-plugin wrongly scheduled on node that doesn't have sriov resource configured.
The issue also happened in a non-offload environment.
Verified this bug on 4.7.0-202101300133.p0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633