Bug 1914066
| Summary: | [sriov] sriov dp pod crash when delete ovs HW offload policy | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | zhaozhanqi <zzhao> |
| Component: | Networking | Assignee: | zenghui.shi <zshi> |
| Networking sub component: | SR-IOV | QA Contact: | zhaozhanqi <zzhao> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | unspecified | CC: | dosmith, pliu |
| Version: | 4.7 | Keywords: | UpcomingSprint |
| Target Milestone: | --- | ||
| Target Release: | 4.7.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-02-24 15:51:24 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
The problem is that when deleting sriov policy, sriov-device-plugin daemonset nodeAffinity is not update accordingly which result in device-plugin wrongly scheduled on node that doesn't have sriov resource configured. The issue also happened in a non-offload environment. Verified this bug on 4.7.0-202101300133.p0 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |
Description of problem: Delete the ovs HW offload policy, sriov dp pod crashed. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. apply one ovs HW offload policy and then delete ovs HW offload policy 2. check the sriov dp crashed oc get pod NAME READY STATUS RESTARTS AGE network-resources-injector-9dm4x 1/1 Running 0 42h network-resources-injector-ffvrp 1/1 Running 0 42h network-resources-injector-qq5sk 1/1 Running 0 42h operator-webhook-7k4sq 1/1 Running 0 42h operator-webhook-kb72c 1/1 Running 0 42h operator-webhook-rl2hx 1/1 Running 0 42h sriov-cni-82tn2 2/2 Running 0 15h sriov-cni-rg8vn 2/2 Running 0 15h sriov-device-plugin-4hpwm 0/1 CrashLoopBackOff 7 15m sriov-device-plugin-4mqtt 1/1 Running 0 4m33s sriov-network-config-daemon-gx8fm 1/1 Running 0 15h sriov-network-config-daemon-vdrwd 1/1 Running 0 4m53s sriov-network-operator-5955546847-sh8st 1/1 Running 0 39h Actual results: #oc logs sriov-device-plugin-4hpwm I0108 04:05:10.690644 1 manager.go:52] Using Kubelet Plugin Registry Mode I0108 04:05:10.690724 1 main.go:44] resource manager reading configs I0108 04:05:10.690778 1 manager.go:86] raw ResourceList: {"resourceList":null} I0108 04:05:10.690783 1 manager.go:106] unmarshalled ResourceList: [] E0108 04:05:10.690789 1 main.go:51] no resource configuration; exiting # oc get cm device-plugin-config -o yaml apiVersion: v1 data: sriov-worker-0: '{"resourceList":[{"resourceName":"mlxnics0","selectors":{"vendors":["15b3"],"devices":["1018"],"pfNames":["ens801f1"],"rootDevices":["0000:b0:00.1"],"IsRdma":false,"NeedVhostNet":false},"SelectorObj":null}]}' sriov-worker-1: '{"resourceList":null}' kind: ConfigMap metadata: creationTimestamp: "2021-01-06T09:27:04Z" managedFields: - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:data: .: {} f:sriov-worker-0: {} f:sriov-worker-1: {} manager: sriov-network-operator operation: Update time: "2021-01-06T09:27:04Z" name: device-plugin-config namespace: openshift-sriov-network-operator resourceVersion: "898768" uid: 7c3622c3-0624-44d3-9f0b-7486b4f7d746 Expected results: Additional info: