Bug 1914066 - [sriov] sriov dp pod crash when delete ovs HW offload policy
Summary: [sriov] sriov dp pod crash when delete ovs HW offload policy
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.7.0
Assignee: zenghui.shi
QA Contact: zhaozhanqi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-08 04:13 UTC by zhaozhanqi
Modified: 2021-02-24 15:51 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:51:24 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift sriov-network-operator pull 464 0 None Closed [RFE][RHV 4.5] Auto-pinning of vCPUs and NUMA nodes 2022-06-01 13:25:19 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:51:39 UTC

Description zhaozhanqi 2021-01-08 04:13:58 UTC
Description of problem:
Delete the ovs HW offload policy, sriov dp pod crashed.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. apply one ovs HW offload policy and then delete ovs HW offload policy
2. check the sriov dp crashed

oc get pod
NAME                                      READY   STATUS             RESTARTS   AGE
network-resources-injector-9dm4x          1/1     Running            0          42h
network-resources-injector-ffvrp          1/1     Running            0          42h
network-resources-injector-qq5sk          1/1     Running            0          42h
operator-webhook-7k4sq                    1/1     Running            0          42h
operator-webhook-kb72c                    1/1     Running            0          42h
operator-webhook-rl2hx                    1/1     Running            0          42h
sriov-cni-82tn2                           2/2     Running            0          15h
sriov-cni-rg8vn                           2/2     Running            0          15h
sriov-device-plugin-4hpwm                 0/1     CrashLoopBackOff   7          15m
sriov-device-plugin-4mqtt                 1/1     Running            0          4m33s
sriov-network-config-daemon-gx8fm         1/1     Running            0          15h
sriov-network-config-daemon-vdrwd         1/1     Running            0          4m53s
sriov-network-operator-5955546847-sh8st   1/1     Running            0          39h

Actual results:


#oc logs sriov-device-plugin-4hpwm
I0108 04:05:10.690644       1 manager.go:52] Using Kubelet Plugin Registry Mode
I0108 04:05:10.690724       1 main.go:44] resource manager reading configs
I0108 04:05:10.690778       1 manager.go:86] raw ResourceList: {"resourceList":null}
I0108 04:05:10.690783       1 manager.go:106] unmarshalled ResourceList: []
E0108 04:05:10.690789       1 main.go:51] no resource configuration; exiting


# oc get cm device-plugin-config -o yaml
apiVersion: v1
data:
  sriov-worker-0: '{"resourceList":[{"resourceName":"mlxnics0","selectors":{"vendors":["15b3"],"devices":["1018"],"pfNames":["ens801f1"],"rootDevices":["0000:b0:00.1"],"IsRdma":false,"NeedVhostNet":false},"SelectorObj":null}]}'
  sriov-worker-1: '{"resourceList":null}'
kind: ConfigMap
metadata:
  creationTimestamp: "2021-01-06T09:27:04Z"
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:data:
        .: {}
        f:sriov-worker-0: {}
        f:sriov-worker-1: {}
    manager: sriov-network-operator
    operation: Update
    time: "2021-01-06T09:27:04Z"
  name: device-plugin-config
  namespace: openshift-sriov-network-operator
  resourceVersion: "898768"
  uid: 7c3622c3-0624-44d3-9f0b-7486b4f7d746

Expected results:


Additional info:

Comment 1 zenghui.shi 2021-01-18 10:33:58 UTC
The problem is that when deleting sriov policy, sriov-device-plugin daemonset nodeAffinity is not update accordingly which result in device-plugin wrongly scheduled on node that doesn't have sriov resource configured.

Comment 2 zenghui.shi 2021-01-18 10:34:49 UTC
The issue also happened in a non-offload environment.

Comment 4 zhaozhanqi 2021-02-01 02:21:32 UTC
Verified this bug on 4.7.0-202101300133.p0

Comment 7 errata-xmlrpc 2021-02-24 15:51:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.