Description of problem: Error looks like: ``` Nov 22 20:37:32.217248 build0-gstfj-w-b-vbct9.c.openshift-ci-build-farm.internal hyperkube[1761]: E1122 20:37:32.217183 1761 remote_runtime.go:140] StopPodSandbox "188fa63d60a136dc917b6aa0570aefa5519d66eaff3e353b605d96f1bde96260" from runtime service failed: rpc error: code = Unknown desc = failed to destroy network for pod sandbox k8s_network-metrics-daemon-2xrxz_openshift-multus_76cd56c9-b986-47f2-b82e-866649cc81bc_0(188fa63d60a136dc917b6aa0570aefa5519d66eaff3e353b605d96f1bde96260): Multus: [openshift-multus/network-metrics-daemon-2xrxz]: error reading the delegates: open /var/lib/cni/multus/188fa63d60a136dc917b6aa0570aefa5519d66eaff3e353b605d96f1bde96260: no such file or directory ``` Version-Release number of selected component (if applicable): 4.6.4 How reproducible: Intermittent. Steps to Reproduce: (unknown) Actual results: Should be handled instead of erroring out. Additional info: Upstream fix created by Tomofumi Hayashi @ https://github.com/s1061123/multus-cni/commit/4eb6ae1553a59771eeefc7a64e1213e2b8bd789b#diff-ddd9d724bf025ac4f6049453e91075e263918c229c8511cf60ee081aacacd88cR740
Turns out that Tomo's commit is present in 4.6. However, Ryan Phillips advised that the problem is that the `IsNotExists block is only entered if pod != nil, IsNotExists block is only entered if pod != nil` That's why we're seeing the error from here @ https://github.com/intel/multus-cni/blob/master/multus/multus.go#L768
Proposed change upstream after consultation to double check the solution from Tomofumi Hayashi. Have downstream PR, pending upstream review.
This is very difficult to produce the situation which causes this condition, generally... it would happen when a system is thrashing. Somehow the cachefiles are gone (possibly because of a reboot), and also the pod is not present in the API at the time when CNI DEL happens. I think we'll need to validate this by just checking that the commit where the fix is made is present on the machine. I will follow up with a recipe to determine the fix is present.
To validate that the code is present, first get the image used from Multus using any pod as an example... ``` oc get pods -n multus oc get pods -n openshift-multus oc describe pod multus-gg7vj -n openshift-multus | grep -i -A10 kube-multus | grep Image: ``` Then using the image ID, which may look something like: "registry.svc.ci.openshift.org/ocp/4.7-2020-12-04-133047@sha256:dbbcb26948470f12bee716c59a7583e46370b48eb28e9f7f485c3bc73242575b" Debug any worker node, and inspect the image using podman... ``` oc debug node/ip-10-0-137-96.us-west-1.compute.internal --image=busybox chroot /host podman inspect registry.svc.ci.openshift.org/ocp/4.7-2020-12-04-133047@sha256:dbbcb26948470f12bee716c59a7583e46370b48eb28e9f7f485c3bc73242575b | grep commit.id ``` Using the commit ID from that command, visit GitHub for Multus and put the commit in the URL like so: https://github.com/openshift/multus-cni/commits/7aa53b3cb5ae0ae4b03906803c7aff37447d4357 In the list of commits there there should be a commit with the ID cfa0d64925526ccf1e01707d5c5eadda810ec9f7 and the description "Multus should exit zero on DEL when cache file is missing and pod can…" If that commit is there, the code is present.
Douglas, thanks for detail information. The bug is verified in 4.7.0-0.nightly-2020-12-07-163758 https://github.com/openshift/multus-cni/commits/7aa53b3cb5ae0ae4b03906803c7aff37447d4357 show fixing PR description as "Multus should exit zero on DEL when cache file is missing and pod can… "
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633