Description of problem:
Error looks like:
Nov 22 20:37:32.217248 build0-gstfj-w-b-vbct9.c.openshift-ci-build-farm.internal hyperkube: E1122 20:37:32.217183 1761 remote_runtime.go:140] StopPodSandbox "188fa63d60a136dc917b6aa0570aefa5519d66eaff3e353b605d96f1bde96260" from runtime service failed: rpc error: code = Unknown desc = failed to destroy network for pod sandbox k8s_network-metrics-daemon-2xrxz_openshift-multus_76cd56c9-b986-47f2-b82e-866649cc81bc_0(188fa63d60a136dc917b6aa0570aefa5519d66eaff3e353b605d96f1bde96260): Multus: [openshift-multus/network-metrics-daemon-2xrxz]: error reading the delegates: open /var/lib/cni/multus/188fa63d60a136dc917b6aa0570aefa5519d66eaff3e353b605d96f1bde96260: no such file or directory
Version-Release number of selected component (if applicable): 4.6.4
How reproducible: Intermittent.
Steps to Reproduce: (unknown)
Actual results: Should be handled instead of erroring out.
Upstream fix created by Tomofumi Hayashi @ https://github.com/s1061123/multus-cni/commit/4eb6ae1553a59771eeefc7a64e1213e2b8bd789b#diff-ddd9d724bf025ac4f6049453e91075e263918c229c8511cf60ee081aacacd88cR740
Turns out that Tomo's commit is present in 4.6. However, Ryan Phillips advised that the problem is that the `IsNotExists block is only entered if pod != nil, IsNotExists block is only entered if pod != nil`
That's why we're seeing the error from here @ https://github.com/intel/multus-cni/blob/master/multus/multus.go#L768
Proposed change upstream after consultation to double check the solution from Tomofumi Hayashi.
Have downstream PR, pending upstream review.
This is very difficult to produce the situation which causes this condition, generally... it would happen when a system is thrashing. Somehow the cachefiles are gone (possibly because of a reboot), and also the pod is not present in the API at the time when CNI DEL happens.
I think we'll need to validate this by just checking that the commit where the fix is made is present on the machine.
I will follow up with a recipe to determine the fix is present.
To validate that the code is present, first get the image used from Multus using any pod as an example...
oc get pods -n multus
oc get pods -n openshift-multus
oc describe pod multus-gg7vj -n openshift-multus | grep -i -A10 kube-multus | grep Image:
Then using the image ID, which may look something like: "registry.svc.ci.openshift.org/ocp/4.7-2020-12-04-133047@sha256:dbbcb26948470f12bee716c59a7583e46370b48eb28e9f7f485c3bc73242575b"
Debug any worker node, and inspect the image using podman...
oc debug node/ip-10-0-137-96.us-west-1.compute.internal --image=busybox
podman inspect registry.svc.ci.openshift.org/ocp/4.7-2020-12-04-133047@sha256:dbbcb26948470f12bee716c59a7583e46370b48eb28e9f7f485c3bc73242575b | grep commit.id
Using the commit ID from that command, visit GitHub for Multus and put the commit in the URL like so:
In the list of commits there there should be a commit with the ID cfa0d64925526ccf1e01707d5c5eadda810ec9f7 and the description "Multus should exit zero on DEL when cache file is missing and pod can…"
If that commit is there, the code is present.
Douglas, thanks for detail information.
The bug is verified in 4.7.0-0.nightly-2020-12-07-163758
https://github.com/openshift/multus-cni/commits/7aa53b3cb5ae0ae4b03906803c7aff37447d4357 show fixing PR description as "Multus should exit zero on DEL when cache file is missing and pod can… "
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.