Bug 1900835

Summary:	Multus errors when cachefile is not found
Product:	OpenShift Container Platform	Reporter:	Douglas Smith <dosmith>
Component:	Networking	Assignee:	Douglas Smith <dosmith>
Networking sub component:	multus	QA Contact:	Weibin Liang <weliang>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	medium	CC:	aaleman
Version:	4.6
Target Milestone:	---
Target Release:	4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: When Multus cache files are deleted by an external process, Multus could not find them, and would exit non zero. Consequence: This would cause pods to not delete in a timely fashion. Fix: Allow Multus to give up by exiting zero when cache files are not found. Result: Pods are deleted successfully when cachefiles have been deleted.	Story Points:	---
Clone Of:
Clones:	1905230 (view as bug list)		Environment:
Last Closed:	2021-02-24 15:35:25 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1905230

Description Douglas Smith 2020-11-23 20:01:22 UTC

Description of problem:

Error looks like:

```
Nov 22 20:37:32.217248 build0-gstfj-w-b-vbct9.c.openshift-ci-build-farm.internal hyperkube[1761]: E1122 20:37:32.217183    1761 remote_runtime.go:140] StopPodSandbox "188fa63d60a136dc917b6aa0570aefa5519d66eaff3e353b605d96f1bde96260" from runtime service failed: rpc error: code = Unknown desc = failed to destroy network for pod sandbox k8s_network-metrics-daemon-2xrxz_openshift-multus_76cd56c9-b986-47f2-b82e-866649cc81bc_0(188fa63d60a136dc917b6aa0570aefa5519d66eaff3e353b605d96f1bde96260): Multus: [openshift-multus/network-metrics-daemon-2xrxz]: error reading the delegates: open /var/lib/cni/multus/188fa63d60a136dc917b6aa0570aefa5519d66eaff3e353b605d96f1bde96260: no such file or directory
```


Version-Release number of selected component (if applicable): 4.6.4


How reproducible: Intermittent.


Steps to Reproduce: (unknown)

Actual results: Should be handled instead of erroring out.

Additional info:

Upstream fix created by Tomofumi Hayashi @ https://github.com/s1061123/multus-cni/commit/4eb6ae1553a59771eeefc7a64e1213e2b8bd789b#diff-ddd9d724bf025ac4f6049453e91075e263918c229c8511cf60ee081aacacd88cR740

Comment 2 Douglas Smith 2020-11-24 14:16:17 UTC

Turns out that Tomo's commit is present in 4.6. However, Ryan Phillips advised that the problem is that the `IsNotExists block is only entered if pod != nil, IsNotExists block is only entered if pod != nil`

That's why we're seeing the error from here @ https://github.com/intel/multus-cni/blob/master/multus/multus.go#L768

Comment 3 Douglas Smith 2020-11-24 18:28:36 UTC

Proposed change upstream after consultation to double check the solution from Tomofumi Hayashi.

Have downstream PR, pending upstream review.

Comment 5 Douglas Smith 2020-12-04 16:42:00 UTC

This is very difficult to produce the situation which causes this condition, generally... it would happen when a system is thrashing. Somehow the cachefiles are gone (possibly because of a reboot), and also the pod is not present in the API at the time when CNI DEL happens.

I think we'll need to validate this by just checking that the commit where the fix is made is present on the machine.

I will follow up with a recipe to determine the fix is present.

Comment 6 Douglas Smith 2020-12-04 17:59:38 UTC

To validate that the code is present, first get the image used from Multus using any pod as an example...

```
oc get pods -n multus
oc get pods -n openshift-multus
oc describe pod multus-gg7vj -n openshift-multus | grep -i -A10 kube-multus | grep Image:
```

Then using the image ID, which may look something like: "registry.svc.ci.openshift.org/ocp/4.7-2020-12-04-133047@sha256:dbbcb26948470f12bee716c59a7583e46370b48eb28e9f7f485c3bc73242575b"

Debug any worker node, and inspect the image using podman...


```
oc debug node/ip-10-0-137-96.us-west-1.compute.internal --image=busybox
chroot /host
podman inspect registry.svc.ci.openshift.org/ocp/4.7-2020-12-04-133047@sha256:dbbcb26948470f12bee716c59a7583e46370b48eb28e9f7f485c3bc73242575b | grep commit.id
```

Using the commit ID from that command, visit GitHub for Multus and put the commit in the URL like so:

https://github.com/openshift/multus-cni/commits/7aa53b3cb5ae0ae4b03906803c7aff37447d4357

In the list of commits there there should be a commit with the ID cfa0d64925526ccf1e01707d5c5eadda810ec9f7 and the description "Multus should exit zero on DEL when cache file is missing and pod can…"

If that commit is there, the code is present.

Comment 7 Weibin Liang 2020-12-07 18:39:40 UTC

Douglas, thanks for detail information.

The bug is verified in 4.7.0-0.nightly-2020-12-07-163758

https://github.com/openshift/multus-cni/commits/7aa53b3cb5ae0ae4b03906803c7aff37447d4357 show fixing PR description as "Multus should exit zero on DEL when cache file is missing and pod can… "

Comment 10 errata-xmlrpc 2021-02-24 15:35:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633