Bug 1900835
| Summary: | Multus errors when cachefile is not found | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Douglas Smith <dosmith> | |
| Component: | Networking | Assignee: | Douglas Smith <dosmith> | |
| Networking sub component: | multus | QA Contact: | Weibin Liang <weliang> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | high | |||
| Priority: | medium | CC: | aaleman | |
| Version: | 4.6 | |||
| Target Milestone: | --- | |||
| Target Release: | 4.7.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: | Cause: When Multus cache files are deleted by an external process, Multus could not find them, and would exit non zero. 
Consequence: This would cause pods to not delete in a timely fashion.
Fix: Allow Multus to give up by exiting zero when cache files are not found.
Result: Pods are deleted successfully when cachefiles have been deleted. | Story Points: | --- | |
| Clone Of: | ||||
| : | 1905230 (view as bug list) | Environment: | ||
| Last Closed: | 2021-02-24 15:35:25 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1905230 | |||
| 
        
          Description
        
        
          Douglas Smith
        
        
        
        
        
          2020-11-23 20:01:22 UTC
        
       Turns out that Tomo's commit is present in 4.6. However, Ryan Phillips advised that the problem is that the `IsNotExists block is only entered if pod != nil, IsNotExists block is only entered if pod != nil` That's why we're seeing the error from here @ https://github.com/intel/multus-cni/blob/master/multus/multus.go#L768 Proposed change upstream after consultation to double check the solution from Tomofumi Hayashi. Have downstream PR, pending upstream review. This is very difficult to produce the situation which causes this condition, generally... it would happen when a system is thrashing. Somehow the cachefiles are gone (possibly because of a reboot), and also the pod is not present in the API at the time when CNI DEL happens. I think we'll need to validate this by just checking that the commit where the fix is made is present on the machine. I will follow up with a recipe to determine the fix is present. To validate that the code is present, first get the image used from Multus using any pod as an example... ``` oc get pods -n multus oc get pods -n openshift-multus oc describe pod multus-gg7vj -n openshift-multus | grep -i -A10 kube-multus | grep Image: ``` Then using the image ID, which may look something like: "registry.svc.ci.openshift.org/ocp/4.7-2020-12-04-133047@sha256:dbbcb26948470f12bee716c59a7583e46370b48eb28e9f7f485c3bc73242575b" Debug any worker node, and inspect the image using podman... ``` oc debug node/ip-10-0-137-96.us-west-1.compute.internal --image=busybox chroot /host podman inspect registry.svc.ci.openshift.org/ocp/4.7-2020-12-04-133047@sha256:dbbcb26948470f12bee716c59a7583e46370b48eb28e9f7f485c3bc73242575b | grep commit.id ``` Using the commit ID from that command, visit GitHub for Multus and put the commit in the URL like so: https://github.com/openshift/multus-cni/commits/7aa53b3cb5ae0ae4b03906803c7aff37447d4357 In the list of commits there there should be a commit with the ID cfa0d64925526ccf1e01707d5c5eadda810ec9f7 and the description "Multus should exit zero on DEL when cache file is missing and pod can…" If that commit is there, the code is present. Douglas, thanks for detail information. The bug is verified in 4.7.0-0.nightly-2020-12-07-163758 https://github.com/openshift/multus-cni/commits/7aa53b3cb5ae0ae4b03906803c7aff37447d4357 show fixing PR description as "Multus should exit zero on DEL when cache file is missing and pod can… " Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |