Bug 1948137

Summary: CNI DEL not called on node reboot - OCP 4 CRI-O.
Product: OpenShift Container Platform Reporter: David Hernández Fernández <dahernan>
Component: NodeAssignee: Peter Hunt <pehunt>
Node sub component: CRI-O QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: aamirian, aos-bugs, mfiedler, mwhitehe, openshift-bugs-escalate, rmanes
Version: 4.6   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: A reboot when using a CNI plugin that does not cleanup its resources on reboot Consequence: Some CNI resources may be leaked Fix: CRI-O attempts to call CNI DEL on all containers that were running before reboot Result: CNI resources are cleaned up after reboot
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:58:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1977432    

Description David Hernández Fernández 2021-04-10 10:22:35 UTC
Description of problem:

This bug is actively being worked here: https://github.com/cri-o/cri-o/issues/4727 

Upon rebooting an OpenShift node using CRI-O, the CNI plugin does not appear to be called to tear down resources for the previously running pods on that node.
Making note of the running pod sandbox IDs, rebooting the node with sudo reboot now (without draining the pods, effectively simulating a failure scenario) and letting the node come back up. When it does, I see the pods get launched again with new sandboxes. However, the old sandboxes never seem to get a CNI DEL.

Similarly, I can see cache files in /var/lib/cni/results for both the new and old sandboxes. Does CRIO make any attempt to release these resources on node reboot?

Version-Release number of selected component (if applicable): OCP 4.6

How reproducible:

Run some pods on a node. Note the pod sandbox container IDs.
Reboot the node, see that the pods are re-launched with new pod sandboxes
See no evidence in the logs that the old sandbox IDs were released.

Expected results:

I expect CNI DEL to be called on the old sandbox IDs, providing a chance for CNI plugins to release any associated resources. CNI DEL is not called.

Additional info:

Output of crio --version:

crio version 1.18.2-18.rhaos4.5.git754d46b.el8
Version:    1.18.2-18.rhaos4.5.git754d46b.el8
GoVersion:  go1.13.4
Compiler:   gc
Platform:   linux/amd64
Linkmode:   dynamic
Additional environment details (AWS, VirtualBox, physical, etc.):

OpenShift, IPI in AWS using Calico CNI.

Comment 1 mfiedler 2021-04-15 16:43:55 UTC
Customer escalation flag set due to impact. This can prevent us from deploying Red Hat OpenShift Cluster Platform at Morgan Stanley. Full Stop.

Comment 2 Peter Hunt 2021-04-15 20:32:12 UTC
One piece of this is solved by the attached PR.

For Calico and other pod-based plugins, we'll need a bit more to actually make it all work end-to-end

Comment 3 Peter Hunt 2021-04-16 21:36:57 UTC
Alright, most of the remaining pieces are in place. We should now be calling CNI del on the majority of situations

Comment 4 Arvin Amirian 2021-04-18 18:42:59 UTC
Can this fix be backported to 4.7?

Comment 5 Peter Hunt 2021-04-23 20:05:47 UTC
The PR to fix is quite large. some changes can be dropped, but this is a pretty big structural change. I would like the 4.8 version to have some time to soak and catch any issues before we evaluate back porting, so I cannot promise a timeline for it being backported

Comment 6 Matthew Whitehead 2021-05-05 17:09:21 UTC
Now that an upstream fix has been released for at least some of the problem cases, the customer is requesting a backport to OpenShift 4.6 (their production) and 4.7 (their development).

Comment 7 Peter Hunt 2021-05-10 20:18:46 UTC
4.8 variant has merged (https://github.com/cri-o/cri-o/pull/4884)
request received, I don't think this should be backported yet, but if it proves itself reasonably stable, I can see it happening. No promises though :)

Comment 9 Peter Hunt 2021-05-11 13:00:20 UTC
oops, this is not finished, we need an MCO PR first

I don't personally have a reproducer; David, do you have one?

Comment 10 David Hernández Fernández 2021-05-12 08:38:45 UTC
Hi Peter.

I can see that there is indeed some work from machine-config/CRI-O here: https://github.com/openshift/machine-config-operator/pull/2574 (calling CNI del on a node reboot with internal_wipe feature)
The reproducer should be focused on reusing old sandbox IDs so those can be re-used but also liberated. A validation from QA should be focused on getting evidence log that CNI DEL is called for old sandbox ID, is that enough for you?

Comment 11 Peter Hunt 2021-05-12 13:26:53 UTC
Something like this test case could be used:

i.e: have pods running with a CNI plugin that saves state somewhere non-volatile (not a tmpfs). reboot the node, and verify the resources are cleaned up upon reboot.

Comment 12 Peter Hunt 2021-05-14 15:56:36 UTC
Turns out I wasn't as thorough as I should have been testing the reboot scenerio. The attached patch is needed as well to actually call CNI del on reboot.

for QA, one can reboot the node and look for log messages saying "Successfully cleaned up network for pod $podID" for all pods that existed before reboot

Comment 13 Peter Hunt 2021-05-21 12:55:47 UTC
As an update to this one, there was one more set of patches that was needed, but I've tested on an openshift node and have confirmed the CNI dels are called once the plugin comes up.
I will attempt to move forward with the MCO PR today

Comment 18 errata-xmlrpc 2021-07-27 22:58:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.