1948137 – CNI DEL not called on node reboot - OCP 4 CRI-O.

Bug 1948137 - CNI DEL not called on node reboot - OCP 4 CRI-O.

Summary: CNI DEL not called on node reboot - OCP 4 CRI-O.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Peter Hunt
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1977432
TreeView+	depends on / blocked

Reported:	2021-04-10 10:22 UTC by David Hernández Fernández
Modified:	2024-10-01 17:53 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: A reboot when using a CNI plugin that does not cleanup its resources on reboot Consequence: Some CNI resources may be leaked Fix: CRI-O attempts to call CNI DEL on all containers that were running before reboot Result: CNI resources are cleaned up after reboot
Clone Of:
Environment:
Last Closed:	2021-07-27 22:58:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	cri-o cri-o pull 4767	None	closed	Introduce InternalWipe	2021-05-21 14:50:35 UTC
Github	cri-o cri-o pull 4900	None	closed	server: rework internal wipe to play better with reboots	2021-05-21 14:50:58 UTC
Github	cri-o cri-o pull 4934	None	closed	[release-1.21] server: reduce log verbosity and speed up startup on restore	2021-05-21 12:55:43 UTC
Github	openshift machine-config-operator pull 2574	None	open	Bug 1948137: crio: enable internal_wipe option	2021-05-21 14:50:51 UTC
Red Hat Product Errata	RHSA-2021:2438	None	None	None	2021-07-27 22:59:18 UTC

Description David Hernández Fernández 2021-04-10 10:22:35 UTC

Description of problem:

This bug is actively being worked here: https://github.com/cri-o/cri-o/issues/4727

Upon rebooting an OpenShift node using CRI-O, the CNI plugin does not appear to be called to tear down resources for the previously running pods on that node.
Making note of the running pod sandbox IDs, rebooting the node with sudo reboot now (without draining the pods, effectively simulating a failure scenario) and letting the node come back up. When it does, I see the pods get launched again with new sandboxes. However, the old sandboxes never seem to get a CNI DEL.

Similarly, I can see cache files in /var/lib/cni/results for both the new and old sandboxes. Does CRIO make any attempt to release these resources on node reboot?

Version-Release number of selected component (if applicable): OCP 4.6

How reproducible:

Run some pods on a node. Note the pod sandbox container IDs.
Reboot the node, see that the pods are re-launched with new pod sandboxes
See no evidence in the logs that the old sandbox IDs were released.

Expected results:

I expect CNI DEL to be called on the old sandbox IDs, providing a chance for CNI plugins to release any associated resources. CNI DEL is not called.

Additional info:

Output of crio --version:

crio version 1.18.2-18.rhaos4.5.git754d46b.el8
Version: 1.18.2-18.rhaos4.5.git754d46b.el8
GoVersion: go1.13.4
Compiler: gc
Platform: linux/amd64
Linkmode: dynamic
Additional environment details (AWS, VirtualBox, physical, etc.):

OpenShift, IPI in AWS using Calico CNI.

Comment 1 mfiedler 2021-04-15 16:43:55 UTC

Customer escalation flag set due to impact. This can prevent us from deploying Red Hat OpenShift Cluster Platform at Morgan Stanley. Full Stop.

Comment 2 Peter Hunt 2021-04-15 20:32:12 UTC

One piece of this is solved by the attached PR.

For Calico and other pod-based plugins, we'll need a bit more to actually make it all work end-to-end

Comment 3 Peter Hunt 2021-04-16 21:36:57 UTC

Alright, most of the remaining pieces are in place. We should now be calling CNI del on the majority of situations

Comment 4 Arvin Amirian 2021-04-18 18:42:59 UTC

Can this fix be backported to 4.7?

Comment 5 Peter Hunt 2021-04-23 20:05:47 UTC

The PR to fix is quite large. some changes can be dropped, but this is a pretty big structural change. I would like the 4.8 version to have some time to soak and catch any issues before we evaluate back porting, so I cannot promise a timeline for it being backported

Comment 6 Matthew Whitehead 2021-05-05 17:09:21 UTC

Now that an upstream fix has been released for at least some of the problem cases, the customer is requesting a backport to OpenShift 4.6 (their production) and 4.7 (their development).

Comment 7 Peter Hunt 2021-05-10 20:18:46 UTC

4.8 variant has merged (https://github.com/cri-o/cri-o/pull/4884)
request received, I don't think this should be backported yet, but if it proves itself reasonably stable, I can see it happening. No promises though :)

Comment 9 Peter Hunt 2021-05-11 13:00:20 UTC

oops, this is not finished, we need an MCO PR first

I don't personally have a reproducer; David, do you have one?

Comment 10 David Hernández Fernández 2021-05-12 08:38:45 UTC

Hi Peter.

I can see that there is indeed some work from machine-config/CRI-O here: https://github.com/openshift/machine-config-operator/pull/2574 (calling CNI del on a node reboot with internal_wipe feature)
The reproducer should be focused on reusing old sandbox IDs so those can be re-used but also liberated. A validation from QA should be focused on getting evidence log that CNI DEL is called for old sandbox ID, is that enough for you?

Comment 11 Peter Hunt 2021-05-12 13:26:53 UTC

Something like this test case could be used:
https://github.com/cri-o/cri-o/blob/ce6527b9be5c6fafbac212e9bb84471d0ad63d88/test/crio-wipe.bats#L266..L297

i.e: have pods running with a CNI plugin that saves state somewhere non-volatile (not a tmpfs). reboot the node, and verify the resources are cleaned up upon reboot.

Comment 12 Peter Hunt 2021-05-14 15:56:36 UTC

Turns out I wasn't as thorough as I should have been testing the reboot scenerio. The attached patch is needed as well to actually call CNI del on reboot.

for QA, one can reboot the node and look for log messages saying "Successfully cleaned up network for pod $podID" for all pods that existed before reboot

Comment 13 Peter Hunt 2021-05-21 12:55:47 UTC

As an update to this one, there was one more set of patches that was needed, but I've tested on an openshift node and have confirmed the CNI dels are called once the plugin comes up.
I will attempt to move forward with the MCO PR today

Comment 18 errata-xmlrpc 2021-07-27 22:58:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.