Description of problem:
This bug is actively being worked here: https://github.com/cri-o/cri-o/issues/4727
Upon rebooting an OpenShift node using CRI-O, the CNI plugin does not appear to be called to tear down resources for the previously running pods on that node.
Making note of the running pod sandbox IDs, rebooting the node with sudo reboot now (without draining the pods, effectively simulating a failure scenario) and letting the node come back up. When it does, I see the pods get launched again with new sandboxes. However, the old sandboxes never seem to get a CNI DEL.
Similarly, I can see cache files in /var/lib/cni/results for both the new and old sandboxes. Does CRIO make any attempt to release these resources on node reboot?
Version-Release number of selected component (if applicable): OCP 4.6
Run some pods on a node. Note the pod sandbox container IDs.
Reboot the node, see that the pods are re-launched with new pod sandboxes
See no evidence in the logs that the old sandbox IDs were released.
I expect CNI DEL to be called on the old sandbox IDs, providing a chance for CNI plugins to release any associated resources. CNI DEL is not called.
Output of crio --version:
crio version 1.18.2-18.rhaos4.5.git754d46b.el8
Additional environment details (AWS, VirtualBox, physical, etc.):
OpenShift, IPI in AWS using Calico CNI.
Customer escalation flag set due to impact. This can prevent us from deploying Red Hat OpenShift Cluster Platform at Morgan Stanley. Full Stop.
One piece of this is solved by the attached PR.
For Calico and other pod-based plugins, we'll need a bit more to actually make it all work end-to-end
Alright, most of the remaining pieces are in place. We should now be calling CNI del on the majority of situations
Can this fix be backported to 4.7?
The PR to fix is quite large. some changes can be dropped, but this is a pretty big structural change. I would like the 4.8 version to have some time to soak and catch any issues before we evaluate back porting, so I cannot promise a timeline for it being backported
Now that an upstream fix has been released for at least some of the problem cases, the customer is requesting a backport to OpenShift 4.6 (their production) and 4.7 (their development).
4.8 variant has merged (https://github.com/cri-o/cri-o/pull/4884)
request received, I don't think this should be backported yet, but if it proves itself reasonably stable, I can see it happening. No promises though :)
oops, this is not finished, we need an MCO PR first
I don't personally have a reproducer; David, do you have one?
I can see that there is indeed some work from machine-config/CRI-O here: https://github.com/openshift/machine-config-operator/pull/2574 (calling CNI del on a node reboot with internal_wipe feature)
The reproducer should be focused on reusing old sandbox IDs so those can be re-used but also liberated. A validation from QA should be focused on getting evidence log that CNI DEL is called for old sandbox ID, is that enough for you?
Something like this test case could be used:
i.e: have pods running with a CNI plugin that saves state somewhere non-volatile (not a tmpfs). reboot the node, and verify the resources are cleaned up upon reboot.
Turns out I wasn't as thorough as I should have been testing the reboot scenerio. The attached patch is needed as well to actually call CNI del on reboot.
for QA, one can reboot the node and look for log messages saying "Successfully cleaned up network for pod $podID" for all pods that existed before reboot
As an update to this one, there was one more set of patches that was needed, but I've tested on an openshift node and have confirmed the CNI dels are called once the plugin comes up.
I will attempt to move forward with the MCO PR today
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.