Bug 1948137
Summary: | CNI DEL not called on node reboot - OCP 4 CRI-O. | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | David Hernández Fernández <dahernan> |
Component: | Node | Assignee: | Peter Hunt <pehunt> |
Node sub component: | CRI-O | QA Contact: | Sunil Choudhary <schoudha> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | medium | CC: | aamirian, aos-bugs, mfiedler, mwhitehe, openshift-bugs-escalate, rmanes |
Version: | 4.6 | ||
Target Milestone: | --- | ||
Target Release: | 4.8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause:
A reboot when using a CNI plugin that does not cleanup its resources on reboot
Consequence:
Some CNI resources may be leaked
Fix:
CRI-O attempts to call CNI DEL on all containers that were running before reboot
Result:
CNI resources are cleaned up after reboot
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2021-07-27 22:58:59 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1977432 |
Description
David Hernández Fernández
2021-04-10 10:22:35 UTC
Customer escalation flag set due to impact. This can prevent us from deploying Red Hat OpenShift Cluster Platform at Morgan Stanley. Full Stop. One piece of this is solved by the attached PR. For Calico and other pod-based plugins, we'll need a bit more to actually make it all work end-to-end Alright, most of the remaining pieces are in place. We should now be calling CNI del on the majority of situations Can this fix be backported to 4.7? The PR to fix is quite large. some changes can be dropped, but this is a pretty big structural change. I would like the 4.8 version to have some time to soak and catch any issues before we evaluate back porting, so I cannot promise a timeline for it being backported Now that an upstream fix has been released for at least some of the problem cases, the customer is requesting a backport to OpenShift 4.6 (their production) and 4.7 (their development). 4.8 variant has merged (https://github.com/cri-o/cri-o/pull/4884) request received, I don't think this should be backported yet, but if it proves itself reasonably stable, I can see it happening. No promises though :) oops, this is not finished, we need an MCO PR first I don't personally have a reproducer; David, do you have one? Hi Peter. I can see that there is indeed some work from machine-config/CRI-O here: https://github.com/openshift/machine-config-operator/pull/2574 (calling CNI del on a node reboot with internal_wipe feature) The reproducer should be focused on reusing old sandbox IDs so those can be re-used but also liberated. A validation from QA should be focused on getting evidence log that CNI DEL is called for old sandbox ID, is that enough for you? Something like this test case could be used: https://github.com/cri-o/cri-o/blob/ce6527b9be5c6fafbac212e9bb84471d0ad63d88/test/crio-wipe.bats#L266..L297 i.e: have pods running with a CNI plugin that saves state somewhere non-volatile (not a tmpfs). reboot the node, and verify the resources are cleaned up upon reboot. Turns out I wasn't as thorough as I should have been testing the reboot scenerio. The attached patch is needed as well to actually call CNI del on reboot. for QA, one can reboot the node and look for log messages saying "Successfully cleaned up network for pod $podID" for all pods that existed before reboot As an update to this one, there was one more set of patches that was needed, but I've tested on an openshift node and have confirmed the CNI dels are called once the plugin comes up. I will attempt to move forward with the MCO PR today Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |