Bug 1820507
Summary: | pod/project stuck at terminating status: The container could not be located when the pod was terminated (Exit Code: 137) | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Maciej Szulik <maszulik> | |
Component: | Node | Assignee: | Mrunal Patel <mpatel> | |
Status: | CLOSED ERRATA | QA Contact: | Sunil Choudhary <schoudha> | |
Severity: | urgent | Docs Contact: | ||
Priority: | urgent | |||
Version: | 4.3.0 | CC: | aaleman, aos-bugs, hongkliu, jokerman, mfojtik, mpatel, rphillips, skuznets, wking, yinzhou | |
Target Milestone: | --- | Keywords: | Reopened | |
Target Release: | 4.5.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | 1819954 | |||
: | 1822268 (view as bug list) | Environment: | ||
Last Closed: | 2020-08-04 18:03:57 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1822268 |
Description
Maciej Szulik
2020-04-03 08:56:37 UTC
I split this part from https://bugzilla.redhat.com/show_bug.cgi?id=1819954 b/c there are 2 problems one is oc panicking which is handled in the other BZ, and the other problem is with terminating pods. Although the latter is very unclear. Hongkai can you provide more concrete information what failed and where? I'm asking for kubelet logs which parts are failing what is stuck the above description does not help debugging the problem. The logs of clusterautoscaler showed that it failed to scale down the cluster because of the failures of draining nodes. And then I tried to run `oc adm drain <node>` manually and it failed with the similar logs like There are pending pods in node "ip-10-0-131-192.ec2.internal" when an error occurred: [error when evicting pod "baremetal-installer-build": pods "baremetal-installer-build" is forbidden: unable to create new content in namespace ci-op-0s7gqtnf because it is being terminated, error when evicting pod "openshift-acme-exposer-build": pods "openshift-acme-exposer-build" is forbidden: unable to create new content in namespace ci-op-i7xcfyv9 because it is being terminated, error when evicting pod "kubevirt-test-build": pods "kubevirt-test-build" is forbidden: unable to create new content in namespace ci-op-rmyyz9vc because it is being terminated] And then panic after sometime (see above). I was trying to use `oc adm must-gather` because i thought you might need me to collect them later for debugging, and failed too. I collected the node log by `oc adm node-logs` (only for ip-10-0-131-192.ec2.internal). I will upload later. Yesterday, we deleted the nodes which autoscaler failed to scale down with. The cluster is used as production of CI infra and we have to fix it. And now must-gather works (i guess it does not help much because the broken nodes are removed from the cluster). To clarity, the dentries are leaking on curl liveness calls. This is technically a kernel bug, but we are fixing this in crio by adding default_env support into crio (https://github.com/cri-o/cri-o/pull/3611), and setting NSS_SDB_USE_CACHE=no within containers. If containers contain a different value, then the container overrides crio's injected variable. There are other fixes going into crio fixing the error handling in deferred functions. https://github.com/cri-o/cri-o/pull/3600 https://github.com/cri-o/cri-o/pull/3608 https://github.com/cri-o/cri-o/pull/3597 https://github.com/cri-o/cri-o/pull/3592 Also related to this bug series are the two post-fix mitigation bugs: bug 1829664 and bug 1829999. We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to bug 1822269. It will be removed if the assessment indicates that this should not block upgrade edges. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? Example: Customers running 4.3 (under some conditions?) who try to drain nodes. Also customers running 4.4 before bug 1822268 landed (in some specific 4.4 RC?). We don't (usually?) pull edges into RCs or for non-regressions, so this would probably be pulling edges from 4.2 -> 4.3 for 4.3 that do not carry the fix for this bug's 4.3 clone (bug 1822269). What is the impact? Is it serious enough to warrant blocking edges? Example: Nodes get stuck draining forever and possibly need manual intervention to unstick them. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? Example: Updating to a fixed version keeps *new* nodes from getting stuck like this, but does not automatically resolve existing stuck nodes. Bug 1829999 is about alerting admins impacted by this issue and pointing them at mitigation. Also, if anyone has ideas about how we can think about the remediation question more actively so we aren't surprised by the "fix avoids lock for new attempts but does not resolve resources that are already locked up" weeks after diagnosing the issue and floating a PR, we'd love to improve that part of our bugfix/impact flow. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |