Bug 1810722 - Node should not delete pods until all container status is available
Summary: Node should not delete pods until all container status is available
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.4.0
Assignee: Clayton Coleman
QA Contact: Sunil Choudhary
URL:
Whiteboard:
: 1734524 1821576 (view as bug list)
Depends On: 1810652 1926546
Blocks: 1821341
TreeView+ depends on / blocked
 
Reported: 2020-03-05 19:16 UTC by Clayton Coleman
Modified: 2021-03-31 04:13 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: RestartNever pods would not report statuses correctly. Consequence: Fix: Bugfix upstream. Result:
Clone Of: 1810652
: 1821341 (view as bug list)
Environment:
Last Closed: 2020-05-04 11:44:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 24649 0 None closed [release-4.4] Bug 1810722: Kubelet should not remove restart never pods until all status is reported 2021-02-08 07:33:52 UTC
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:45:26 UTC

Description Clayton Coleman 2020-03-05 19:16:30 UTC
+++ This bug was initially created as a clone of Bug #1810652 +++

The kubelet does not properly terminate pods that are RestartNever - upstream it reports success (even if the pod actually failed), and in OpenShift since 4.1 we provides synthetic status (a fake 137 exit code).  Now that we have fixed the issue upstream, we should backport it to 4.4 at least, possible 4.3.

The upstream e2e reproduces the issue by:

1. Creating a RestartNever pod that should always exit with status code 1
2. Waiting 0-4s
3. Deleting the pod
4. Observing the status written by the kubelet - no container should report exit code 0

To test this in Origin the e2e test is sufficient, and we can verify in upgrade jobs (which terminate lots of pods) that no openshift-* namespace pod exits with code 137 reason ContainerStatusUnknown.

Comment 4 Scott Dodson 2020-04-07 18:48:55 UTC
*** Bug 1734524 has been marked as a duplicate of this bug. ***

Comment 5 Scott Dodson 2020-04-07 18:49:11 UTC
*** Bug 1821576 has been marked as a duplicate of this bug. ***

Comment 7 errata-xmlrpc 2020-05-04 11:44:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Comment 8 W. Trevor King 2021-03-31 04:13:36 UTC
Scott had added UpgradeBlocker to this bug way back, but I don't think we ever ended up blocking update recommendations on this series, and the fix has been out for almost a year, and 4.4 is now end-of-life.  Removing the keyword to get it out of our suspect queue [1].

[1]: https://github.com/openshift/enhancements/pull/475


Note You need to log in before you can comment on or make changes to this bug.