Bug 1884035 - Pods are illegally transitioning back to pending
Summary: Pods are illegally transitioning back to pending
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.0
Assignee: Ryan Phillips
QA Contact: Sunil Choudhary
URL:
Whiteboard:
: 1884697 (view as bug list)
Depends On:
Blocks: 1886247 1887501 1891539
TreeView+ depends on / blocked
 
Reported: 2020-09-30 20:01 UTC by David Eads
Modified: 2021-02-24 15:22 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1891539 (view as bug list)
Environment:
Last Closed: 2021-02-24 15:21:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 393 0 None closed bug 1884035: set lastterminationstate for container status even when CRI fails to return termination (or any) data 2021-02-17 09:25:11 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:22:21 UTC

Description David Eads 2020-09-30 20:01:56 UTC
We see frequent evidence of this in ci search  https://search.ci.openshift.org/?search=pod+should+not+transition+Running-%3EPending+even+when+terminated&maxAge=48h&context=1&type=build-log&name=4.6&maxMatches=5&maxBytes=20971520&groupBy=job

Adding a test https://github.com/openshift/origin/pull/25572 to make the problem more obvious in the future, but the immediate problem is determining why this happens.  As with the previous illegal state transition bug, this has consequences for code that relies on state transitions.

Comment 1 Seth Jennings 2020-10-01 14:03:26 UTC
While I do acknowledge the severity of the bug, it doesn't result in any user-visible error in the product.  Code freeze for 4.6 is tomorrow and we have been looking at this bug for months (years?) at this point.  We've even merged a few PRs that seems to mitigate it somewhat.  But obviously not fully resolved.

Deferring to 4.7.

Comment 2 David Eads 2020-10-01 20:36:26 UTC
We now have a clear junit signal that this a widespread problem.  I'm updating sippy to provide this in a unique view.

The last time this happened, we identified a cause before deferring a bug: https://bugzilla.redhat.com/show_bug.cgi?id=1694087#c5

Quoting inline
  > Clayton Coleman 2019-05-09 14:50:07 UTC
  > I do not agree.  I want to know why it’s happening before it gets kicked out.  I see no evidence that this isn’t failing because of a serious bug in the platform.  Show me this is a trivial flake, and I’ll be ok bumping it out.  But if this is a bug in how Kubelet works that violates the guarantee stateful applications rely on then it is most definitely not acceptable to defer it.

Moving back into 4.6.0 for diagnosis of why it has cropped up again.  Before pushing out on the basis of "not a regression", I would encourage porting the tests back to 4.5 and 4.4.

Comment 3 David Eads 2020-10-01 20:38:31 UTC
In addition to causing deploymentconfig pod lifecycle problems, this invalidates assumptions made in the eviction API that a pending pod never counted towards the PDB budget and can be evicted without consulting PDBs.

Comment 4 Michael Gugino 2020-10-01 21:15:37 UTC
We're seeing an intermittent race condition on machine objects after successful patch events in the machine-api:

https://github.com/openshift/machine-api-operator/pull/711

If we're seeing races here, then it's possible we're getting race conditions all over the place and invalidating assumptions in various controllers, particularly if a controller observes a state in object X and performs an action against object Y.

One hypothesis is that the API is sending an event to the informers with the stale version of the object even though it successfully accepted the patch.

Comment 5 Seth Jennings 2020-10-02 21:22:12 UTC
Deferring this to 4.7 again.

We know of no controller that would take action based on a Running->Pending transition and, while it definitely is a bug, it is not one likely to impact a customer.

I am planning to open a PR to check for the bad phase transition on the kubelet side here
https://github.com/kubernetes/kubernetes/blob/112dbd55860e600af525cedc255f2664e3f286aa/pkg/kubelet/kubelet_pods.go#L1513-L1520

The check would not correct/block the bad phase transition as we still want to surface the failure when it happens.

That should help us catch if/when this happens on the kubelet side and we can inspect the old and new pod status at that point.

We also plan to backport the test that surfaces this issue back to 4.5 and 4.4 so we can observe if this trend is changing across releases.

Comment 8 Seth Jennings 2020-10-08 14:57:07 UTC
*** Bug 1884697 has been marked as a duplicate of this bug. ***

Comment 9 Michael Gugino 2020-10-08 15:20:02 UTC
Seeing this in the wild upstream: https://github.com/kubernetes/kubernetes/pull/94958

Comment 10 Michael Gugino 2020-10-08 15:39:18 UTC
Seems we could also enforce the logic here: https://github.com/kubernetes/kubernetes/blob/90c9f7b3e198e82a756a68ffeac978a00d606e55/pkg/kubelet/kubelet_pods.go#L1513

If kubelet calculates that pod should be pending, but we know we're currently in a phase that is post-pending, we should stop transition.  Having similar error message would allow easier regression testing in the future, WDYT?

Comment 12 Lukasz Szaszkiewicz 2020-10-14 06:47:58 UTC
Not sure why this bug was moved to ON_QA. It looks like the issue still persists on 4.7. Please take a look https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=4.7&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-node%5C%5D+pods+should+never+transition+back+to+pending

Comment 13 Tom Sweeney 2020-10-14 13:29:19 UTC
Seth any thoughts on the past few comments on this one?

Comment 14 Seth Jennings 2020-10-14 13:39:48 UTC
This is a different issue than originally reported and fix.  The test now includes a number of different situations in which pods transitioned to Pending.

https://github.com/openshift/openshift-tests/blob/292dfd1dc2d170bd8b5f2d4dfb2414ef657ff22b/pkg/monitor/pod.go#L87 fixed

https://github.com/openshift/openshift-tests/blob/292dfd1dc2d170bd8b5f2d4dfb2414ef657ff22b/pkg/monitor/pod.go#L93 is the case we see now

I would say lets limit the scope of this bug to the first case, since it impacts normal pods and (re)open a new bug for the second.
https://bugzilla.redhat.com/show_bug.cgi?id=1886920

Comment 17 Michael Gugino 2020-10-27 15:53:59 UTC
I see this failing in 4.7 specific jobs with static pods still: https://search.ci.openshift.org/?search=invariant+violation&maxAge=336h&context=1&type=junit&name=4.7&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 18 Ryan Phillips 2020-10-27 16:13:55 UTC
This PR does not address static pod transitions.

Moving back to ON_QA...

Comment 19 Ryan Phillips 2020-10-27 16:14:40 UTC
Setting Verified, since the last state by Sunil was verified.

Comment 22 errata-xmlrpc 2021-02-24 15:21:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.