1884035 – Pods are illegally transitioning back to pending

Bug 1884035 - Pods are illegally transitioning back to pending

Summary: Pods are illegally transitioning back to pending

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1884697 (view as bug list)
Depends On:
Blocks:	1886247 1887501 1891539
TreeView+	depends on / blocked

Reported:	2020-09-30 20:01 UTC by David Eads
Modified:	2021-02-24 15:22 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1891539 (view as bug list)
Environment:
Last Closed:	2021-02-24 15:21:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift kubernetes pull 393	0	None	closed	bug 1884035: set lastterminationstate for container status even when CRI fails to return termination (or any) data	2021-02-17 09:25:11 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:22:21 UTC

Description David Eads 2020-09-30 20:01:56 UTC

We see frequent evidence of this in ci search  https://search.ci.openshift.org/?search=pod+should+not+transition+Running-%3EPending+even+when+terminated&maxAge=48h&context=1&type=build-log&name=4.6&maxMatches=5&maxBytes=20971520&groupBy=job

Adding a test https://github.com/openshift/origin/pull/25572 to make the problem more obvious in the future, but the immediate problem is determining why this happens.  As with the previous illegal state transition bug, this has consequences for code that relies on state transitions.

Comment 1 Seth Jennings 2020-10-01 14:03:26 UTC

While I do acknowledge the severity of the bug, it doesn't result in any user-visible error in the product.  Code freeze for 4.6 is tomorrow and we have been looking at this bug for months (years?) at this point.  We've even merged a few PRs that seems to mitigate it somewhat.  But obviously not fully resolved.

Deferring to 4.7.

Comment 2 David Eads 2020-10-01 20:36:26 UTC

We now have a clear junit signal that this a widespread problem.  I'm updating sippy to provide this in a unique view.

The last time this happened, we identified a cause before deferring a bug: https://bugzilla.redhat.com/show_bug.cgi?id=1694087#c5

Quoting inline
  > Clayton Coleman 2019-05-09 14:50:07 UTC
  > I do not agree.  I want to know why it’s happening before it gets kicked out.  I see no evidence that this isn’t failing because of a serious bug in the platform.  Show me this is a trivial flake, and I’ll be ok bumping it out.  But if this is a bug in how Kubelet works that violates the guarantee stateful applications rely on then it is most definitely not acceptable to defer it.

Moving back into 4.6.0 for diagnosis of why it has cropped up again.  Before pushing out on the basis of "not a regression", I would encourage porting the tests back to 4.5 and 4.4.

Comment 3 David Eads 2020-10-01 20:38:31 UTC

In addition to causing deploymentconfig pod lifecycle problems, this invalidates assumptions made in the eviction API that a pending pod never counted towards the PDB budget and can be evicted without consulting PDBs.

Comment 4 Michael Gugino 2020-10-01 21:15:37 UTC

We're seeing an intermittent race condition on machine objects after successful patch events in the machine-api:

https://github.com/openshift/machine-api-operator/pull/711

If we're seeing races here, then it's possible we're getting race conditions all over the place and invalidating assumptions in various controllers, particularly if a controller observes a state in object X and performs an action against object Y.

One hypothesis is that the API is sending an event to the informers with the stale version of the object even though it successfully accepted the patch.

Comment 5 Seth Jennings 2020-10-02 21:22:12 UTC

Deferring this to 4.7 again.

We know of no controller that would take action based on a Running->Pending transition and, while it definitely is a bug, it is not one likely to impact a customer.

I am planning to open a PR to check for the bad phase transition on the kubelet side here
https://github.com/kubernetes/kubernetes/blob/112dbd55860e600af525cedc255f2664e3f286aa/pkg/kubelet/kubelet_pods.go#L1513-L1520

The check would not correct/block the bad phase transition as we still want to surface the failure when it happens.

That should help us catch if/when this happens on the kubelet side and we can inspect the old and new pod status at that point.

We also plan to backport the test that surfaces this issue back to 4.5 and 4.4 so we can observe if this trend is changing across releases.

Comment 8 Seth Jennings 2020-10-08 14:57:07 UTC

*** Bug 1884697 has been marked as a duplicate of this bug. ***

Comment 9 Michael Gugino 2020-10-08 15:20:02 UTC

Seeing this in the wild upstream: https://github.com/kubernetes/kubernetes/pull/94958

Comment 10 Michael Gugino 2020-10-08 15:39:18 UTC

Seems we could also enforce the logic here: https://github.com/kubernetes/kubernetes/blob/90c9f7b3e198e82a756a68ffeac978a00d606e55/pkg/kubelet/kubelet_pods.go#L1513

If kubelet calculates that pod should be pending, but we know we're currently in a phase that is post-pending, we should stop transition.  Having similar error message would allow easier regression testing in the future, WDYT?

Comment 12 Lukasz Szaszkiewicz 2020-10-14 06:47:58 UTC

Not sure why this bug was moved to ON_QA. It looks like the issue still persists on 4.7. Please take a look https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=4.7&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-node%5C%5D+pods+should+never+transition+back+to+pending

Comment 13 Tom Sweeney 2020-10-14 13:29:19 UTC

Seth any thoughts on the past few comments on this one?

Comment 14 Seth Jennings 2020-10-14 13:39:48 UTC

This is a different issue than originally reported and fix.  The test now includes a number of different situations in which pods transitioned to Pending.

https://github.com/openshift/openshift-tests/blob/292dfd1dc2d170bd8b5f2d4dfb2414ef657ff22b/pkg/monitor/pod.go#L87 fixed

https://github.com/openshift/openshift-tests/blob/292dfd1dc2d170bd8b5f2d4dfb2414ef657ff22b/pkg/monitor/pod.go#L93 is the case we see now

I would say lets limit the scope of this bug to the first case, since it impacts normal pods and (re)open a new bug for the second.
https://bugzilla.redhat.com/show_bug.cgi?id=1886920

Comment 17 Michael Gugino 2020-10-27 15:53:59 UTC

I see this failing in 4.7 specific jobs with static pods still: https://search.ci.openshift.org/?search=invariant+violation&maxAge=336h&context=1&type=junit&name=4.7&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 18 Ryan Phillips 2020-10-27 16:13:55 UTC

This PR does not address static pod transitions.

Moving back to ON_QA...

Comment 19 Ryan Phillips 2020-10-27 16:14:40 UTC

Setting Verified, since the last state by Sunil was verified.

Comment 22 errata-xmlrpc 2021-02-24 15:21:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.