1939294 – OLM may not delete pods with grace period zero (force delete)

Bug 1939294 - OLM may not delete pods with grace period zero (force delete)

Summary: OLM may not delete pods with grace period zero (force delete)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.8.0
Assignee:	tflannag
QA Contact:	Jian Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-16 00:03 UTC by Clayton Coleman
Modified:	2021-07-27 22:54 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 22:53:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	operator-framework operator-lifecycle-manager pull 2047	0	None	open	Bug 1939294: Avoid setting metadata.GracePeriodSeconds to zero seconds	2021-03-16 22:30:14 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:54:05 UTC

Description Clayton Coleman 2021-03-16 00:03:03 UTC

Force deleting a pod is not allowed for automated processes within Kube or OpenShift unless done so by a human. That action is reserved because it is effectively bypassing the safety mechanisms of a cluster which ensure that only one pod is on any one node at a time, and leaves the state of the system inconsistency between apiserver and node (the node may still run that old process indefinitely).

OLM is force deleting (grace period zero) the community and marketplace operators. It may not do so, and instead should delete with a grace period of 1 if it wants "the pod to be deleted ASAP".

This was caught by debug code we added while looking at another bug where pods were force deleted when they should not have been (in https://github.com/openshift/kubernetes/pull/613)

I0314 00:01:24.469711      18 store.go:926] DEBUG: Consumer that is not node system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount requested delete of pods openshift-marketplace/community-operators-sdjjv with explicit grace period zero (deletionTimestamp=<nil>)
I0314 00:01:25.069608      18 store.go:926] DEBUG: Consumer that is not node system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount requested delete of pods openshift-marketplace/redhat-operators-px42b with explicit grace period zero (deletionTimestamp=<nil>)

This may not be deferred from 4.8.

Comment 4 Jian Zhang 2021-03-31 09:14:35 UTC

For OCP4.8, the OLM is built from the https://github.com/openshift/operator-framework-olm repo, not the https://github.com/operator-framework/operator-lifecycle-manager repo anymore. 
And, I don't find that above fixed PR https://github.com/operator-framework/operator-lifecycle-manager/pull/2047 was cherry-picked to this https://github.com/openshift/operator-framework-olm repo, change the status to ASSIGNED first.

Comment 13 Clayton Coleman 2021-04-22 19:13:28 UTC

I think I'm still seeing this

Apr 22 15:21:57.646 W ns/openshift-marketplace pod/redhat-operators-4lscq node/ip-10-0-193-52.ec2.internal reason/DeleteWithoutGracePeriod
Apr 22 15:21:57.646 I ns/openshift-marketplace pod/redhat-operators-4lscq node/ip-10-0-193-52.ec2.internal reason/Deleted
Apr 22 15:21:58.362 W ns/openshift-marketplace pod/community-operators-w4lhz node/ip-10-0-188-231.ec2.internal reason/DeleteWithoutGracePeriod
Apr 22 15:21:58.362 I ns/openshift-marketplace pod/community-operators-w4lhz node/ip-10-0-188-231.ec2.internal reason/Deleted

This is a new test condition I was checking, but the test condition may be wrong.   We're positive that the operator code inside CI is up to date?

Comment 14 tflannag 2021-04-22 20:33:30 UTC

I just double-checked that the downstream repository still contains the bug fixes that were introduced in the PR(s) that are linked in this bug. It sounds like those fixes either weren't enough, the test condition needs to be updated, or our CI wasn't properly set up when building up the downstream prow configuration. I'd like to rule out the latter is quick as possible, but we've been promoting images from the downstream repository for a couple of weeks now and already verified a couple of other bugs at this point.

Comment 15 Clayton Coleman 2021-04-22 20:59:32 UTC

I think based on logs my test condition just may not be accurate, because I do see this:

[36mINFO[0m[2021-04-22T20:41:23Z] Apr 22 20:33:16.081 W ns/openshift-marketplace pod/community-operators-ql69j node/ip-10-0-166-178.us-east-2.compute.internal reason/GracefulDelete in 1s
[36mINFO[0m[2021-04-22T20:41:23Z] Apr 22 20:34:02.593 W ns/openshift-marketplace pod/redhat-operators-wjcfm node/ip-10-0-166-178.us-east-2.compute.internal reason/GracefulDelete in 1s
[36mINFO[0m[2021-04-22T20:41:23Z] Apr 22 20:40:07.831 W ns/openshift-marketplace pod/redhat-marketplace-2vjpt node/ip-10-0-166-178.us-east-2.compute.internal reason/GracefulDelete in 1s

So I also consider this verified.

Comment 18 errata-xmlrpc 2021-07-27 22:53:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.