Bug 1939294

Summary: OLM may not delete pods with grace period zero (force delete)
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: OLMAssignee: tflannag
OLM sub component: OLM QA Contact: Jian Zhang <jiazha>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: nhale, tflannag
Version: 4.8Keywords: Triaged
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:53:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Clayton Coleman 2021-03-16 00:03:03 UTC
Force deleting a pod is not allowed for automated processes within Kube or OpenShift unless done so by a human. That action is reserved because it is effectively bypassing the safety mechanisms of a cluster which ensure that only one pod is on any one node at a time, and leaves the state of the system inconsistency between apiserver and node (the node may still run that old process indefinitely).

OLM is force deleting (grace period zero) the community and marketplace operators. It may not do so, and instead should delete with a grace period of 1 if it wants "the pod to be deleted ASAP".

This was caught by debug code we added while looking at another bug where pods were force deleted when they should not have been (in https://github.com/openshift/kubernetes/pull/613)

I0314 00:01:24.469711      18 store.go:926] DEBUG: Consumer that is not node system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount requested delete of pods openshift-marketplace/community-operators-sdjjv with explicit grace period zero (deletionTimestamp=<nil>)
I0314 00:01:25.069608      18 store.go:926] DEBUG: Consumer that is not node system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount requested delete of pods openshift-marketplace/redhat-operators-px42b with explicit grace period zero (deletionTimestamp=<nil>)

This may not be deferred from 4.8.

Comment 4 Jian Zhang 2021-03-31 09:14:35 UTC
For OCP4.8, the OLM is built from the https://github.com/openshift/operator-framework-olm repo, not the https://github.com/operator-framework/operator-lifecycle-manager repo anymore. 
And, I don't find that above fixed PR https://github.com/operator-framework/operator-lifecycle-manager/pull/2047 was cherry-picked to this https://github.com/openshift/operator-framework-olm repo, change the status to ASSIGNED first.

Comment 13 Clayton Coleman 2021-04-22 19:13:28 UTC
I think I'm still seeing this

Apr 22 15:21:57.646 W ns/openshift-marketplace pod/redhat-operators-4lscq node/ip-10-0-193-52.ec2.internal reason/DeleteWithoutGracePeriod
Apr 22 15:21:57.646 I ns/openshift-marketplace pod/redhat-operators-4lscq node/ip-10-0-193-52.ec2.internal reason/Deleted
Apr 22 15:21:58.362 W ns/openshift-marketplace pod/community-operators-w4lhz node/ip-10-0-188-231.ec2.internal reason/DeleteWithoutGracePeriod
Apr 22 15:21:58.362 I ns/openshift-marketplace pod/community-operators-w4lhz node/ip-10-0-188-231.ec2.internal reason/Deleted

This is a new test condition I was checking, but the test condition may be wrong.   We're positive that the operator code inside CI is up to date?

Comment 14 tflannag 2021-04-22 20:33:30 UTC
I just double-checked that the downstream repository still contains the bug fixes that were introduced in the PR(s) that are linked in this bug. It sounds like those fixes either weren't enough, the test condition needs to be updated, or our CI wasn't properly set up when building up the downstream prow configuration. I'd like to rule out the latter is quick as possible, but we've been promoting images from the downstream repository for a couple of weeks now and already verified a couple of other bugs at this point.

Comment 15 Clayton Coleman 2021-04-22 20:59:32 UTC
I think based on logs my test condition just may not be accurate, because I do see this:

[36mINFO[0m[2021-04-22T20:41:23Z] Apr 22 20:33:16.081 W ns/openshift-marketplace pod/community-operators-ql69j node/ip-10-0-166-178.us-east-2.compute.internal reason/GracefulDelete in 1s
[36mINFO[0m[2021-04-22T20:41:23Z] Apr 22 20:34:02.593 W ns/openshift-marketplace pod/redhat-operators-wjcfm node/ip-10-0-166-178.us-east-2.compute.internal reason/GracefulDelete in 1s
[36mINFO[0m[2021-04-22T20:41:23Z] Apr 22 20:40:07.831 W ns/openshift-marketplace pod/redhat-marketplace-2vjpt node/ip-10-0-166-178.us-east-2.compute.internal reason/GracefulDelete in 1s

So I also consider this verified.

Comment 18 errata-xmlrpc 2021-07-27 22:53:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438