Bug 1958888

Summary:	4.7.6 -> 4.7.9 upgrade: leader election stuck
Product:	OpenShift Container Platform	Reporter:	Vadim Rutkovsky <vrutkovs>
Component:	OLM	Assignee:	tflannag
OLM sub component:	OLM	QA Contact:	xzha
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	ankithom, davegord, dhellmann, htariq, krizza, lmohanty, scuppett, sdodson, tflannag, wking
Version:	4.7	Keywords:	Triaged, Upgrades
Target Milestone:	---	Flags:	davegord: needinfo-
Target Release:	4.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: The marketplace-operator was using the leader-for-life implementation where a ConfigMap holding the leasing owner's identity has owner references placed by the controller's pod. This is problematic in the case where the node the pod was scheduled on became unavailable, and the pod was unable to be terminated, and the ConfigMap couldn't be proper garbage collected so a new leader could be elected. Consequence: Minor version OCP upgrades were blocked as the newer marketplace operator version could not gain leader election. Manual cleanup of the ConfigMap holding the leader election lease was required in order to release the lock and complete the upgrade of the marketplace component. Fix: Switch to using the leader-for-lease leader election implementation Result:	Story Points:	---
Clone Of:
Clones:	1965113 (view as bug list)		Environment:
Last Closed:	2021-10-18 17:31:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1998938

Description Vadim Rutkovsky 2021-05-10 11:32:47 UTC

Description of problem:
vsphere build cluster updated from 4.7.6 to 4.7.9 and stuck on marketplace operator not progressing to 4.7.9. The new pod got stuck on leader election:

```
time="2021-05-08T13:54:51Z" level=info msg="Go Version: go1.15.7"
time="2021-05-08T13:54:51Z" level=info msg="Go OS/Arch: linux/amd64"
time="2021-05-08T13:54:51Z" level=info msg="operator-sdk Version: v0.8.0"
time="2021-05-08T13:54:51Z" level=info msg="TLS keys set, using https for metrics"
time="2021-05-08T13:54:51Z" level=info msg="[metrics] Registering marketplace metrics"
time="2021-05-08T13:54:51Z" level=info msg="[metrics] Serving marketplace metrics"
time="2021-05-08T13:54:51Z" level=info msg="Config API is available"
time="2021-05-08T13:54:52Z" level=info msg="Registering Components."
time="2021-05-08T13:54:52Z" level=info msg="Waiting to become leader."
```

Version-Release number of selected component (if applicable):
4.7.6

Comment 3 Stephen Cuppett 2021-05-10 23:02:20 UTC

Deleted marketplace-operator-lock ConfigMap created April 20 in the openshift-marketplace Namespace

time="2021-05-08T13:54:52Z" level=info msg="Waiting to become leader."
time="2021-05-10T22:57:40Z" level=info msg="Elected leader."
time="2021-05-10T22:57:40Z" level=info msg="Starting the Cmd."

Comment 11 Kevin Rizza 2021-05-18 19:20:22 UTC

>>Any word/investigation on the openshift-network-operator also?

I don't know that there is a generic solution here. It looks like the network-operator had a similar problem but it came from a different client implementation (that potentially share similar roots?) and it was actually resolved in a 4.8 bz:

https://github.com/openshift/cluster-network-operator/pull/1052
https://bugzilla.redhat.com/show_bug.cgi?id=1936515

It seems like it'll be a problem for non 4.8 clusters, so I replied in the verified 4.8.0 bz and asked why it wasn't backported: https://bugzilla.redhat.com/show_bug.cgi?id=1936515#c6

Comment 13 Dave Gordon 2021-05-26 20:15:32 UTC

We don't currently have a way to get to this type of information via a PromQL query.

Comment 15 Haseeb Tariq 2021-05-26 22:53:15 UTC

*** Bug 1965113 has been marked as a duplicate of this bug. ***

Comment 19 xzha 2021-08-23 09:52:09 UTC

verify

check latest 4.9 upgrade ci, there is no such marketplace issue.

http://virt-openshift-05.lab.eng.nay.redhat.com/ci-logs/upgrade_CI/16860/log
http://virt-openshift-05.lab.eng.nay.redhat.com/ci-logs/upgrade_CI/16859/log

LGTM, verified.

Comment 22 errata-xmlrpc 2021-10-18 17:31:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759