Bug 1958888 - 4.7.6 -> 4.7.9 upgrade: leader election stuck
Summary: 4.7.6 -> 4.7.9 upgrade: leader election stuck
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.9.0
Assignee: tflannag
QA Contact: xzha
URL:
Whiteboard:
: 1965113 (view as bug list)
Depends On:
Blocks: 1998938
TreeView+ depends on / blocked
 
Reported: 2021-05-10 11:32 UTC by Vadim Rutkovsky
Modified: 2021-10-18 17:31 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: The marketplace-operator was using the leader-for-life implementation where a ConfigMap holding the leasing owner's identity has owner references placed by the controller's pod. This is problematic in the case where the node the pod was scheduled on became unavailable, and the pod was unable to be terminated, and the ConfigMap couldn't be proper garbage collected so a new leader could be elected. Consequence: Minor version OCP upgrades were blocked as the newer marketplace operator version could not gain leader election. Manual cleanup of the ConfigMap holding the leader election lease was required in order to release the lock and complete the upgrade of the marketplace component. Fix: Switch to using the leader-for-lease leader election implementation Result:
Clone Of:
: 1965113 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:31:03 UTC
Target Upstream Version:
Embargoed:
davegord: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github operator-framework operator-marketplace pull 414 0 None None None 2021-08-12 21:54:05 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:31:28 UTC

Description Vadim Rutkovsky 2021-05-10 11:32:47 UTC
Description of problem:
vsphere build cluster updated from 4.7.6 to 4.7.9 and stuck on marketplace operator not progressing to 4.7.9. The new pod got stuck on leader election:

```
time="2021-05-08T13:54:51Z" level=info msg="Go Version: go1.15.7"
time="2021-05-08T13:54:51Z" level=info msg="Go OS/Arch: linux/amd64"
time="2021-05-08T13:54:51Z" level=info msg="operator-sdk Version: v0.8.0"
time="2021-05-08T13:54:51Z" level=info msg="TLS keys set, using https for metrics"
time="2021-05-08T13:54:51Z" level=info msg="[metrics] Registering marketplace metrics"
time="2021-05-08T13:54:51Z" level=info msg="[metrics] Serving marketplace metrics"
time="2021-05-08T13:54:51Z" level=info msg="Config API is available"
time="2021-05-08T13:54:52Z" level=info msg="Registering Components."
time="2021-05-08T13:54:52Z" level=info msg="Waiting to become leader."
```

Version-Release number of selected component (if applicable):
4.7.6

Comment 3 Stephen Cuppett 2021-05-10 23:02:20 UTC
Deleted marketplace-operator-lock ConfigMap created April 20 in the openshift-marketplace Namespace

time="2021-05-08T13:54:52Z" level=info msg="Waiting to become leader."
time="2021-05-10T22:57:40Z" level=info msg="Elected leader."
time="2021-05-10T22:57:40Z" level=info msg="Starting the Cmd."

Comment 11 Kevin Rizza 2021-05-18 19:20:22 UTC
>>Any word/investigation on the openshift-network-operator also?

I don't know that there is a generic solution here. It looks like the network-operator had a similar problem but it came from a different client implementation (that potentially share similar roots?) and it was actually resolved in a 4.8 bz:

https://github.com/openshift/cluster-network-operator/pull/1052
https://bugzilla.redhat.com/show_bug.cgi?id=1936515

It seems like it'll be a problem for non 4.8 clusters, so I replied in the verified 4.8.0 bz and asked why it wasn't backported: https://bugzilla.redhat.com/show_bug.cgi?id=1936515#c6

Comment 13 Dave Gordon 2021-05-26 20:15:32 UTC
We don't currently have a way to get to this type of information via a PromQL query.

Comment 15 Haseeb Tariq 2021-05-26 22:53:15 UTC
*** Bug 1965113 has been marked as a duplicate of this bug. ***

Comment 19 xzha 2021-08-23 09:52:09 UTC
verify

check latest 4.9 upgrade ci, there is no such marketplace issue.

http://virt-openshift-05.lab.eng.nay.redhat.com/ci-logs/upgrade_CI/16860/log
http://virt-openshift-05.lab.eng.nay.redhat.com/ci-logs/upgrade_CI/16859/log

LGTM, verified.

Comment 22 errata-xmlrpc 2021-10-18 17:31:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.