Bug 1958888
Summary: | 4.7.6 -> 4.7.9 upgrade: leader election stuck | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Vadim Rutkovsky <vrutkovs> | |
Component: | OLM | Assignee: | tflannag | |
OLM sub component: | OLM | QA Contact: | xzha | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | ankithom, davegord, dhellmann, htariq, krizza, lmohanty, scuppett, sdodson, tflannag, wking | |
Version: | 4.7 | Keywords: | Triaged, Upgrades | |
Target Milestone: | --- | Flags: | davegord:
needinfo-
|
|
Target Release: | 4.9.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause:
The marketplace-operator was using the leader-for-life implementation where a ConfigMap holding the leasing owner's identity has owner references placed by the controller's pod. This is problematic in the case where the node the pod was scheduled on became unavailable, and the pod was unable to be terminated, and the ConfigMap couldn't be proper garbage collected so a new leader could be elected.
Consequence:
Minor version OCP upgrades were blocked as the newer marketplace operator version could not gain leader election. Manual cleanup of the ConfigMap holding the leader election lease was required in order to release the lock and complete the upgrade of the marketplace component.
Fix:
Switch to using the leader-for-lease leader election implementation
Result:
|
Story Points: | --- | |
Clone Of: | ||||
: | 1965113 (view as bug list) | Environment: | ||
Last Closed: | 2021-10-18 17:31:03 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1998938 |
Description
Vadim Rutkovsky
2021-05-10 11:32:47 UTC
Deleted marketplace-operator-lock ConfigMap created April 20 in the openshift-marketplace Namespace time="2021-05-08T13:54:52Z" level=info msg="Waiting to become leader." time="2021-05-10T22:57:40Z" level=info msg="Elected leader." time="2021-05-10T22:57:40Z" level=info msg="Starting the Cmd." >>Any word/investigation on the openshift-network-operator also? I don't know that there is a generic solution here. It looks like the network-operator had a similar problem but it came from a different client implementation (that potentially share similar roots?) and it was actually resolved in a 4.8 bz: https://github.com/openshift/cluster-network-operator/pull/1052 https://bugzilla.redhat.com/show_bug.cgi?id=1936515 It seems like it'll be a problem for non 4.8 clusters, so I replied in the verified 4.8.0 bz and asked why it wasn't backported: https://bugzilla.redhat.com/show_bug.cgi?id=1936515#c6 We don't currently have a way to get to this type of information via a PromQL query. *** Bug 1965113 has been marked as a duplicate of this bug. *** verify check latest 4.9 upgrade ci, there is no such marketplace issue. http://virt-openshift-05.lab.eng.nay.redhat.com/ci-logs/upgrade_CI/16860/log http://virt-openshift-05.lab.eng.nay.redhat.com/ci-logs/upgrade_CI/16859/log LGTM, verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |