Bug 1813343
Summary: | Cloud Credential Operator pod crashlooping with golang segfault | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ben Browning <bbrownin> | |
Component: | Cloud Credential Operator | Assignee: | Joel Diaz <jdiaz> | |
Status: | CLOSED ERRATA | QA Contact: | wang lin <lwan> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 4.3.z | CC: | cblecker, jdiaz, jeder, lmohanty, lwan, mwoodson, nmalik, vrutkovs, wking | |
Target Milestone: | --- | Keywords: | Upgrades | |
Target Release: | 4.5.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: Cloud credential operator could crash loop when the original cluster was installed with OpenShift 4.1
Consequence: CCO would be unable to reconcile the permissions requests found in the CredentialsRequest objects.
Fix: Do not assume that parts of the Infrastructure fields are available.
Result: CCO can work with clusters that were originally installed with OpenShift 4.1.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1813998 (view as bug list) | Environment: | ||
Last Closed: | 2020-07-13 17:19:58 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1813998 |
Description
Ben Browning
2020-03-13 14:51:07 UTC
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context. The UpgradeBlocker flag has been added to this bug, it will be removed if the assessment indicates that this should not block upgrade edges. Who is impacted? Customers upgrading from 4.2.99 to 4.3.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet All customers upgrading from 4.2.z to 4.3.z fail approximately 10% of the time What is the impact? Up to 2 minute disruption in edge routing Up to 90seconds of API downtime etcd loses quorum and you have to restore from backup How involved is remediation? Issue resolves itself after five minutes Admin uses oc to fix things Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression? No, it’s always been like this we just never noticed Yes, from 4.2.z and 4.3.1 (In reply to Scott Dodson from comment #1) > Who is impacted? Anyone who has a cluster old enough where the Infrastructure CR doesn't include the Status.PlatformStatus fields (introduced in 4.2). In other words, AWS clusters originally installed with 4.1. > What is the impact? CCO cannot function properly on AWS. > How involved is remediation? Hasn't been tried, but some example remediation would involve manually updating the Infrastructure CRD, and manually putting in the new Status.PlatformStatus field information. > Is this a regression? A regression for all AWS clusters originally installed with 4.1 that have been upgraded up until these changes were merged: https://github.com/openshift/cloud-credential-operator/pull/158 (in master) https://github.com/openshift/cloud-credential-operator/pull/160 (in release-4.4) Can you please confirm that your cluster fits the assessment in comment 2? Clearing UpgradeBlocker as this appears to be another instance of a previously known defect in updates to default fields between 4.1 and 4.2. Tracking down the PlatformStatus existence assumption: $ for Y in 2 3 4; do git --no-pager log -G PlatformStatus --oneline --decorate "origin/release-4.${Y}" -- pkg/controller/utils/utils.go; done ed98561 (origin/pr/157) improve permissions simulation by adding region info 4c3a383 (origin/pr/155) improve permissions simulation by adding region info 79661cd (origin/pr/124) improve permissions simulation by adding region info $ git log --all --decorate --grep '#124\|#155\|#157' | grep Bug [release-4.2] Bug 1803221: improve permissions simulation by adding region info [release-4.3] Bug 1757244: improve permissions simulation by adding region info Bug 1750338: improve permissions simulation by adding region info so we'll need backport fixes for all of those. Following up on the referenced bugs: $ curl -s 'https://bugzilla.redhat.com/show_bug.cgi?id=1803221' | grep /errata/ | head -n1 <a href="https://access.redhat.com/errata/RHBA-2020:0614">RHBA-2020:0614</a> $ curl -sH 'Accept:application/json' 'https://api.openshift.com/api/upgrades_info/v1/graph?channel=candidate-4.2' | jq -r '.nodes[] | select(.metadata.url == "https://access.redhat.com/errata/RHBA-2020:0614").version' 4.2.21 $ curl -s 'https://bugzilla.redhat.com/show_bug.cgi?id=1757244' | grep /errata/ | head -n1 <a href="https://access.redhat.com/errata/RHBA-2020:0528">RHBA-2020:0528</a> 4.3.3 $ curl -s 'https://bugzilla.redhat.com/show_bug.cgi?id=1750338' | grep /errata/ ...no hits, but presumably 4.4.0-rc.0 is impacted... That means that 4.1 -> 4.2.20 would be fine, but 4.1 -> 4.2.20 -> 4.2.21 would hit this bug. From an update-recommendation flow, we currently have a 4.1.21 -> 4.2.2 edge in candidate-4.2, and many more connections after that which could get us into trouble. We probably don't want to drop 4.2->4.2 edges, which leaves us pruning 4.y-jumping edges. But there are also not that many 4.1 clusters left. We probably want to leave the graph alone for now, but once we get a fix out in 4.2 we should remove all 4.1 -> 4.2 edges to 4.2 older than the fixed version. And similarly once we get a fixed 4.3, we should remove all 4.2 -> 4.3 edges older than the fixed version. >once we get a fix out in 4.2 we should remove all 4.1 -> 4.2 edges to 4.2 older than the fixed version. And similarly once we get a fixed 4.3, we should remove all 4.2 -> 4.3 edges older than the fixed version.
That seems to be the most safest way now, given that we've already fixed an important MCO issue in 4.2.20
> Can you please confirm that your cluster fits the assessment in comment 2? Yes, my cluster seems to fit the assessment in comment 2. It was originally installed with 4.1.16 on AWS and has followed this update path to get to this point: 4.1.16 -> 4.1.17 -> 4.1.18 -> 4.1.20 -> 4.1.21 -> 4.2.2 -> 4.2.9 -> 4.2.14 -> 4.2.16 -> 4.3.0 -> 4.3.1 -> 4.3.5. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |