+++ This bug was initially created as a clone of Bug #1813998 +++ +++ This bug was initially created as a clone of Bug #1813343 +++ Description of problem: After upgrading from 4.3.1 to 4.3.5, my cloud-credential-operator pod is constantly crashlooping with a segfault of: time="2020-03-13T14:23:36Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-image-registry time="2020-03-13T14:23:36Z" level=debug msg="found secret namespace" controller=credreq cr=openshift-cloud-credential-operator/openshift-image-registry secret=openshift-image-registry/installer-cloud-credentials time="2020-03-13T14:23:36Z" level=debug msg="running Exists" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry time="2020-03-13T14:23:36Z" level=debug msg="target secret exists" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry time="2020-03-13T14:23:36Z" level=debug msg="running sync" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry time="2020-03-13T14:23:36Z" level=debug msg="Loading infrastructure name: devint-6sjvm" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry time="2020-03-13T14:23:36Z" level=debug msg="syncing cluster operator status" controller=credreq_status time="2020-03-13T14:23:36Z" level=debug msg="6 cred requests" controller=credreq_status time="2020-03-13T14:23:36Z" level=debug msg="set ClusterOperator condition" controller=credreq_status message="No credentials requests reporting errors." reason=NoCredentialsFailing status=False type=Degraded time="2020-03-13T14:23:36Z" level=debug msg="set ClusterOperator condition" controller=credreq_status message="6 of 6 credentials requests provisioned and reconciled." reason=ReconcilingComplete status=False type=Progressing time="2020-03-13T14:23:36Z" level=debug msg="set ClusterOperator condition" controller=credreq_status message= reason= status=True type=Available time="2020-03-13T14:23:36Z" level=debug msg="set ClusterOperator condition" controller=credreq_status message= reason= status=True type=Upgradeable E0313 14:23:36.871042 1 runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) /go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76 /go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65 /go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51 /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522 /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:82 /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/signal_unix.go:390 /go/src/github.com/openshift/cloud-credential-operator/pkg/controller/utils/utils.go:73 /go/src/github.com/openshift/cloud-credential-operator/pkg/aws/actuator/actuator.go:272 /go/src/github.com/openshift/cloud-credential-operator/pkg/aws/actuator/actuator.go:257 /go/src/github.com/openshift/cloud-credential-operator/pkg/controller/credentialsrequest/credentialsrequest_controller.go:452 /go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:213 /go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:163 /go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 /go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 /go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/asm_amd64.s:1337 panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x118d149] goroutine 1114 [running]: github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x105 panic(0x17ae140, 0x2e10050) /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522 +0x1b5 github.com/openshift/cloud-credential-operator/pkg/controller/utils.LoadInfrastructureRegion(0x1d18020, 0xc0005c9500, 0x1d3d620, 0xc000714e40, 0xc0006f55e0, 0xc, 0x0, 0x0) /go/src/github.com/openshift/cloud-credential-operator/pkg/controller/utils/utils.go:73 +0xf9 github.com/openshift/cloud-credential-operator/pkg/aws/actuator.(*AWSActuator).sync(0xc000a3d770, 0x1d02620, 0xc0000ae078, 0xc001db9860, 0xc000714840, 0x3ddab2f20c70) /go/src/github.com/openshift/cloud-credential-operator/pkg/aws/actuator/actuator.go:272 +0x18b github.com/openshift/cloud-credential-operator/pkg/aws/actuator.(*AWSActuator).Update(0xc000a3d770, 0x1d02620, 0xc0000ae078, 0xc001db9860, 0x4cd708701, 0x0) /go/src/github.com/openshift/cloud-credential-operator/pkg/aws/actuator/actuator.go:257 +0x49 github.com/openshift/cloud-credential-operator/pkg/controller/credentialsrequest.(*ReconcileCredentialsRequest).Reconcile(0xc000a3d7d0, 0xc00003e840, 0x23, 0xc00073cc40, 0x18, 0x0, 0x0, 0x0, 0x0) /go/src/github.com/openshift/cloud-credential-operator/pkg/controller/credentialsrequest/credentialsrequest_controller.go:452 +0x26ea github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller.(*Controller).processNextWorkItem(0xc0004643c0, 0x0) /go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:213 +0x17d github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller.(*Controller).Start.func1() /go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:163 +0x36 github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc001c7f6c0) /go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x54 github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001c7f6c0, 0x3b9aca00, 0x0, 0xc00003d401, 0xc00009a180) /go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8 github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc001c7f6c0, 0x3b9aca00, 0xc00009a180) /go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d created by github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller.(*Controller).Start /go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:162 +0x33e Version-Release number of selected component (if applicable): OCP 4.3.5 How reproducible: It's segfaulting constantly. However, I suspect Steps to Reproduce: 1. Install a 4.1.x OCP cluster. 2. Upgrade via supported paths up to 4.3.5. Actual results: The cloud-credential-operator pod crashloops. Expected results: The cloud-credential-operator pod actually stays up. Additional info: There are multiple Alerts firing for this error. However, the cloud-credential CVO operator itself is not reporting a degraded status which is suspicious. The code at https://github.com/openshift/cloud-credential-operator/blob/88c7f1b1dfe63bb61e874acffa37bb69ff334a7d/pkg/controller/utils/utils.go#L73 expects a PlatformStatus block to exist in the status field but mine does not have this. It looks like this PlatformStatus field was added sometime during 4.2 but not only for newly-created clusters. Here's my Infrastructure object with no PlatformStatus field: apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2019-09-24T15:57:33Z" generation: 1 name: cluster resourceVersion: "404" selfLink: /apis/config.openshift.io/v1/infrastructures/cluster uid: 057b479e-dee4-11e9-89b3-026c9616ccc0 spec: cloudConfig: name: "" status: apiServerInternalURI: https://api-int.devint.openshiftknativedemo.org:6443 apiServerURL: https://api.devint.openshiftknativedemo.org:6443 etcdDiscoveryDomain: devint.openshiftknativedemo.org infrastructureName: devint-6sjvm platform: AWS --- Additional comment from W. Trevor King on 2020-03-13 18:07:27 UTC --- Tracking down the PlatformStatus existence assumption: $ for Y in 2 3 4; do git --no-pager log -G PlatformStatus --oneline --decorate "origin/release-4.${Y}" -- pkg/controller/utils/utils.go; done ed98561 (origin/pr/157) improve permissions simulation by adding region info 4c3a383 (origin/pr/155) improve permissions simulation by adding region info 79661cd (origin/pr/124) improve permissions simulation by adding region info $ git log --all --decorate --grep '#124\|#155\|#157' | grep Bug [release-4.2] Bug 1803221: improve permissions simulation by adding region info [release-4.3] Bug 1757244: improve permissions simulation by adding region info Bug 1750338: improve permissions simulation by adding region info so we'll need backport fixes for all of those. Following up on the referenced bugs: $ curl -s 'https://bugzilla.redhat.com/show_bug.cgi?id=1803221' | grep /errata/ | head -n1 <a href="https://access.redhat.com/errata/RHBA-2020:0614">RHBA-2020:0614</a> $ curl -sH 'Accept:application/json' 'https://api.openshift.com/api/upgrades_info/v1/graph?channel=candidate-4.2' | jq -r '.nodes[] | select(.metadata.url == "https://access.redhat.com/errata/RHBA-2020:0614").version' 4.2.21 $ curl -s 'https://bugzilla.redhat.com/show_bug.cgi?id=1757244' | grep /errata/ | head -n1 <a href="https://access.redhat.com/errata/RHBA-2020:0528">RHBA-2020:0528</a> 4.3.3 $ curl -s 'https://bugzilla.redhat.com/show_bug.cgi?id=1750338' | grep /errata/ ...no hits, but presumably 4.4.0-rc.0 is impacted... That means that 4.1 -> 4.2.20 would be fine, but 4.1 -> 4.2.20 -> 4.2.21 would hit this bug. From an update-recommendation flow, we currently have a 4.1.21 -> 4.2.2 edge in candidate-4.2, and many more connections after that which could get us into trouble. We probably don't want to drop 4.2->4.2 edges, which leaves us pruning 4.y-jumping edges. But there are also not that many 4.1 clusters left. We probably want to leave the graph alone for now, but once we get a fix out in 4.2 we should remove all 4.1 -> 4.2 edges to 4.2 older than the fixed version. And similarly once we get a fixed 4.3, we should remove all 4.2 -> 4.3 edges older than the fixed version. --- Additional comment from W. Trevor King on 2020-03-16 13:54:47 EDT --- Based on the impacted edges above, the verification plan for the 4.4 fix is going to be: 1. Install a 4.1 cluster. Looks like 4.1.34 is a good choice because we recommend 4.1.24 -> 4.2.20 in stable-4.2. 2. Update to 4.2.20 (avoiding 4.2.21+ having the broken assumption that PlatformStatus is populated). 3. Update to 4.3.2 (avoid 4.3.3+ having the broken assumption that PlatformStatus is populated). We don't have any recommended edges into 4.3.2, [1] lists the motivating bugs, but none of them should actually break the update. 4. Force an update to the 4.4 nightly with the candidate fix. 5. Ensure the cred operator is not panicking. [1]: https://github.com/openshift/cincinnati-graph-data/blob/dc2a3c4cd879b0eeb7153c50f8706ad45166e8ac/blocked-edges/4.3.2.yaml#L1 --- Additional comment from W. Trevor King on 2020-03-16 15:25:41 EDT --- Similar verification plan for the related registry-operator issue in [1]. If this fix lands first, and you attempt verification before the registry operator also lands a fix, you can ignore a panicking registry operator (or otherwise complaining, I'm not actually clear on how graciously it accepts the lack of install-config fallback) for the purpose of verifying this cred-operator bug. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1808425#c5 --- Additional comment from Scott Dodson on 2020-03-17 16:44:09 EDT ---
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. Who is impacted? Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Up to 2 minute disruption in edge routing Up to 90seconds of API downtime etcd loses quorum and you have to restore from backup How involved is remediation? Issue resolves itself after five minutes Admin uses oc to fix things Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression? No, itβs always been like this we just never noticed Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
(In reply to Scott Dodson from comment #2) > Who is impacted? Clusters originally installed with 4.1 and upgraded to affected versions. > What is the impact? Cloud-credential-operator is unable to process any CredentialsRequests. If the permissions requested in a CredentialsRequest changed along with the introduction of the bug, then that CredentialsRequest would be unable to be proccessed. Alerts would also potentially be firing since the CCO is unhealthy. > How involved is remediation? Manual patching of the Infrastructure CR to put in the new/updated Status fields. > Is this a regression? Yes, since the backport of the enhanced permissions simulation https://github.com/openshift/cloud-credential-operator/pull/157
The upgrading process is : 4.1.24 -> 4.2.20 -> 4.3.0-0.nightly-2020-04-06-093556 The bug has fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1393