Description of problem: After upgrading from 4.3.1 to 4.3.5, my cloud-credential-operator pod is constantly crashlooping with a segfault of: time="2020-03-13T14:23:36Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-image-registry time="2020-03-13T14:23:36Z" level=debug msg="found secret namespace" controller=credreq cr=openshift-cloud-credential-operator/openshift-image-registry secret=openshift-image-registry/installer-cloud-credentials time="2020-03-13T14:23:36Z" level=debug msg="running Exists" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry time="2020-03-13T14:23:36Z" level=debug msg="target secret exists" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry time="2020-03-13T14:23:36Z" level=debug msg="running sync" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry time="2020-03-13T14:23:36Z" level=debug msg="Loading infrastructure name: devint-6sjvm" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry time="2020-03-13T14:23:36Z" level=debug msg="syncing cluster operator status" controller=credreq_status time="2020-03-13T14:23:36Z" level=debug msg="6 cred requests" controller=credreq_status time="2020-03-13T14:23:36Z" level=debug msg="set ClusterOperator condition" controller=credreq_status message="No credentials requests reporting errors." reason=NoCredentialsFailing status=False type=Degraded time="2020-03-13T14:23:36Z" level=debug msg="set ClusterOperator condition" controller=credreq_status message="6 of 6 credentials requests provisioned and reconciled." reason=ReconcilingComplete status=False type=Progressing time="2020-03-13T14:23:36Z" level=debug msg="set ClusterOperator condition" controller=credreq_status message= reason= status=True type=Available time="2020-03-13T14:23:36Z" level=debug msg="set ClusterOperator condition" controller=credreq_status message= reason= status=True type=Upgradeable E0313 14:23:36.871042 1 runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) /go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76 /go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65 /go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51 /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522 /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:82 /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/signal_unix.go:390 /go/src/github.com/openshift/cloud-credential-operator/pkg/controller/utils/utils.go:73 /go/src/github.com/openshift/cloud-credential-operator/pkg/aws/actuator/actuator.go:272 /go/src/github.com/openshift/cloud-credential-operator/pkg/aws/actuator/actuator.go:257 /go/src/github.com/openshift/cloud-credential-operator/pkg/controller/credentialsrequest/credentialsrequest_controller.go:452 /go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:213 /go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:163 /go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 /go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 /go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/asm_amd64.s:1337 panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x118d149] goroutine 1114 [running]: github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x105 panic(0x17ae140, 0x2e10050) /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522 +0x1b5 github.com/openshift/cloud-credential-operator/pkg/controller/utils.LoadInfrastructureRegion(0x1d18020, 0xc0005c9500, 0x1d3d620, 0xc000714e40, 0xc0006f55e0, 0xc, 0x0, 0x0) /go/src/github.com/openshift/cloud-credential-operator/pkg/controller/utils/utils.go:73 +0xf9 github.com/openshift/cloud-credential-operator/pkg/aws/actuator.(*AWSActuator).sync(0xc000a3d770, 0x1d02620, 0xc0000ae078, 0xc001db9860, 0xc000714840, 0x3ddab2f20c70) /go/src/github.com/openshift/cloud-credential-operator/pkg/aws/actuator/actuator.go:272 +0x18b github.com/openshift/cloud-credential-operator/pkg/aws/actuator.(*AWSActuator).Update(0xc000a3d770, 0x1d02620, 0xc0000ae078, 0xc001db9860, 0x4cd708701, 0x0) /go/src/github.com/openshift/cloud-credential-operator/pkg/aws/actuator/actuator.go:257 +0x49 github.com/openshift/cloud-credential-operator/pkg/controller/credentialsrequest.(*ReconcileCredentialsRequest).Reconcile(0xc000a3d7d0, 0xc00003e840, 0x23, 0xc00073cc40, 0x18, 0x0, 0x0, 0x0, 0x0) /go/src/github.com/openshift/cloud-credential-operator/pkg/controller/credentialsrequest/credentialsrequest_controller.go:452 +0x26ea github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller.(*Controller).processNextWorkItem(0xc0004643c0, 0x0) /go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:213 +0x17d github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller.(*Controller).Start.func1() /go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:163 +0x36 github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc001c7f6c0) /go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x54 github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001c7f6c0, 0x3b9aca00, 0x0, 0xc00003d401, 0xc00009a180) /go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8 github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc001c7f6c0, 0x3b9aca00, 0xc00009a180) /go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d created by github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller.(*Controller).Start /go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:162 +0x33e Version-Release number of selected component (if applicable): OCP 4.3.5 How reproducible: It's segfaulting constantly. However, I suspect Steps to Reproduce: 1. Install a 4.1.x OCP cluster. 2. Upgrade via supported paths up to 4.3.5. Actual results: The cloud-credential-operator pod crashloops. Expected results: The cloud-credential-operator pod actually stays up. Additional info: There are multiple Alerts firing for this error. However, the cloud-credential CVO operator itself is not reporting a degraded status which is suspicious. The code at https://github.com/openshift/cloud-credential-operator/blob/88c7f1b1dfe63bb61e874acffa37bb69ff334a7d/pkg/controller/utils/utils.go#L73 expects a PlatformStatus block to exist in the status field but mine does not have this. It looks like this PlatformStatus field was added sometime during 4.2 but not only for newly-created clusters. Here's my Infrastructure object with no PlatformStatus field: apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2019-09-24T15:57:33Z" generation: 1 name: cluster resourceVersion: "404" selfLink: /apis/config.openshift.io/v1/infrastructures/cluster uid: 057b479e-dee4-11e9-89b3-026c9616ccc0 spec: cloudConfig: name: "" status: apiServerInternalURI: https://api-int.devint.openshiftknativedemo.org:6443 apiServerURL: https://api.devint.openshiftknativedemo.org:6443 etcdDiscoveryDomain: devint.openshiftknativedemo.org infrastructureName: devint-6sjvm platform: AWS
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context. The UpgradeBlocker flag has been added to this bug, it will be removed if the assessment indicates that this should not block upgrade edges. Who is impacted? Customers upgrading from 4.2.99 to 4.3.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet All customers upgrading from 4.2.z to 4.3.z fail approximately 10% of the time What is the impact? Up to 2 minute disruption in edge routing Up to 90seconds of API downtime etcd loses quorum and you have to restore from backup How involved is remediation? Issue resolves itself after five minutes Admin uses oc to fix things Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression? No, itβs always been like this we just never noticed Yes, from 4.2.z and 4.3.1
(In reply to Scott Dodson from comment #1) > Who is impacted? Anyone who has a cluster old enough where the Infrastructure CR doesn't include the Status.PlatformStatus fields (introduced in 4.2). In other words, AWS clusters originally installed with 4.1. > What is the impact? CCO cannot function properly on AWS. > How involved is remediation? Hasn't been tried, but some example remediation would involve manually updating the Infrastructure CRD, and manually putting in the new Status.PlatformStatus field information. > Is this a regression? A regression for all AWS clusters originally installed with 4.1 that have been upgraded up until these changes were merged: https://github.com/openshift/cloud-credential-operator/pull/158 (in master) https://github.com/openshift/cloud-credential-operator/pull/160 (in release-4.4)
Can you please confirm that your cluster fits the assessment in comment 2?
Clearing UpgradeBlocker as this appears to be another instance of a previously known defect in updates to default fields between 4.1 and 4.2.
Tracking down the PlatformStatus existence assumption: $ for Y in 2 3 4; do git --no-pager log -G PlatformStatus --oneline --decorate "origin/release-4.${Y}" -- pkg/controller/utils/utils.go; done ed98561 (origin/pr/157) improve permissions simulation by adding region info 4c3a383 (origin/pr/155) improve permissions simulation by adding region info 79661cd (origin/pr/124) improve permissions simulation by adding region info $ git log --all --decorate --grep '#124\|#155\|#157' | grep Bug [release-4.2] Bug 1803221: improve permissions simulation by adding region info [release-4.3] Bug 1757244: improve permissions simulation by adding region info Bug 1750338: improve permissions simulation by adding region info so we'll need backport fixes for all of those. Following up on the referenced bugs: $ curl -s 'https://bugzilla.redhat.com/show_bug.cgi?id=1803221' | grep /errata/ | head -n1 <a href="https://access.redhat.com/errata/RHBA-2020:0614">RHBA-2020:0614</a> $ curl -sH 'Accept:application/json' 'https://api.openshift.com/api/upgrades_info/v1/graph?channel=candidate-4.2' | jq -r '.nodes[] | select(.metadata.url == "https://access.redhat.com/errata/RHBA-2020:0614").version' 4.2.21 $ curl -s 'https://bugzilla.redhat.com/show_bug.cgi?id=1757244' | grep /errata/ | head -n1 <a href="https://access.redhat.com/errata/RHBA-2020:0528">RHBA-2020:0528</a> 4.3.3 $ curl -s 'https://bugzilla.redhat.com/show_bug.cgi?id=1750338' | grep /errata/ ...no hits, but presumably 4.4.0-rc.0 is impacted... That means that 4.1 -> 4.2.20 would be fine, but 4.1 -> 4.2.20 -> 4.2.21 would hit this bug. From an update-recommendation flow, we currently have a 4.1.21 -> 4.2.2 edge in candidate-4.2, and many more connections after that which could get us into trouble. We probably don't want to drop 4.2->4.2 edges, which leaves us pruning 4.y-jumping edges. But there are also not that many 4.1 clusters left. We probably want to leave the graph alone for now, but once we get a fix out in 4.2 we should remove all 4.1 -> 4.2 edges to 4.2 older than the fixed version. And similarly once we get a fixed 4.3, we should remove all 4.2 -> 4.3 edges older than the fixed version.
>once we get a fix out in 4.2 we should remove all 4.1 -> 4.2 edges to 4.2 older than the fixed version. And similarly once we get a fixed 4.3, we should remove all 4.2 -> 4.3 edges older than the fixed version. That seems to be the most safest way now, given that we've already fixed an important MCO issue in 4.2.20
> Can you please confirm that your cluster fits the assessment in comment 2? Yes, my cluster seems to fit the assessment in comment 2. It was originally installed with 4.1.16 on AWS and has followed this update path to get to this point: 4.1.16 -> 4.1.17 -> 4.1.18 -> 4.1.20 -> 4.1.21 -> 4.2.2 -> 4.2.9 -> 4.2.14 -> 4.2.16 -> 4.3.0 -> 4.3.1 -> 4.3.5.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409