Bug 1813343

Summary:	Cloud Credential Operator pod crashlooping with golang segfault
Product:	OpenShift Container Platform	Reporter:	Ben Browning <bbrownin>
Component:	Cloud Credential Operator	Assignee:	Joel Diaz <jdiaz>
Status:	CLOSED ERRATA	QA Contact:	wang lin <lwan>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.3.z	CC:	cblecker, jdiaz, jeder, lmohanty, lwan, mwoodson, nmalik, vrutkovs, wking
Target Milestone:	---	Keywords:	Upgrades
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Cloud credential operator could crash loop when the original cluster was installed with OpenShift 4.1 Consequence: CCO would be unable to reconcile the permissions requests found in the CredentialsRequest objects. Fix: Do not assume that parts of the Infrastructure fields are available. Result: CCO can work with clusters that were originally installed with OpenShift 4.1.	Story Points:	---
Clone Of:
Clones:	1813998 (view as bug list)		Environment:
Last Closed:	2020-07-13 17:19:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1813998

Description Ben Browning 2020-03-13 14:51:07 UTC

Description of problem:

After upgrading from 4.3.1 to 4.3.5, my cloud-credential-operator pod is constantly crashlooping with a segfault of:

time="2020-03-13T14:23:36Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-image-registry
time="2020-03-13T14:23:36Z" level=debug msg="found secret namespace" controller=credreq cr=openshift-cloud-credential-operator/openshift-image-registry secret=openshift-image-registry/installer-cloud-credentials
time="2020-03-13T14:23:36Z" level=debug msg="running Exists" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2020-03-13T14:23:36Z" level=debug msg="target secret exists" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2020-03-13T14:23:36Z" level=debug msg="running sync" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2020-03-13T14:23:36Z" level=debug msg="Loading infrastructure name: devint-6sjvm" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2020-03-13T14:23:36Z" level=debug msg="syncing cluster operator status" controller=credreq_status
time="2020-03-13T14:23:36Z" level=debug msg="6 cred requests" controller=credreq_status
time="2020-03-13T14:23:36Z" level=debug msg="set ClusterOperator condition" controller=credreq_status message="No credentials requests reporting errors." reason=NoCredentialsFailing status=False type=Degraded
time="2020-03-13T14:23:36Z" level=debug msg="set ClusterOperator condition" controller=credreq_status message="6 of 6 credentials requests provisioned and reconciled." reason=ReconcilingComplete status=False type=Progressing
time="2020-03-13T14:23:36Z" level=debug msg="set ClusterOperator condition" controller=credreq_status message= reason= status=True type=Available
time="2020-03-13T14:23:36Z" level=debug msg="set ClusterOperator condition" controller=credreq_status message= reason= status=True type=Upgradeable
E0313 14:23:36.871042       1 runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522
/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:82
/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/signal_unix.go:390
/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/utils/utils.go:73
/go/src/github.com/openshift/cloud-credential-operator/pkg/aws/actuator/actuator.go:272
/go/src/github.com/openshift/cloud-credential-operator/pkg/aws/actuator/actuator.go:257
/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/credentialsrequest/credentialsrequest_controller.go:452
/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:213
/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:163
/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152
/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153
/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/asm_amd64.s:1337
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x118d149]

goroutine 1114 [running]:
github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x105
panic(0x17ae140, 0x2e10050)
	/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522 +0x1b5
github.com/openshift/cloud-credential-operator/pkg/controller/utils.LoadInfrastructureRegion(0x1d18020, 0xc0005c9500, 0x1d3d620, 0xc000714e40, 0xc0006f55e0, 0xc, 0x0, 0x0)
	/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/utils/utils.go:73 +0xf9
github.com/openshift/cloud-credential-operator/pkg/aws/actuator.(*AWSActuator).sync(0xc000a3d770, 0x1d02620, 0xc0000ae078, 0xc001db9860, 0xc000714840, 0x3ddab2f20c70)
	/go/src/github.com/openshift/cloud-credential-operator/pkg/aws/actuator/actuator.go:272 +0x18b
github.com/openshift/cloud-credential-operator/pkg/aws/actuator.(*AWSActuator).Update(0xc000a3d770, 0x1d02620, 0xc0000ae078, 0xc001db9860, 0x4cd708701, 0x0)
	/go/src/github.com/openshift/cloud-credential-operator/pkg/aws/actuator/actuator.go:257 +0x49
github.com/openshift/cloud-credential-operator/pkg/controller/credentialsrequest.(*ReconcileCredentialsRequest).Reconcile(0xc000a3d7d0, 0xc00003e840, 0x23, 0xc00073cc40, 0x18, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/credentialsrequest/credentialsrequest_controller.go:452 +0x26ea
github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller.(*Controller).processNextWorkItem(0xc0004643c0, 0x0)
	/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:213 +0x17d
github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller.(*Controller).Start.func1()
	/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:163 +0x36
github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc001c7f6c0)
	/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x54
github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001c7f6c0, 0x3b9aca00, 0x0, 0xc00003d401, 0xc00009a180)
	/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8
github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc001c7f6c0, 0x3b9aca00, 0xc00009a180)
	/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller.(*Controller).Start
	/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:162 +0x33e



Version-Release number of selected component (if applicable):

OCP 4.3.5

How reproducible:

It's segfaulting constantly. However, I suspect 


Steps to Reproduce:
1. Install a 4.1.x OCP cluster.
2. Upgrade via supported paths up to 4.3.5.

Actual results:

The cloud-credential-operator pod crashloops.


Expected results:

The cloud-credential-operator pod actually stays up.


Additional info:

There are multiple Alerts firing for this error. However, the cloud-credential CVO operator itself is not reporting a degraded status which is suspicious.

The code at https://github.com/openshift/cloud-credential-operator/blob/88c7f1b1dfe63bb61e874acffa37bb69ff334a7d/pkg/controller/utils/utils.go#L73 expects a PlatformStatus block to exist in the status field but mine does not have this. It looks like this PlatformStatus field was added sometime during 4.2 but not only for newly-created clusters.

Here's my Infrastructure object with no PlatformStatus field:

apiVersion: config.openshift.io/v1
kind: Infrastructure
metadata:
  creationTimestamp: "2019-09-24T15:57:33Z"
  generation: 1
  name: cluster
  resourceVersion: "404"
  selfLink: /apis/config.openshift.io/v1/infrastructures/cluster
  uid: 057b479e-dee4-11e9-89b3-026c9616ccc0
spec:
  cloudConfig:
    name: ""
status:
  apiServerInternalURI: https://api-int.devint.openshiftknativedemo.org:6443
  apiServerURL: https://api.devint.openshiftknativedemo.org:6443
  etcdDiscoveryDomain: devint.openshiftknativedemo.org
  infrastructureName: devint-6sjvm
  platform: AWS

Comment 1 Scott Dodson 2020-03-13 17:10:13 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context. The UpgradeBlocker flag has been added to this bug, it will be removed if the assessment indicates that this should not block upgrade edges.

 
Who is impacted?
  Customers upgrading from 4.2.99 to 4.3.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  All customers upgrading from 4.2.z to 4.3.z fail approximately 10% of the time
What is the impact?
  Up to 2 minute disruption in edge routing
  Up to 90seconds of API downtime
  etcd loses quorum and you have to restore from backup
How involved is remediation?
  Issue resolves itself after five minutes
  Admin uses oc to fix things
  Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression?
  No, it’s always been like this we just never noticed
  Yes, from 4.2.z and 4.3.1

Comment 2 Joel Diaz 2020-03-13 17:34:27 UTC

(In reply to Scott Dodson from comment #1)
> Who is impacted?
Anyone who has a cluster old enough where the Infrastructure CR doesn't include the Status.PlatformStatus fields (introduced in 4.2). In other words, AWS clusters originally installed with 4.1.

> What is the impact?
CCO cannot function properly on AWS.

> How involved is remediation?
Hasn't been tried, but some example remediation would involve manually updating the Infrastructure CRD, and manually putting in the new Status.PlatformStatus field information.

> Is this a regression?
A regression for all AWS clusters originally installed with 4.1 that have been upgraded up until these changes were merged:
https://github.com/openshift/cloud-credential-operator/pull/158 (in master)
https://github.com/openshift/cloud-credential-operator/pull/160 (in release-4.4)

Comment 3 Scott Dodson 2020-03-13 17:49:25 UTC

Can you please confirm that your cluster fits the assessment in comment 2?

Comment 4 Scott Dodson 2020-03-13 17:50:33 UTC

Clearing UpgradeBlocker as this appears to be another instance of a previously known defect in updates to default fields between 4.1 and 4.2.

Comment 5 W. Trevor King 2020-03-13 18:07:27 UTC

Tracking down the PlatformStatus existence assumption:

$ for Y in 2 3 4; do git --no-pager log -G PlatformStatus --oneline --decorate "origin/release-4.${Y}" -- pkg/controller/utils/utils.go; done
ed98561 (origin/pr/157) improve permissions simulation by adding region info
4c3a383 (origin/pr/155) improve permissions simulation by adding region info
79661cd (origin/pr/124) improve permissions simulation by adding region info
$ git log --all --decorate --grep '#124\|#155\|#157' | grep Bug
    [release-4.2] Bug 1803221: improve permissions simulation by adding region info
    [release-4.3] Bug 1757244: improve permissions simulation by adding region info
    Bug 1750338: improve permissions simulation by adding region info

so we'll need backport fixes for all of those.  Following up on the referenced bugs:

$ curl -s 'https://bugzilla.redhat.com/show_bug.cgi?id=1803221' | grep /errata/ | head -n1
        <a href="https://access.redhat.com/errata/RHBA-2020:0614">RHBA-2020:0614</a>
$ curl -sH 'Accept:application/json' 'https://api.openshift.com/api/upgrades_info/v1/graph?channel=candidate-4.2' | jq -r '.nodes[] | select(.metadata.url == "https://access.redhat.com/errata/RHBA-2020:0614").version'
4.2.21
$ curl -s 'https://bugzilla.redhat.com/show_bug.cgi?id=1757244' | grep /errata/ | head -n1
        <a href="https://access.redhat.com/errata/RHBA-2020:0528">RHBA-2020:0528</a>
4.3.3
$ curl -s 'https://bugzilla.redhat.com/show_bug.cgi?id=1750338' | grep /errata/
...no hits, but presumably 4.4.0-rc.0 is impacted...

That means that 4.1 -> 4.2.20 would be fine, but 4.1 -> 4.2.20 -> 4.2.21 would hit this bug.  From an update-recommendation flow, we currently have a 4.1.21 -> 4.2.2 edge in candidate-4.2, and many more connections after that which could get us into trouble.  We probably don't want to drop 4.2->4.2 edges, which leaves us pruning 4.y-jumping edges.  But there are also not that many 4.1 clusters left.  We probably want to leave the graph alone for now, but once we get a fix out in 4.2 we should remove all 4.1 -> 4.2 edges to 4.2 older than the fixed version.  And similarly once we get a fixed 4.3, we should remove all 4.2 -> 4.3 edges older than the fixed version.

Comment 6 Vadim Rutkovsky 2020-03-13 18:15:12 UTC

>once we get a fix out in 4.2 we should remove all 4.1 -> 4.2 edges to 4.2 older than the fixed version.  And similarly once we get a fixed 4.3, we should remove all 4.2 -> 4.3 edges older than the fixed version.

That seems to be the most safest way now, given that we've already fixed an important MCO issue in 4.2.20

Comment 7 Ben Browning 2020-03-13 18:28:19 UTC

> Can you please confirm that your cluster fits the assessment in comment 2?

Yes, my cluster seems to fit the assessment in comment 2. It was originally installed with 4.1.16 on AWS and has followed this update path to get to this point:

4.1.16 -> 4.1.17 -> 4.1.18 -> 4.1.20 -> 4.1.21 -> 4.2.2 -> 4.2.9 -> 4.2.14 -> 4.2.16 -> 4.3.0 -> 4.3.1 -> 4.3.5.

Comment 14 errata-xmlrpc 2020-07-13 17:19:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409