Bug 1813998

Summary: Cloud Credential Operator pod crashlooping with golang segfault
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: Cloud Credential OperatorAssignee: Joel Diaz <jdiaz>
Status: CLOSED ERRATA QA Contact: wang lin <lwan>
Severity: high Docs Contact:
Priority: high    
Version: 4.4CC: bbrownin, cblecker, ccoleman, jdiaz, lmohanty, lwan, nmalik, pkanthal, suchaudh, vrutkovs, wking
Target Milestone: ---Keywords: Upgrades
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Cloud credential operator could crash loop when the original cluster was installed with OpenShift 4.1 Consequence: CCO would be unable to reconcile the permissions requests found in the CredentialsRequest objects. Fix: Do not assume that parts of the Infrastructure fields are available. Result: CCO can work with clusters that were originally installed with OpenShift 4.1.
Story Points: ---
Clone Of: 1813343
: 1816704 (view as bug list) Environment:
Last Closed: 2020-05-04 11:46:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1813343    
Bug Blocks: 1816704    

Description W. Trevor King 2020-03-16 17:48:44 UTC
+++ This bug was initially created as a clone of Bug #1813343 +++

Description of problem:

After upgrading from 4.3.1 to 4.3.5, my cloud-credential-operator pod is constantly crashlooping with a segfault of:

time="2020-03-13T14:23:36Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-image-registry
time="2020-03-13T14:23:36Z" level=debug msg="found secret namespace" controller=credreq cr=openshift-cloud-credential-operator/openshift-image-registry secret=openshift-image-registry/installer-cloud-credentials
time="2020-03-13T14:23:36Z" level=debug msg="running Exists" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2020-03-13T14:23:36Z" level=debug msg="target secret exists" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2020-03-13T14:23:36Z" level=debug msg="running sync" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2020-03-13T14:23:36Z" level=debug msg="Loading infrastructure name: devint-6sjvm" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2020-03-13T14:23:36Z" level=debug msg="syncing cluster operator status" controller=credreq_status
time="2020-03-13T14:23:36Z" level=debug msg="6 cred requests" controller=credreq_status
time="2020-03-13T14:23:36Z" level=debug msg="set ClusterOperator condition" controller=credreq_status message="No credentials requests reporting errors." reason=NoCredentialsFailing status=False type=Degraded
time="2020-03-13T14:23:36Z" level=debug msg="set ClusterOperator condition" controller=credreq_status message="6 of 6 credentials requests provisioned and reconciled." reason=ReconcilingComplete status=False type=Progressing
time="2020-03-13T14:23:36Z" level=debug msg="set ClusterOperator condition" controller=credreq_status message= reason= status=True type=Available
time="2020-03-13T14:23:36Z" level=debug msg="set ClusterOperator condition" controller=credreq_status message= reason= status=True type=Upgradeable
E0313 14:23:36.871042       1 runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522
/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:82
/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/signal_unix.go:390
/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/utils/utils.go:73
/go/src/github.com/openshift/cloud-credential-operator/pkg/aws/actuator/actuator.go:272
/go/src/github.com/openshift/cloud-credential-operator/pkg/aws/actuator/actuator.go:257
/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/credentialsrequest/credentialsrequest_controller.go:452
/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:213
/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:163
/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152
/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153
/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/asm_amd64.s:1337
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x118d149]

goroutine 1114 [running]:
github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x105
panic(0x17ae140, 0x2e10050)
	/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522 +0x1b5
github.com/openshift/cloud-credential-operator/pkg/controller/utils.LoadInfrastructureRegion(0x1d18020, 0xc0005c9500, 0x1d3d620, 0xc000714e40, 0xc0006f55e0, 0xc, 0x0, 0x0)
	/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/utils/utils.go:73 +0xf9
github.com/openshift/cloud-credential-operator/pkg/aws/actuator.(*AWSActuator).sync(0xc000a3d770, 0x1d02620, 0xc0000ae078, 0xc001db9860, 0xc000714840, 0x3ddab2f20c70)
	/go/src/github.com/openshift/cloud-credential-operator/pkg/aws/actuator/actuator.go:272 +0x18b
github.com/openshift/cloud-credential-operator/pkg/aws/actuator.(*AWSActuator).Update(0xc000a3d770, 0x1d02620, 0xc0000ae078, 0xc001db9860, 0x4cd708701, 0x0)
	/go/src/github.com/openshift/cloud-credential-operator/pkg/aws/actuator/actuator.go:257 +0x49
github.com/openshift/cloud-credential-operator/pkg/controller/credentialsrequest.(*ReconcileCredentialsRequest).Reconcile(0xc000a3d7d0, 0xc00003e840, 0x23, 0xc00073cc40, 0x18, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/credentialsrequest/credentialsrequest_controller.go:452 +0x26ea
github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller.(*Controller).processNextWorkItem(0xc0004643c0, 0x0)
	/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:213 +0x17d
github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller.(*Controller).Start.func1()
	/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:163 +0x36
github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc001c7f6c0)
	/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x54
github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001c7f6c0, 0x3b9aca00, 0x0, 0xc00003d401, 0xc00009a180)
	/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8
github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc001c7f6c0, 0x3b9aca00, 0xc00009a180)
	/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller.(*Controller).Start
	/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:162 +0x33e



Version-Release number of selected component (if applicable):

OCP 4.3.5

How reproducible:

It's segfaulting constantly. However, I suspect 


Steps to Reproduce:
1. Install a 4.1.x OCP cluster.
2. Upgrade via supported paths up to 4.3.5.

Actual results:

The cloud-credential-operator pod crashloops.


Expected results:

The cloud-credential-operator pod actually stays up.


Additional info:

There are multiple Alerts firing for this error. However, the cloud-credential CVO operator itself is not reporting a degraded status which is suspicious.

The code at https://github.com/openshift/cloud-credential-operator/blob/88c7f1b1dfe63bb61e874acffa37bb69ff334a7d/pkg/controller/utils/utils.go#L73 expects a PlatformStatus block to exist in the status field but mine does not have this. It looks like this PlatformStatus field was added sometime during 4.2 but not only for newly-created clusters.

Here's my Infrastructure object with no PlatformStatus field:

apiVersion: config.openshift.io/v1
kind: Infrastructure
metadata:
  creationTimestamp: "2019-09-24T15:57:33Z"
  generation: 1
  name: cluster
  resourceVersion: "404"
  selfLink: /apis/config.openshift.io/v1/infrastructures/cluster
  uid: 057b479e-dee4-11e9-89b3-026c9616ccc0
spec:
  cloudConfig:
    name: ""
status:
  apiServerInternalURI: https://api-int.devint.openshiftknativedemo.org:6443
  apiServerURL: https://api.devint.openshiftknativedemo.org:6443
  etcdDiscoveryDomain: devint.openshiftknativedemo.org
  infrastructureName: devint-6sjvm
  platform: AWS

--- Additional comment from W. Trevor King on 2020-03-13 18:07:27 UTC ---

Tracking down the PlatformStatus existence assumption:

$ for Y in 2 3 4; do git --no-pager log -G PlatformStatus --oneline --decorate "origin/release-4.${Y}" -- pkg/controller/utils/utils.go; done
ed98561 (origin/pr/157) improve permissions simulation by adding region info
4c3a383 (origin/pr/155) improve permissions simulation by adding region info
79661cd (origin/pr/124) improve permissions simulation by adding region info
$ git log --all --decorate --grep '#124\|#155\|#157' | grep Bug
    [release-4.2] Bug 1803221: improve permissions simulation by adding region info
    [release-4.3] Bug 1757244: improve permissions simulation by adding region info
    Bug 1750338: improve permissions simulation by adding region info

so we'll need backport fixes for all of those.  Following up on the referenced bugs:

$ curl -s 'https://bugzilla.redhat.com/show_bug.cgi?id=1803221' | grep /errata/ | head -n1
        <a href="https://access.redhat.com/errata/RHBA-2020:0614">RHBA-2020:0614</a>
$ curl -sH 'Accept:application/json' 'https://api.openshift.com/api/upgrades_info/v1/graph?channel=candidate-4.2' | jq -r '.nodes[] | select(.metadata.url == "https://access.redhat.com/errata/RHBA-2020:0614").version'
4.2.21
$ curl -s 'https://bugzilla.redhat.com/show_bug.cgi?id=1757244' | grep /errata/ | head -n1
        <a href="https://access.redhat.com/errata/RHBA-2020:0528">RHBA-2020:0528</a>
4.3.3
$ curl -s 'https://bugzilla.redhat.com/show_bug.cgi?id=1750338' | grep /errata/
...no hits, but presumably 4.4.0-rc.0 is impacted...

That means that 4.1 -> 4.2.20 would be fine, but 4.1 -> 4.2.20 -> 4.2.21 would hit this bug.  From an update-recommendation flow, we currently have a 4.1.21 -> 4.2.2 edge in candidate-4.2, and many more connections after that which could get us into trouble.  We probably don't want to drop 4.2->4.2 edges, which leaves us pruning 4.y-jumping edges.  But there are also not that many 4.1 clusters left.  We probably want to leave the graph alone for now, but once we get a fix out in 4.2 we should remove all 4.1 -> 4.2 edges to 4.2 older than the fixed version.  And similarly once we get a fixed 4.3, we should remove all 4.2 -> 4.3 edges older than the fixed version.

Comment 1 W. Trevor King 2020-03-16 17:54:47 UTC
Based on the impacted edges above, the verification plan for the 4.4 fix is going to be:

1. Install a 4.1 cluster.  Looks like 4.1.34 is a good choice because we recommend 4.1.24 -> 4.2.20 in stable-4.2.
2. Update to 4.2.20 (avoiding 4.2.21+ having the broken assumption that PlatformStatus is populated).
3. Update to 4.3.2 (avoid 4.3.3+ having the broken assumption that PlatformStatus is populated).
   We don't have any recommended edges into 4.3.2, [1] lists the motivating bugs, but none of them should actually break the update.
4. Force an update to the 4.4 nightly with the candidate fix.
5. Ensure the cred operator is not panicking.

[1]: https://github.com/openshift/cincinnati-graph-data/blob/dc2a3c4cd879b0eeb7153c50f8706ad45166e8ac/blocked-edges/4.3.2.yaml#L1

Comment 2 W. Trevor King 2020-03-16 19:25:41 UTC
Similar verification plan for the related registry-operator issue in [1].  If this fix lands first, and you attempt verification before the registry operator also lands a fix, you can ignore a panicking registry operator (or otherwise complaining, I'm not actually clear on how graciously it accepts the lack of install-config fallback) for the purpose of verifying this cred-operator bug.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1808425#c5

Comment 3 Scott Dodson 2020-03-17 20:44:09 UTC
*** Bug 1814328 has been marked as a duplicate of this bug. ***

Comment 6 wang lin 2020-03-25 05:09:50 UTC
The upgrading process is: 
4.1.24  -> 4.2.20  -> 4.3.0  -> 4.4.0-0.nightly-2020-03-24-225110

Although there are some other cluster operators have not yet rolled out when cluster upgraded from 4.3.0 to 4.4.0-0.nightly-2020-03-24-225110, CCO can upgrade successfully. CCO won't panic when infrastructure .status has only platform field.

Comment 10 errata-xmlrpc 2020-05-04 11:46:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581