Bug 1819183 - Cloud Credential Operator pod crashlooping with golang segfault
Summary: Cloud Credential Operator pod crashlooping with golang segfault
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Credential Operator
Version: 4.2.z
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.2.z
Assignee: Joel Diaz
QA Contact: wang lin
URL:
Whiteboard:
Depends On: 1816704
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-31 12:18 UTC by Scott Dodson
Modified: 2020-04-14 11:58 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1816704
Environment:
Last Closed: 2020-04-14 11:58:24 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift cloud-credential-operator pull 173 None closed Bug 1819183: handle old Infrastructure objects without PlatformStatus 2020-04-07 08:36:31 UTC
Red Hat Product Errata RHBA-2020:1398 None None None 2020-04-14 11:58:26 UTC

Description Scott Dodson 2020-03-31 12:18:16 UTC
+++ This bug was initially created as a clone of Bug #1816704 +++

+++ This bug was initially created as a clone of Bug #1813998 +++

+++ This bug was initially created as a clone of Bug #1813343 +++

Description of problem:

After upgrading from 4.3.1 to 4.3.5, my cloud-credential-operator pod is constantly crashlooping with a segfault of:

time="2020-03-13T14:23:36Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-image-registry
time="2020-03-13T14:23:36Z" level=debug msg="found secret namespace" controller=credreq cr=openshift-cloud-credential-operator/openshift-image-registry secret=openshift-image-registry/installer-cloud-credentials
time="2020-03-13T14:23:36Z" level=debug msg="running Exists" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2020-03-13T14:23:36Z" level=debug msg="target secret exists" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2020-03-13T14:23:36Z" level=debug msg="running sync" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2020-03-13T14:23:36Z" level=debug msg="Loading infrastructure name: devint-6sjvm" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2020-03-13T14:23:36Z" level=debug msg="syncing cluster operator status" controller=credreq_status
time="2020-03-13T14:23:36Z" level=debug msg="6 cred requests" controller=credreq_status
time="2020-03-13T14:23:36Z" level=debug msg="set ClusterOperator condition" controller=credreq_status message="No credentials requests reporting errors." reason=NoCredentialsFailing status=False type=Degraded
time="2020-03-13T14:23:36Z" level=debug msg="set ClusterOperator condition" controller=credreq_status message="6 of 6 credentials requests provisioned and reconciled." reason=ReconcilingComplete status=False type=Progressing
time="2020-03-13T14:23:36Z" level=debug msg="set ClusterOperator condition" controller=credreq_status message= reason= status=True type=Available
time="2020-03-13T14:23:36Z" level=debug msg="set ClusterOperator condition" controller=credreq_status message= reason= status=True type=Upgradeable
E0313 14:23:36.871042       1 runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522
/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:82
/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/signal_unix.go:390
/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/utils/utils.go:73
/go/src/github.com/openshift/cloud-credential-operator/pkg/aws/actuator/actuator.go:272
/go/src/github.com/openshift/cloud-credential-operator/pkg/aws/actuator/actuator.go:257
/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/credentialsrequest/credentialsrequest_controller.go:452
/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:213
/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:163
/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152
/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153
/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/asm_amd64.s:1337
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x118d149]

goroutine 1114 [running]:
github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x105
panic(0x17ae140, 0x2e10050)
	/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522 +0x1b5
github.com/openshift/cloud-credential-operator/pkg/controller/utils.LoadInfrastructureRegion(0x1d18020, 0xc0005c9500, 0x1d3d620, 0xc000714e40, 0xc0006f55e0, 0xc, 0x0, 0x0)
	/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/utils/utils.go:73 +0xf9
github.com/openshift/cloud-credential-operator/pkg/aws/actuator.(*AWSActuator).sync(0xc000a3d770, 0x1d02620, 0xc0000ae078, 0xc001db9860, 0xc000714840, 0x3ddab2f20c70)
	/go/src/github.com/openshift/cloud-credential-operator/pkg/aws/actuator/actuator.go:272 +0x18b
github.com/openshift/cloud-credential-operator/pkg/aws/actuator.(*AWSActuator).Update(0xc000a3d770, 0x1d02620, 0xc0000ae078, 0xc001db9860, 0x4cd708701, 0x0)
	/go/src/github.com/openshift/cloud-credential-operator/pkg/aws/actuator/actuator.go:257 +0x49
github.com/openshift/cloud-credential-operator/pkg/controller/credentialsrequest.(*ReconcileCredentialsRequest).Reconcile(0xc000a3d7d0, 0xc00003e840, 0x23, 0xc00073cc40, 0x18, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/credentialsrequest/credentialsrequest_controller.go:452 +0x26ea
github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller.(*Controller).processNextWorkItem(0xc0004643c0, 0x0)
	/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:213 +0x17d
github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller.(*Controller).Start.func1()
	/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:163 +0x36
github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc001c7f6c0)
	/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x54
github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001c7f6c0, 0x3b9aca00, 0x0, 0xc00003d401, 0xc00009a180)
	/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8
github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc001c7f6c0, 0x3b9aca00, 0xc00009a180)
	/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller.(*Controller).Start
	/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:162 +0x33e



Version-Release number of selected component (if applicable):

OCP 4.3.5

How reproducible:

It's segfaulting constantly. However, I suspect 


Steps to Reproduce:
1. Install a 4.1.x OCP cluster.
2. Upgrade via supported paths up to 4.3.5.

Actual results:

The cloud-credential-operator pod crashloops.


Expected results:

The cloud-credential-operator pod actually stays up.


Additional info:

There are multiple Alerts firing for this error. However, the cloud-credential CVO operator itself is not reporting a degraded status which is suspicious.

The code at https://github.com/openshift/cloud-credential-operator/blob/88c7f1b1dfe63bb61e874acffa37bb69ff334a7d/pkg/controller/utils/utils.go#L73 expects a PlatformStatus block to exist in the status field but mine does not have this. It looks like this PlatformStatus field was added sometime during 4.2 but not only for newly-created clusters.

Here's my Infrastructure object with no PlatformStatus field:

apiVersion: config.openshift.io/v1
kind: Infrastructure
metadata:
  creationTimestamp: "2019-09-24T15:57:33Z"
  generation: 1
  name: cluster
  resourceVersion: "404"
  selfLink: /apis/config.openshift.io/v1/infrastructures/cluster
  uid: 057b479e-dee4-11e9-89b3-026c9616ccc0
spec:
  cloudConfig:
    name: ""
status:
  apiServerInternalURI: https://api-int.devint.openshiftknativedemo.org:6443
  apiServerURL: https://api.devint.openshiftknativedemo.org:6443
  etcdDiscoveryDomain: devint.openshiftknativedemo.org
  infrastructureName: devint-6sjvm
  platform: AWS

--- Additional comment from W. Trevor King on 2020-03-13 18:07:27 UTC ---

Tracking down the PlatformStatus existence assumption:

$ for Y in 2 3 4; do git --no-pager log -G PlatformStatus --oneline --decorate "origin/release-4.${Y}" -- pkg/controller/utils/utils.go; done
ed98561 (origin/pr/157) improve permissions simulation by adding region info
4c3a383 (origin/pr/155) improve permissions simulation by adding region info
79661cd (origin/pr/124) improve permissions simulation by adding region info
$ git log --all --decorate --grep '#124\|#155\|#157' | grep Bug
    [release-4.2] Bug 1803221: improve permissions simulation by adding region info
    [release-4.3] Bug 1757244: improve permissions simulation by adding region info
    Bug 1750338: improve permissions simulation by adding region info

so we'll need backport fixes for all of those.  Following up on the referenced bugs:

$ curl -s 'https://bugzilla.redhat.com/show_bug.cgi?id=1803221' | grep /errata/ | head -n1
        <a href="https://access.redhat.com/errata/RHBA-2020:0614">RHBA-2020:0614</a>
$ curl -sH 'Accept:application/json' 'https://api.openshift.com/api/upgrades_info/v1/graph?channel=candidate-4.2' | jq -r '.nodes[] | select(.metadata.url == "https://access.redhat.com/errata/RHBA-2020:0614").version'
4.2.21
$ curl -s 'https://bugzilla.redhat.com/show_bug.cgi?id=1757244' | grep /errata/ | head -n1
        <a href="https://access.redhat.com/errata/RHBA-2020:0528">RHBA-2020:0528</a>
4.3.3
$ curl -s 'https://bugzilla.redhat.com/show_bug.cgi?id=1750338' | grep /errata/
...no hits, but presumably 4.4.0-rc.0 is impacted...

That means that 4.1 -> 4.2.20 would be fine, but 4.1 -> 4.2.20 -> 4.2.21 would hit this bug.  From an update-recommendation flow, we currently have a 4.1.21 -> 4.2.2 edge in candidate-4.2, and many more connections after that which could get us into trouble.  We probably don't want to drop 4.2->4.2 edges, which leaves us pruning 4.y-jumping edges.  But there are also not that many 4.1 clusters left.  We probably want to leave the graph alone for now, but once we get a fix out in 4.2 we should remove all 4.1 -> 4.2 edges to 4.2 older than the fixed version.  And similarly once we get a fixed 4.3, we should remove all 4.2 -> 4.3 edges older than the fixed version.

--- Additional comment from W. Trevor King on 2020-03-16 13:54:47 EDT ---

Based on the impacted edges above, the verification plan for the 4.4 fix is going to be:

1. Install a 4.1 cluster.  Looks like 4.1.34 is a good choice because we recommend 4.1.24 -> 4.2.20 in stable-4.2.
2. Update to 4.2.20 (avoiding 4.2.21+ having the broken assumption that PlatformStatus is populated).
3. Update to 4.3.2 (avoid 4.3.3+ having the broken assumption that PlatformStatus is populated).
   We don't have any recommended edges into 4.3.2, [1] lists the motivating bugs, but none of them should actually break the update.
4. Force an update to the 4.4 nightly with the candidate fix.
5. Ensure the cred operator is not panicking.

[1]: https://github.com/openshift/cincinnati-graph-data/blob/dc2a3c4cd879b0eeb7153c50f8706ad45166e8ac/blocked-edges/4.3.2.yaml#L1

--- Additional comment from W. Trevor King on 2020-03-16 15:25:41 EDT ---

Similar verification plan for the related registry-operator issue in [1].  If this fix lands first, and you attempt verification before the registry operator also lands a fix, you can ignore a panicking registry operator (or otherwise complaining, I'm not actually clear on how graciously it accepts the lack of install-config fallback) for the purpose of verifying this cred-operator bug.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1808425#c5

--- Additional comment from Scott Dodson on 2020-03-17 16:44:09 EDT ---

Comment 3 wang lin 2020-04-07 03:32:44 UTC
The upgrading process is :
4.1.24 -> 4.2.0-0.nightly-2020-04-06-113246

The bug has fixed.

Comment 5 errata-xmlrpc 2020-04-14 11:58:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1398


Note You need to log in before you can comment on or make changes to this bug.