Bug 1816704 - Cloud Credential Operator pod crashlooping with golang segfault
Summary: Cloud Credential Operator pod crashlooping with golang segfault
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Credential Operator
Version: 4.2.z
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.3.z
Assignee: Joel Diaz
QA Contact: wang lin
URL:
Whiteboard:
Depends On: 1813998
Blocks: 1819183
TreeView+ depends on / blocked
 
Reported: 2020-03-24 15:12 UTC by Scott Dodson
Modified: 2020-04-14 16:19 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1813998
: 1819183 (view as bug list)
Environment:
Last Closed: 2020-04-14 16:18:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cloud-credential-operator pull 169 0 None closed Bug 1816704: handle old Infrastructure objects without PlatformStatus 2020-04-22 09:08:48 UTC
Red Hat Product Errata RHBA-2020:1393 0 None None None 2020-04-14 16:19:07 UTC

Description Scott Dodson 2020-03-24 15:12:58 UTC
+++ This bug was initially created as a clone of Bug #1813998 +++

+++ This bug was initially created as a clone of Bug #1813343 +++

Description of problem:

After upgrading from 4.3.1 to 4.3.5, my cloud-credential-operator pod is constantly crashlooping with a segfault of:

time="2020-03-13T14:23:36Z" level=info msg="syncing credentials request" controller=credreq cr=openshift-cloud-credential-operator/openshift-image-registry
time="2020-03-13T14:23:36Z" level=debug msg="found secret namespace" controller=credreq cr=openshift-cloud-credential-operator/openshift-image-registry secret=openshift-image-registry/installer-cloud-credentials
time="2020-03-13T14:23:36Z" level=debug msg="running Exists" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2020-03-13T14:23:36Z" level=debug msg="target secret exists" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2020-03-13T14:23:36Z" level=debug msg="running sync" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2020-03-13T14:23:36Z" level=debug msg="Loading infrastructure name: devint-6sjvm" actuator=aws cr=openshift-cloud-credential-operator/openshift-image-registry
time="2020-03-13T14:23:36Z" level=debug msg="syncing cluster operator status" controller=credreq_status
time="2020-03-13T14:23:36Z" level=debug msg="6 cred requests" controller=credreq_status
time="2020-03-13T14:23:36Z" level=debug msg="set ClusterOperator condition" controller=credreq_status message="No credentials requests reporting errors." reason=NoCredentialsFailing status=False type=Degraded
time="2020-03-13T14:23:36Z" level=debug msg="set ClusterOperator condition" controller=credreq_status message="6 of 6 credentials requests provisioned and reconciled." reason=ReconcilingComplete status=False type=Progressing
time="2020-03-13T14:23:36Z" level=debug msg="set ClusterOperator condition" controller=credreq_status message= reason= status=True type=Available
time="2020-03-13T14:23:36Z" level=debug msg="set ClusterOperator condition" controller=credreq_status message= reason= status=True type=Upgradeable
E0313 14:23:36.871042       1 runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522
/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:82
/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/signal_unix.go:390
/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/utils/utils.go:73
/go/src/github.com/openshift/cloud-credential-operator/pkg/aws/actuator/actuator.go:272
/go/src/github.com/openshift/cloud-credential-operator/pkg/aws/actuator/actuator.go:257
/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/credentialsrequest/credentialsrequest_controller.go:452
/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:213
/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:163
/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152
/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153
/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/asm_amd64.s:1337
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x118d149]

goroutine 1114 [running]:
github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x105
panic(0x17ae140, 0x2e10050)
	/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522 +0x1b5
github.com/openshift/cloud-credential-operator/pkg/controller/utils.LoadInfrastructureRegion(0x1d18020, 0xc0005c9500, 0x1d3d620, 0xc000714e40, 0xc0006f55e0, 0xc, 0x0, 0x0)
	/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/utils/utils.go:73 +0xf9
github.com/openshift/cloud-credential-operator/pkg/aws/actuator.(*AWSActuator).sync(0xc000a3d770, 0x1d02620, 0xc0000ae078, 0xc001db9860, 0xc000714840, 0x3ddab2f20c70)
	/go/src/github.com/openshift/cloud-credential-operator/pkg/aws/actuator/actuator.go:272 +0x18b
github.com/openshift/cloud-credential-operator/pkg/aws/actuator.(*AWSActuator).Update(0xc000a3d770, 0x1d02620, 0xc0000ae078, 0xc001db9860, 0x4cd708701, 0x0)
	/go/src/github.com/openshift/cloud-credential-operator/pkg/aws/actuator/actuator.go:257 +0x49
github.com/openshift/cloud-credential-operator/pkg/controller/credentialsrequest.(*ReconcileCredentialsRequest).Reconcile(0xc000a3d7d0, 0xc00003e840, 0x23, 0xc00073cc40, 0x18, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/credentialsrequest/credentialsrequest_controller.go:452 +0x26ea
github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller.(*Controller).processNextWorkItem(0xc0004643c0, 0x0)
	/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:213 +0x17d
github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller.(*Controller).Start.func1()
	/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:163 +0x36
github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc001c7f6c0)
	/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x54
github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001c7f6c0, 0x3b9aca00, 0x0, 0xc00003d401, 0xc00009a180)
	/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8
github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc001c7f6c0, 0x3b9aca00, 0xc00009a180)
	/go/src/github.com/openshift/cloud-credential-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller.(*Controller).Start
	/go/src/github.com/openshift/cloud-credential-operator/pkg/controller/internalcontroller/controller.go:162 +0x33e



Version-Release number of selected component (if applicable):

OCP 4.3.5

How reproducible:

It's segfaulting constantly. However, I suspect 


Steps to Reproduce:
1. Install a 4.1.x OCP cluster.
2. Upgrade via supported paths up to 4.3.5.

Actual results:

The cloud-credential-operator pod crashloops.


Expected results:

The cloud-credential-operator pod actually stays up.


Additional info:

There are multiple Alerts firing for this error. However, the cloud-credential CVO operator itself is not reporting a degraded status which is suspicious.

The code at https://github.com/openshift/cloud-credential-operator/blob/88c7f1b1dfe63bb61e874acffa37bb69ff334a7d/pkg/controller/utils/utils.go#L73 expects a PlatformStatus block to exist in the status field but mine does not have this. It looks like this PlatformStatus field was added sometime during 4.2 but not only for newly-created clusters.

Here's my Infrastructure object with no PlatformStatus field:

apiVersion: config.openshift.io/v1
kind: Infrastructure
metadata:
  creationTimestamp: "2019-09-24T15:57:33Z"
  generation: 1
  name: cluster
  resourceVersion: "404"
  selfLink: /apis/config.openshift.io/v1/infrastructures/cluster
  uid: 057b479e-dee4-11e9-89b3-026c9616ccc0
spec:
  cloudConfig:
    name: ""
status:
  apiServerInternalURI: https://api-int.devint.openshiftknativedemo.org:6443
  apiServerURL: https://api.devint.openshiftknativedemo.org:6443
  etcdDiscoveryDomain: devint.openshiftknativedemo.org
  infrastructureName: devint-6sjvm
  platform: AWS

--- Additional comment from W. Trevor King on 2020-03-13 18:07:27 UTC ---

Tracking down the PlatformStatus existence assumption:

$ for Y in 2 3 4; do git --no-pager log -G PlatformStatus --oneline --decorate "origin/release-4.${Y}" -- pkg/controller/utils/utils.go; done
ed98561 (origin/pr/157) improve permissions simulation by adding region info
4c3a383 (origin/pr/155) improve permissions simulation by adding region info
79661cd (origin/pr/124) improve permissions simulation by adding region info
$ git log --all --decorate --grep '#124\|#155\|#157' | grep Bug
    [release-4.2] Bug 1803221: improve permissions simulation by adding region info
    [release-4.3] Bug 1757244: improve permissions simulation by adding region info
    Bug 1750338: improve permissions simulation by adding region info

so we'll need backport fixes for all of those.  Following up on the referenced bugs:

$ curl -s 'https://bugzilla.redhat.com/show_bug.cgi?id=1803221' | grep /errata/ | head -n1
        <a href="https://access.redhat.com/errata/RHBA-2020:0614">RHBA-2020:0614</a>
$ curl -sH 'Accept:application/json' 'https://api.openshift.com/api/upgrades_info/v1/graph?channel=candidate-4.2' | jq -r '.nodes[] | select(.metadata.url == "https://access.redhat.com/errata/RHBA-2020:0614").version'
4.2.21
$ curl -s 'https://bugzilla.redhat.com/show_bug.cgi?id=1757244' | grep /errata/ | head -n1
        <a href="https://access.redhat.com/errata/RHBA-2020:0528">RHBA-2020:0528</a>
4.3.3
$ curl -s 'https://bugzilla.redhat.com/show_bug.cgi?id=1750338' | grep /errata/
...no hits, but presumably 4.4.0-rc.0 is impacted...

That means that 4.1 -> 4.2.20 would be fine, but 4.1 -> 4.2.20 -> 4.2.21 would hit this bug.  From an update-recommendation flow, we currently have a 4.1.21 -> 4.2.2 edge in candidate-4.2, and many more connections after that which could get us into trouble.  We probably don't want to drop 4.2->4.2 edges, which leaves us pruning 4.y-jumping edges.  But there are also not that many 4.1 clusters left.  We probably want to leave the graph alone for now, but once we get a fix out in 4.2 we should remove all 4.1 -> 4.2 edges to 4.2 older than the fixed version.  And similarly once we get a fixed 4.3, we should remove all 4.2 -> 4.3 edges older than the fixed version.

--- Additional comment from W. Trevor King on 2020-03-16 13:54:47 EDT ---

Based on the impacted edges above, the verification plan for the 4.4 fix is going to be:

1. Install a 4.1 cluster.  Looks like 4.1.34 is a good choice because we recommend 4.1.24 -> 4.2.20 in stable-4.2.
2. Update to 4.2.20 (avoiding 4.2.21+ having the broken assumption that PlatformStatus is populated).
3. Update to 4.3.2 (avoid 4.3.3+ having the broken assumption that PlatformStatus is populated).
   We don't have any recommended edges into 4.3.2, [1] lists the motivating bugs, but none of them should actually break the update.
4. Force an update to the 4.4 nightly with the candidate fix.
5. Ensure the cred operator is not panicking.

[1]: https://github.com/openshift/cincinnati-graph-data/blob/dc2a3c4cd879b0eeb7153c50f8706ad45166e8ac/blocked-edges/4.3.2.yaml#L1

--- Additional comment from W. Trevor King on 2020-03-16 15:25:41 EDT ---

Similar verification plan for the related registry-operator issue in [1].  If this fix lands first, and you attempt verification before the registry operator also lands a fix, you can ignore a panicking registry operator (or otherwise complaining, I'm not actually clear on how graciously it accepts the lack of install-config fallback) for the purpose of verifying this cred-operator bug.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1808425#c5

--- Additional comment from Scott Dodson on 2020-03-17 16:44:09 EDT ---

Comment 2 Scott Dodson 2020-04-02 23:44:05 UTC
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges.

Who is impacted?
  Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?
  Up to 2 minute disruption in edge routing
  Up to 90seconds of API downtime
  etcd loses quorum and you have to restore from backup
How involved is remediation?
  Issue resolves itself after five minutes
  Admin uses oc to fix things
  Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression?
  No, it’s always been like this we just never noticed
  Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 7 Joel Diaz 2020-04-03 13:10:48 UTC
(In reply to Scott Dodson from comment #2)
> Who is impacted?
Clusters originally installed with 4.1 and upgraded to affected versions.

> What is the impact?
Cloud-credential-operator is unable to process any CredentialsRequests. If the permissions requested in a CredentialsRequest changed along with the introduction of the bug, then that CredentialsRequest would be unable to be proccessed.
Alerts would also potentially be firing since the CCO is unhealthy.

> How involved is remediation?
Manual patching of the Infrastructure CR to put in the new/updated Status fields.
> Is this a regression?
Yes, since the backport of the enhanced permissions simulation https://github.com/openshift/cloud-credential-operator/pull/157

Comment 8 wang lin 2020-04-07 04:33:24 UTC
The upgrading process is :
4.1.24 -> 4.2.20 -> 4.3.0-0.nightly-2020-04-06-093556

The bug has fixed.

Comment 10 errata-xmlrpc 2020-04-14 16:18:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1393


Note You need to log in before you can comment on or make changes to this bug.