1919778 – Upgrade is stuck in insights operator Degraded with "Source clusterconfig could not be retrieved" until insights operator pod is manually deleted

Bug 1919778 - Upgrade is stuck in insights operator Degraded with "Source clusterconfig could not be retrieved" until insights operator pod is manually deleted

Summary: Upgrade is stuck in insights operator Degraded with "Source clusterconfig cou...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Insights Operator
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Marcell Sevcsik
QA Contact:	Pavel Šimovec
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-01-25 06:39 UTC by Xingxing Xia
Modified:	2021-02-24 15:56 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Any gather function's error caused the insights operator to go degraded. Consequence: After upgrades gather functions tend to fail because they might start executing before the gathered resource is ready, which causes an error Fix: Introduce a retry using ExponentialBackOff to try a few gathers before going degraded Result: During an upgrade, insights operator doesn't go degraded.
Clone Of:
Environment:
Last Closed:	2021-02-24 15:55:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
[xxia@pres 2021-01-25 13:58:40 CST my]$ oc logs insights-operator-6b45495cb8-nzqcc -n openshift-insights (245.88 KB, text/plain) 2021-01-25 06:45 UTC, Xingxing Xia	no flags	Details
Pre-workaround log (281.21 KB, text/plain) 2021-01-26 21:19 UTC, brad.williams	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift insights-operator pull 320	0	None	closed	Bug 1919778: Monitors how many gatherings failed in a row, and applies degraded status accordingly	2021-02-08 12:26:21 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:56:05 UTC

Description Xingxing Xia 2021-01-25 06:39:10 UTC

Description of problem:
4.6 -> 4.7 upgrade is stuck in insights operator Degraded with "Source clusterconfig could not be retrieved" until insights operator pod is manually deleted

Version-Release number of selected component (if applicable):
4.6.12 to 4.7.0-0.nightly-2021-01-22-134922

How reproducible:
Not sure. I hit once so far.

Steps to Reproduce:
1. Launch a 4.6.12 cluster (IPI_on_AWS_Multitenant)
2. Upgrade to 4.7.0-0.nightly-2021-01-22-134922
3. Watch the upgrade

Actual results:
3. Watch oc get clusterversion, it shows "the cluster operator insights is degraded".

Check all nodes, pods, all are ready. Check oc get co, all other COs are "4.7.0-0.nightly-2021-01-22-134922   True        False         False", except insights is "4.7.0-0.nightly-2021-01-22-134922   True   False   True".

More observations:
[xxia@pres 2021-01-25 13:24:44 CST my]$ oc get co
...
insights                                   4.7.0-0.nightly-2021-01-22-134922   True        False         True       100m
...

[xxia@pres 2021-01-25 13:32:35 CST my]$ oc describe co insights
...
  Conditions:
    Last Transition Time:  2021-01-25T05:23:59Z
    Message:               Source clusterconfig could not be retrieved: container "openshift-apiserver-operator" in pod "openshift-apiserver-operator-5d9f9f75bc-tgxff" is waiting to start: ContainerCreating
    Reason:                PeriodicGatherFailed
    Status:                True
    Type:                  Degraded
...

But the referenced resource is in good status:
[xxia@pres 2021-01-25 13:32:43 CST my]$ oc get po -n openshift-apiserver-operator
NAME                                            READY   STATUS    RESTARTS   AGE
openshift-apiserver-operator-5d9f9f75bc-tgxff   1/1     Running   0          9m53s

[xxia@pres 2021-01-25 13:53:40 CST my]$ oc get co
...
insights                                   4.7.0-0.nightly-2021-01-22-134922   True        False         True       127m
...

[xxia@pres 2021-01-25 13:58:05 CST my]$ oc get po -n openshift-insights
NAME                                 READY   STATUS    RESTARTS   AGE
insights-operator-6b45495cb8-nzqcc   1/1     Running   0          34m
[xxia@pres 2021-01-25 13:58:40 CST my]$ oc logs insights-operator-6b45495cb8-nzqcc -n openshift-insights > logs/insights-operator-6b45495cb8-nzqcc.log # see attachment

Try workaround:
[xxia@pres 2021-01-25 14:03:56 CST my]$ oc delete po insights-operator-6b45495cb8-nzqcc -n openshift-insights
pod "insights-operator-6b45495cb8-nzqcc" deleted
[xxia@pres 2021-01-25 14:04:18 CST my]$ oc get po -n openshift-insights
NAME                                 READY   STATUS    RESTARTS   AGE
insights-operator-6b45495cb8-qqz9j   1/1     Running   0          55s

Then no "the cluster operator insights is degraded" now, the upgrade turned to complete:
[xxia@pres 2021-01-25 14:05:01 CST my]$ ogcv
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-01-22-134922   True        False         26s     Cluster version is 4.7.0-0.nightly-2021-01-22-134922

[xxia@pres 2021-01-25 14:05:08 CST my]$ oc get co
...
insights                                   4.7.0-0.nightly-2021-01-22-134922   True        False         False      134m
...


Expected results:
3. No stuck.

Additional info:

Comment 1 Xingxing Xia 2021-01-25 06:45:02 UTC

Created attachment 1750400 [details]
[xxia@pres 2021-01-25 13:58:40 CST my]$ oc logs insights-operator-6b45495cb8-nzqcc -n openshift-insights

The logs collected before the insights-operator-6b45495cb8-nzqcc pod is manually deleted:
$ grep "Source clusterconfig could not be retrieved" logs/insights-operator-6b45495cb8-nzqcc.log | wc -l
37

$ grep "Source clusterconfig could not be retrieved" logs/insights-operator-6b45495cb8-nzqcc.log | head -n 2
I0125 05:23:54.717915       1 controllerstatus.go:59] name=periodic-clusterconfig healthy=false reason=PeriodicGatherFailed message=Source clusterconfig could not be retrieved: container "openshift-apiserver-operator" in pod "openshift-apiserver-operator-5d9f9f75bc-tgxff" is waiting to start: ContainerCreating
I0125 05:23:59.348688       1 status.go:235] The operator has some internal errors: Source clusterconfig could not be retrieved: container "openshift-apiserver-operator" in pod "openshift-apiserver-operator-5d9f9f75bc-tgxff" is waiting to start: ContainerCreating

$ grep "Source clusterconfig could not be retrieved" logs/insights-operator-6b45495cb8-nzqcc.log | tail -n 2
I0125 05:57:44.119853       1 status.go:235] The operator has some internal errors: Source clusterconfig could not be retrieved: container "openshift-apiserver-operator" in pod "openshift-apiserver-operator-5d9f9f75bc-tgxff" is waiting to start: ContainerCreating
I0125 05:57:44.119877       1 status.go:287] The operator has some internal errors: Source clusterconfig could not be retrieved: container "openshift-apiserver-operator" in pod "openshift-apiserver-operator-5d9f9f75bc-tgxff" is waiting to start: ContainerCreating

Comment 2 Tomas Remes 2021-01-25 08:17:40 UTC

Xingxing thanks for the report. I think this is a good catch and it should be reproducible every time when any of the gatherers in IO fails.

Comment 3 brad.williams 2021-01-26 21:19:09 UTC

Created attachment 1751017 [details]
Pre-workaround log

We ran into this same problem while upgrading from 4.7.0-fc.3 to 4.7.0-fc.4 this afternoon.  Attached log file (insights-operator-58b4dbdd6f-rv5j4.log) from the pod prior to performing the workaround.

Comment 4 Marcell Sevcsik 2021-01-27 08:37:06 UTC

The PR that should fix it: https://github.com/openshift/insights-operator/pull/320

Comment 6 Xingxing Xia 2021-01-28 14:22:30 UTC

Retested per comment 0 steps from 4.6.12 to 4.7.0-0.nightly-2021-01-28-102244, didn't hit it again (bug 1920027 is still hit, though):
$ oc get co insights
insights   4.7.0-0.nightly-2021-01-28-102244   True        False         False      167m

Comment 9 errata-xmlrpc 2021-02-24 15:55:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.