Bug 1919778

Summary: Upgrade is stuck in insights operator Degraded with "Source clusterconfig could not be retrieved" until insights operator pod is manually deleted
Product: OpenShift Container Platform Reporter: Xingxing Xia <xxia>
Component: Insights OperatorAssignee: Marcell Sevcsik <msevcsik>
Status: CLOSED ERRATA QA Contact: Pavel Šimovec <psimovec>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.7CC: aos-bugs, brad.williams, fdeutsch, inecas, mklika, msevcsik, rvokal, sdodson, tremes, wking
Target Milestone: ---Keywords: Upgrades
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Any gather function's error caused the insights operator to go degraded. Consequence: After upgrades gather functions tend to fail because they might start executing before the gathered resource is ready, which causes an error Fix: Introduce a retry using ExponentialBackOff to try a few gathers before going degraded Result: During an upgrade, insights operator doesn't go degraded.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:55:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
[xxia@pres 2021-01-25 13:58:40 CST my]$ oc logs insights-operator-6b45495cb8-nzqcc -n openshift-insights
none
Pre-workaround log none

Description Xingxing Xia 2021-01-25 06:39:10 UTC
Description of problem:
4.6 -> 4.7 upgrade is stuck in insights operator Degraded with "Source clusterconfig could not be retrieved" until insights operator pod is manually deleted

Version-Release number of selected component (if applicable):
4.6.12 to 4.7.0-0.nightly-2021-01-22-134922

How reproducible:
Not sure. I hit once so far.

Steps to Reproduce:
1. Launch a 4.6.12 cluster (IPI_on_AWS_Multitenant)
2. Upgrade to 4.7.0-0.nightly-2021-01-22-134922
3. Watch the upgrade

Actual results:
3. Watch oc get clusterversion, it shows "the cluster operator insights is degraded".

Check all nodes, pods, all are ready. Check oc get co, all other COs are "4.7.0-0.nightly-2021-01-22-134922   True        False         False", except insights is "4.7.0-0.nightly-2021-01-22-134922   True   False   True".

More observations:
[xxia@pres 2021-01-25 13:24:44 CST my]$ oc get co
...
insights                                   4.7.0-0.nightly-2021-01-22-134922   True        False         True       100m
...

[xxia@pres 2021-01-25 13:32:35 CST my]$ oc describe co insights
...
  Conditions:
    Last Transition Time:  2021-01-25T05:23:59Z
    Message:               Source clusterconfig could not be retrieved: container "openshift-apiserver-operator" in pod "openshift-apiserver-operator-5d9f9f75bc-tgxff" is waiting to start: ContainerCreating
    Reason:                PeriodicGatherFailed
    Status:                True
    Type:                  Degraded
...

But the referenced resource is in good status:
[xxia@pres 2021-01-25 13:32:43 CST my]$ oc get po -n openshift-apiserver-operator
NAME                                            READY   STATUS    RESTARTS   AGE
openshift-apiserver-operator-5d9f9f75bc-tgxff   1/1     Running   0          9m53s

[xxia@pres 2021-01-25 13:53:40 CST my]$ oc get co
...
insights                                   4.7.0-0.nightly-2021-01-22-134922   True        False         True       127m
...

[xxia@pres 2021-01-25 13:58:05 CST my]$ oc get po -n openshift-insights
NAME                                 READY   STATUS    RESTARTS   AGE
insights-operator-6b45495cb8-nzqcc   1/1     Running   0          34m
[xxia@pres 2021-01-25 13:58:40 CST my]$ oc logs insights-operator-6b45495cb8-nzqcc -n openshift-insights > logs/insights-operator-6b45495cb8-nzqcc.log # see attachment

Try workaround:
[xxia@pres 2021-01-25 14:03:56 CST my]$ oc delete po insights-operator-6b45495cb8-nzqcc -n openshift-insights
pod "insights-operator-6b45495cb8-nzqcc" deleted
[xxia@pres 2021-01-25 14:04:18 CST my]$ oc get po -n openshift-insights
NAME                                 READY   STATUS    RESTARTS   AGE
insights-operator-6b45495cb8-qqz9j   1/1     Running   0          55s

Then no "the cluster operator insights is degraded" now, the upgrade turned to complete:
[xxia@pres 2021-01-25 14:05:01 CST my]$ ogcv
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-01-22-134922   True        False         26s     Cluster version is 4.7.0-0.nightly-2021-01-22-134922

[xxia@pres 2021-01-25 14:05:08 CST my]$ oc get co
...
insights                                   4.7.0-0.nightly-2021-01-22-134922   True        False         False      134m
...


Expected results:
3. No stuck.

Additional info:

Comment 1 Xingxing Xia 2021-01-25 06:45:02 UTC
Created attachment 1750400 [details]
[xxia@pres 2021-01-25 13:58:40 CST my]$ oc logs insights-operator-6b45495cb8-nzqcc -n openshift-insights

The logs collected before the insights-operator-6b45495cb8-nzqcc pod is manually deleted:
$ grep "Source clusterconfig could not be retrieved" logs/insights-operator-6b45495cb8-nzqcc.log | wc -l
37

$ grep "Source clusterconfig could not be retrieved" logs/insights-operator-6b45495cb8-nzqcc.log | head -n 2
I0125 05:23:54.717915       1 controllerstatus.go:59] name=periodic-clusterconfig healthy=false reason=PeriodicGatherFailed message=Source clusterconfig could not be retrieved: container "openshift-apiserver-operator" in pod "openshift-apiserver-operator-5d9f9f75bc-tgxff" is waiting to start: ContainerCreating
I0125 05:23:59.348688       1 status.go:235] The operator has some internal errors: Source clusterconfig could not be retrieved: container "openshift-apiserver-operator" in pod "openshift-apiserver-operator-5d9f9f75bc-tgxff" is waiting to start: ContainerCreating

$ grep "Source clusterconfig could not be retrieved" logs/insights-operator-6b45495cb8-nzqcc.log | tail -n 2
I0125 05:57:44.119853       1 status.go:235] The operator has some internal errors: Source clusterconfig could not be retrieved: container "openshift-apiserver-operator" in pod "openshift-apiserver-operator-5d9f9f75bc-tgxff" is waiting to start: ContainerCreating
I0125 05:57:44.119877       1 status.go:287] The operator has some internal errors: Source clusterconfig could not be retrieved: container "openshift-apiserver-operator" in pod "openshift-apiserver-operator-5d9f9f75bc-tgxff" is waiting to start: ContainerCreating

Comment 2 Tomas Remes 2021-01-25 08:17:40 UTC
Xingxing thanks for the report. I think this is a good catch and it should be reproducible every time when any of the gatherers in IO fails.

Comment 3 brad.williams 2021-01-26 21:19:09 UTC
Created attachment 1751017 [details]
Pre-workaround log

We ran into this same problem while upgrading from 4.7.0-fc.3 to 4.7.0-fc.4 this afternoon.  Attached log file (insights-operator-58b4dbdd6f-rv5j4.log) from the pod prior to performing the workaround.

Comment 4 Marcell Sevcsik 2021-01-27 08:37:06 UTC
The PR that should fix it: https://github.com/openshift/insights-operator/pull/320

Comment 6 Xingxing Xia 2021-01-28 14:22:30 UTC
Retested per comment 0 steps from 4.6.12 to 4.7.0-0.nightly-2021-01-28-102244, didn't hit it again (bug 1920027 is still hit, though):
$ oc get co insights
insights   4.7.0-0.nightly-2021-01-28-102244   True        False         False      167m

Comment 9 errata-xmlrpc 2021-02-24 15:55:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633