Bug 1919778

Summary:

Upgrade is stuck in insights operator Degraded with "Source clusterconfig could not be retrieved" until insights operator pod is manually deleted

Product:

OpenShift Container Platform

Reporter:

Xingxing Xia <xxia>

Component:

Insights Operator

Assignee:

Marcell Sevcsik <msevcsik>

Status:

CLOSED ERRATA

QA Contact:

Pavel Šimovec <psimovec>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

4.7

CC:

aos-bugs, brad.williams, fdeutsch, inecas, mklika, msevcsik, rvokal, sdodson, tremes, wking

Target Milestone:

---

Keywords:

Upgrades

Target Release:

4.7.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Cause: Any gather function's error caused the insights operator to go degraded. Consequence: After upgrades gather functions tend to fail because they might start executing before the gathered resource is ready, which causes an error Fix: Introduce a retry using ExponentialBackOff to try a few gathers before going degraded Result: During an upgrade, insights operator doesn't go degraded.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-02-24 15:55:50 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
[xxia@pres 2021-01-25 13:58:40 CST my]$ oc logs insights-operator-6b45495cb8-nzqcc -n openshift-insights	none
Pre-workaround log	none

Description Xingxing Xia 2021-01-25 06:39:10 UTC

Description of problem:
4.6 -> 4.7 upgrade is stuck in insights operator Degraded with "Source clusterconfig could not be retrieved" until insights operator pod is manually deleted

Version-Release number of selected component (if applicable):
4.6.12 to 4.7.0-0.nightly-2021-01-22-134922

How reproducible:
Not sure. I hit once so far.

Steps to Reproduce:
1. Launch a 4.6.12 cluster (IPI_on_AWS_Multitenant)
2. Upgrade to 4.7.0-0.nightly-2021-01-22-134922
3. Watch the upgrade

Actual results:
3. Watch oc get clusterversion, it shows "the cluster operator insights is degraded".

Check all nodes, pods, all are ready. Check oc get co, all other COs are "4.7.0-0.nightly-2021-01-22-134922   True        False         False", except insights is "4.7.0-0.nightly-2021-01-22-134922   True   False   True".

More observations:
[xxia@pres 2021-01-25 13:24:44 CST my]$ oc get co
...
insights                                   4.7.0-0.nightly-2021-01-22-134922   True        False         True       100m
...

[xxia@pres 2021-01-25 13:32:35 CST my]$ oc describe co insights
...
  Conditions:
    Last Transition Time:  2021-01-25T05:23:59Z
    Message:               Source clusterconfig could not be retrieved: container "openshift-apiserver-operator" in pod "openshift-apiserver-operator-5d9f9f75bc-tgxff" is waiting to start: ContainerCreating
    Reason:                PeriodicGatherFailed
    Status:                True
    Type:                  Degraded
...

But the referenced resource is in good status:
[xxia@pres 2021-01-25 13:32:43 CST my]$ oc get po -n openshift-apiserver-operator
NAME                                            READY   STATUS    RESTARTS   AGE
openshift-apiserver-operator-5d9f9f75bc-tgxff   1/1     Running   0          9m53s

[xxia@pres 2021-01-25 13:53:40 CST my]$ oc get co
...
insights                                   4.7.0-0.nightly-2021-01-22-134922   True        False         True       127m
...

[xxia@pres 2021-01-25 13:58:05 CST my]$ oc get po -n openshift-insights
NAME                                 READY   STATUS    RESTARTS   AGE
insights-operator-6b45495cb8-nzqcc   1/1     Running   0          34m
[xxia@pres 2021-01-25 13:58:40 CST my]$ oc logs insights-operator-6b45495cb8-nzqcc -n openshift-insights > logs/insights-operator-6b45495cb8-nzqcc.log # see attachment

Try workaround:
[xxia@pres 2021-01-25 14:03:56 CST my]$ oc delete po insights-operator-6b45495cb8-nzqcc -n openshift-insights
pod "insights-operator-6b45495cb8-nzqcc" deleted
[xxia@pres 2021-01-25 14:04:18 CST my]$ oc get po -n openshift-insights
NAME                                 READY   STATUS    RESTARTS   AGE
insights-operator-6b45495cb8-qqz9j   1/1     Running   0          55s

Then no "the cluster operator insights is degraded" now, the upgrade turned to complete:
[xxia@pres 2021-01-25 14:05:01 CST my]$ ogcv
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-01-22-134922   True        False         26s     Cluster version is 4.7.0-0.nightly-2021-01-22-134922

[xxia@pres 2021-01-25 14:05:08 CST my]$ oc get co
...
insights                                   4.7.0-0.nightly-2021-01-22-134922   True        False         False      134m
...


Expected results:
3. No stuck.

Additional info:

Comment 1 Xingxing Xia 2021-01-25 06:45:02 UTC

Created attachment 1750400 [details]
[xxia@pres 2021-01-25 13:58:40 CST my]$ oc logs insights-operator-6b45495cb8-nzqcc -n openshift-insights

The logs collected before the insights-operator-6b45495cb8-nzqcc pod is manually deleted:
$ grep "Source clusterconfig could not be retrieved" logs/insights-operator-6b45495cb8-nzqcc.log | wc -l
37

$ grep "Source clusterconfig could not be retrieved" logs/insights-operator-6b45495cb8-nzqcc.log | head -n 2
I0125 05:23:54.717915       1 controllerstatus.go:59] name=periodic-clusterconfig healthy=false reason=PeriodicGatherFailed message=Source clusterconfig could not be retrieved: container "openshift-apiserver-operator" in pod "openshift-apiserver-operator-5d9f9f75bc-tgxff" is waiting to start: ContainerCreating
I0125 05:23:59.348688       1 status.go:235] The operator has some internal errors: Source clusterconfig could not be retrieved: container "openshift-apiserver-operator" in pod "openshift-apiserver-operator-5d9f9f75bc-tgxff" is waiting to start: ContainerCreating

$ grep "Source clusterconfig could not be retrieved" logs/insights-operator-6b45495cb8-nzqcc.log | tail -n 2
I0125 05:57:44.119853       1 status.go:235] The operator has some internal errors: Source clusterconfig could not be retrieved: container "openshift-apiserver-operator" in pod "openshift-apiserver-operator-5d9f9f75bc-tgxff" is waiting to start: ContainerCreating
I0125 05:57:44.119877       1 status.go:287] The operator has some internal errors: Source clusterconfig could not be retrieved: container "openshift-apiserver-operator" in pod "openshift-apiserver-operator-5d9f9f75bc-tgxff" is waiting to start: ContainerCreating

Comment 2 Tomas Remes 2021-01-25 08:17:40 UTC

Xingxing thanks for the report. I think this is a good catch and it should be reproducible every time when any of the gatherers in IO fails.

Comment 3 brad.williams 2021-01-26 21:19:09 UTC

Created attachment 1751017 [details]
Pre-workaround log

We ran into this same problem while upgrading from 4.7.0-fc.3 to 4.7.0-fc.4 this afternoon.  Attached log file (insights-operator-58b4dbdd6f-rv5j4.log) from the pod prior to performing the workaround.

Comment 4 Marcell Sevcsik 2021-01-27 08:37:06 UTC

The PR that should fix it: https://github.com/openshift/insights-operator/pull/320

Comment 6 Xingxing Xia 2021-01-28 14:22:30 UTC

Retested per comment 0 steps from 4.6.12 to 4.7.0-0.nightly-2021-01-28-102244, didn't hit it again (bug 1920027 is still hit, though):
$ oc get co insights
insights   4.7.0-0.nightly-2021-01-28-102244   True        False         False      167m

Comment 9 errata-xmlrpc 2021-02-24 15:55:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633