Description of problem: 4.6 -> 4.7 upgrade is stuck in insights operator Degraded with "Source clusterconfig could not be retrieved" until insights operator pod is manually deleted Version-Release number of selected component (if applicable): 4.6.12 to 4.7.0-0.nightly-2021-01-22-134922 How reproducible: Not sure. I hit once so far. Steps to Reproduce: 1. Launch a 4.6.12 cluster (IPI_on_AWS_Multitenant) 2. Upgrade to 4.7.0-0.nightly-2021-01-22-134922 3. Watch the upgrade Actual results: 3. Watch oc get clusterversion, it shows "the cluster operator insights is degraded". Check all nodes, pods, all are ready. Check oc get co, all other COs are "4.7.0-0.nightly-2021-01-22-134922 True False False", except insights is "4.7.0-0.nightly-2021-01-22-134922 True False True". More observations: [xxia@pres 2021-01-25 13:24:44 CST my]$ oc get co ... insights 4.7.0-0.nightly-2021-01-22-134922 True False True 100m ... [xxia@pres 2021-01-25 13:32:35 CST my]$ oc describe co insights ... Conditions: Last Transition Time: 2021-01-25T05:23:59Z Message: Source clusterconfig could not be retrieved: container "openshift-apiserver-operator" in pod "openshift-apiserver-operator-5d9f9f75bc-tgxff" is waiting to start: ContainerCreating Reason: PeriodicGatherFailed Status: True Type: Degraded ... But the referenced resource is in good status: [xxia@pres 2021-01-25 13:32:43 CST my]$ oc get po -n openshift-apiserver-operator NAME READY STATUS RESTARTS AGE openshift-apiserver-operator-5d9f9f75bc-tgxff 1/1 Running 0 9m53s [xxia@pres 2021-01-25 13:53:40 CST my]$ oc get co ... insights 4.7.0-0.nightly-2021-01-22-134922 True False True 127m ... [xxia@pres 2021-01-25 13:58:05 CST my]$ oc get po -n openshift-insights NAME READY STATUS RESTARTS AGE insights-operator-6b45495cb8-nzqcc 1/1 Running 0 34m [xxia@pres 2021-01-25 13:58:40 CST my]$ oc logs insights-operator-6b45495cb8-nzqcc -n openshift-insights > logs/insights-operator-6b45495cb8-nzqcc.log # see attachment Try workaround: [xxia@pres 2021-01-25 14:03:56 CST my]$ oc delete po insights-operator-6b45495cb8-nzqcc -n openshift-insights pod "insights-operator-6b45495cb8-nzqcc" deleted [xxia@pres 2021-01-25 14:04:18 CST my]$ oc get po -n openshift-insights NAME READY STATUS RESTARTS AGE insights-operator-6b45495cb8-qqz9j 1/1 Running 0 55s Then no "the cluster operator insights is degraded" now, the upgrade turned to complete: [xxia@pres 2021-01-25 14:05:01 CST my]$ ogcv NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-01-22-134922 True False 26s Cluster version is 4.7.0-0.nightly-2021-01-22-134922 [xxia@pres 2021-01-25 14:05:08 CST my]$ oc get co ... insights 4.7.0-0.nightly-2021-01-22-134922 True False False 134m ... Expected results: 3. No stuck. Additional info:
Created attachment 1750400 [details] [xxia@pres 2021-01-25 13:58:40 CST my]$ oc logs insights-operator-6b45495cb8-nzqcc -n openshift-insights The logs collected before the insights-operator-6b45495cb8-nzqcc pod is manually deleted: $ grep "Source clusterconfig could not be retrieved" logs/insights-operator-6b45495cb8-nzqcc.log | wc -l 37 $ grep "Source clusterconfig could not be retrieved" logs/insights-operator-6b45495cb8-nzqcc.log | head -n 2 I0125 05:23:54.717915 1 controllerstatus.go:59] name=periodic-clusterconfig healthy=false reason=PeriodicGatherFailed message=Source clusterconfig could not be retrieved: container "openshift-apiserver-operator" in pod "openshift-apiserver-operator-5d9f9f75bc-tgxff" is waiting to start: ContainerCreating I0125 05:23:59.348688 1 status.go:235] The operator has some internal errors: Source clusterconfig could not be retrieved: container "openshift-apiserver-operator" in pod "openshift-apiserver-operator-5d9f9f75bc-tgxff" is waiting to start: ContainerCreating $ grep "Source clusterconfig could not be retrieved" logs/insights-operator-6b45495cb8-nzqcc.log | tail -n 2 I0125 05:57:44.119853 1 status.go:235] The operator has some internal errors: Source clusterconfig could not be retrieved: container "openshift-apiserver-operator" in pod "openshift-apiserver-operator-5d9f9f75bc-tgxff" is waiting to start: ContainerCreating I0125 05:57:44.119877 1 status.go:287] The operator has some internal errors: Source clusterconfig could not be retrieved: container "openshift-apiserver-operator" in pod "openshift-apiserver-operator-5d9f9f75bc-tgxff" is waiting to start: ContainerCreating
Xingxing thanks for the report. I think this is a good catch and it should be reproducible every time when any of the gatherers in IO fails.
Created attachment 1751017 [details] Pre-workaround log We ran into this same problem while upgrading from 4.7.0-fc.3 to 4.7.0-fc.4 this afternoon. Attached log file (insights-operator-58b4dbdd6f-rv5j4.log) from the pod prior to performing the workaround.
The PR that should fix it: https://github.com/openshift/insights-operator/pull/320
Retested per comment 0 steps from 4.6.12 to 4.7.0-0.nightly-2021-01-28-102244, didn't hit it again (bug 1920027 is still hit, though): $ oc get co insights insights 4.7.0-0.nightly-2021-01-28-102244 True False False 167m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633