Bug 1919778
Summary: | Upgrade is stuck in insights operator Degraded with "Source clusterconfig could not be retrieved" until insights operator pod is manually deleted | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Xingxing Xia <xxia> | ||||||
Component: | Insights Operator | Assignee: | Marcell Sevcsik <msevcsik> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Pavel Šimovec <psimovec> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 4.7 | CC: | aos-bugs, brad.williams, fdeutsch, inecas, mklika, msevcsik, rvokal, sdodson, tremes, wking | ||||||
Target Milestone: | --- | Keywords: | Upgrades | ||||||
Target Release: | 4.7.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: |
Cause: Any gather function's error caused the insights operator to go degraded.
Consequence: After upgrades gather functions tend to fail because they might start executing before the gathered resource is ready, which causes an error
Fix: Introduce a retry using ExponentialBackOff to try a few gathers before going degraded
Result: During an upgrade, insights operator doesn't go degraded.
|
Story Points: | --- | ||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2021-02-24 15:55:50 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Xingxing Xia
2021-01-25 06:39:10 UTC
Created attachment 1750400 [details]
[xxia@pres 2021-01-25 13:58:40 CST my]$ oc logs insights-operator-6b45495cb8-nzqcc -n openshift-insights
The logs collected before the insights-operator-6b45495cb8-nzqcc pod is manually deleted:
$ grep "Source clusterconfig could not be retrieved" logs/insights-operator-6b45495cb8-nzqcc.log | wc -l
37
$ grep "Source clusterconfig could not be retrieved" logs/insights-operator-6b45495cb8-nzqcc.log | head -n 2
I0125 05:23:54.717915 1 controllerstatus.go:59] name=periodic-clusterconfig healthy=false reason=PeriodicGatherFailed message=Source clusterconfig could not be retrieved: container "openshift-apiserver-operator" in pod "openshift-apiserver-operator-5d9f9f75bc-tgxff" is waiting to start: ContainerCreating
I0125 05:23:59.348688 1 status.go:235] The operator has some internal errors: Source clusterconfig could not be retrieved: container "openshift-apiserver-operator" in pod "openshift-apiserver-operator-5d9f9f75bc-tgxff" is waiting to start: ContainerCreating
$ grep "Source clusterconfig could not be retrieved" logs/insights-operator-6b45495cb8-nzqcc.log | tail -n 2
I0125 05:57:44.119853 1 status.go:235] The operator has some internal errors: Source clusterconfig could not be retrieved: container "openshift-apiserver-operator" in pod "openshift-apiserver-operator-5d9f9f75bc-tgxff" is waiting to start: ContainerCreating
I0125 05:57:44.119877 1 status.go:287] The operator has some internal errors: Source clusterconfig could not be retrieved: container "openshift-apiserver-operator" in pod "openshift-apiserver-operator-5d9f9f75bc-tgxff" is waiting to start: ContainerCreating
Xingxing thanks for the report. I think this is a good catch and it should be reproducible every time when any of the gatherers in IO fails. Created attachment 1751017 [details]
Pre-workaround log
We ran into this same problem while upgrading from 4.7.0-fc.3 to 4.7.0-fc.4 this afternoon. Attached log file (insights-operator-58b4dbdd6f-rv5j4.log) from the pod prior to performing the workaround.
The PR that should fix it: https://github.com/openshift/insights-operator/pull/320 Retested per comment 0 steps from 4.6.12 to 4.7.0-0.nightly-2021-01-28-102244, didn't hit it again (bug 1920027 is still hit, though): $ oc get co insights insights 4.7.0-0.nightly-2021-01-28-102244 True False False 167m Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |