1926082 – Insights operator should not go degraded during upgrade

Bug 1926082 - Insights operator should not go degraded during upgrade

Summary: Insights operator should not go degraded during upgrade

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Insights Operator
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Tomas Remes
QA Contact:	Pavel Šimovec
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1925659
TreeView+	depends on / blocked

Reported:	2021-02-08 07:37 UTC by Tomas Remes
Modified:	2021-07-27 22:42 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:	1925659
Environment:
Last Closed:	2021-07-27 22:42:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift insights-operator pull 332	0	None	open	Bug 1926082: Relax the recent log gatherers to avoid degrading during…	2021-02-08 07:40:41 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:42:37 UTC

Description Tomas Remes 2021-02-08 07:37:50 UTC

+++ This bug was initially created as a clone of Bug #1925659 +++

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25861/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1357748792526376960

    ClusterOperators did not settle: 
    clusteroperator/insights is Degraded for 17m19.717543653s because "Source clusterconfig could not be retrieved: Get \"https://10.0.0.5:10250/containerLogs/openshift-sdn/sdn-controller-j59d2/sdn-controller?limitBytes=65536&sinceSeconds=86400\": dial tcp 10.0.0.5:10250: connect: connection refused, Get \"https://10.0.0.5:10250/containerLogs/openshift-sdn/sdn-z5nx8/sdn?limitBytes=65536&sinceSeconds=86400\": dial tcp 10.0.0.5:10250: connect: connection refused"

The insights operator is not allowed to go degraded and block a cluster upgrade because of failures in gathering data from other components.

The failure that caused this will be independently debugged, insights is not allowed to go degraded because of a failure to gather data from one specific subsystem.

--- Additional comment from Clayton Coleman on 2021-02-05 20:05:43 UTC ---

Especially transient errors.

--- Additional comment from Clayton Coleman on 2021-02-05 20:39:04 UTC ---

This is happening in about 1% of upgrades

https://search.ci.openshift.org/?search=clusteroperator%2Finsights+is+Degraded&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

It looks like the symptom is fairly similar, but the correct behavior is not "hide this one failure from causing degraded" it is "don't degrade at all if not all data can be gathered".

--- Additional comment from Tomas Remes on 2021-02-08 07:23:21 UTC ---

OK. I didn't know this. Yes we parse some logs (sdn, sdn-controller and some more) for interesting messages and these are very likely not ready yet. I think the fix should be pretty easy. I am about to open a PR.

Comment 3 errata-xmlrpc 2021-07-27 22:42:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.