1925659 – Insights operator should not go degraded during upgrade

Bug 1925659 - Insights operator should not go degraded during upgrade

Summary: Insights operator should not go degraded during upgrade

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Insights Operator
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.7.z
Assignee:	Tomas Remes
QA Contact:	Pavel Šimovec
Docs Contact:
URL:
Whiteboard:
Depends On:	1926082
Blocks:	1939640
TreeView+	depends on / blocked

Reported:	2021-02-05 20:05 UTC by Clayton Coleman
Modified:	2021-03-29 07:21 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1926082 (view as bug list)
Environment:
Last Closed:	2021-03-10 11:24:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift insights-operator pull 331	None	open	[release-4.7] Bug 1925659: Relax the recent log gatherers to avoid degrading during…	2021-02-17 19:43:49 UTC
Red Hat Knowledge Base (Solution)	5906281	None	None	None	2021-03-29 07:21:03 UTC
Red Hat Product Errata	RHBA-2021:0678	None	None	None	2021-03-10 11:24:36 UTC

Description Clayton Coleman 2021-02-05 20:05:27 UTC

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25861/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1357748792526376960

    ClusterOperators did not settle: 
    clusteroperator/insights is Degraded for 17m19.717543653s because "Source clusterconfig could not be retrieved: Get \"https://10.0.0.5:10250/containerLogs/openshift-sdn/sdn-controller-j59d2/sdn-controller?limitBytes=65536&sinceSeconds=86400\": dial tcp 10.0.0.5:10250: connect: connection refused, Get \"https://10.0.0.5:10250/containerLogs/openshift-sdn/sdn-z5nx8/sdn?limitBytes=65536&sinceSeconds=86400\": dial tcp 10.0.0.5:10250: connect: connection refused"

The insights operator is not allowed to go degraded and block a cluster upgrade because of failures in gathering data from other components.

The failure that caused this will be independently debugged, insights is not allowed to go degraded because of a failure to gather data from one specific subsystem.

Comment 1 Clayton Coleman 2021-02-05 20:05:43 UTC

Especially transient errors.

Comment 2 Clayton Coleman 2021-02-05 20:39:04 UTC

This is happening in about 1% of upgrades

https://search.ci.openshift.org/?search=clusteroperator%2Finsights+is+Degraded&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

It looks like the symptom is fairly similar, but the correct behavior is not "hide this one failure from causing degraded" it is "don't degrade at all if not all data can be gathered".

Comment 3 Tomas Remes 2021-02-08 07:23:21 UTC

OK. I didn't know this. Yes we parse some logs (sdn, sdn-controller and some more) for interesting messages and these are very likely not ready yet. I think the fix should be pretty easy. I am about to open a PR.

Comment 5 Tomas Remes 2021-02-10 10:44:44 UTC

Let me ask once again here please. The original behaviour (4.6 and lower) was following (if I am not mistaken). If there was any error (and most of the time there were probably none, because there was few gatherers) during the gathering then the state was set as degraded, but the new gathering was triggered immediately (which probably means that the degraded state lasted a very short time). In fact, we only changed the second part - the new gathering is not triggered immediately but there's some time before the next run. So we don't really need the degraded state caused by some gatherer error at all right?

Comment 6 Marcell Sevcsik 2021-02-10 12:52:22 UTC

The part we changed is that we retry 5 times before setting the degraded status using ExponentialBackoff. We implemented this because in the original implementation an infinite loop was created if a gather was failing consistently.

Comment 7 Clayton Coleman 2021-02-10 19:29:57 UTC

Here's another failure:

        {
            s: "query failed: ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1: promQL query: ALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards\",alertstate=\"firing\",severity!=\"info\"} >= 1 had reported incorrect results:\n[{\"metric\":{\"__name__\":\"ALERTS\",\"alertname\":\"ClusterOperatorDegraded\",\"alertstate\":\"firing\",\"condition\":\"Degraded\",\"endpoint\":\"metrics\",\"instance\":\"10.0.239.234:9099\",\"job\":\"cluster-version-operator\",\"name\":\"insights\",\"namespace\":\"openshift-cluster-version\",\"pod\":\"cluster-version-operator-64d68cd48d-r5h7f\",\"reason\":\"PeriodicGatherFailed\",\"service\":\"cluster-version-operator\",\"severity\":\"critical\"},\"value\":[1612980990.956,\"1\"]},{\"metric\":{\"__name__\":\"ALERTS\",\"alertname\":\"ClusterOperatorDown\",\"alertstate\":\"firing\",\"endpoint\":\"metrics\",\"instance\":\"10.0.239.234:9099\",\"job\":\"cluster-version-operator\",\"name\":\"insights\",\"namespace\":\"openshift-cluster-version\",\"pod\":\"cluster-version-operator-64d68cd48d-r5h7f\",\"service\":\"cluster-version-operator\",\"severity\":\"critical\",\"version\":\"4.7.0-0.ci.test-2021-02-10-172859-ci-ln-xb3dpyb\"},\"value\":[1612980990.956,\"1\"]}]",

Tomas' last question is what I consider fundamental.

Insights-operator may not go degraded due to gather failures, but it should signal those via the Disabled (after enough consistent failures, maybe 3-4).

Comment 8 Clayton Coleman 2021-02-10 19:33:21 UTC

Put in another way, insights operator may not operationally bring down a cluster (which going degraded on a purely cosmetic failure causes).  Telemetry failing doesn't cause monitoring to report degraded.

Comment 9 Tomas Remes 2021-02-11 11:50:44 UTC

Thanks Clayton. We updated the PR (4.7 is waiting for group lead approval) so that the IO is disabled when the gathering failures threshold is exceeded. It looks like it would be good to backport it to 4.6 as well.

Comment 10 W. Trevor King 2021-02-17 20:47:50 UTC

(In reply to Tomas Remes from comment #9)
> It looks like it would be good to backport it to 4.6 as well.

"Issue exists in 4.6" sounds like "not a 4.6 -> 4.7 regression", so I'm going to set blocker- on this to avoid delaying the 4.7 GA.

Comment 13 Pavel Šimovec 2021-03-04 08:11:08 UTC

this would be challenging to verify directly,
steps to verify:
1) change a gatherer of IO to always return error (modify IO code)
2) build & replace IO on cluster
3) check that IO is not degraded

Comment 15 errata-xmlrpc 2021-03-10 11:24:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.1 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0678

Note You need to log in before you can comment on or make changes to this bug.