1948082 – Monitoring should not set Available=False with no reason on updates

Bug 1948082 - Monitoring should not set Available=False with no reason on updates

Summary: Monitoring should not set Available=False with no reason on updates

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Jan Fajerski
QA Contact:	hongyan li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-04-10 00:27 UTC by W. Trevor King
Modified:	2021-11-15 09:23 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 22:58:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1112	0	None	closed	Bug 1948082: Set unavailable message	2021-04-22 12:41:46 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:59:12 UTC

Description W. Trevor King 2021-04-10 00:27:34 UTC

From CI runs like [1]:

  : [bz-Monitoring] clusteroperator/monitoring should not change condition/Available expand_less
    Run #0: Failed expand_less	0s
    2 unexpected clusteroperator state transitions during e2e test run 

    Apr 09 13:16:50.662 - 329s  E clusteroperator/monitoring condition/Available status/False reason/
    Apr 09 13:25:31.238 - 212s  E clusteroperator/monitoring condition/Available status/False reason/

No reason or message, which is really not a great user experience:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_installer/4831/pull-ci-openshift-installer-master-e2e-aws-upgrade/1380486185595441152/artifacts/e2e-aws-upgrade/openshift-e2e-test/build-log.txt | grep 'clusteroperator/monitoring condition/Available'
Apr 09 13:16:50.662 E clusteroperator/monitoring condition/Available status/False changed: 
Apr 09 13:16:50.662 - 329s  E clusteroperator/monitoring condition/Available status/False reason/
Apr 09 13:22:19.749 W clusteroperator/monitoring condition/Available status/True reason/RollOutDone changed: Successfully rolled out the stack.
Apr 09 13:25:31.238 E clusteroperator/monitoring condition/Available status/False changed: 
Apr 09 13:25:31.238 - 212s  E clusteroperator/monitoring condition/Available status/False reason/
Apr 09 13:29:03.744 W clusteroperator/monitoring condition/Available status/True reason/RollOutDone changed: Successfully rolled out the stack.

If you're going to claim to be completely dead, at least give folks a reason ;).  Very popular:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&name=^periodic.*upgrade&type=junit&search=clusteroperator/monitoring+should+not+change+co
ndition/Available' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade (all) - 16 runs, 100% failed, 63% of failures match = 63% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 20 runs, 100% failed, 95% of failures match = 95% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 17 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-ovn-upgrade (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-azure-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-ovirt-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-gcp-upgrade (all) - 19 runs, 100% failed, 74% of failures match = 74% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade (all) - 10 runs, 80% failed, 25% of failures match = 20% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade (all) - 9 runs, 56% failed, 60% of failures match = 33% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade (all) - 9 runs, 100% failed, 89% of failures match = 89% impact
periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade (all) - 10 runs, 100% failed, 90% of failures match = 90% impact

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/4831/pull-ci-openshift-installer-master-e2e-aws-upgrade/1380486185595441152

Comment 1 W. Trevor King 2021-04-12 19:07:17 UTC

cmo#1112 is going to recycle the current Degraded=True reason when setting Available=False and Progressing=False.  Makes sense to me.  But ideally monitoring won't go Available=False at all during CI updates.  Checking the Degraded information from the job from comment 0:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_installer/4831/pull-ci-openshift-installer-master-e2e-aws-upgrade/1380486185595441152/artifacts/e2e-aws-upgrade/openshift-e2e-test/build-log.txt | grep 'clusteroperator/monitoring condition/Degraded'
Apr 09 13:16:50.662 E clusteroperator/monitoring condition/Degraded status/True reason/UpdatingPrometheusOperatorFailed changed: Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling prometheus-operator rules PrometheusRule failed: updating PrometheusRule object failed: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": x509: certificate signed by unknown authority
Apr 09 13:16:50.662 - 329s  E clusteroperator/monitoring condition/Degraded status/True reason/Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling prometheus-operator rules PrometheusRule failed: updating PrometheusRule object failed: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": x509: certificate signed by unknown authority
Apr 09 13:22:19.749 W clusteroperator/monitoring condition/Degraded status/False changed: 
Apr 09 13:25:31.238 E clusteroperator/monitoring condition/Degraded status/True reason/UpdatingControlPlanecomponentsFailed changed: Failed to rollout the stack. Error: running task Updating Control Plane components failed: reconciling etcd rules PrometheusRule failed: updating PrometheusRule object failed: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": dial tcp 10.129.0.56:8080: connect: no route to host
Apr 09 13:25:31.238 - 212s  E clusteroperator/monitoring condition/Degraded status/True reason/Failed to rollout the stack. Error: running task Updating Control Plane components failed: reconciling etcd rules PrometheusRule failed: updating PrometheusRule object failed: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": dial tcp 10.129.0.56:8080: connect: no route to host
Apr 09 13:29:03.744 W clusteroperator/monitoring condition/Degraded status/False changed: 

So ~5m with UpdatingPrometheusOperatorFailed on an admission webhook x509.  And then ~3m with UpdatingControlPlanecomponentsFailed on an admission webhook "no route to host".  Should I spin out another bug for sorting out what's going on with those, or did you want to address it as part of this bug as well?  Definitely spin it out if the issues are with a non-monitoring component that you can't directly fix.

Comment 2 Jan Fajerski 2021-04-13 07:21:58 UTC

My vote is to spin this out into a new bug.

Comment 3 Simon Pasquier 2021-04-13 12:22:29 UTC

I'd say that we should have another bug for the admission webhook errors. @Jan can you create it from the information that Trevor already highlighted?

Comment 4 Jan Fajerski 2021-04-15 08:51:35 UTC

Added https://bugzilla.redhat.com/show_bug.cgi?id=1949840 for the upgrade availability issue.

Comment 6 hongyan li 2021-04-16 04:00:24 UTC

From today's CI Run, we can see reason displays when co monitoring Available=False

$curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1382871934601007104/artifacts/e2e-aws-upgrade/openshift-e2e-test/build-log.txt| grep 'clusteroperator/monitoring condition/Available'
Apr 16 02:56:57.578 E clusteroperator/monitoring condition/Available status/False reason/UpdatingkubeStateMetricsFailed changed: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
Apr 16 02:56:57.578 - 155s  E clusteroperator/monitoring condition/Available status/False reason/Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
Apr 16 02:59:33.179 W clusteroperator/monitoring condition/Available status/True reason/RollOutDone changed: Successfully rolled out the stack.

Comment 9 errata-xmlrpc 2021-07-27 22:58:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.