1925517 – Monitoring operators flaps Progressing status multiple times when serving CAs are updates

Bug 1925517 - Monitoring operators flaps Progressing status multiple times when serving CAs are updates

Summary: Monitoring operators flaps Progressing status multiple times when serving CAs...

Keywords:
Status:	CLOSED DUPLICATE of bug 1949840
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Brad Ison
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-02-05 12:27 UTC by Vadim Rutkovsky
Modified:	2021-05-27 10:04 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-05-27 10:04:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1949840	1	medium	CLOSED	CMO reports unavailable during upgrades	2021-10-18 17:30:35 UTC
Red Hat Bugzilla	1953264	1	medium	CLOSED	"remote error: tls: bad certificate" logs in prometheus-operator container	2022-12-30 08:56:25 UTC

Description Vadim Rutkovsky 2021-02-05 12:27:39 UTC

During 4.6 -> 4.7 upgrade monitoring operators switches Progressing status from True to False multiple times, as it watches several configmaps and secrets:
https://github.com/openshift/cluster-monitoring-operator/blob/master/pkg/operator/operator.go#L44-L50

All of these are controlled by `service-ca`, which updates those sequentially. As a result monitoring operator sets the status several times instead of once per update.

Comment 1 Sergiusz Urbaniak 2021-02-05 12:43:49 UTC

@damien: not super urgent, but maybe something we should look into. One idea is to consolidate and have one central sync point in CMO for the service-CA.

Comment 6 Brad Ison 2021-05-26 14:02:16 UTC

Unfortunately, I've found it pretty hard to reproduce the exact interactions that lead to this on a running cluster, so looking at upgrade jobs in CI has been the best bet so far. This definitely seems to be happening during most upgrades in CI.

One concrete example: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1175/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1396962340552839168

Here it looks like the service-ca-operator started rolling new certs, which triggered a series of syncs, some of which failed because the CA bundle in the webhook for PrometheusRule resources was either not yet injected or was out of date:

> operator.go:474] Updating ClusterOperator status to failed. Err: running task Updating Prometheus Operator failed: reconciling prometheus-operator rules PrometheusRule failed: updating PrometheusRule object failed: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": x509: certificate signed by unknown authority
> E0525 00:32:31.452318       1 operator.go:399] Syncing "openshift-monitoring/cluster-monitoring-config" failed

Though, I definitely think that is just one of multiple possible interactions that can lead to the status flapping.

Ultimately, I guess this is due to the fact that we copy the data from the servica-ca-operator managed resources into versions with a hash appended to the name, then template that into our deployments. That lets us automatically roll the pods when these change, but given that we run a bunch of tasks in parallel that have some amount of interdependence, things often fail when these certs or the CA change and the ConfigMaps and Secrets are updated in sequence.

I'm not really sure how best to workaround this without a major change. The only thing that comes to mind is having a dedicated reconciliation loop for these, and instead of copying the data, we could just patch the pod specs in the deployments with a new annotation to cause them to restart. Does that sound reasonable?

Comment 7 W. Trevor King 2021-05-27 01:27:28 UTC

There's no way to have the logic that's about to put you Progressing=False notice that you still have some config changes queued up that you're about to push out?  Or are the subsequent changes only queued after the previous change stops progressing?

Comment 8 Jan Fajerski 2021-05-27 08:46:14 UTC

We're tracking the service-ca updates issues in https://bugzilla.redhat.com/show_bug.cgi?id=1953264 and implementing saner upgrade status reporting here https://bugzilla.redhat.com/show_bug.cgi?id=1949840.

It seems to me this bug here duplicates both those issues to a degree. Is there something in this issue that isn't covered by the two linked issues? If not we can probably close this one here?

Comment 9 Simon Pasquier 2021-05-27 09:26:18 UTC

From what Brad shared, they indeed look the same so +1 for me to close this one as a duplicate.

Comment 10 Brad Ison 2021-05-27 10:04:10 UTC


*** This bug has been marked as a duplicate of bug 1949840 ***

Note You need to log in before you can comment on or make changes to this bug.