Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1925517

Summary: Monitoring operators flaps Progressing status multiple times when serving CAs are updates
Product: OpenShift Container Platform Reporter: Vadim Rutkovsky <vrutkovs>
Component: MonitoringAssignee: Brad Ison <brad.ison>
Status: CLOSED DUPLICATE QA Contact: Junqi Zhao <juzhao>
Severity: low Docs Contact:
Priority: low    
Version: 4.7CC: alegrand, anpicker, erooth, jfajersk, kakkoyun, lcosic, pkrupa, spasquie, wking
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-27 10:04:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vadim Rutkovsky 2021-02-05 12:27:39 UTC
During 4.6 -> 4.7 upgrade monitoring operators switches Progressing status from True to False multiple times, as it watches several configmaps and secrets:
https://github.com/openshift/cluster-monitoring-operator/blob/master/pkg/operator/operator.go#L44-L50

All of these are controlled by `service-ca`, which updates those sequentially. As a result monitoring operator sets the status several times instead of once per update.

Comment 1 Sergiusz Urbaniak 2021-02-05 12:43:49 UTC
@damien: not super urgent, but maybe something we should look into. One idea is to consolidate and have one central sync point in CMO for the service-CA.

Comment 6 Brad Ison 2021-05-26 14:02:16 UTC
Unfortunately, I've found it pretty hard to reproduce the exact interactions that lead to this on a running cluster, so looking at upgrade jobs in CI has been the best bet so far. This definitely seems to be happening during most upgrades in CI.

One concrete example: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1175/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1396962340552839168

Here it looks like the service-ca-operator started rolling new certs, which triggered a series of syncs, some of which failed because the CA bundle in the webhook for PrometheusRule resources was either not yet injected or was out of date:

> operator.go:474] Updating ClusterOperator status to failed. Err: running task Updating Prometheus Operator failed: reconciling prometheus-operator rules PrometheusRule failed: updating PrometheusRule object failed: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": x509: certificate signed by unknown authority
> E0525 00:32:31.452318       1 operator.go:399] Syncing "openshift-monitoring/cluster-monitoring-config" failed

Though, I definitely think that is just one of multiple possible interactions that can lead to the status flapping.

Ultimately, I guess this is due to the fact that we copy the data from the servica-ca-operator managed resources into versions with a hash appended to the name, then template that into our deployments. That lets us automatically roll the pods when these change, but given that we run a bunch of tasks in parallel that have some amount of interdependence, things often fail when these certs or the CA change and the ConfigMaps and Secrets are updated in sequence.

I'm not really sure how best to workaround this without a major change. The only thing that comes to mind is having a dedicated reconciliation loop for these, and instead of copying the data, we could just patch the pod specs in the deployments with a new annotation to cause them to restart. Does that sound reasonable?

Comment 7 W. Trevor King 2021-05-27 01:27:28 UTC
There's no way to have the logic that's about to put you Progressing=False notice that you still have some config changes queued up that you're about to push out?  Or are the subsequent changes only queued after the previous change stops progressing?

Comment 8 Jan Fajerski 2021-05-27 08:46:14 UTC
We're tracking the service-ca updates issues in https://bugzilla.redhat.com/show_bug.cgi?id=1953264 and implementing saner upgrade status reporting here https://bugzilla.redhat.com/show_bug.cgi?id=1949840.

It seems to me this bug here duplicates both those issues to a degree. Is there something in this issue that isn't covered by the two linked issues? If not we can probably close this one here?

Comment 9 Simon Pasquier 2021-05-27 09:26:18 UTC
From what Brad shared, they indeed look the same so +1 for me to close this one as a duplicate.

Comment 10 Brad Ison 2021-05-27 10:04:10 UTC

*** This bug has been marked as a duplicate of bug 1949840 ***