During 4.6 -> 4.7 upgrade monitoring operators switches Progressing status from True to False multiple times, as it watches several configmaps and secrets: https://github.com/openshift/cluster-monitoring-operator/blob/master/pkg/operator/operator.go#L44-L50 All of these are controlled by `service-ca`, which updates those sequentially. As a result monitoring operator sets the status several times instead of once per update.
@damien: not super urgent, but maybe something we should look into. One idea is to consolidate and have one central sync point in CMO for the service-CA.
Unfortunately, I've found it pretty hard to reproduce the exact interactions that lead to this on a running cluster, so looking at upgrade jobs in CI has been the best bet so far. This definitely seems to be happening during most upgrades in CI. One concrete example: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/1175/pull-ci-openshift-cluster-monitoring-operator-master-e2e-agnostic-upgrade/1396962340552839168 Here it looks like the service-ca-operator started rolling new certs, which triggered a series of syncs, some of which failed because the CA bundle in the webhook for PrometheusRule resources was either not yet injected or was out of date: > operator.go:474] Updating ClusterOperator status to failed. Err: running task Updating Prometheus Operator failed: reconciling prometheus-operator rules PrometheusRule failed: updating PrometheusRule object failed: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": x509: certificate signed by unknown authority > E0525 00:32:31.452318 1 operator.go:399] Syncing "openshift-monitoring/cluster-monitoring-config" failed Though, I definitely think that is just one of multiple possible interactions that can lead to the status flapping. Ultimately, I guess this is due to the fact that we copy the data from the servica-ca-operator managed resources into versions with a hash appended to the name, then template that into our deployments. That lets us automatically roll the pods when these change, but given that we run a bunch of tasks in parallel that have some amount of interdependence, things often fail when these certs or the CA change and the ConfigMaps and Secrets are updated in sequence. I'm not really sure how best to workaround this without a major change. The only thing that comes to mind is having a dedicated reconciliation loop for these, and instead of copying the data, we could just patch the pod specs in the deployments with a new annotation to cause them to restart. Does that sound reasonable?
There's no way to have the logic that's about to put you Progressing=False notice that you still have some config changes queued up that you're about to push out? Or are the subsequent changes only queued after the previous change stops progressing?
We're tracking the service-ca updates issues in https://bugzilla.redhat.com/show_bug.cgi?id=1953264 and implementing saner upgrade status reporting here https://bugzilla.redhat.com/show_bug.cgi?id=1949840. It seems to me this bug here duplicates both those issues to a degree. Is there something in this issue that isn't covered by the two linked issues? If not we can probably close this one here?
From what Brad shared, they indeed look the same so +1 for me to close this one as a duplicate.
*** This bug has been marked as a duplicate of bug 1949840 ***