Description of problem: Monitoring cluster operator is in degraded state for about 64 seconds during the kube-apiserver rollout on a SNO cluster, it seems to be triggering an update regularly to update Control Plane components, Prometheus-k8s, prometheus-adapter, openshift-state-metrics, Prometheus-user-workload, node-exporter and others as noted in the logs on changes to the ConfigMaps and Secrets:http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/api-rollout/monitoring_operator.log. This seems to occur frequently and hits an issue updating the resources when the GET requests to the API fail. Is it possible to poll the API server less aggressively to save Degraded=True for longer-lasting issues? Version-Release number of selected component (if applicable): 4.8.0-0.nightly-2021-06-24-222938 How reproducible: Always Steps to Reproduce: 1.Install a SNO cluster using one of the latest 4.8 nightly payload. 2. Roll out kube-apiserver. 3. Observe the status of the Monitoring operator. Actual results: Monitoring operator is in degraded state for ~64 sec during the API server downtime. Expected results: Monitoring cluster operator handles the API downtime gracefully.
Possibly a dup of bug 1949840? I'm not sure how broad a change bug 1949840 is aiming for. $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=96h&type=junit&search=clusteroperator/monitoring+should+not+change+condition' | grep 'single-node.*failures match' | grep -v 'pull-ci-\|rehearse-' | sort periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 4 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 4 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade-single-node (all) - 3 runs, 100% failed, 67% of failures match = 67% impact periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade-single-node (all) - 4 runs, 75% failed, 133% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-single-node (all) - 4 runs, 100% failed, 75% of failures match = 75% impact periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-single-node (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
Yes I think its the same issue at the core. The fix for https://bugzilla.redhat.com/show_bug.cgi?id=1949840 should improve the SNO situation quite a bit, though I think there is still room for improvement. Lets keep this open for now to track the impact on SNO.
The fix for the related bug was merged last week (PR https://github.com/openshift/cluster-monitoring-operator/pull/1193). I'd be interested if and how this improves the situation for SNO.
Jan, we no longer see monitoring operator in degraded state during the API rollout in Single Node OpenShift which has been tuned to last for around 60 second in the latest builds.
*** This bug has been marked as a duplicate of bug 1949840 ***