Bug 1977470

Summary: Monitoring operator is in degraded state for ~64 sec during the API server rollout in SNO
Product: OpenShift Container Platform Reporter: Naga Ravi Chaitanya Elluri <nelluri>
Component: MonitoringAssignee: Jan Fajerski <jfajersk>
Status: CLOSED DUPLICATE QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.8CC: alegrand, anpicker, aos-bugs, erooth, kakkoyun, nelluri, pkrupa, pnair, spasquie, wking
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard: chaos
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-26 06:34:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1984730    

Description Naga Ravi Chaitanya Elluri 2021-06-29 20:23:01 UTC
Description of problem:
Monitoring cluster operator is in degraded state for about 64 seconds during the kube-apiserver rollout on a SNO cluster, it seems to be triggering an update regularly to update Control Plane components, Prometheus-k8s, prometheus-adapter, openshift-state-metrics, Prometheus-user-workload, node-exporter and others as noted in the logs on changes to the ConfigMaps and Secrets:http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/api-rollout/monitoring_operator.log. This seems to occur frequently and hits an issue updating the resources when the GET requests to the API fail.

Is it possible to poll the API server less aggressively to save Degraded=True for longer-lasting issues?

Version-Release number of selected component (if applicable):
4.8.0-0.nightly-2021-06-24-222938

How reproducible:
Always

Steps to Reproduce:
1.Install a SNO cluster using one of the latest 4.8 nightly payload.
2. Roll out kube-apiserver.
3. Observe the status of the Monitoring operator.

Actual results:
Monitoring operator is in degraded state for ~64 sec during the API server downtime.

Expected results:
Monitoring cluster operator handles the API downtime gracefully.

Comment 1 W. Trevor King 2021-06-29 20:40:54 UTC
Possibly a dup of bug 1949840?  I'm not sure how broad a change bug 1949840 is aiming for.

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=96h&type=junit&search=clusteroperator/monitoring+should+not+change+condition' | grep 'single-node.*failures match' | grep -v 'pull-ci-\|rehearse-' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 4 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade-single-node (all) - 3 runs, 100% failed, 67% of failures match = 67% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade-single-node (all) - 4 runs, 75% failed, 133% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-single-node (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-single-node (all) - 4 runs, 100% failed, 75% of failures match = 75% impact

Comment 2 Jan Fajerski 2021-06-30 07:45:02 UTC
Yes I think its the same issue at the core. The fix for https://bugzilla.redhat.com/show_bug.cgi?id=1949840 should improve the SNO situation quite a bit, though I think there is still room for improvement. Lets keep this open for now to track the impact on SNO.

Comment 3 Jan Fajerski 2021-07-12 07:43:46 UTC
The fix for the related bug was merged last week (PR https://github.com/openshift/cluster-monitoring-operator/pull/1193).

I'd be interested if and how this improves the situation for SNO.

Comment 4 Naga Ravi Chaitanya Elluri 2021-07-24 23:01:32 UTC
Jan, we no longer see monitoring operator in degraded state during the API rollout in Single Node OpenShift which has been tuned to last for around 60 second in the latest builds.

Comment 5 Simon Pasquier 2021-07-26 06:34:55 UTC

*** This bug has been marked as a duplicate of bug 1949840 ***