Bug 1977470 - Monitoring operator is in degraded state for ~64 sec during the API server rollout in SNO
Summary: Monitoring operator is in degraded state for ~64 sec during the API server ro...
Keywords:
Status: CLOSED DUPLICATE of bug 1949840
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.8
Hardware: Unspecified
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Jan Fajerski
QA Contact: Junqi Zhao
URL:
Whiteboard: chaos
Depends On:
Blocks: 1984730
TreeView+ depends on / blocked
 
Reported: 2021-06-29 20:23 UTC by Naga Ravi Chaitanya Elluri
Modified: 2021-07-26 06:34 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-26 06:34:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Naga Ravi Chaitanya Elluri 2021-06-29 20:23:01 UTC
Description of problem:
Monitoring cluster operator is in degraded state for about 64 seconds during the kube-apiserver rollout on a SNO cluster, it seems to be triggering an update regularly to update Control Plane components, Prometheus-k8s, prometheus-adapter, openshift-state-metrics, Prometheus-user-workload, node-exporter and others as noted in the logs on changes to the ConfigMaps and Secrets:http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/sno/api-rollout/monitoring_operator.log. This seems to occur frequently and hits an issue updating the resources when the GET requests to the API fail.

Is it possible to poll the API server less aggressively to save Degraded=True for longer-lasting issues?

Version-Release number of selected component (if applicable):
4.8.0-0.nightly-2021-06-24-222938

How reproducible:
Always

Steps to Reproduce:
1.Install a SNO cluster using one of the latest 4.8 nightly payload.
2. Roll out kube-apiserver.
3. Observe the status of the Monitoring operator.

Actual results:
Monitoring operator is in degraded state for ~64 sec during the API server downtime.

Expected results:
Monitoring cluster operator handles the API downtime gracefully.

Comment 1 W. Trevor King 2021-06-29 20:40:54 UTC
Possibly a dup of bug 1949840?  I'm not sure how broad a change bug 1949840 is aiming for.

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=96h&type=junit&search=clusteroperator/monitoring+should+not+change+condition' | grep 'single-node.*failures match' | grep -v 'pull-ci-\|rehearse-' | sort
periodic-ci-openshift-release-master-ci-4.8-e2e-aws-upgrade-single-node (all) - 4 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-release-master-ci-4.8-e2e-azure-upgrade-single-node (all) - 4 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-aws-upgrade-single-node (all) - 3 runs, 100% failed, 67% of failures match = 67% impact
periodic-ci-openshift-release-master-ci-4.9-e2e-azure-upgrade-single-node (all) - 4 runs, 75% failed, 133% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-single-node (all) - 4 runs, 100% failed, 75% of failures match = 75% impact
periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-single-node (all) - 4 runs, 100% failed, 75% of failures match = 75% impact

Comment 2 Jan Fajerski 2021-06-30 07:45:02 UTC
Yes I think its the same issue at the core. The fix for https://bugzilla.redhat.com/show_bug.cgi?id=1949840 should improve the SNO situation quite a bit, though I think there is still room for improvement. Lets keep this open for now to track the impact on SNO.

Comment 3 Jan Fajerski 2021-07-12 07:43:46 UTC
The fix for the related bug was merged last week (PR https://github.com/openshift/cluster-monitoring-operator/pull/1193).

I'd be interested if and how this improves the situation for SNO.

Comment 4 Naga Ravi Chaitanya Elluri 2021-07-24 23:01:32 UTC
Jan, we no longer see monitoring operator in degraded state during the API rollout in Single Node OpenShift which has been tuned to last for around 60 second in the latest builds.

Comment 5 Simon Pasquier 2021-07-26 06:34:55 UTC

*** This bug has been marked as a duplicate of bug 1949840 ***


Note You need to log in before you can comment on or make changes to this bug.