1861543 – Prometheus rulesfiles CM recreated very frequently on 4.4.11

Bug 1861543 - Prometheus rulesfiles CM recreated very frequently on 4.4.11 [NEEDINFO]

Summary: Prometheus rulesfiles CM recreated very frequently on 4.4.11

Keywords:
Status:	CLOSED DUPLICATE of bug 1845561
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Simon Pasquier
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-28 22:43 UTC by Naveen Malik
Modified:	2021-05-10 02:24 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-09-11 12:43:46 UTC
Target Upstream Version:
Embargoed:
Flags:	spasquie: needinfo?

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	prometheus-operator prometheus-operator pull 3457	0	None	closed	*: use a single reloader for Prometheus	2021-02-17 11:29:08 UTC

Description Naveen Malik 2020-07-28 22:43:08 UTC

Description of problem:
On OSD we recently upgraded clusters to 4.4.11 and are now seeing cases of the prometheus-k8s-rulefiles-0 CM in openshift-monitoring being recreated very frequently.  It can result in prometheus not having any rules in cluster which raises alerts.  These alerts are only "warning" severity so I think there is also an alert gap to close since misconfigured monitoring is a critical issue for operations teams.

Version-Release number of selected component (if applicable):
4.4.11


How reproducible:
Infrequent.

Steps to Reproduce:
1. Upgrade cluster to 4.4.11 from 4.3.25
2. Wait..

Actual results:
prometheus-k8s-rulefiles-0 is recreated very frequently if this problem happens.

$ oc -n openshift-monitoring get cm prometheus-k8s-rulefiles-0 -w
NAME                         DATA   AGE
prometheus-k8s-rulefiles-0   40     39s
prometheus-k8s-rulefiles-0   40     7m12s
prometheus-k8s-rulefiles-0   40     1s
prometheus-k8s-rulefiles-0   40     1s
prometheus-k8s-rulefiles-0   40     1s
prometheus-k8s-rulefiles-0   40     3m37s
prometheus-k8s-rulefiles-0   40     1s
prometheus-k8s-rulefiles-0   40     1s
prometheus-k8s-rulefiles-0   40     1s


Expected results:
prometheus-k8s-rulefiles-0 is not recreated unless needed.

Additional info:

I will attach a must-gather.

SRE steps for OSD to remediate:

# scale down prometheus operator
oc -n openshift-monitoring scale deployment.apps/prometheus-operator --replicas=0

# give time for pod to terminate
sleep 10 

# scale up prometheus operator
oc -n openshift-monitoring scale deployment.apps/prometheus-operator --replicas=1

# restart prometheus by scaling down to pick up CM (operator will immediately scale it back up, so that isn't necessary)
oc -n openshift-monitoring scale statefulset.apps/prometheus-k8s --replicas=0

# watch to see if the CM is being recreated for a few minutes
oc -n openshift-monitoring get cm prometheus-k8s-rulefiles-0 -w


(captured in https://github.com/openshift/ops-sop/blob/master/v4/alerts/PrometheusNotConnectedToAlertmanagers.md#troubleshooting)

Comment 3 Lili Cosic 2020-07-29 07:44:01 UTC

Can you provide the count of the PrometheusRule in your cluster and number of ServiceMonitors and PodMonitors. If I remember correctly your stack is modified no to include not tested and custom scrapping of metrics and custom alerting? This might result in the incorrect behaviour.

Comment 4 Lili Cosic 2020-07-29 07:44:14 UTC

Can you provide the count of the PrometheusRule in your cluster and number of ServiceMonitors and PodMonitors. If I remember correctly your stack is modified no to include not tested and custom scrapping of metrics and custom alerting? This might result in the incorrect behaviour.

Comment 7 Rick Rackow 2020-07-29 14:33:44 UTC

To clarify: this is not happening on all cluster and after upgrade, it seem to be specific conditions, which are not entirely clear yet.
We can provide a long lived OSD cluster to test and verify this if that is needed and helpful.

Comment 8 Naveen Malik 2020-07-29 22:03:56 UTC

Rick provided info requested, clearing needs info.

Comment 9 Lili Cosic 2020-07-31 11:16:18 UTC

Too many higher priority 4.6 realease blocking bugzillas to have time to look into this one, moving to next sprint.

Comment 22 Sergiusz Urbaniak 2020-08-31 13:47:15 UTC

Lowering severity to low as there seem to be no indicator of a defect in software as of today.

Note You need to log in before you can comment on or make changes to this bug.