Bug 1861543 - Prometheus rulesfiles CM recreated very frequently on 4.4.11 [NEEDINFO]
Summary: Prometheus rulesfiles CM recreated very frequently on 4.4.11
Keywords:
Status: CLOSED DUPLICATE of bug 1845561
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.4
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: 4.6.0
Assignee: Simon Pasquier
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-28 22:43 UTC by Naveen Malik
Modified: 2021-05-10 02:24 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-09-11 12:43:46 UTC
Target Upstream Version:
Embargoed:
spasquie: needinfo?


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github prometheus-operator prometheus-operator pull 3457 0 None closed *: use a single reloader for Prometheus 2021-02-17 11:29:08 UTC

Description Naveen Malik 2020-07-28 22:43:08 UTC
Description of problem:
On OSD we recently upgraded clusters to 4.4.11 and are now seeing cases of the prometheus-k8s-rulefiles-0 CM in openshift-monitoring being recreated very frequently.  It can result in prometheus not having any rules in cluster which raises alerts.  These alerts are only "warning" severity so I think there is also an alert gap to close since misconfigured monitoring is a critical issue for operations teams.

Version-Release number of selected component (if applicable):
4.4.11


How reproducible:
Infrequent.

Steps to Reproduce:
1. Upgrade cluster to 4.4.11 from 4.3.25
2. Wait..

Actual results:
prometheus-k8s-rulefiles-0 is recreated very frequently if this problem happens.

$ oc -n openshift-monitoring get cm prometheus-k8s-rulefiles-0 -w
NAME                         DATA   AGE
prometheus-k8s-rulefiles-0   40     39s
prometheus-k8s-rulefiles-0   40     7m12s
prometheus-k8s-rulefiles-0   40     1s
prometheus-k8s-rulefiles-0   40     1s
prometheus-k8s-rulefiles-0   40     1s
prometheus-k8s-rulefiles-0   40     3m37s
prometheus-k8s-rulefiles-0   40     1s
prometheus-k8s-rulefiles-0   40     1s
prometheus-k8s-rulefiles-0   40     1s


Expected results:
prometheus-k8s-rulefiles-0 is not recreated unless needed.

Additional info:

I will attach a must-gather.

SRE steps for OSD to remediate:

# scale down prometheus operator
oc -n openshift-monitoring scale deployment.apps/prometheus-operator --replicas=0

# give time for pod to terminate
sleep 10 

# scale up prometheus operator
oc -n openshift-monitoring scale deployment.apps/prometheus-operator --replicas=1

# restart prometheus by scaling down to pick up CM (operator will immediately scale it back up, so that isn't necessary)
oc -n openshift-monitoring scale statefulset.apps/prometheus-k8s --replicas=0

# watch to see if the CM is being recreated for a few minutes
oc -n openshift-monitoring get cm prometheus-k8s-rulefiles-0 -w


(captured in https://github.com/openshift/ops-sop/blob/master/v4/alerts/PrometheusNotConnectedToAlertmanagers.md#troubleshooting)

Comment 3 Lili Cosic 2020-07-29 07:44:01 UTC
Can you provide the count of the PrometheusRule in your cluster and number of ServiceMonitors and PodMonitors. If I remember correctly your stack is modified no to include not tested and custom scrapping of metrics and custom alerting? This might result in the incorrect behaviour.

Comment 4 Lili Cosic 2020-07-29 07:44:14 UTC
Can you provide the count of the PrometheusRule in your cluster and number of ServiceMonitors and PodMonitors. If I remember correctly your stack is modified no to include not tested and custom scrapping of metrics and custom alerting? This might result in the incorrect behaviour.

Comment 7 Rick Rackow 2020-07-29 14:33:44 UTC
To clarify: this is not happening on all cluster and after upgrade, it seem to be specific conditions, which are not entirely clear yet.
We can provide a long lived OSD cluster to test and verify this if that is needed and helpful.

Comment 8 Naveen Malik 2020-07-29 22:03:56 UTC
Rick provided info requested, clearing needs info.

Comment 9 Lili Cosic 2020-07-31 11:16:18 UTC
Too many higher priority 4.6 realease blocking bugzillas to have time to look into this one, moving to next sprint.

Comment 22 Sergiusz Urbaniak 2020-08-31 13:47:15 UTC
Lowering severity to low as there seem to be no indicator of a defect in software as of today.


Note You need to log in before you can comment on or make changes to this bug.