Description of problem:
On OSD we recently upgraded clusters to 4.4.11 and are now seeing cases of the prometheus-k8s-rulefiles-0 CM in openshift-monitoring being recreated very frequently. It can result in prometheus not having any rules in cluster which raises alerts. These alerts are only "warning" severity so I think there is also an alert gap to close since misconfigured monitoring is a critical issue for operations teams.
Version-Release number of selected component (if applicable):
4.4.11
How reproducible:
Infrequent.
Steps to Reproduce:
1. Upgrade cluster to 4.4.11 from 4.3.25
2. Wait..
Actual results:
prometheus-k8s-rulefiles-0 is recreated very frequently if this problem happens.
$ oc -n openshift-monitoring get cm prometheus-k8s-rulefiles-0 -w
NAME DATA AGE
prometheus-k8s-rulefiles-0 40 39s
prometheus-k8s-rulefiles-0 40 7m12s
prometheus-k8s-rulefiles-0 40 1s
prometheus-k8s-rulefiles-0 40 1s
prometheus-k8s-rulefiles-0 40 1s
prometheus-k8s-rulefiles-0 40 3m37s
prometheus-k8s-rulefiles-0 40 1s
prometheus-k8s-rulefiles-0 40 1s
prometheus-k8s-rulefiles-0 40 1s
Expected results:
prometheus-k8s-rulefiles-0 is not recreated unless needed.
Additional info:
I will attach a must-gather.
SRE steps for OSD to remediate:
# scale down prometheus operator
oc -n openshift-monitoring scale deployment.apps/prometheus-operator --replicas=0
# give time for pod to terminate
sleep 10
# scale up prometheus operator
oc -n openshift-monitoring scale deployment.apps/prometheus-operator --replicas=1
# restart prometheus by scaling down to pick up CM (operator will immediately scale it back up, so that isn't necessary)
oc -n openshift-monitoring scale statefulset.apps/prometheus-k8s --replicas=0
# watch to see if the CM is being recreated for a few minutes
oc -n openshift-monitoring get cm prometheus-k8s-rulefiles-0 -w
(captured in https://github.com/openshift/ops-sop/blob/master/v4/alerts/PrometheusNotConnectedToAlertmanagers.md#troubleshooting)
Can you provide the count of the PrometheusRule in your cluster and number of ServiceMonitors and PodMonitors. If I remember correctly your stack is modified no to include not tested and custom scrapping of metrics and custom alerting? This might result in the incorrect behaviour.
Can you provide the count of the PrometheusRule in your cluster and number of ServiceMonitors and PodMonitors. If I remember correctly your stack is modified no to include not tested and custom scrapping of metrics and custom alerting? This might result in the incorrect behaviour.
To clarify: this is not happening on all cluster and after upgrade, it seem to be specific conditions, which are not entirely clear yet.
We can provide a long lived OSD cluster to test and verify this if that is needed and helpful.