Description of problem: On OSD we recently upgraded clusters to 4.4.11 and are now seeing cases of the prometheus-k8s-rulefiles-0 CM in openshift-monitoring being recreated very frequently. It can result in prometheus not having any rules in cluster which raises alerts. These alerts are only "warning" severity so I think there is also an alert gap to close since misconfigured monitoring is a critical issue for operations teams. Version-Release number of selected component (if applicable): 4.4.11 How reproducible: Infrequent. Steps to Reproduce: 1. Upgrade cluster to 4.4.11 from 4.3.25 2. Wait.. Actual results: prometheus-k8s-rulefiles-0 is recreated very frequently if this problem happens. $ oc -n openshift-monitoring get cm prometheus-k8s-rulefiles-0 -w NAME DATA AGE prometheus-k8s-rulefiles-0 40 39s prometheus-k8s-rulefiles-0 40 7m12s prometheus-k8s-rulefiles-0 40 1s prometheus-k8s-rulefiles-0 40 1s prometheus-k8s-rulefiles-0 40 1s prometheus-k8s-rulefiles-0 40 3m37s prometheus-k8s-rulefiles-0 40 1s prometheus-k8s-rulefiles-0 40 1s prometheus-k8s-rulefiles-0 40 1s Expected results: prometheus-k8s-rulefiles-0 is not recreated unless needed. Additional info: I will attach a must-gather. SRE steps for OSD to remediate: # scale down prometheus operator oc -n openshift-monitoring scale deployment.apps/prometheus-operator --replicas=0 # give time for pod to terminate sleep 10 # scale up prometheus operator oc -n openshift-monitoring scale deployment.apps/prometheus-operator --replicas=1 # restart prometheus by scaling down to pick up CM (operator will immediately scale it back up, so that isn't necessary) oc -n openshift-monitoring scale statefulset.apps/prometheus-k8s --replicas=0 # watch to see if the CM is being recreated for a few minutes oc -n openshift-monitoring get cm prometheus-k8s-rulefiles-0 -w (captured in https://github.com/openshift/ops-sop/blob/master/v4/alerts/PrometheusNotConnectedToAlertmanagers.md#troubleshooting)
Can you provide the count of the PrometheusRule in your cluster and number of ServiceMonitors and PodMonitors. If I remember correctly your stack is modified no to include not tested and custom scrapping of metrics and custom alerting? This might result in the incorrect behaviour.
To clarify: this is not happening on all cluster and after upgrade, it seem to be specific conditions, which are not entirely clear yet. We can provide a long lived OSD cluster to test and verify this if that is needed and helpful.
Rick provided info requested, clearing needs info.
Too many higher priority 4.6 realease blocking bugzillas to have time to look into this one, moving to next sprint.
Lowering severity to low as there seem to be no indicator of a defect in software as of today.