Description of problem: Large amount of ingesting errors sporadically generating the alert of PrometheusDuplicateTimestamps Version-Release number of selected component (if applicable): 4.8.0-fc.0 <- strictly How reproducible: Appears to be apparent on all clusters of this version Steps to Reproduce: 1.Deploy cluster of 4.8.0-fc.0 version with user-workload monitoring enabled 2.Review openshift-monitoring/prometheus logs for ingesting errors Actual results: ``` level=warn ts=2021-04-23T02:29:32.351Z caller=scrape.go:1375 component="scrape manager" scrape_pool=openshift-user-workload-monitoring/prometheus-user-workload/0 target=https://10.129.4.15:9091/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=7 level=warn ts=2021-04-23T02:33:02.343Z caller=scrape.go:1375 component="scrape manager" scrape_pool=openshift-user-workload-monitoring/prometheus-user-workload/0 target=https://10.129.4.15:9091/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=6 level=warn ts=2021-04-23T02:41:09.605Z caller=scrape.go:1375 component="scrape manager" scrape_pool=openshift-monitoring/prometheus-k8s/0 target=https://10.129.2.10:9091/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=12 level=warn ts=2021-04-23T02:51:03.446Z caller=scrape.go:1375 component="scrape manager" scrape_pool=openshift-user-workload-monitoring/prometheus-user-workload/0 target=https://10.130.6.24:9091/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=7 ``` ``` $ oc -n openshift-monitoring logs prometheus-k8s-0 -c prometheus | grep 'Error on' | awk '{print $6'} | sort | uniq -c 47 scrape_pool=openshift-monitoring/prometheus-k8s/0 43 scrape_pool=openshift-monitoring/prometheus/0 150 scrape_pool=openshift-user-workload-monitoring/prometheus-user-workload/0 155 scrape_pool=openshift-user-workload-monitoring/prometheus/0 ``` Expected results: Nil/low ingesting errors Additional info: I ran this query on two 4.8.0.fc-0 clusters. Both showed counts of errors. ``` oc -n openshift-monitoring logs prometheus-k8s-0 -c prometheus | grep 'Error on' | awk '{print $6'} | sort | uniq -c 47 scrape_pool=openshift-monitoring/prometheus-k8s/0 43 scrape_pool=openshift-monitoring/prometheus/0 150 scrape_pool=openshift-user-workload-monitoring/prometheus-user-workload/0 155 scrape_pool=openshift-user-workload-monitoring/prometheus/0 ``` ``` oc -n openshift-monitoring logs prometheus-k8s-0 -c prometheus | grep 'Error on' | awk '{print $6'} | sort | uniq -c 88 scrape_pool=openshift-monitoring/prometheus-k8s/0 121 scrape_pool=openshift-monitoring/prometheus/0 270 scrape_pool=openshift-user-workload-monitoring/prometheus-user-workload/0 259 scrape_pool=openshift-user-workload-monitoring/prometheus/0 ``` Same query on 4.7/4.6 clusters returns nil. Maybe something has changed with relabling among monitoring/user-workload monitoring configs?
Created attachment 1774665 [details] debug prom log
Created attachment 1774666 [details] debug prom log
This sounds like a relabeling issues at first, but since it concerns Prometheus health metrics and we didn't change anything to it in 4.8, there might be something else going on. Maybe something like there are 2 ServiceMonitors for the same Prometheus instance or a second Prometheus that is monitored by the platform one. I tried launching a cluster on 4.8.0-fc.0 with UWM enabled and couldn't reproduce the issue, so it might be specific to the OSD clusters. Would you mind sharing a must-gather of one of these cluster?
Thank you all for all the information you provided. The root cause seems to be that the old ServiceMonitor isn't being deleted during the 4.8 upgrade because of a name change in 4.8 from `prometheus` to `prometheus-k8s`. We'll work on a fix to ensure that the old ServiceMonitor is deleted properly during the upgrade, but in the meantime you can resolve the issue by deleting the `prometheus` ServiceMonitor manually. I'm also increasing the severity/priority of this bug since it is impacting every clusters upgraded to 4.8.
Had same issue in PD Incident #257980, alert resolved after scaling statefulset.apps/prometheus-k8s --replicas=0 and back to statefulset.apps/prometheus-k8s --replicas=2 permanent fix greatly appreciated.
upgrade from 4.7.10 to 4.8.0-0.nightly-2021-05-10-225140, no "Error on ingesting samples with different value but same timestamp" logs in prometheus, and prometheus servicemonitor is removed after upgrade to 4.8.0-0.nightly-2021-05-10-225140, and the new servicemonitor name is prometheus-k8s. but prometheus servicemonitor still exists in openshift-user-workload-monitoring project, which tracked in bug 1959278 before upgrade ********************************* # oc -n openshift-monitoring get servicemonitor NAME AGE alertmanager 12m cluster-monitoring-operator 23m etcd 12m grafana 12m kube-state-metrics 12m kubelet 12m node-exporter 22m openshift-state-metrics 12m prometheus 12m prometheus-adapter 12m prometheus-operator 22m telemeter-client 13m thanos-querier 12m thanos-sidecar 12m # oc -n openshift-user-workload-monitoring get servicemonitor NAME AGE prometheus 9m2s prometheus-operator 9m18s thanos-sidecar 9m2s ********************************* upgrade to 4.8.0-0.nightly-2021-05-10-225140 ********************************* # oc -n openshift-monitoring get servicemonitor NAME AGE alertmanager 101m cluster-monitoring-operator 112m etcd 101m grafana 101m kube-state-metrics 101m kubelet 101m node-exporter 111m openshift-state-metrics 101m prometheus-adapter 101m prometheus-k8s 40m prometheus-operator 111m telemeter-client 102m thanos-querier 101m thanos-sidecar 101m # oc -n openshift-user-workload-monitoring get servicemonitor NAME AGE prometheus 89m prometheus-operator 89m prometheus-user-workload 47m thanos-ruler 41m thanos-sidecar 89m ********************************* # oc -n openshift-monitoring logs prometheus-k8s-0 -c prometheus | grep "Error on" no result # oc -n openshift-monitoring logs prometheus-k8s-1 -c prometheus | grep "Error on" no result
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438