Bug 1952744
Summary: | PrometheusDuplicateTimestamps with user workload monitoring enabled | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | dofinn | ||||||
Component: | Monitoring | Assignee: | Brad Ison <brad.ison> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 4.8 | CC: | aabhishe, alegrand, anpicker, dgrisonn, erooth, gsleeman, hongyli, kakkoyun, lcosic, pbergene, pkrupa, spasquie, todabasi, travi | ||||||
Target Milestone: | --- | Keywords: | ServiceDeliveryBlocker | ||||||
Target Release: | 4.8.0 | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | No Doc Update | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2021-07-27 23:03:27 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
dofinn
2021-04-23 03:14:26 UTC
Created attachment 1774665 [details]
debug prom log
Created attachment 1774666 [details]
debug prom log
This sounds like a relabeling issues at first, but since it concerns Prometheus health metrics and we didn't change anything to it in 4.8, there might be something else going on. Maybe something like there are 2 ServiceMonitors for the same Prometheus instance or a second Prometheus that is monitored by the platform one. I tried launching a cluster on 4.8.0-fc.0 with UWM enabled and couldn't reproduce the issue, so it might be specific to the OSD clusters. Would you mind sharing a must-gather of one of these cluster? Thank you all for all the information you provided. The root cause seems to be that the old ServiceMonitor isn't being deleted during the 4.8 upgrade because of a name change in 4.8 from `prometheus` to `prometheus-k8s`. We'll work on a fix to ensure that the old ServiceMonitor is deleted properly during the upgrade, but in the meantime you can resolve the issue by deleting the `prometheus` ServiceMonitor manually. I'm also increasing the severity/priority of this bug since it is impacting every clusters upgraded to 4.8. Had same issue in PD Incident #257980, alert resolved after scaling statefulset.apps/prometheus-k8s --replicas=0 and back to statefulset.apps/prometheus-k8s --replicas=2 permanent fix greatly appreciated. upgrade from 4.7.10 to 4.8.0-0.nightly-2021-05-10-225140, no "Error on ingesting samples with different value but same timestamp" logs in prometheus, and prometheus servicemonitor is removed after upgrade to 4.8.0-0.nightly-2021-05-10-225140, and the new servicemonitor name is prometheus-k8s. but prometheus servicemonitor still exists in openshift-user-workload-monitoring project, which tracked in bug 1959278 before upgrade ********************************* # oc -n openshift-monitoring get servicemonitor NAME AGE alertmanager 12m cluster-monitoring-operator 23m etcd 12m grafana 12m kube-state-metrics 12m kubelet 12m node-exporter 22m openshift-state-metrics 12m prometheus 12m prometheus-adapter 12m prometheus-operator 22m telemeter-client 13m thanos-querier 12m thanos-sidecar 12m # oc -n openshift-user-workload-monitoring get servicemonitor NAME AGE prometheus 9m2s prometheus-operator 9m18s thanos-sidecar 9m2s ********************************* upgrade to 4.8.0-0.nightly-2021-05-10-225140 ********************************* # oc -n openshift-monitoring get servicemonitor NAME AGE alertmanager 101m cluster-monitoring-operator 112m etcd 101m grafana 101m kube-state-metrics 101m kubelet 101m node-exporter 111m openshift-state-metrics 101m prometheus-adapter 101m prometheus-k8s 40m prometheus-operator 111m telemeter-client 102m thanos-querier 101m thanos-sidecar 101m # oc -n openshift-user-workload-monitoring get servicemonitor NAME AGE prometheus 89m prometheus-operator 89m prometheus-user-workload 47m thanos-ruler 41m thanos-sidecar 89m ********************************* # oc -n openshift-monitoring logs prometheus-k8s-0 -c prometheus | grep "Error on" no result # oc -n openshift-monitoring logs prometheus-k8s-1 -c prometheus | grep "Error on" no result Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |