Bug 1952744

Summary: PrometheusDuplicateTimestamps with user workload monitoring enabled
Product: OpenShift Container Platform Reporter: dofinn
Component: MonitoringAssignee: Brad Ison <brad.ison>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.8CC: aabhishe, alegrand, anpicker, dgrisonn, erooth, gsleeman, hongyli, kakkoyun, lcosic, pbergene, pkrupa, spasquie, todabasi, travi
Target Milestone: ---Keywords: ServiceDeliveryBlocker
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:03:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
debug prom log
none
debug prom log none

Description dofinn 2021-04-23 03:14:26 UTC
Description of problem:
Large amount of ingesting errors sporadically generating the alert of PrometheusDuplicateTimestamps

Version-Release number of selected component (if applicable):
4.8.0-fc.0 <- strictly

How reproducible:
Appears to be apparent on all clusters of this version


Steps to Reproduce:
1.Deploy cluster of 4.8.0-fc.0 version with user-workload monitoring enabled
2.Review openshift-monitoring/prometheus logs for ingesting errors

Actual results:

```
level=warn ts=2021-04-23T02:29:32.351Z caller=scrape.go:1375 component="scrape manager" scrape_pool=openshift-user-workload-monitoring/prometheus-user-workload/0 target=https://10.129.4.15:9091/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=7
level=warn ts=2021-04-23T02:33:02.343Z caller=scrape.go:1375 component="scrape manager" scrape_pool=openshift-user-workload-monitoring/prometheus-user-workload/0 target=https://10.129.4.15:9091/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=6
level=warn ts=2021-04-23T02:41:09.605Z caller=scrape.go:1375 component="scrape manager" scrape_pool=openshift-monitoring/prometheus-k8s/0 target=https://10.129.2.10:9091/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=12
level=warn ts=2021-04-23T02:51:03.446Z caller=scrape.go:1375 component="scrape manager" scrape_pool=openshift-user-workload-monitoring/prometheus-user-workload/0 target=https://10.130.6.24:9091/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=7
```

```
$ oc -n openshift-monitoring logs prometheus-k8s-0 -c prometheus | grep 'Error on' | awk '{print $6'} | sort | uniq -c
     47 scrape_pool=openshift-monitoring/prometheus-k8s/0
     43 scrape_pool=openshift-monitoring/prometheus/0
    150 scrape_pool=openshift-user-workload-monitoring/prometheus-user-workload/0
    155 scrape_pool=openshift-user-workload-monitoring/prometheus/0
```


Expected results:

Nil/low ingesting errors


Additional info:

I ran this query on two 4.8.0.fc-0 clusters. Both showed counts of errors. 

```
oc -n openshift-monitoring logs prometheus-k8s-0 -c prometheus | grep 'Error on' | awk '{print $6'} | sort | uniq -c
     47 scrape_pool=openshift-monitoring/prometheus-k8s/0
     43 scrape_pool=openshift-monitoring/prometheus/0
    150 scrape_pool=openshift-user-workload-monitoring/prometheus-user-workload/0
    155 scrape_pool=openshift-user-workload-monitoring/prometheus/0
```

```
oc -n openshift-monitoring logs prometheus-k8s-0 -c prometheus | grep 'Error on' | awk '{print $6'} | sort | uniq -c
     88 scrape_pool=openshift-monitoring/prometheus-k8s/0
    121 scrape_pool=openshift-monitoring/prometheus/0
    270 scrape_pool=openshift-user-workload-monitoring/prometheus-user-workload/0
    259 scrape_pool=openshift-user-workload-monitoring/prometheus/0
```

Same query on 4.7/4.6 clusters returns nil. 

Maybe something has changed with relabling among monitoring/user-workload monitoring configs?

Comment 1 dofinn 2021-04-23 04:12:07 UTC
Created attachment 1774665 [details]
debug prom log

Comment 2 dofinn 2021-04-23 04:12:18 UTC
Created attachment 1774666 [details]
debug prom log

Comment 4 Damien Grisonnet 2021-04-28 10:44:27 UTC
This sounds like a relabeling issues at first, but since it concerns Prometheus health metrics and we didn't change anything to it in 4.8, there might be something else going on. Maybe something like there are 2 ServiceMonitors for the same Prometheus instance or a second Prometheus that is monitored by the platform one.

I tried launching a cluster on 4.8.0-fc.0 with UWM enabled and couldn't reproduce the issue, so it might be specific to the OSD clusters.

Would you mind sharing a must-gather of one of these cluster?

Comment 8 Damien Grisonnet 2021-05-03 12:00:28 UTC
Thank you all for all the information you provided. The root cause seems to be that the old ServiceMonitor isn't being deleted during the 4.8 upgrade because of a name change in 4.8 from `prometheus` to `prometheus-k8s`.

We'll work on a fix to ensure that the old ServiceMonitor is deleted properly during the upgrade, but in the meantime you can resolve the issue by deleting the `prometheus` ServiceMonitor manually.

I'm also increasing the severity/priority of this bug since it is impacting every clusters upgraded to 4.8.

Comment 9 Tomas DabaĊĦinskas 2021-05-04 02:21:35 UTC
Had same issue in PD Incident #257980, alert resolved after scaling statefulset.apps/prometheus-k8s --replicas=0 and back to statefulset.apps/prometheus-k8s --replicas=2 permanent fix greatly appreciated.

Comment 14 Junqi Zhao 2021-05-11 08:00:12 UTC
upgrade from 4.7.10 to 4.8.0-0.nightly-2021-05-10-225140, no "Error on ingesting samples with different value but same timestamp" logs in prometheus, and prometheus servicemonitor is removed after upgrade to 4.8.0-0.nightly-2021-05-10-225140, and the new servicemonitor name is prometheus-k8s. but prometheus servicemonitor still exists in openshift-user-workload-monitoring project, which tracked in bug 1959278
before upgrade
*********************************
# oc -n openshift-monitoring get servicemonitor
NAME                          AGE
alertmanager                  12m
cluster-monitoring-operator   23m
etcd                          12m
grafana                       12m
kube-state-metrics            12m
kubelet                       12m
node-exporter                 22m
openshift-state-metrics       12m
prometheus                    12m
prometheus-adapter            12m
prometheus-operator           22m
telemeter-client              13m
thanos-querier                12m
thanos-sidecar                12m

# oc -n openshift-user-workload-monitoring get servicemonitor
NAME                  AGE
prometheus            9m2s
prometheus-operator   9m18s
thanos-sidecar        9m2s
*********************************

upgrade to 4.8.0-0.nightly-2021-05-10-225140
*********************************
# oc -n openshift-monitoring get servicemonitor
NAME                          AGE
alertmanager                  101m
cluster-monitoring-operator   112m
etcd                          101m
grafana                       101m
kube-state-metrics            101m
kubelet                       101m
node-exporter                 111m
openshift-state-metrics       101m
prometheus-adapter            101m
prometheus-k8s                40m
prometheus-operator           111m
telemeter-client              102m
thanos-querier                101m
thanos-sidecar                101m


# oc -n openshift-user-workload-monitoring get servicemonitor
NAME                       AGE
prometheus                 89m
prometheus-operator        89m
prometheus-user-workload   47m
thanos-ruler               41m
thanos-sidecar             89m
*********************************
# oc -n openshift-monitoring logs prometheus-k8s-0 -c prometheus | grep "Error on"
no result

# oc -n openshift-monitoring logs prometheus-k8s-1 -c prometheus | grep "Error on"
no result

Comment 17 errata-xmlrpc 2021-07-27 23:03:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438