Bug 1952744 - PrometheusDuplicateTimestamps with user workload monitoring enabled
Summary: PrometheusDuplicateTimestamps with user workload monitoring enabled
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.8.0
Assignee: Brad Ison
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-23 03:14 UTC by dofinn
Modified: 2021-07-27 23:03 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 23:03:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
debug prom log (58.48 KB, text/plain)
2021-04-23 04:12 UTC, dofinn
no flags Details
debug prom log (58.48 KB, text/plain)
2021-04-23 04:12 UTC, dofinn
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1146 0 None open WIP: Bug 1952744: Remove obsolete prometheus service monitor 2021-05-04 13:47:27 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 23:03:53 UTC

Description dofinn 2021-04-23 03:14:26 UTC
Description of problem:
Large amount of ingesting errors sporadically generating the alert of PrometheusDuplicateTimestamps

Version-Release number of selected component (if applicable):
4.8.0-fc.0 <- strictly

How reproducible:
Appears to be apparent on all clusters of this version


Steps to Reproduce:
1.Deploy cluster of 4.8.0-fc.0 version with user-workload monitoring enabled
2.Review openshift-monitoring/prometheus logs for ingesting errors

Actual results:

```
level=warn ts=2021-04-23T02:29:32.351Z caller=scrape.go:1375 component="scrape manager" scrape_pool=openshift-user-workload-monitoring/prometheus-user-workload/0 target=https://10.129.4.15:9091/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=7
level=warn ts=2021-04-23T02:33:02.343Z caller=scrape.go:1375 component="scrape manager" scrape_pool=openshift-user-workload-monitoring/prometheus-user-workload/0 target=https://10.129.4.15:9091/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=6
level=warn ts=2021-04-23T02:41:09.605Z caller=scrape.go:1375 component="scrape manager" scrape_pool=openshift-monitoring/prometheus-k8s/0 target=https://10.129.2.10:9091/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=12
level=warn ts=2021-04-23T02:51:03.446Z caller=scrape.go:1375 component="scrape manager" scrape_pool=openshift-user-workload-monitoring/prometheus-user-workload/0 target=https://10.130.6.24:9091/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=7
```

```
$ oc -n openshift-monitoring logs prometheus-k8s-0 -c prometheus | grep 'Error on' | awk '{print $6'} | sort | uniq -c
     47 scrape_pool=openshift-monitoring/prometheus-k8s/0
     43 scrape_pool=openshift-monitoring/prometheus/0
    150 scrape_pool=openshift-user-workload-monitoring/prometheus-user-workload/0
    155 scrape_pool=openshift-user-workload-monitoring/prometheus/0
```


Expected results:

Nil/low ingesting errors


Additional info:

I ran this query on two 4.8.0.fc-0 clusters. Both showed counts of errors. 

```
oc -n openshift-monitoring logs prometheus-k8s-0 -c prometheus | grep 'Error on' | awk '{print $6'} | sort | uniq -c
     47 scrape_pool=openshift-monitoring/prometheus-k8s/0
     43 scrape_pool=openshift-monitoring/prometheus/0
    150 scrape_pool=openshift-user-workload-monitoring/prometheus-user-workload/0
    155 scrape_pool=openshift-user-workload-monitoring/prometheus/0
```

```
oc -n openshift-monitoring logs prometheus-k8s-0 -c prometheus | grep 'Error on' | awk '{print $6'} | sort | uniq -c
     88 scrape_pool=openshift-monitoring/prometheus-k8s/0
    121 scrape_pool=openshift-monitoring/prometheus/0
    270 scrape_pool=openshift-user-workload-monitoring/prometheus-user-workload/0
    259 scrape_pool=openshift-user-workload-monitoring/prometheus/0
```

Same query on 4.7/4.6 clusters returns nil. 

Maybe something has changed with relabling among monitoring/user-workload monitoring configs?

Comment 1 dofinn 2021-04-23 04:12:07 UTC
Created attachment 1774665 [details]
debug prom log

Comment 2 dofinn 2021-04-23 04:12:18 UTC
Created attachment 1774666 [details]
debug prom log

Comment 4 Damien Grisonnet 2021-04-28 10:44:27 UTC
This sounds like a relabeling issues at first, but since it concerns Prometheus health metrics and we didn't change anything to it in 4.8, there might be something else going on. Maybe something like there are 2 ServiceMonitors for the same Prometheus instance or a second Prometheus that is monitored by the platform one.

I tried launching a cluster on 4.8.0-fc.0 with UWM enabled and couldn't reproduce the issue, so it might be specific to the OSD clusters.

Would you mind sharing a must-gather of one of these cluster?

Comment 8 Damien Grisonnet 2021-05-03 12:00:28 UTC
Thank you all for all the information you provided. The root cause seems to be that the old ServiceMonitor isn't being deleted during the 4.8 upgrade because of a name change in 4.8 from `prometheus` to `prometheus-k8s`.

We'll work on a fix to ensure that the old ServiceMonitor is deleted properly during the upgrade, but in the meantime you can resolve the issue by deleting the `prometheus` ServiceMonitor manually.

I'm also increasing the severity/priority of this bug since it is impacting every clusters upgraded to 4.8.

Comment 9 Tomas Dabašinskas 2021-05-04 02:21:35 UTC
Had same issue in PD Incident #257980, alert resolved after scaling statefulset.apps/prometheus-k8s --replicas=0 and back to statefulset.apps/prometheus-k8s --replicas=2 permanent fix greatly appreciated.

Comment 14 Junqi Zhao 2021-05-11 08:00:12 UTC
upgrade from 4.7.10 to 4.8.0-0.nightly-2021-05-10-225140, no "Error on ingesting samples with different value but same timestamp" logs in prometheus, and prometheus servicemonitor is removed after upgrade to 4.8.0-0.nightly-2021-05-10-225140, and the new servicemonitor name is prometheus-k8s. but prometheus servicemonitor still exists in openshift-user-workload-monitoring project, which tracked in bug 1959278
before upgrade
*********************************
# oc -n openshift-monitoring get servicemonitor
NAME                          AGE
alertmanager                  12m
cluster-monitoring-operator   23m
etcd                          12m
grafana                       12m
kube-state-metrics            12m
kubelet                       12m
node-exporter                 22m
openshift-state-metrics       12m
prometheus                    12m
prometheus-adapter            12m
prometheus-operator           22m
telemeter-client              13m
thanos-querier                12m
thanos-sidecar                12m

# oc -n openshift-user-workload-monitoring get servicemonitor
NAME                  AGE
prometheus            9m2s
prometheus-operator   9m18s
thanos-sidecar        9m2s
*********************************

upgrade to 4.8.0-0.nightly-2021-05-10-225140
*********************************
# oc -n openshift-monitoring get servicemonitor
NAME                          AGE
alertmanager                  101m
cluster-monitoring-operator   112m
etcd                          101m
grafana                       101m
kube-state-metrics            101m
kubelet                       101m
node-exporter                 111m
openshift-state-metrics       101m
prometheus-adapter            101m
prometheus-k8s                40m
prometheus-operator           111m
telemeter-client              102m
thanos-querier                101m
thanos-sidecar                101m


# oc -n openshift-user-workload-monitoring get servicemonitor
NAME                       AGE
prometheus                 89m
prometheus-operator        89m
prometheus-user-workload   47m
thanos-ruler               41m
thanos-sidecar             89m
*********************************
# oc -n openshift-monitoring logs prometheus-k8s-0 -c prometheus | grep "Error on"
no result

# oc -n openshift-monitoring logs prometheus-k8s-1 -c prometheus | grep "Error on"
no result

Comment 17 errata-xmlrpc 2021-07-27 23:03:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.