Bug 1697295 - Prometheus shows different monitoring history with Grafana dashboard refresh
Summary: Prometheus shows different monitoring history with Grafana dashboard refresh
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.11.z
Assignee: Frederic Branczyk
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-08 10:08 UTC by Robert Sandu
Modified: 2019-06-06 02:00 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-06 02:00:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0794 0 None None None 2019-06-06 02:00:33 UTC

Description Robert Sandu 2019-04-08 10:08:01 UTC
Description of problem:

- Prometheus shows different monitoring history when refreshing the Grafana dashboard.
- Also, Prometheus does not seem to honor storage.tsdb.retention: stores less than 12h of monitoring, instead of the 15d set:

# curl -sk -H "Authorization: Bearer $(oc whoami -t)" https://prometheus-k8s.openshift-monitoring.svc.cluster.local:9091/api/v1/status/flags | python -m json.tool | grep storage
        "storage.remote.flush-deadline": "1m",
        "storage.tsdb.max-block-duration": "36h",
        "storage.tsdb.min-block-duration": "2h",
        "storage.tsdb.no-lockfile": "true",
        "storage.tsdb.path": "/prometheus",
        "storage.tsdb.retention": "15d",

Version-Release number of selected component (if applicable): OCP v3.11.69


How reproducible: not always. I haven't been able to reproduce this issue in a lab environment.


Steps to Reproduce:
1. N/A

Actual results: storage.tsdb.retention as it stores less than 12h of monitoring, instead of the 15d set & seeing different retention frames.


Expected results: storage.tsdb.retention to be honored and see the same retention frames in Prometheus.


Additional info:

- The monitoring stack does not use persistent storage.
- Prometheus pods have been deleted. Seeing the same issue after the pods have been recreated.

Comment 2 Frederic Branczyk 2019-04-08 13:39:43 UTC
That Prometheus setup doesn't have persistent storage configured, so deleting the Prometheus pods deletes the "historic" data, so it doesn't seem like that's an issue (also this would be the first time we hear of this both upstream and in OpenShift). What is the case however is that this stack currently does not appropriately set session affinity so the HA model of Prometheus causes inconsistent data to be shown (see the HA model documentation here for further insight: https://github.com/coreos/prometheus-operator/blob/master/Documentation/high-availability.md#prometheus).

We have opened https://github.com/openshift/cluster-monitoring-operator/pull/313 to fix the session affinity issue to get consistent graphs when looking at Grafana.

Comment 3 Frederic Branczyk 2019-04-09 07:56:50 UTC
PR is merged so moving to modified.

Comment 5 Junqi Zhao 2019-04-16 03:13:02 UTC
when refreshing grafana UI, there is not big difference there, issue is fixed

ose-cluster-monitoring-operator-v3.11.105-1
firfox 52.0.2 (64-bit)
chrome Version 58.0.3029.81 (64-bit)

Comment 10 errata-xmlrpc 2019-06-06 02:00:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0794


Note You need to log in before you can comment on or make changes to this bug.