Bug 1697295

Summary: Prometheus shows different monitoring history with Grafana dashboard refresh
Product: OpenShift Container Platform Reporter: Robert Sandu <rsandu>
Component: MonitoringAssignee: Frederic Branczyk <fbranczy>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: anpicker, erooth, info-sistemi, mloibl, pkrupa, rsandu, surbania
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-06 02:00:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Robert Sandu 2019-04-08 10:08:01 UTC
Description of problem:

- Prometheus shows different monitoring history when refreshing the Grafana dashboard.
- Also, Prometheus does not seem to honor storage.tsdb.retention: stores less than 12h of monitoring, instead of the 15d set:

# curl -sk -H "Authorization: Bearer $(oc whoami -t)" https://prometheus-k8s.openshift-monitoring.svc.cluster.local:9091/api/v1/status/flags | python -m json.tool | grep storage
        "storage.remote.flush-deadline": "1m",
        "storage.tsdb.max-block-duration": "36h",
        "storage.tsdb.min-block-duration": "2h",
        "storage.tsdb.no-lockfile": "true",
        "storage.tsdb.path": "/prometheus",
        "storage.tsdb.retention": "15d",

Version-Release number of selected component (if applicable): OCP v3.11.69


How reproducible: not always. I haven't been able to reproduce this issue in a lab environment.


Steps to Reproduce:
1. N/A

Actual results: storage.tsdb.retention as it stores less than 12h of monitoring, instead of the 15d set & seeing different retention frames.


Expected results: storage.tsdb.retention to be honored and see the same retention frames in Prometheus.


Additional info:

- The monitoring stack does not use persistent storage.
- Prometheus pods have been deleted. Seeing the same issue after the pods have been recreated.

Comment 2 Frederic Branczyk 2019-04-08 13:39:43 UTC
That Prometheus setup doesn't have persistent storage configured, so deleting the Prometheus pods deletes the "historic" data, so it doesn't seem like that's an issue (also this would be the first time we hear of this both upstream and in OpenShift). What is the case however is that this stack currently does not appropriately set session affinity so the HA model of Prometheus causes inconsistent data to be shown (see the HA model documentation here for further insight: https://github.com/coreos/prometheus-operator/blob/master/Documentation/high-availability.md#prometheus).

We have opened https://github.com/openshift/cluster-monitoring-operator/pull/313 to fix the session affinity issue to get consistent graphs when looking at Grafana.

Comment 3 Frederic Branczyk 2019-04-09 07:56:50 UTC
PR is merged so moving to modified.

Comment 5 Junqi Zhao 2019-04-16 03:13:02 UTC
when refreshing grafana UI, there is not big difference there, issue is fixed

ose-cluster-monitoring-operator-v3.11.105-1
firfox 52.0.2 (64-bit)
chrome Version 58.0.3029.81 (64-bit)

Comment 10 errata-xmlrpc 2019-06-06 02:00:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0794