Description of problem: - Prometheus shows different monitoring history when refreshing the Grafana dashboard. - Also, Prometheus does not seem to honor storage.tsdb.retention: stores less than 12h of monitoring, instead of the 15d set: # curl -sk -H "Authorization: Bearer $(oc whoami -t)" https://prometheus-k8s.openshift-monitoring.svc.cluster.local:9091/api/v1/status/flags | python -m json.tool | grep storage "storage.remote.flush-deadline": "1m", "storage.tsdb.max-block-duration": "36h", "storage.tsdb.min-block-duration": "2h", "storage.tsdb.no-lockfile": "true", "storage.tsdb.path": "/prometheus", "storage.tsdb.retention": "15d", Version-Release number of selected component (if applicable): OCP v3.11.69 How reproducible: not always. I haven't been able to reproduce this issue in a lab environment. Steps to Reproduce: 1. N/A Actual results: storage.tsdb.retention as it stores less than 12h of monitoring, instead of the 15d set & seeing different retention frames. Expected results: storage.tsdb.retention to be honored and see the same retention frames in Prometheus. Additional info: - The monitoring stack does not use persistent storage. - Prometheus pods have been deleted. Seeing the same issue after the pods have been recreated.
That Prometheus setup doesn't have persistent storage configured, so deleting the Prometheus pods deletes the "historic" data, so it doesn't seem like that's an issue (also this would be the first time we hear of this both upstream and in OpenShift). What is the case however is that this stack currently does not appropriately set session affinity so the HA model of Prometheus causes inconsistent data to be shown (see the HA model documentation here for further insight: https://github.com/coreos/prometheus-operator/blob/master/Documentation/high-availability.md#prometheus). We have opened https://github.com/openshift/cluster-monitoring-operator/pull/313 to fix the session affinity issue to get consistent graphs when looking at Grafana.
PR is merged so moving to modified.
when refreshing grafana UI, there is not big difference there, issue is fixed ose-cluster-monitoring-operator-v3.11.105-1 firfox 52.0.2 (64-bit) chrome Version 58.0.3029.81 (64-bit)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0794