Bug 1674270

Summary: PVs for Prometheus pods shows different usage
Product: OpenShift Container Platform Reporter: hgomes
Component: MonitoringAssignee: Frederic Branczyk <fbranczy>
Status: CLOSED NOTABUG QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: hgomes, kgeorgie, minden, mloibl, spasquie, surbania
Target Milestone: ---Flags: kgeorgie: needinfo-
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-02-27 14:59:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 1 minden 2019-02-12 12:46:51 UTC
That indeed looks like an inconsistency. Off the top of my head I could see two scenarios causing this:

1. One of the Prometheus instances was down for a given time.

2. One Prometheus compacts differently than the other.

Given that Prometheus scrapes itself, we could prove the former by looking at the `scrape_samples_scraped` metric. Would you mind running this metric in the Prometheus UI and looking for anomalies over a long time range?

I am also adding Krasi here. He is our Prometheus time series database expert. Krasi: Have you seen similar reports?

Comment 2 Krasi 2019-02-12 12:56:33 UTC
nope haven't had such reports to far.

I would say there should be some difference somewhere - retention policy , compaction, recording rules etc.

compare /config , /rules , /targets, /flags http endpoints  and let us know if you see any difference there.