Bug 1989128
Summary: | Occasional shelves in node_cpu_seconds_total where CPU is reported but not increasing | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> | ||||
Component: | Monitoring | Assignee: | Philip Gough <pgough> | ||||
Status: | CLOSED NOTABUG | QA Contact: | Junqi Zhao <juzhao> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 4.8 | CC: | alegrand, amuller, anpicker, aos-bugs, erooth, kakkoyun, pkrupa, pnair | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-08-09 07:59:36 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
W. Trevor King
2021-08-02 13:53:22 UTC
This does not appear to be a bug more a case of how Prometheus handles stale data [1]. The first shelf after the hole in the graph at around 13:03:40 likely has staleness markers (hidden from end user) in which case the time series would be returned as empty and not evaluated by a function such as rate,avg_over_time etc. The second ~5min shelf around 13:28:52 suggests that there may have been a period of downtime for Prometheus at which point there are missing staleness markers and Prometheus falls back to old 1.x behaviour of dropping a metric if it has not seen it for five minutes. We also, from the must gather, know that prometheus-k8s-1 started at 13:34:38 which is just after the metric drops so it is certain that there was downtime before and it aligns. We don't have any data for the shelf at 13:47:04 since we have no new pod after that time but given the shelf length of 5 minutes, we can assume same. [1]: https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness |