Bug 1989128
| Summary: | Occasional shelves in node_cpu_seconds_total where CPU is reported but not increasing | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> | ||||
| Component: | Monitoring | Assignee: | Philip Gough <pgough> | ||||
| Status: | CLOSED NOTABUG | QA Contact: | Junqi Zhao <juzhao> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 4.8 | CC: | alegrand, amuller, anpicker, aos-bugs, erooth, kakkoyun, pkrupa, pnair | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2021-08-09 07:59:36 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
W. Trevor King
2021-08-02 13:53:22 UTC
This does not appear to be a bug more a case of how Prometheus handles stale data [1]. The first shelf after the hole in the graph at around 13:03:40 likely has staleness markers (hidden from end user) in which case the time series would be returned as empty and not evaluated by a function such as rate,avg_over_time etc. The second ~5min shelf around 13:28:52 suggests that there may have been a period of downtime for Prometheus at which point there are missing staleness markers and Prometheus falls back to old 1.x behaviour of dropping a metric if it has not seen it for five minutes. We also, from the must gather, know that prometheus-k8s-1 started at 13:34:38 which is just after the metric drops so it is certain that there was downtime before and it aligns. We don't have any data for the shelf at 13:47:04 since we have no new pod after that time but given the shelf length of 5 minutes, we can assume same. [1]: https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness |