Bug 1989128 - Occasional shelves in node_cpu_seconds_total where CPU is reported but not increasing
Summary: Occasional shelves in node_cpu_seconds_total where CPU is reported but not in...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Philip Gough
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-02 13:53 UTC by W. Trevor King
Modified: 2021-09-13 15:13 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-09 07:59:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
sum(node_cpu_seconds_total{instance="ci-op-b74rpbq7-3b3f8-2mxs5-master-0"}) (45.07 KB, image/png)
2021-08-02 13:53 UTC, W. Trevor King
no flags Details

Description W. Trevor King 2021-08-02 13:53:22 UTC
Created attachment 1810132 [details]
sum(node_cpu_seconds_total{instance="ci-op-b74rpbq7-3b3f8-2mxs5-master-0"})

Spinning off from bug 1985073, [1] is a 4.7 to 4.8 update job where:

  node_cpu_seconds_total{instance="ci-op-b74rpbq7-3b3f8-2mxs5-master-0"}

has some unexpected flat shelves.  These seem to occur in the vicinity of metric outages.  For example:

* 13:02:37, node_cpu_seconds_total goes away [2].
* 13:03:01, node_cpu_seconds_total comes back [2].
* 13:03:31, node_cpu_seconds_total stops increasing [2].
* 13:03:44, Killing: Stopping container prometheus [3].
* 13:04:54, 2 Started: Started container prometheus [3].
* 13:05:14, node_cpu_seconds_total starts increasing again, with a large step that seems like a catch-up for the shelf where it wasn't increasing [2].

So seems like prometheus-k8s-1 (the one we gathered into artifacts [3]) was down for a bit while it was being updated, but instead of reporting a hole in node_cpu_seconds_total during that range, it's extrapolating a flat line from the last node-exporter scrape.  And then the new container comes back up, and successfully scrapes the node exporter, and says "oh, heh, I guess I was a bit off with that extrapolation, and the value is really up here".  That's not terrible, but as bug 1985073 shows, it can be hard for alerts to avoid false positives when they can't distinguish between all three phases:

1. Accurate data (extrapolated) from a recent scrape.
2. Hole in the metric, where we know there is no data.
3. Long extrapolation, because a Prom reboot or whatever kept us from successfully scraping on our usual poll period.

Possibly this whole thing is just an artifact of us gathering a single Prom's data, and if we were able to gather aggregated data [4] like Alertmanager would see in-cluster, we wouldn't see these shelves.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1418177253686120448
[2]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1418177253686120448
[3]: https://bugzilla.redhat.com/show_bug.cgi?id=1985073#c2
[4]: https://issues.redhat.com/browse/MON-1771

Comment 1 Philip Gough 2021-08-09 07:59:36 UTC
This does not appear to be a bug more a case of how Prometheus handles stale data [1].

The first shelf after the hole in the graph at around 13:03:40 likely has staleness markers (hidden from end user) in which case the time series would be returned as empty and not evaluated by a function such as rate,avg_over_time etc.

The second ~5min shelf around 13:28:52 suggests that there may have been a period of downtime for Prometheus at which point there are missing staleness markers and Prometheus falls back to old 1.x behaviour of dropping a metric if it has not seen it for five minutes. We also, from the must gather, know that prometheus-k8s-1 started at 13:34:38 which is just after the metric drops so it is certain that there was downtime before and it aligns.

We don't have any data for the shelf at 13:47:04 since we have no new pod after that time but given the shelf length of 5 minutes, we can assume same.


[1]: https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness


Note You need to log in before you can comment on or make changes to this bug.