GCP minor update jobs frequently fail on this alert firing briefly: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=alert+.*High.*ControlPlaneCPU+fired+for' | grep -v 'rehearse-' | grep 'failures match' periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade (all) - 8 runs, 88% failed, 14% of failures match = 13% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact Example job [1] has: disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success 1h13m54s Jul 22 13:42:42.464: Unexpected alerts fired or pending during the upgrade: alert ExtremelyHighIndividualControlPlaneCPU fired for 60 seconds with labels: {instance="ci-op-b74rpbq7-3b3f8-2mxs5-master-0", severity="critical"} Looking in PromeCIeus, this alert fires from 13:40:38 through 13:41:37Z. Weirdly: sum by (mode) (rate(node_cpu_seconds_total{instance="ci-op-b74rpbq7-3b3f8-2mxs5-master-0"}[5m])) adds up to ~4 vCPU for most of the run, but when the alert is going off, the total sum across all modes for that instance is only 0.5912. Where did all the other CPUs go? What mode are they actually in? There are lots of options for getting this alert-noise to go away, including "bump 'for' on the alert to 10m", "fix whatever issues the mode scraper is having", etc., etc. This bug tracks the fact that the alert is firing in (and therefore failing) CI updates, which makes it hard to see other issues that CI should be bringing to our attention. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1418177253686120448
Poking around more in PromeCIeus, dropping the rate range to 1m: sum by (mode) (rate(node_cpu_seconds_total{instance="ci-op-b74rpbq7-3b3f8-2mxs5-master-0"}[1m])) is getting the total stacked sum up to ~4 vCPUs over most of the job. But because CI is only gathering a single Prom's data, and not Thanos-aggregated data pulling from both Prom's, we have a few holes when the Prom we end up pulling from is down. There's one hole from 13:02 to ~13:06Z, while monitoring is updating to the new version. And another hole from 13:29 to 13:40Z, while the machine-config rolls compute. But both of those holes have a spike of node_cpu_seconds_total hits inside the hol, one at 13:03, and another at 13:35. I'm fuzzy on what's going on there. Maybe, besides the Prom we gathered being occasionally down, node exporter [1] is also occasionally down? It's probably worth adding some sort of 'absent' guard to the alert rule to avoid firing when rates are unreliable due to holes during the rate-measured window. [1]: https://github.com/openshift/node_exporter/blob/1ab97f35f07cf1ed02a2b4ea3f3c3c9d09bf94b5/collector/cpu_linux.go#L284-L298
Trying to pin down those Prom gaps. [1] is up asking about gathering Thanos-aggregated data, so we get anything that either Prom caught. But that's probably a ways off. Back to PromeCIeus: ALERTS{alertname="Watchdog"} hits for the whole run, except for a single gap from 13:33:16 to 13:36:17Z. That's during the machine-config, node-rolling phase, and I'm not sure why there is no gap there when monitoring bumped Prom images earlier in the run. Checking to see which Prom we gathered: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1418177253686120448/artifacts/e2e-gcp-upgrade/gather-extra/build-log.txt | grep 'pod/prometheus-k8s.* tar ' oc --insecure-skip-tls-verify exec -n openshift-monitoring pod/prometheus-k8s-1 -- tar cvzf - -C /prometheus . Looking at events for that pod: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1418177253686120448/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-monitoring" and .involvedObject.name == "prometheus-k8s-1")] | sort_by(.metadata.creationTimestamp)[] | (.firstTimestamp // .metadata.creationTimestamp) + " " + (.count | tostring) + " " + .reason + ": " + .message' 2021-07-22T12:17:04Z null Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ci-op-b74rpbq7-3b3f8-2mxs5-worker-b-hzsrj 2021-07-22T12:17:10Z 1 SuccessfulAttachVolume: AttachVolume.Attach succeeded for volume "pvc-b0bad821-660c-4c05-8b88-3bce78698044" ... 2021-07-22T12:17:20Z 1 Started: Started container prometheus ... 2021-07-22T13:03:44Z 1 Killing: Stopping container prometheus ... 2021-07-22T13:03:58Z null Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ci-op-b74rpbq7-3b3f8-2mxs5-worker-b-hzsrj 2021-07-22T13:04:04Z 1 SuccessfulAttachVolume: AttachVolume.Attach succeeded for volume "pvc-b0bad821-660c-4c05-8b88-3bce78698044" ... 2021-07-22T13:04:54Z 2 Started: Started container prometheus ... 2021-07-22T13:28:45Z 1 Killing: Stopping container prometheus ... 2021-07-22T13:34:08Z null Scheduled: Successfully assigned openshift-monitoring/prometheus-k8s-1 to ci-op-b74rpbq7-3b3f8-2mxs5-worker-b-hzsrj 2021-07-22T13:34:14Z 1 SuccessfulAttachVolume: AttachVolume.Attach succeeded for volume "pvc-b0bad821-660c-4c05-8b88-3bce78698044" ... 2021-07-22T13:34:33Z 2 Started: Started container prometheus ... Hmm, counts of 2 for starting the Prom containers seems suspicious. And indeed, looks like the first attempt post-reschedule crashed [2]: level=error ts=2021-07-22T13:34:33.873Z caller=main.go:347 msg="Error loading config (--config.file=/etc/prometheus/config_out/prometheus.env.yaml)" err="open /etc/prometheus/config_out/prometheus.env.yaml: no such file or directory" But the replacement came up, and was shortly serving: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1418177253686120448/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-k8s-1_prometheus.log | grep 'Server is ready to receive web requests' level=info ts=2021-07-22T13:34:52.293Z caller=main.go:767 msg="Server is ready to receive web requests." Aaand, now I'm realizing that: group by (instance) (node_cpu_seconds_total) in Prom is a good way to distinguish between node-exporter-scrape-failure and this-Prom-not-scraping-anyone. Indeed, we have healthy scraping of all nodes besides master-0, which we are failing to scrape, during the 13:35:43 to 13:40:14Z window. So timeline: * 12:17:04: prometheus-k8s-1 up on worker-b-hzsrj. * 13:02:35: beginning of 'group by (instance) (node_cpu_seconds_total)' hole in gathered Prom data. * 13:03:23: end of 'group by (instance) (node_cpu_seconds_total)' hole in gathered Prom data. Magical time travel while this Prom isn't even down yet (per event data). * 13:03:44: prometheus-k8s-1 down for an image bump. * 13:04:54: prometheus-k8s-1 takes one attempt at coming back up, fails, and shortly afterward a second attempt succeeds. * 13:28:45: prometheus-k8s-1 down for a node drain. * 13:34:33: prometheus-k8s-1 attempts to come back up, but fails on 'prometheus.env.yaml: no such file'. * 13:34:52: prometheus-k8s-1 back up and serving requests. * 13:35:12: 'group by (instance) (node_cpu_seconds_total)' is picking up all 6 nodes again. * 13:35:30: master-0 events that it's about to reboot. * 13:35:43: 'group by (instance) (node_cpu_seconds_total)' loses master-0. * 13:38:15: master-0 events that its kubelet is starting back up. * 13:39:33: master-0 events NodeDone and bumps its currentConfig. * 13:40:14: 'group by (instance) (node_cpu_seconds_total)' picks master-0 back up. * 13:40:38: ExtremelyHighIndividualControlPlaneCPU starts firing for master-0. * 13:41:37: ExtremelyHighIndividualControlPlaneCPU stops firing. So yeah, this is definitely just "the alert needs to learn to ignore periods when the node is being rebooted, and count "I have no idea what the CPUs are up to; node exporter, and maybe the whole node, is down" as "CPU must be high". [1]: https://issues.redhat.com/browse/MON-1771 [2]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-upgrade/1418177253686120448/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/pods/openshift-monitoring_prometheus-k8s-1_prometheus_previous.log
[1] landed 9 days ago with the temporary exception. Checking the last week of CI: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=168h&type=junit&search=alert+ .*High.*ControlPlaneCPU+fired+for' | grep -v 'rehearse-' | grep 'failures match' periodic-ci-openshift-release-master-okd-4.8-upgrade-from-4.7-e2e-upgrade-gcp (all) - 3 runs, 100% failed, 33% of failures match = 33% impact periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-upgrade (all) - 7 runs, 71% failed, 20% of failures match = 14% impact pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade (all) - 14 runs, 57% failed, 13% of failures match = 7% impact pull-ci-openshift-ovn-kubernetes-release-4.8-e2e-gcp-ovn-upgrade (all) - 33 runs, 82% failed, 7% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-gcp-ovn-upgrade (all) - 4 runs, 100% failed, 25% of failures match = 25% impact pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-upgrade (all) - 36 runs, 81% failed, 7% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.8-e2e-aws-compact-upgrade (all) - 4 runs, 25% failed, 100% of failures match = 25% impact That's mostly lands-in-4.8 stuff, and the temporary exception is just for lands-in-4.9 jobs unless we backport [1] to 4.8. However, there are still some lands-in-4.9 hits like [2]. Ah, but that's: alert ExtremelyHighIndividualControlPlaneCPU fired for 90 seconds with labels: {instance="ci-op-3gqvl0g2-1ff06-gh969-master-2", severity="critical"} and in [1] for some reason I'd used HighOverallControlPlaneCPU <facepalm>. I've opened [3] to pivot the exception to cover ExtremelyHighIndividualControlPlaneCPU, which I should have done from the start. [1]: https://github.com/openshift/origin/pull/26341 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-gcp-upgrade/1420824398541099008 [3]: https://github.com/openshift/origin/pull/26365
I've spun off bug 1989128 about shelves in node_cpu_seconds_total that we see in PromeCIeus while Prom containers are rebooting. I'm not sure if that's an artifact of our gather, or PromeCIeus serving the data we gather, or if it is actually something that Alertmanager would see in-cluster. But something like: avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) + on (instance) 0.1 (1 - sum by (instance) (count_over_time(( group by (instance) (node_cpu_seconds_total) )[5m:10s])) / 30.5) < 0.1 Would match if idle cycles dropped below 10% without bias from sections where the metric was missing entirely. And maybe that would be enough to keep the alert from firing false positives? Returning to the default assignee, because the kube-API-server folks would be in charge of maintaining any altered alert expression, and maybe they'd prefer just kicking 'for' up higher or something.
origin/26365 is a temporary softening of CI constraints. It is not a fix for the noisy-alert issue, which is the main thing that this bug should fix.
moving it to 4.10, openshift/cluster-kube-apiserver-operator/pull/1201 merged on Aug 9.
rate does extrapolation based on the slope of the first and the last sample under the window. rate also avoids extrapolating too far, extrapolation extends to half the sample interval when the first or the last sample is too far away from the window so prometheus does not handle the missing data/gap as above. since rate already extrapolates for us i don't see it's necessary for the alert to take into account any gap in its calculation. slack thread where we discussed it: https://coreos.slack.com/archives/C01CQA76KMX/p1626752724386400 so openshift/cluster-kube-apiserver-operator/pull/1201 is a candidate for fix. let's test it.
As checked in search CI not found any cpu error in last 48 hours for 4.10 upgrade. https://search.ci.openshift.org/?search=alert+.*High.*ControlPlaneCPU+fired+for&maxAge=48h&context=1&type=junit&name=4%5C.10&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job No results found.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056