While adding the e2e test to prevent alerts firing during upgrade the TargetDown alert was firing on the machine-config-daemon and that led to the realization that a properly functioning upgrade should always trigger TargetDown as currently written, which is not desirable (no alerts should fire during normal upgrades). TargetDown looks for job workloads that have up == 0 values (can't scrape), but daemonsets aren't drained during a normal upgrade, so for a period of time while the node is rebooting the target will be listed as down. The alert has a 10m for, but upgrading all the nodes in the cluster will always take more than that, and a proper upgrade works quickly and serially to keep at least one node upgrading and rebooting at any time, so it's very likely that the alert will fire because one pod is always being updated. During e2e runs the control plane and workers are upgrading in parallel, which adds to the chance that the alert doesn't reset. Therefore TargetDown is not correct as an alert in the presence of upgrades and we need to redefine it. In other fixes we've identified the the node being unschedulable as a potential signal that can be used to suppress alerts for daemonsets - we expect up == 0 from daemonset pods that are being drained. Ideally we would have an MCO driven metric that indicates intent "node X being upgraded" from the time the mco decides a node should be updated to the time it is done, and then we would time limit how long we suppress the condition to start flagging if the node remains down abnormally long (a normal node upgrade should have an SLO of completing within X time after drain). In the short term, unschedulable is an acceptable signal that we can suppress the alert for a down pod (admins marking all their nodes unschedulable is on them).
Created attachment 1764135 [details] TargetDown alert rule definition in console UI
Sanity verification done with 4.8.0-0.nightly-2021-03-17-123640 on AWS. Confirmed that the TargetDown AR was properly updated (see attached screenshot) and that the `relabelings` rule on the MCD ServiceMonitor was added. ``` $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-03-17-123640 True False 14m Cluster version is 4.8.0-0.nightly-2021-03-17-123640 $ oc -n openshift-machine-config-operator get servicemonitor/machine-config-daemon -o json | jq .spec.endpoints [ { "bearerTokenFile": "/var/run/secrets/kubernetes.io/serviceaccount/token", "interval": "30s", "path": "/metrics", "port": "metrics", "relabelings": [ { "action": "replace", "regex": ";(.*)", "replacement": "$1", "separator": ";", "sourceLabels": [ "node", "__meta_kubernetes_pod_node_name" ], "targetLabel": "node" } ], "scheme": "https", "tlsConfig": { "caFile": "/etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt", "serverName": "machine-config-daemon.openshift-machine-config-operator.svc" } } ]
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438