1933805 – TargetDown alert fires during upgrades because of normal upgrade behavior

Bug 1933805 - TargetDown alert fires during upgrades because of normal upgrade behavior

Summary: TargetDown alert fires during upgrades because of normal upgrade behavior

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Clayton Coleman
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-01 18:44 UTC by Clayton Coleman
Modified:	2021-07-30 12:13 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 22:48:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
TargetDown alert rule definition in console UI (95.01 KB, image/png) 2021-03-17 17:42 UTC, Micah Abbott	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1072	None	open	Bug 1933805: TargetDown should exclude unschedulable nodes	2021-03-04 23:36:00 UTC
Github	openshift machine-config-operator pull 2446	None	open	DO NOT MERGE: Bug 1933805: Add node label to service monitor	2021-03-01 21:10:39 UTC
Red Hat Product Errata	RHSA-2021:2438	None	None	None	2021-07-27 22:49:03 UTC

Description Clayton Coleman 2021-03-01 18:44:28 UTC

While adding the e2e test to prevent alerts firing during upgrade the TargetDown alert was firing on the machine-config-daemon and that led to the realization that a properly functioning upgrade should always trigger TargetDown as currently written, which is not desirable (no alerts should fire during normal upgrades).

TargetDown looks for job workloads that have up == 0 values (can't scrape), but daemonsets aren't drained during a normal upgrade, so for a period of time while the node is rebooting the target will be listed as down.  The alert has a 10m for, but upgrading all the nodes in the cluster will always take more than that, and a proper upgrade works quickly and serially to keep at least one node upgrading and rebooting at any time, so it's very likely that the alert will fire because one pod is always being updated.  During e2e runs the control plane and workers are upgrading in parallel, which adds to the chance that the alert doesn't reset.

Therefore TargetDown is not correct as an alert in the presence of upgrades and we need to redefine it.  In other fixes we've identified the the node being unschedulable as a potential signal that can be used to suppress alerts for daemonsets - we expect up == 0 from daemonset pods that are being drained.  Ideally we would have an MCO driven metric that indicates intent "node X being upgraded" from the time the mco decides a node should be updated to the time it is done, and then we would time limit how long we suppress the condition to start flagging if the node remains down abnormally long (a normal node upgrade should have an SLO of completing within X time after drain).  In the short term, unschedulable is an acceptable signal that we can suppress the alert for a down pod (admins marking all their nodes unschedulable is on them).

Comment 3 Micah Abbott 2021-03-17 17:42:04 UTC

Created attachment 1764135 [details]
TargetDown alert rule definition in console UI

Comment 4 Micah Abbott 2021-03-17 17:45:35 UTC

Sanity verification done with 4.8.0-0.nightly-2021-03-17-123640 on AWS.

Confirmed that the TargetDown AR was properly updated (see attached screenshot) and that the `relabelings` rule on the MCD ServiceMonitor was added.

```
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-03-17-123640   True        False         14m     Cluster version is 4.8.0-0.nightly-2021-03-17-123640
$ oc -n openshift-machine-config-operator get servicemonitor/machine-config-daemon -o json | jq .spec.endpoints
[
  {
    "bearerTokenFile": "/var/run/secrets/kubernetes.io/serviceaccount/token",
    "interval": "30s",
    "path": "/metrics",
    "port": "metrics",
    "relabelings": [
      {
        "action": "replace",
        "regex": ";(.*)",
        "replacement": "$1",
        "separator": ";",
        "sourceLabels": [
          "node",
          "__meta_kubernetes_pod_node_name"
        ],
        "targetLabel": "node"
      }
    ],
    "scheme": "https",
    "tlsConfig": {
      "caFile": "/etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt",
      "serverName": "machine-config-daemon.openshift-machine-config-operator.svc"
    }
  }
]

Comment 7 errata-xmlrpc 2021-07-27 22:48:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.