Bug 1933805 - TargetDown alert fires during upgrades because of normal upgrade behavior
Summary: TargetDown alert fires during upgrades because of normal upgrade behavior
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.8
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.8.0
Assignee: Clayton Coleman
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-01 18:44 UTC by Clayton Coleman
Modified: 2021-07-30 12:13 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:48:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
TargetDown alert rule definition in console UI (95.01 KB, image/png)
2021-03-17 17:42 UTC, Micah Abbott
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1072 0 None open Bug 1933805: TargetDown should exclude unschedulable nodes 2021-03-04 23:36:00 UTC
Github openshift machine-config-operator pull 2446 0 None open DO NOT MERGE: Bug 1933805: Add node label to service monitor 2021-03-01 21:10:39 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:49:03 UTC

Description Clayton Coleman 2021-03-01 18:44:28 UTC
While adding the e2e test to prevent alerts firing during upgrade the TargetDown alert was firing on the machine-config-daemon and that led to the realization that a properly functioning upgrade should always trigger TargetDown as currently written, which is not desirable (no alerts should fire during normal upgrades).

TargetDown looks for job workloads that have up == 0 values (can't scrape), but daemonsets aren't drained during a normal upgrade, so for a period of time while the node is rebooting the target will be listed as down.  The alert has a 10m for, but upgrading all the nodes in the cluster will always take more than that, and a proper upgrade works quickly and serially to keep at least one node upgrading and rebooting at any time, so it's very likely that the alert will fire because one pod is always being updated.  During e2e runs the control plane and workers are upgrading in parallel, which adds to the chance that the alert doesn't reset.

Therefore TargetDown is not correct as an alert in the presence of upgrades and we need to redefine it.  In other fixes we've identified the the node being unschedulable as a potential signal that can be used to suppress alerts for daemonsets - we expect up == 0 from daemonset pods that are being drained.  Ideally we would have an MCO driven metric that indicates intent "node X being upgraded" from the time the mco decides a node should be updated to the time it is done, and then we would time limit how long we suppress the condition to start flagging if the node remains down abnormally long (a normal node upgrade should have an SLO of completing within X time after drain).  In the short term, unschedulable is an acceptable signal that we can suppress the alert for a down pod (admins marking all their nodes unschedulable is on them).

Comment 3 Micah Abbott 2021-03-17 17:42:04 UTC
Created attachment 1764135 [details]
TargetDown alert rule definition in console UI

Comment 4 Micah Abbott 2021-03-17 17:45:35 UTC
Sanity verification done with 4.8.0-0.nightly-2021-03-17-123640 on AWS.

Confirmed that the TargetDown AR was properly updated (see attached screenshot) and that the `relabelings` rule on the MCD ServiceMonitor was added.

```
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-03-17-123640   True        False         14m     Cluster version is 4.8.0-0.nightly-2021-03-17-123640
$ oc -n openshift-machine-config-operator get servicemonitor/machine-config-daemon -o json | jq .spec.endpoints
[
  {
    "bearerTokenFile": "/var/run/secrets/kubernetes.io/serviceaccount/token",
    "interval": "30s",
    "path": "/metrics",
    "port": "metrics",
    "relabelings": [
      {
        "action": "replace",
        "regex": ";(.*)",
        "replacement": "$1",
        "separator": ";",
        "sourceLabels": [
          "node",
          "__meta_kubernetes_pod_node_name"
        ],
        "targetLabel": "node"
      }
    ],
    "scheme": "https",
    "tlsConfig": {
      "caFile": "/etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt",
      "serverName": "machine-config-daemon.openshift-machine-config-operator.svc"
    }
  }
]

Comment 7 errata-xmlrpc 2021-07-27 22:48:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.