In the test-platform build02 cluster, MCDDrainError is firing for nodes that have successfully drained and updated to their desired configuration. Must-gather was taken 2021-05-05T17:05Z: $ head -n1 timestamp 2021-05-05 17:08:17.05003762 +0000 UTC m=+1.084843367 Cluster has been running 4.7.3 for over a month: $ yaml2json - <cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | jq -r '.status.history[] | .startedTime + " " + .completionTime + " " + .state + " " + .version' | head -n1 2021-03-19T13:53:16Z 2021-03-20T01:02:35Z Completed 4.7.3 'worker' MachineConfigPool is happy, and went Updated=True around an hour before the must-gather: $ yaml2json - <cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigpools/worker.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' 2021-03-11T12:39:06Z RenderDegraded=False : 2021-05-05T05:05:57Z NodeDegraded=False : 2021-05-05T05:05:57Z Degraded=False : 2021-05-05T16:02:12Z Updated=True : All nodes are updated with rendered-worker-dde10a54c9d435b2e56364f36da94d89 2021-05-05T16:02:12Z Updating=False : Picking on a single node, the MCD had issues with the drain, but eventually completed it ~17h before the must-gather: $ grep -i drain namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon/machine-config-daemon/logs/current.log | tail -n3 2021-05-04T23:53:01.919309548Z I0504 23:53:01.919170 2245 update.go:221] Draining failed with: error when waiting for pod "e2e-aws" terminating: global timeout reached: 1m30s, retrying 2021-05-04T23:55:10.672597261Z I0504 23:55:10.672517 2245 update.go:1901] drain complete 2021-05-04T23:55:10.677753981Z I0504 23:55:10.677621 2245 update.go:237] Successful drain took 433.872992811 seconds However, MCDDrainError is still firing for that node. Checking the backing metric: mcd_drain_err{pod="machine-config-daemon-bzzlq"} For the past several hours, we've had two matching mcd_drain_err series: one with err="WaitTimeout" and a value of 5, and another with no err and a value of 0. We probably only want the MCD serving a single mcd_drain_err series per pod.
Could not reproduce this - tested in 4.6/4.7 the alert stopped firing correctly. I note that the alert firing is from machine-config-daemon-bzzlq but this mcd isn't in your must gather?? (there are 6 mcds listed ftr). Can you clarify what's going on?
Should be in the must-gather: $ sha1sum quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a.zip 914d56324cb9dffc884e5609eeb4c116f03b1f47 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a.zip $ unzip -l quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a.zip | grep machine-config-daemon-bzzlq 0 2021-05-05 10:18 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/ 0 2021-05-05 10:17 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/oauth-proxy/ 0 2021-05-05 10:17 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon/ 10329 2021-05-05 10:12 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon-bzzlq.yaml 0 2021-05-05 10:17 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/oauth-proxy/oauth-proxy/ 0 2021-05-05 10:17 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon/machine-config-daemon/ 0 2021-05-05 10:18 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/oauth-proxy/oauth-proxy/logs/ 0 2021-05-05 10:18 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon/machine-config-daemon/logs/ 69700 2021-05-05 10:12 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/oauth-proxy/oauth-proxy/logs/current.log 0 2021-05-05 10:12 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/oauth-proxy/oauth-proxy/logs/previous.log 0 2021-05-05 10:12 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/oauth-proxy/oauth-proxy/logs/previous.insecure.log 1037104 2021-05-05 10:12 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon/machine-config-daemon/logs/current.log 0 2021-05-05 10:12 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon/machine-config-daemon/logs/previous.log 0 2021-05-05 10:12 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon/machine-config-daemon/logs/previous.insecure.log
Checking on the alert history, the cluster's Alertmanager shows: Drain failed on build0-gstfj-w-c-b7znx.c.openshift-ci-build-farm.internal , updates may be blocked. For more details: oc logs -f -n openshift-machine-config-operator machine-config-daemon-<hash> -c machine-config-daemon firing since 2021-05-04T23:39:36.021Z. That falls within the Degraded range from the comment 0 log snippet. Growing that log snippet a bit: $ grep -i drain namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon/machine-config-daemon/logs/current.log | tail -n20 ... 2021-04-30T19:43:16.836552509Z I0430 19:43:16.836467 2245 update.go:237] Successful drain took 518.677280257 seconds 2021-05-04T23:22:36.000290043Z I0504 23:22:35.999717 2245 update.go:1901] Update prepared; beginning drain 2021-05-04T23:39:17.158683480Z I0504 23:39:17.150734 2245 update.go:221] Draining failed with: [error when waiting for pod ... terminating: global timeout reached: 1m30s], retrying 2021-05-04T23:41:01.445342529Z I0504 23:41:01.445141 2245 update.go:221] Draining failed with: [error when waiting for pod ... global timeout reached: 1m30s], retrying 2021-05-04T23:42:53.181074080Z I0504 23:42:53.180944 2245 update.go:221] Draining failed with: [error when waiting for pod "prometheus-k8s-0" terminating: global timeout reached: 1m30s, error when waiting for pod "e2e-aws" terminating: global timeout reached: 1m30s], retrying 2021-05-04T23:45:04.817022763Z I0504 23:45:04.816894 2245 update.go:221] Draining failed with: [error when waiting for pod "e2e-aws" terminating: global timeout reached: 1m30s, error when waiting for pod "prometheus-k8s-0" terminating: global timeout reached: 1m30s], retrying 2021-05-04T23:47:56.668208726Z I0504 23:47:56.668098 2245 update.go:221] Draining failed with: [error when waiting for pod "prometheus-k8s-0" terminating: global timeout reached: 1m30s, error when waiting for pod "e2e-aws" terminating: global timeout reached: 1m30s], retrying 2021-05-04T23:47:56.671027735Z E0504 23:47:56.670907 2245 writer.go:135] Marking Degraded due to: failed to drain node (5 tries): timed out waiting for the condition: [error when waiting for pod "prometheus-k8s-0" terminating: global timeout reached: 1m30s, error when waiting for pod "e2e-aws" terminating: global timeout reached: 1m30s] 2021-05-04T23:47:56.799605095Z I0504 23:47:56.799579 2245 update.go:1901] Update prepared; beginning drain 2021-05-04T23:49:28.467357997Z I0504 23:49:28.467269 2245 update.go:221] Draining failed with: [error when waiting for pod "prometheus-k8s-0" terminating: global timeout reached: 1m30s, error when waiting for pod "e2e-aws" terminating: global timeout reached: 1m30s], retrying 2021-05-04T23:51:10.158373495Z I0504 23:51:10.158238 2245 update.go:221] Draining failed with: [error when waiting for pod "prometheus-k8s-0" terminating: global timeout reached: 1m30s, error when waiting for pod "e2e-aws" terminating: global timeout reached: 1m30s], retrying 2021-05-04T23:53:01.919309548Z I0504 23:53:01.919170 2245 update.go:221] Draining failed with: error when waiting for pod "e2e-aws" terminating: global timeout reached: 1m30s, retrying 2021-05-04T23:55:10.672597261Z I0504 23:55:10.672517 2245 update.go:1901] drain complete 2021-05-04T23:55:10.677753981Z I0504 23:55:10.677621 2245 update.go:237] Successful drain took 433.872992811 seconds So that matches up fairly well with when the alert went off.
I have a couple of ideas of what's going on here, but can you clarify what the cluster was doing / rolling out in these rendered configs on May4th?
> ...but can you clarify what the cluster was doing / rolling out in these rendered configs on May4th? $ grep 'completed update for config rendered-worker-' namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon/machine-config-daemon/logs/current.log 2021-03-22T00:10:01.905135990Z I0322 00:10:01.905078 2245 update.go:1901] completed update for config rendered-worker-f89c57ed7b2922e1ade47cda11bacbc7 2021-04-07T06:44:02.392901728Z I0407 06:44:02.392818 2245 update.go:1901] completed update for config rendered-worker-dde10a54c9d435b2e56364f36da94d89 2021-04-30T19:43:18.657172046Z I0430 19:43:18.657077 2245 update.go:1901] completed update for config rendered-worker-b24c654a2f2e06bf8fe4d435450e5091 2021-05-04T23:55:12.340492818Z I0504 23:55:12.340379 2245 update.go:1901] completed update for config rendered-worker-dde10a54c9d435b2e56364f36da94d89 $ diff -u cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/rendered-worker-b24c654a2f2e06bf8fe4d435450e5091.yaml cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/rendered-worker-dde10a54c9d435b2e56364f36da94d89.yaml --- cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/rendered-worker-b24c654a2f2e06bf8fe4d435450e5091.yaml 2021-05-05 10:12:20.000000000 -0700 +++ cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/rendered-worker-dde10a54c9d435b2e56364f36da94d89.yaml 2021-05-05 10:12:21.000000000 -0700 @@ -4,7 +4,7 @@ metadata: annotations: machineconfiguration.openshift.io/generated-by-controller-version: 39de2f3d075d84c945f02d209c02c17b24d5dd71 - creationTimestamp: "2021-04-30T17:53:00Z" + creationTimestamp: "2021-04-07T05:14:08Z" generation: 1 ... @@ -137,7 +137,7 @@ overwrite: true path: /etc/NetworkManager/conf.d/sdn.conf - contents: - source: data:... + source: data:... mode: 384 overwrite: true path: /var/lib/kubelet/config.json $ function render() { > yaml2json - < "${1}" | jq -r '.spec.config.storage.files[] | select(.path == "/var/lib/kubelet/config.json").contents.source' | sed 's/^data:,//' | python3 -c 'import sys, urllib.parse; print(urllib.parse.unquote(sys.stdin.read()))' | jq . > } $ diff -u0 <(render cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/rendered-worker-b24c654a2f2e06bf8fe4d435450e5091.yaml) <(render cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/rendered-worker-dde10a54c9d435b2e56364f36da94d89.yaml) --- /dev/fd/63 2021-05-05 22:01:54.754413780 -0700 +++ /dev/fd/62 2021-05-05 22:01:54.754413780 -0700 @@ -11,3 +10,0 @@ - "quay.io/multi-arch": { - "auth": "..." - }, So that seems to be rolling back to an older rendered-worker MachineConfig, and the only thing being touched seems to be the cluster's pull secret, and the only touch is dropping that one cred.
Did some digging and this issue only exists in 4.7.0->4.7.3 clusters. In those cluster the MCO drains without rebooting on several changes including sshkeys, pullsecrets & ca certs, this was changed in >=4.7.4 where changing those will not incur a drain. That means that in those clusters if there was a drain alert due to the dual metric per mcd, and the lack of reboot, the alert would persist even after the issue was resolved. The metric has a fix in flight as a part of https://bugzilla.redhat.com/show_bug.cgi?id=1853264 The drain was removed and this bz can't be hit in 4.7.4+ via https://bugzilla.redhat.com/show_bug.cgi?id=1853264 *** This bug has been marked as a duplicate of bug 1853264 ***