Bug 1957421

Summary:	MCDDrainError continues to fire despite a successful drain
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Machine Config Operator	Assignee:	Kirsten Garrison <kgarriso>
Status:	CLOSED DUPLICATE	QA Contact:	Michael Nguyen <mnguyen>
Severity:	low	Docs Contact:
Priority:	medium
Version:	4.7	CC:	kgarriso
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-05-06 23:39:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description W. Trevor King 2021-05-05 18:25:10 UTC

In the test-platform build02 cluster, MCDDrainError is firing for nodes that have successfully drained and updated to their desired configuration.  Must-gather was taken 2021-05-05T17:05Z:

$ head -n1 timestamp 
2021-05-05 17:08:17.05003762 +0000 UTC m=+1.084843367

Cluster has been running 4.7.3 for over a month:

$ yaml2json - <cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | jq -r '.status.history[] | .startedTime + " " + .completionTime + " " + .state + " " + .version' | head -n1
2021-03-19T13:53:16Z 2021-03-20T01:02:35Z Completed 4.7.3

'worker' MachineConfigPool is happy, and went Updated=True around an hour before the must-gather:

$ yaml2json - <cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigpools/worker.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")'
2021-03-11T12:39:06Z RenderDegraded=False : 
2021-05-05T05:05:57Z NodeDegraded=False : 
2021-05-05T05:05:57Z Degraded=False : 
2021-05-05T16:02:12Z Updated=True : All nodes are updated with rendered-worker-dde10a54c9d435b2e56364f36da94d89
2021-05-05T16:02:12Z Updating=False : 

Picking on a single node, the MCD had issues with the drain, but eventually completed it ~17h before the must-gather:

$ grep -i drain namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon/machine-config-daemon/logs/current.log | tail -n3
2021-05-04T23:53:01.919309548Z I0504 23:53:01.919170    2245 update.go:221] Draining failed with: error when waiting for pod "e2e-aws" terminating: global timeout reached: 1m30s, retrying
2021-05-04T23:55:10.672597261Z I0504 23:55:10.672517    2245 update.go:1901] drain complete
2021-05-04T23:55:10.677753981Z I0504 23:55:10.677621    2245 update.go:237] Successful drain took 433.872992811 seconds

However, MCDDrainError is still firing for that node.  Checking the backing metric:

  mcd_drain_err{pod="machine-config-daemon-bzzlq"}

For the past several hours, we've had two matching mcd_drain_err series: one with  err="WaitTimeout" and a value of 5, and another with no err and a value of 0.  We probably only want the MCD serving a single mcd_drain_err series per pod.

Comment 2 Kirsten Garrison 2021-05-05 21:00:59 UTC

Could not reproduce this - tested in 4.6/4.7 the alert stopped firing correctly. 

I note that the alert firing is from machine-config-daemon-bzzlq but this mcd isn't in your must gather?? (there are 6 mcds listed ftr). Can you clarify what's going on?

Comment 3 W. Trevor King 2021-05-05 22:15:15 UTC

Should be in the must-gather:

$ sha1sum quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a.zip 
914d56324cb9dffc884e5609eeb4c116f03b1f47  quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a.zip
$ unzip -l quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a.zip | grep machine-config-daemon-bzzlq
        0  2021-05-05 10:18   quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/
        0  2021-05-05 10:17   quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/oauth-proxy/
        0  2021-05-05 10:17   quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon/
    10329  2021-05-05 10:12   quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon-bzzlq.yaml
        0  2021-05-05 10:17   quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/oauth-proxy/oauth-proxy/
        0  2021-05-05 10:17   quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon/machine-config-daemon/
        0  2021-05-05 10:18   quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/oauth-proxy/oauth-proxy/logs/
        0  2021-05-05 10:18   quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon/machine-config-daemon/logs/
    69700  2021-05-05 10:12   quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/oauth-proxy/oauth-proxy/logs/current.log
        0  2021-05-05 10:12   quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/oauth-proxy/oauth-proxy/logs/previous.log
        0  2021-05-05 10:12   quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/oauth-proxy/oauth-proxy/logs/previous.insecure.log
  1037104  2021-05-05 10:12   quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon/machine-config-daemon/logs/current.log
        0  2021-05-05 10:12   quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon/machine-config-daemon/logs/previous.log
        0  2021-05-05 10:12   quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon/machine-config-daemon/logs/previous.insecure.log

Comment 4 W. Trevor King 2021-05-05 23:50:19 UTC

Checking on the alert history, the cluster's Alertmanager shows:

  Drain failed on build0-gstfj-w-c-b7znx.c.openshift-ci-build-farm.internal , updates may be blocked. For more details: oc logs -f -n openshift-machine-config-operator machine-config-daemon-<hash> -c machine-config-daemon

firing since 2021-05-04T23:39:36.021Z.  That falls within the Degraded range from the comment 0 log snippet.  Growing that log snippet a bit:

$ grep -i drain namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon/machine-config-daemon/logs/current.log | tail -n20
...
2021-04-30T19:43:16.836552509Z I0430 19:43:16.836467    2245 update.go:237] Successful drain took 518.677280257 seconds
2021-05-04T23:22:36.000290043Z I0504 23:22:35.999717    2245 update.go:1901] Update prepared; beginning drain
2021-05-04T23:39:17.158683480Z I0504 23:39:17.150734    2245 update.go:221] Draining failed with: [error when waiting for pod ... terminating: global timeout reached: 1m30s], retrying
2021-05-04T23:41:01.445342529Z I0504 23:41:01.445141    2245 update.go:221] Draining failed with: [error when waiting for pod ... global timeout reached: 1m30s], retrying
2021-05-04T23:42:53.181074080Z I0504 23:42:53.180944    2245 update.go:221] Draining failed with: [error when waiting for pod "prometheus-k8s-0" terminating: global timeout reached: 1m30s, error when waiting for pod "e2e-aws" terminating: global timeout reached: 1m30s], retrying
2021-05-04T23:45:04.817022763Z I0504 23:45:04.816894    2245 update.go:221] Draining failed with: [error when waiting for pod "e2e-aws" terminating: global timeout reached: 1m30s, error when waiting for pod "prometheus-k8s-0" terminating: global timeout reached: 1m30s], retrying
2021-05-04T23:47:56.668208726Z I0504 23:47:56.668098    2245 update.go:221] Draining failed with: [error when waiting for pod "prometheus-k8s-0" terminating: global timeout reached: 1m30s, error when waiting for pod "e2e-aws" terminating: global timeout reached: 1m30s], retrying
2021-05-04T23:47:56.671027735Z E0504 23:47:56.670907    2245 writer.go:135] Marking Degraded due to: failed to drain node (5 tries): timed out waiting for the condition: [error when waiting for pod "prometheus-k8s-0" terminating: global timeout reached: 1m30s, error when waiting for pod "e2e-aws" terminating: global timeout reached: 1m30s]
2021-05-04T23:47:56.799605095Z I0504 23:47:56.799579    2245 update.go:1901] Update prepared; beginning drain
2021-05-04T23:49:28.467357997Z I0504 23:49:28.467269    2245 update.go:221] Draining failed with: [error when waiting for pod "prometheus-k8s-0" terminating: global timeout reached: 1m30s, error when waiting for pod "e2e-aws" terminating: global timeout reached: 1m30s], retrying
2021-05-04T23:51:10.158373495Z I0504 23:51:10.158238    2245 update.go:221] Draining failed with: [error when waiting for pod "prometheus-k8s-0" terminating: global timeout reached: 1m30s, error when waiting for pod "e2e-aws" terminating: global timeout reached: 1m30s], retrying
2021-05-04T23:53:01.919309548Z I0504 23:53:01.919170    2245 update.go:221] Draining failed with: error when waiting for pod "e2e-aws" terminating: global timeout reached: 1m30s, retrying
2021-05-04T23:55:10.672597261Z I0504 23:55:10.672517    2245 update.go:1901] drain complete
2021-05-04T23:55:10.677753981Z I0504 23:55:10.677621    2245 update.go:237] Successful drain took 433.872992811 seconds

So that matches up fairly well with when the alert went off.

Comment 5 Kirsten Garrison 2021-05-06 02:23:04 UTC

I have a couple of ideas of what's going on here, but can you clarify what the cluster was doing / rolling out in these rendered configs on May4th?

Comment 6 W. Trevor King 2021-05-06 04:57:30 UTC

> ...but can you clarify what the cluster was doing / rolling out in these rendered configs on May4th?

$ grep 'completed update for config rendered-worker-' namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon/machine-config-daemon/logs/current.log
2021-03-22T00:10:01.905135990Z I0322 00:10:01.905078    2245 update.go:1901] completed update for config rendered-worker-f89c57ed7b2922e1ade47cda11bacbc7
2021-04-07T06:44:02.392901728Z I0407 06:44:02.392818    2245 update.go:1901] completed update for config rendered-worker-dde10a54c9d435b2e56364f36da94d89
2021-04-30T19:43:18.657172046Z I0430 19:43:18.657077    2245 update.go:1901] completed update for config rendered-worker-b24c654a2f2e06bf8fe4d435450e5091
2021-05-04T23:55:12.340492818Z I0504 23:55:12.340379    2245 update.go:1901] completed update for config rendered-worker-dde10a54c9d435b2e56364f36da94d89
$ diff -u cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/rendered-worker-b24c654a2f2e06bf8fe4d435450e5091.yaml cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/rendered-worker-dde10a54c9d435b2e56364f36da94d89.yaml 
--- cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/rendered-worker-b24c654a2f2e06bf8fe4d435450e5091.yaml     2021-05-05 10:12:20.000000000 -0700
+++ cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/rendered-worker-dde10a54c9d435b2e56364f36da94d89.yaml     2021-05-05 10:12:21.000000000 -0700
@@ -4,7 +4,7 @@
 metadata:
   annotations:
     machineconfiguration.openshift.io/generated-by-controller-version: 39de2f3d075d84c945f02d209c02c17b24d5dd71
-  creationTimestamp: "2021-04-30T17:53:00Z"
+  creationTimestamp: "2021-04-07T05:14:08Z"
   generation: 1
...
@@ -137,7 +137,7 @@
         overwrite: true
         path: /etc/NetworkManager/conf.d/sdn.conf
       - contents:
-          source: data:...
+          source: data:...
         mode: 384
         overwrite: true
         path: /var/lib/kubelet/config.json
$ function render() {
>   yaml2json - < "${1}" | jq -r '.spec.config.storage.files[] | select(.path == "/var/lib/kubelet/config.json").contents.source' | sed 's/^data:,//' | python3 -c 'import sys, urllib.parse; print(urllib.parse.unquote(sys.stdin.read()))' | jq .
> }
$ diff -u0 <(render cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/rendered-worker-b24c654a2f2e06bf8fe4d435450e5091.yaml) <(render cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/rendered-worker-dde10a54c9d435b2e56364f36da94d89.yaml)
--- /dev/fd/63  2021-05-05 22:01:54.754413780 -0700
+++ /dev/fd/62  2021-05-05 22:01:54.754413780 -0700
@@ -11,3 +10,0 @@
-    "quay.io/multi-arch": {
-      "auth": "..."
-    },

So that seems to be rolling back to an older rendered-worker MachineConfig, and the only thing being touched seems to be the cluster's pull secret, and the only touch is dropping that one cred.

Comment 7 Kirsten Garrison 2021-05-06 23:39:03 UTC

Did some digging and this issue only exists in 4.7.0->4.7.3 clusters. In those cluster the MCO drains without rebooting on several changes including sshkeys, pullsecrets & ca certs, this was changed in >=4.7.4 where changing those will not incur a drain. That means that in those clusters if there was a drain alert due to the dual metric per mcd, and the lack of reboot, the alert would persist even after the issue was resolved.  

The metric has a fix in flight as a part of https://bugzilla.redhat.com/show_bug.cgi?id=1853264

The drain was removed and this bz can't be hit in 4.7.4+ via https://bugzilla.redhat.com/show_bug.cgi?id=1853264

*** This bug has been marked as a duplicate of bug 1853264 ***