Bug 1957421
Summary: | MCDDrainError continues to fire despite a successful drain | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> |
Component: | Machine Config Operator | Assignee: | Kirsten Garrison <kgarriso> |
Status: | CLOSED DUPLICATE | QA Contact: | Michael Nguyen <mnguyen> |
Severity: | low | Docs Contact: | |
Priority: | medium | ||
Version: | 4.7 | CC: | kgarriso |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-05-06 23:39:03 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
W. Trevor King
2021-05-05 18:25:10 UTC
Could not reproduce this - tested in 4.6/4.7 the alert stopped firing correctly. I note that the alert firing is from machine-config-daemon-bzzlq but this mcd isn't in your must gather?? (there are 6 mcds listed ftr). Can you clarify what's going on? Should be in the must-gather: $ sha1sum quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a.zip 914d56324cb9dffc884e5609eeb4c116f03b1f47 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a.zip $ unzip -l quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a.zip | grep machine-config-daemon-bzzlq 0 2021-05-05 10:18 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/ 0 2021-05-05 10:17 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/oauth-proxy/ 0 2021-05-05 10:17 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon/ 10329 2021-05-05 10:12 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon-bzzlq.yaml 0 2021-05-05 10:17 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/oauth-proxy/oauth-proxy/ 0 2021-05-05 10:17 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon/machine-config-daemon/ 0 2021-05-05 10:18 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/oauth-proxy/oauth-proxy/logs/ 0 2021-05-05 10:18 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon/machine-config-daemon/logs/ 69700 2021-05-05 10:12 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/oauth-proxy/oauth-proxy/logs/current.log 0 2021-05-05 10:12 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/oauth-proxy/oauth-proxy/logs/previous.log 0 2021-05-05 10:12 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/oauth-proxy/oauth-proxy/logs/previous.insecure.log 1037104 2021-05-05 10:12 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon/machine-config-daemon/logs/current.log 0 2021-05-05 10:12 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon/machine-config-daemon/logs/previous.log 0 2021-05-05 10:12 quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-218d8d18c27b34e65864ccd5adb4d9c8b32ab004935fceba9e513d6beb51562a/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon/machine-config-daemon/logs/previous.insecure.log Checking on the alert history, the cluster's Alertmanager shows: Drain failed on build0-gstfj-w-c-b7znx.c.openshift-ci-build-farm.internal , updates may be blocked. For more details: oc logs -f -n openshift-machine-config-operator machine-config-daemon-<hash> -c machine-config-daemon firing since 2021-05-04T23:39:36.021Z. That falls within the Degraded range from the comment 0 log snippet. Growing that log snippet a bit: $ grep -i drain namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon/machine-config-daemon/logs/current.log | tail -n20 ... 2021-04-30T19:43:16.836552509Z I0430 19:43:16.836467 2245 update.go:237] Successful drain took 518.677280257 seconds 2021-05-04T23:22:36.000290043Z I0504 23:22:35.999717 2245 update.go:1901] Update prepared; beginning drain 2021-05-04T23:39:17.158683480Z I0504 23:39:17.150734 2245 update.go:221] Draining failed with: [error when waiting for pod ... terminating: global timeout reached: 1m30s], retrying 2021-05-04T23:41:01.445342529Z I0504 23:41:01.445141 2245 update.go:221] Draining failed with: [error when waiting for pod ... global timeout reached: 1m30s], retrying 2021-05-04T23:42:53.181074080Z I0504 23:42:53.180944 2245 update.go:221] Draining failed with: [error when waiting for pod "prometheus-k8s-0" terminating: global timeout reached: 1m30s, error when waiting for pod "e2e-aws" terminating: global timeout reached: 1m30s], retrying 2021-05-04T23:45:04.817022763Z I0504 23:45:04.816894 2245 update.go:221] Draining failed with: [error when waiting for pod "e2e-aws" terminating: global timeout reached: 1m30s, error when waiting for pod "prometheus-k8s-0" terminating: global timeout reached: 1m30s], retrying 2021-05-04T23:47:56.668208726Z I0504 23:47:56.668098 2245 update.go:221] Draining failed with: [error when waiting for pod "prometheus-k8s-0" terminating: global timeout reached: 1m30s, error when waiting for pod "e2e-aws" terminating: global timeout reached: 1m30s], retrying 2021-05-04T23:47:56.671027735Z E0504 23:47:56.670907 2245 writer.go:135] Marking Degraded due to: failed to drain node (5 tries): timed out waiting for the condition: [error when waiting for pod "prometheus-k8s-0" terminating: global timeout reached: 1m30s, error when waiting for pod "e2e-aws" terminating: global timeout reached: 1m30s] 2021-05-04T23:47:56.799605095Z I0504 23:47:56.799579 2245 update.go:1901] Update prepared; beginning drain 2021-05-04T23:49:28.467357997Z I0504 23:49:28.467269 2245 update.go:221] Draining failed with: [error when waiting for pod "prometheus-k8s-0" terminating: global timeout reached: 1m30s, error when waiting for pod "e2e-aws" terminating: global timeout reached: 1m30s], retrying 2021-05-04T23:51:10.158373495Z I0504 23:51:10.158238 2245 update.go:221] Draining failed with: [error when waiting for pod "prometheus-k8s-0" terminating: global timeout reached: 1m30s, error when waiting for pod "e2e-aws" terminating: global timeout reached: 1m30s], retrying 2021-05-04T23:53:01.919309548Z I0504 23:53:01.919170 2245 update.go:221] Draining failed with: error when waiting for pod "e2e-aws" terminating: global timeout reached: 1m30s, retrying 2021-05-04T23:55:10.672597261Z I0504 23:55:10.672517 2245 update.go:1901] drain complete 2021-05-04T23:55:10.677753981Z I0504 23:55:10.677621 2245 update.go:237] Successful drain took 433.872992811 seconds So that matches up fairly well with when the alert went off. I have a couple of ideas of what's going on here, but can you clarify what the cluster was doing / rolling out in these rendered configs on May4th? > ...but can you clarify what the cluster was doing / rolling out in these rendered configs on May4th? $ grep 'completed update for config rendered-worker-' namespaces/openshift-machine-config-operator/pods/machine-config-daemon-bzzlq/machine-config-daemon/machine-config-daemon/logs/current.log 2021-03-22T00:10:01.905135990Z I0322 00:10:01.905078 2245 update.go:1901] completed update for config rendered-worker-f89c57ed7b2922e1ade47cda11bacbc7 2021-04-07T06:44:02.392901728Z I0407 06:44:02.392818 2245 update.go:1901] completed update for config rendered-worker-dde10a54c9d435b2e56364f36da94d89 2021-04-30T19:43:18.657172046Z I0430 19:43:18.657077 2245 update.go:1901] completed update for config rendered-worker-b24c654a2f2e06bf8fe4d435450e5091 2021-05-04T23:55:12.340492818Z I0504 23:55:12.340379 2245 update.go:1901] completed update for config rendered-worker-dde10a54c9d435b2e56364f36da94d89 $ diff -u cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/rendered-worker-b24c654a2f2e06bf8fe4d435450e5091.yaml cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/rendered-worker-dde10a54c9d435b2e56364f36da94d89.yaml --- cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/rendered-worker-b24c654a2f2e06bf8fe4d435450e5091.yaml 2021-05-05 10:12:20.000000000 -0700 +++ cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/rendered-worker-dde10a54c9d435b2e56364f36da94d89.yaml 2021-05-05 10:12:21.000000000 -0700 @@ -4,7 +4,7 @@ metadata: annotations: machineconfiguration.openshift.io/generated-by-controller-version: 39de2f3d075d84c945f02d209c02c17b24d5dd71 - creationTimestamp: "2021-04-30T17:53:00Z" + creationTimestamp: "2021-04-07T05:14:08Z" generation: 1 ... @@ -137,7 +137,7 @@ overwrite: true path: /etc/NetworkManager/conf.d/sdn.conf - contents: - source: data:... + source: data:... mode: 384 overwrite: true path: /var/lib/kubelet/config.json $ function render() { > yaml2json - < "${1}" | jq -r '.spec.config.storage.files[] | select(.path == "/var/lib/kubelet/config.json").contents.source' | sed 's/^data:,//' | python3 -c 'import sys, urllib.parse; print(urllib.parse.unquote(sys.stdin.read()))' | jq . > } $ diff -u0 <(render cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/rendered-worker-b24c654a2f2e06bf8fe4d435450e5091.yaml) <(render cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/rendered-worker-dde10a54c9d435b2e56364f36da94d89.yaml) --- /dev/fd/63 2021-05-05 22:01:54.754413780 -0700 +++ /dev/fd/62 2021-05-05 22:01:54.754413780 -0700 @@ -11,3 +10,0 @@ - "quay.io/multi-arch": { - "auth": "..." - }, So that seems to be rolling back to an older rendered-worker MachineConfig, and the only thing being touched seems to be the cluster's pull secret, and the only touch is dropping that one cred. Did some digging and this issue only exists in 4.7.0->4.7.3 clusters. In those cluster the MCO drains without rebooting on several changes including sshkeys, pullsecrets & ca certs, this was changed in >=4.7.4 where changing those will not incur a drain. That means that in those clusters if there was a drain alert due to the dual metric per mcd, and the lack of reboot, the alert would persist even after the issue was resolved. The metric has a fix in flight as a part of https://bugzilla.redhat.com/show_bug.cgi?id=1853264 The drain was removed and this bz can't be hit in 4.7.4+ via https://bugzilla.redhat.com/show_bug.cgi?id=1853264 *** This bug has been marked as a duplicate of bug 1853264 *** |