1829999 – Warn about stuck-draining nodes so admins can mitigate

Bug 1829999 - Warn about stuck-draining nodes so admins can mitigate

Summary: Warn about stuck-draining nodes so admins can mitigate

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Kirsten Garrison
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-30 17:19 UTC by W. Trevor King
Modified:	2020-10-05 20:59 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-05 20:59:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description W. Trevor King 2020-04-30 17:19:01 UTC

Bug 1822269 (4.3) and bug 1822268 (4.4) fixed CRI-O bugs around stuck-in-terminating pods.  That protects nodes launched with the fixed CRI-O, but clusters may still have stuck-in-terminating pods blocking node drains running the old, affected CRI-O.  Users should be alerted to these stuck nodes and be pointed at suggested mitigations.  We may already alert on "we've been trying to drain $NODE, but have been stuck for a very long $DURATION".  If so, that would be a good place for a link to a Knowledgebase article or some such describing suggested diagnostics and mitigations.  Or the MCO could grow logic around auto-recovering from this situation, although personally I'm not clear on how it would distinguish between "stuck on drain because of a CRI-O bug, force-terminate" and "stuck on drain because pod is blocked by PDB, notify admin about possible manual intervention".  But whether this is automatic or manual, we need some process in place to get folks who may not realize they are still impacted by the stuck-in-terminating bugs healed up and back to having healthy, maintained Machine(Set)s.

Comment 1 W. Trevor King 2020-04-30 17:21:49 UTC

This is similar to bug 1829664, but they're slightly different.  That one is about blocking folks from making multiple minor bumps while folks have stuck machines that lag behind by a minor version.  This one is about unsticking stuck machines so folks can avoid that block.

Comment 2 Antonio Murdaca 2020-04-30 17:23:10 UTC

what't the difference with this https://bugzilla.redhat.com/show_bug.cgi?id=1829664 - can we just compact all this into just 1 bug - the solution seems pretty straightforward which will include a block and notify mechanism.

Comment 3 Antonio Murdaca 2020-04-30 17:24:34 UTC

(In reply to W. Trevor King from comment #1)
> This is similar to bug 1829664, but they're slightly different.  That one is
> about blocking folks from making multiple minor bumps while folks have stuck
> machines that lag behind by a minor version.  This one is about unsticking
> stuck machines so folks can avoid that block.

the signal about a stuck situation will include an unblock path one way or the other so I don't see why we're using this extra BZ

I think what is here is gonna be part of that other BZ so it's covered there anyway

Comment 4 W. Trevor King 2020-04-30 19:29:53 UTC

> I think what is here is gonna be part of that other BZ so it's covered there anyway

If there is a single PR that creates both the Upgradeable=False (or otherwise) blocker requested in bug 1829664 and also the alert (or otherwise) upgrade-agnostic resolver, I'm fine with that.  Can we close one or the other of these as a dup if that happens, but leave them both open to track the potentially decoupled fixes until then?  I don't want to hold up, say, an informing alert if we have that ready for this bug before we have some kind of update blocker in place for bug 1829664.

Comment 5 Kirsten Garrison 2020-05-01 00:23:37 UTC

we do have a drain alert in mco: https://github.com/openshift/machine-config-operator/blob/37b92529c4f9c8d554bdd93008cb044de3799b27/install/0000_90_machine-config-operator_01_prometheus-rules.yaml#L21

Comment 6 W. Trevor King 2020-05-01 02:27:08 UTC

> we do have a drain alert...

Hmm, that looks good to me.  And the CI-cluster was firing these before yesterday's excitement.  Maybe it's just missing links to docs to cover the impact on the cluster and suggest mitigation procedures?

Comment 7 Kirsten Garrison 2020-05-01 02:29:03 UTC

@wking what docs do you suggest we link to? I can take a look at what we can do/makes the most sense?

Comment 8 W. Trevor King 2020-05-01 02:38:56 UTC

> I can take a look at what we can do/makes the most sense?

Probably good to have test-platform folks weigh on on how they reacted to the existing alert and what would have helped them understand and recover on their own.  But I dunno about where to link.  [1,2] has some thoughts on potential candidates, but that's just me talking into the void.  Anything that makes sense to MCO folks in this space would be good, and we can always pivot later if we decide on a better place for some reason.

[1]: https://github.com/openshift/enhancements/commit/2b05511d7671ef9b75c8c8f01b69d06857dba0a0#diff-38b9cc6beb67f344dc174429e307aae2R31-R37
[2]: https://github.com/openshift/enhancements/pull/298

Comment 10 W. Trevor King 2020-05-01 21:21:08 UTC

History of the CI cluster:

* 2020-04-06T16:45Z: new 'kublet-debug' MachineConfig for workers [1,2].
* 2020-04-06T16:47Z: rendered-worker-965bc0a2f7f5db799ded97e2c12b800e created [7].
* 2020-04-08T20:31Z: worker MachineConfigPool sticks updating to rendered-worker-965bc0a2f7f5db799ded97e2c12b800e [5,6].
* 2020-04-16T17:03Z: Update to 4.3.0-0.nightly-2020-04-13-190424 initiated; completes in a bit over an hour [3].
* 2020-04-27T20:38Z: Update to unknown-version followed by 4.3.0-0.nightly-2020-04-13-190424, 4.3.17, and 4.4.0-rc.12 [3].
* 2020-04-27T23:02Z: Update to 4.4.0-rc.12 completes [3].
* 2020-04-27T23:48Z: MCDDrainError starts firing, bug 1822269 gets a (private, sorry) comment with node logs.
  [FIRING:1] MCDDrainError machine-config-daemon (190.373587037 sec metrics 5 tries: error when evicting pod "knative-eventing-filter-build": pods "knative-eventing-filter-build" is forbidden: unable to create new content in namespace ci-op-fthy1lds because it is being terminated 10.0.147.192:9001 openshift-machine-config-operator machine-config-daemon-mp4jv openshift-monitoring/k8s machine-config-daemon critical)
Drain failed on  , updates may be blocked. For more details:  oc logs -f -n openshift-machine-config-operator machine-config-daemon-<hash> -c machine-config-daemon
* 2020-04-28T14:13Z: must-gather collected

Oddities:

* No logs for machine-config-daemon-mp4jv in the must-gather [4].
* "All nodes are updating to rendered-worker-965bc0a2f7f5db799ded97e2c12b800e", from the MachineConfigPool [6] conflicts with node annotations, which show a range of desiredConfig value [8].  Three nodes seem to be happily on that version, with both desiredConfig and currentConfig matching, but one is attempting 965bc0... -> fd5d710..., one is on fd5d710..., and one is on 8758e49... [8] (8758e49... may be from the 16th's 4.3.0-0.nightly-2020-04-13-190424 update).  All three m5d4x seem happy at 049d20...
* The worker MachineConfigPool's 'machineCount: 9' [5] seems to count both worker and worker-m5d4x nodes [8].
* All the nodes have ready kubelets, many of which do not seem to have been restarted during the update to rc.12 [9].

Must-gathers don't seem to gather alerts (!), so I'm not sure if this cluster was still firing MCDDrainError when the must-gather was collected.  But it seems like *something* should have been firing in this space since at least 2020-04-08 when the worker MachineConfigPool got stuck.

[1]: https://github.com/openshift/release/pull/8129/files#diff-942e0bef7d32b2de4004d7c529bce15aR5-R6
[2]: https://github.com/openshift/release/pull/8129#event-3206265985
[3]: $ yaml2json <cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | jq -r '.status.history[] | .startedTime + " " + .completionTime + " " + .version + " " + .state + " " + (.verified | tostring)'
     2020-04-27T22:02:43Z 2020-04-27T23:00:42Z 4.4.0-rc.12 Completed true
     2020-04-27T21:14:02Z 2020-04-27T21:55:13Z 4.3.17 Completed true
     2020-04-27T20:50:40Z 2020-04-27T20:51:18Z 4.3.0-0.nightly-2020-04-13-190424 Completed false
     2020-04-27T20:38:33Z 2020-04-27T20:50:40Z  Partial true
     2020-04-16T17:03:11Z 2020-04-16T18:13:18Z 4.3.0-0.nightly-2020-04-13-190424 Completed false
     2020-03-24T12:22:21Z 2020-03-24T13:06:13Z 4.3.0-0.nightly-2020-03-23-130439 Completed false
     ...
[4]: $ (cd namespaces/openshift-machine-config-operator/pods && ls -d machine-config-daemon-*)
     machine-config-daemon-29lhv  machine-config-daemon-4ngdw  machine-config-daemon-8z46n  machine-config-daemon-bvbf7  machine-config-daemon-hbsf7  machine-config-daemon-p6hp9
     machine-config-daemon-2mpx2  machine-config-daemon-7kllh  machine-config-daemon-bllbj  machine-config-daemon-fslhs  machine-config-daemon-m5jxl  machine-config-daemon-tp7f9
[5] $ tail -n6 cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigpools/worker.yaml 
       degradedMachineCount: 0
       machineCount: 9
       observedGeneration: 13
       readyMachineCount: 1
       unavailableMachineCount: 1
       updatedMachineCount: 1
[6]: $ yaml2json <cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigpools/worker.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + " " + .status + " " + .reason + " " + .message'
     2020-01-30T13:45:46Z RenderDegraded False  
     2020-04-08T20:31:31Z Updated False  
     2020-04-08T20:31:31Z Updating True  All nodes are updating to rendered-worker-965bc0a2f7f5db799ded97e2c12b800e
     2020-04-28T00:26:35Z Degraded False  
     2020-04-28T00:26:35Z NodeDegraded False
[7]: $ for P in cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/rendered-worker-[0-9a-f]*.yaml; do yaml2json < "${P}" | jq -r '.metadata.creationTimestamp + " " + .metadata.name'; done | sort
     2020-01-30T13:45:46Z rendered-worker-9f3937d51d62634b6d9ab2fe9c93b416
     2020-02-24T18:43:15Z rendered-worker-450ac93c7d08c437ad42815ddfd69056
     2020-02-26T14:26:58Z rendered-worker-a12fa4a597a22d2935bd9184e63fa601
     2020-03-05T15:45:42Z rendered-worker-2e0479535983547a0bcc9c3c043466a8
     2020-03-09T17:40:33Z rendered-worker-92a1e7a277ffefcf5fa4fa3d767cc52a
     2020-03-24T12:53:39Z rendered-worker-29d660dd6147feab0db552fbe701e9b1
     2020-04-03T16:42:35Z rendered-worker-b3b603424816b24c4fcc69582af95f68
     2020-04-06T16:47:34Z rendered-worker-965bc0a2f7f5db799ded97e2c12b800e
     2020-04-16T18:01:10Z rendered-worker-8758e49e3f03d58f6f2f1be35ea7ec7f
     2020-04-27T21:42:35Z rendered-worker-48c36cdab5e98e3a58c1c4887ab97bd4
     2020-04-27T22:42:37Z rendered-worker-fd5d7107556afa4cfaa157fe73e4a36e
[8]: $ for P in cluster-scoped-resources/core/nodes/*; do yaml2json < "${P}" | jq -r '.metadata | .annotations["machineconfiguration.openshift.io/desiredConfig"] + " " + .annotations["machineconfiguration.openshift.io/currentConfig"] + " " + .name'; done | grep rendered-worker | sort
     rendered-worker-8758e49e3f03d58f6f2f1be35ea7ec7f rendered-worker-8758e49e3f03d58f6f2f1be35ea7ec7f ip-10-0-174-84.ec2.internal
     rendered-worker-965bc0a2f7f5db799ded97e2c12b800e rendered-worker-965bc0a2f7f5db799ded97e2c12b800e ip-10-0-130-84.ec2.internal
     rendered-worker-965bc0a2f7f5db799ded97e2c12b800e rendered-worker-965bc0a2f7f5db799ded97e2c12b800e ip-10-0-132-55.ec2.internal
     rendered-worker-965bc0a2f7f5db799ded97e2c12b800e rendered-worker-965bc0a2f7f5db799ded97e2c12b800e ip-10-0-141-108.ec2.internal
     rendered-worker-fd5d7107556afa4cfaa157fe73e4a36e rendered-worker-965bc0a2f7f5db799ded97e2c12b800e ip-10-0-143-120.ec2.internal
     rendered-worker-fd5d7107556afa4cfaa157fe73e4a36e rendered-worker-fd5d7107556afa4cfaa157fe73e4a36e ip-10-0-152-34.ec2.internal
     rendered-worker-m5d4x-049d207be5cfe536c82a9ac80757780c rendered-worker-m5d4x-049d207be5cfe536c82a9ac80757780c ip-10-0-143-70.ec2.internal
     rendered-worker-m5d4x-049d207be5cfe536c82a9ac80757780c rendered-worker-m5d4x-049d207be5cfe536c82a9ac80757780c ip-10-0-146-117.ec2.internal
     rendered-worker-m5d4x-049d207be5cfe536c82a9ac80757780c rendered-worker-m5d4x-049d207be5cfe536c82a9ac80757780c ip-10-0-166-107.ec2.internal
[9]: $ for P in cluster-scoped-resources/core/nodes/*; do yaml2json < "${P}" | jq -r '(.status.conditions[] | select(.type == "Ready") | .lastTransitionTime + " " + .lastHeartbeatTime + " " + .status) + " " + .metadata.name'; done
$ for P in cluster-scoped-resources/core/nodes/*; do yaml2json < "${P}" | jq -r '(.status.conditions[] | select(.type == "Ready") | .lastTransitionTime + " " + .lastHeartbeatTime + " " + .status) + " " + .metadata.name'; done | sort | sed 's/^/     /'
     2020-04-16T22:45:11Z 2020-04-28T14:18:22Z True ip-10-0-143-120.ec2.internal
     2020-04-17T08:39:38Z 2020-04-28T14:18:03Z True ip-10-0-132-55.ec2.internal
     2020-04-17T14:14:24Z 2020-04-28T14:18:30Z True ip-10-0-141-108.ec2.internal
     2020-04-17T15:02:07Z 2020-04-28T14:18:21Z True ip-10-0-166-107.ec2.internal
     2020-04-17T15:07:06Z 2020-04-28T14:18:33Z True ip-10-0-146-117.ec2.internal
     2020-04-20T13:31:19Z 2020-04-28T14:18:42Z True ip-10-0-174-84.ec2.internal
     2020-04-27T14:41:06Z 2020-04-28T14:18:17Z True ip-10-0-130-84.ec2.internal
     2020-04-27T14:43:12Z 2020-04-28T14:18:17Z True ip-10-0-143-70.ec2.internal
     2020-04-27T22:45:55Z 2020-04-28T14:17:06Z True ip-10-0-159-123.ec2.internal
     2020-04-27T22:50:28Z 2020-04-28T14:16:46Z True ip-10-0-140-81.ec2.internal
     2020-04-27T22:54:27Z 2020-04-28T14:16:21Z True ip-10-0-175-171.ec2.internal
     2020-04-28T00:14:31Z 2020-04-28T14:14:28Z True ip-10-0-152-34.ec2.internal

Comment 11 W. Trevor King 2020-05-01 21:22:10 UTC

Oops.  I'd been thinking about re-assigning to the machine-API operator, but it seems like this is really a machine-config issue, not a Machine issue.  Moving it back...

Comment 15 Antonio Murdaca 2020-06-16 12:58:00 UTC

Adding UpcomingSprint as this won't make the current sprint. We'll try to work on this bug in the next sprint.

Comment 18 Kirsten Garrison 2020-09-01 20:03:55 UTC

After thinking about this a lot, I think this is overloading the alert itself and that this is really addressing a hole in our general troubleshooting documentation...

Comment 20 Michelle Krejci 2020-10-05 20:59:45 UTC

Given that this issue has been moved the last two releases, I am closing. Please reopen if it's worth revisiting.

Note You need to log in before you can comment on or make changes to this bug.