Bug 1822269 (4.3) and bug 1822268 (4.4) fixed CRI-O bugs around stuck-in-terminating pods. That protects nodes launched with the fixed CRI-O, but clusters may still have stuck-in-terminating pods blocking node drains running the old, affected CRI-O. Users should be alerted to these stuck nodes and be pointed at suggested mitigations. We may already alert on "we've been trying to drain $NODE, but have been stuck for a very long $DURATION". If so, that would be a good place for a link to a Knowledgebase article or some such describing suggested diagnostics and mitigations. Or the MCO could grow logic around auto-recovering from this situation, although personally I'm not clear on how it would distinguish between "stuck on drain because of a CRI-O bug, force-terminate" and "stuck on drain because pod is blocked by PDB, notify admin about possible manual intervention". But whether this is automatic or manual, we need some process in place to get folks who may not realize they are still impacted by the stuck-in-terminating bugs healed up and back to having healthy, maintained Machine(Set)s.
This is similar to bug 1829664, but they're slightly different. That one is about blocking folks from making multiple minor bumps while folks have stuck machines that lag behind by a minor version. This one is about unsticking stuck machines so folks can avoid that block.
what't the difference with this https://bugzilla.redhat.com/show_bug.cgi?id=1829664 - can we just compact all this into just 1 bug - the solution seems pretty straightforward which will include a block and notify mechanism.
(In reply to W. Trevor King from comment #1) > This is similar to bug 1829664, but they're slightly different. That one is > about blocking folks from making multiple minor bumps while folks have stuck > machines that lag behind by a minor version. This one is about unsticking > stuck machines so folks can avoid that block. the signal about a stuck situation will include an unblock path one way or the other so I don't see why we're using this extra BZ I think what is here is gonna be part of that other BZ so it's covered there anyway
> I think what is here is gonna be part of that other BZ so it's covered there anyway If there is a single PR that creates both the Upgradeable=False (or otherwise) blocker requested in bug 1829664 and also the alert (or otherwise) upgrade-agnostic resolver, I'm fine with that. Can we close one or the other of these as a dup if that happens, but leave them both open to track the potentially decoupled fixes until then? I don't want to hold up, say, an informing alert if we have that ready for this bug before we have some kind of update blocker in place for bug 1829664.
we do have a drain alert in mco: https://github.com/openshift/machine-config-operator/blob/37b92529c4f9c8d554bdd93008cb044de3799b27/install/0000_90_machine-config-operator_01_prometheus-rules.yaml#L21
> we do have a drain alert... Hmm, that looks good to me. And the CI-cluster was firing these before yesterday's excitement. Maybe it's just missing links to docs to cover the impact on the cluster and suggest mitigation procedures?
@wking what docs do you suggest we link to? I can take a look at what we can do/makes the most sense?
> I can take a look at what we can do/makes the most sense? Probably good to have test-platform folks weigh on on how they reacted to the existing alert and what would have helped them understand and recover on their own. But I dunno about where to link. [1,2] has some thoughts on potential candidates, but that's just me talking into the void. Anything that makes sense to MCO folks in this space would be good, and we can always pivot later if we decide on a better place for some reason. [1]: https://github.com/openshift/enhancements/commit/2b05511d7671ef9b75c8c8f01b69d06857dba0a0#diff-38b9cc6beb67f344dc174429e307aae2R31-R37 [2]: https://github.com/openshift/enhancements/pull/298
History of the CI cluster: * 2020-04-06T16:45Z: new 'kublet-debug' MachineConfig for workers [1,2]. * 2020-04-06T16:47Z: rendered-worker-965bc0a2f7f5db799ded97e2c12b800e created [7]. * 2020-04-08T20:31Z: worker MachineConfigPool sticks updating to rendered-worker-965bc0a2f7f5db799ded97e2c12b800e [5,6]. * 2020-04-16T17:03Z: Update to 4.3.0-0.nightly-2020-04-13-190424 initiated; completes in a bit over an hour [3]. * 2020-04-27T20:38Z: Update to unknown-version followed by 4.3.0-0.nightly-2020-04-13-190424, 4.3.17, and 4.4.0-rc.12 [3]. * 2020-04-27T23:02Z: Update to 4.4.0-rc.12 completes [3]. * 2020-04-27T23:48Z: MCDDrainError starts firing, bug 1822269 gets a (private, sorry) comment with node logs. [FIRING:1] MCDDrainError machine-config-daemon (190.373587037 sec metrics 5 tries: error when evicting pod "knative-eventing-filter-build": pods "knative-eventing-filter-build" is forbidden: unable to create new content in namespace ci-op-fthy1lds because it is being terminated 10.0.147.192:9001 openshift-machine-config-operator machine-config-daemon-mp4jv openshift-monitoring/k8s machine-config-daemon critical) Drain failed on , updates may be blocked. For more details: oc logs -f -n openshift-machine-config-operator machine-config-daemon-<hash> -c machine-config-daemon * 2020-04-28T14:13Z: must-gather collected Oddities: * No logs for machine-config-daemon-mp4jv in the must-gather [4]. * "All nodes are updating to rendered-worker-965bc0a2f7f5db799ded97e2c12b800e", from the MachineConfigPool [6] conflicts with node annotations, which show a range of desiredConfig value [8]. Three nodes seem to be happily on that version, with both desiredConfig and currentConfig matching, but one is attempting 965bc0... -> fd5d710..., one is on fd5d710..., and one is on 8758e49... [8] (8758e49... may be from the 16th's 4.3.0-0.nightly-2020-04-13-190424 update). All three m5d4x seem happy at 049d20... * The worker MachineConfigPool's 'machineCount: 9' [5] seems to count both worker and worker-m5d4x nodes [8]. * All the nodes have ready kubelets, many of which do not seem to have been restarted during the update to rc.12 [9]. Must-gathers don't seem to gather alerts (!), so I'm not sure if this cluster was still firing MCDDrainError when the must-gather was collected. But it seems like *something* should have been firing in this space since at least 2020-04-08 when the worker MachineConfigPool got stuck. [1]: https://github.com/openshift/release/pull/8129/files#diff-942e0bef7d32b2de4004d7c529bce15aR5-R6 [2]: https://github.com/openshift/release/pull/8129#event-3206265985 [3]: $ yaml2json <cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | jq -r '.status.history[] | .startedTime + " " + .completionTime + " " + .version + " " + .state + " " + (.verified | tostring)' 2020-04-27T22:02:43Z 2020-04-27T23:00:42Z 4.4.0-rc.12 Completed true 2020-04-27T21:14:02Z 2020-04-27T21:55:13Z 4.3.17 Completed true 2020-04-27T20:50:40Z 2020-04-27T20:51:18Z 4.3.0-0.nightly-2020-04-13-190424 Completed false 2020-04-27T20:38:33Z 2020-04-27T20:50:40Z Partial true 2020-04-16T17:03:11Z 2020-04-16T18:13:18Z 4.3.0-0.nightly-2020-04-13-190424 Completed false 2020-03-24T12:22:21Z 2020-03-24T13:06:13Z 4.3.0-0.nightly-2020-03-23-130439 Completed false ... [4]: $ (cd namespaces/openshift-machine-config-operator/pods && ls -d machine-config-daemon-*) machine-config-daemon-29lhv machine-config-daemon-4ngdw machine-config-daemon-8z46n machine-config-daemon-bvbf7 machine-config-daemon-hbsf7 machine-config-daemon-p6hp9 machine-config-daemon-2mpx2 machine-config-daemon-7kllh machine-config-daemon-bllbj machine-config-daemon-fslhs machine-config-daemon-m5jxl machine-config-daemon-tp7f9 [5] $ tail -n6 cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigpools/worker.yaml degradedMachineCount: 0 machineCount: 9 observedGeneration: 13 readyMachineCount: 1 unavailableMachineCount: 1 updatedMachineCount: 1 [6]: $ yaml2json <cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigpools/worker.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + " " + .status + " " + .reason + " " + .message' 2020-01-30T13:45:46Z RenderDegraded False 2020-04-08T20:31:31Z Updated False 2020-04-08T20:31:31Z Updating True All nodes are updating to rendered-worker-965bc0a2f7f5db799ded97e2c12b800e 2020-04-28T00:26:35Z Degraded False 2020-04-28T00:26:35Z NodeDegraded False [7]: $ for P in cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/rendered-worker-[0-9a-f]*.yaml; do yaml2json < "${P}" | jq -r '.metadata.creationTimestamp + " " + .metadata.name'; done | sort 2020-01-30T13:45:46Z rendered-worker-9f3937d51d62634b6d9ab2fe9c93b416 2020-02-24T18:43:15Z rendered-worker-450ac93c7d08c437ad42815ddfd69056 2020-02-26T14:26:58Z rendered-worker-a12fa4a597a22d2935bd9184e63fa601 2020-03-05T15:45:42Z rendered-worker-2e0479535983547a0bcc9c3c043466a8 2020-03-09T17:40:33Z rendered-worker-92a1e7a277ffefcf5fa4fa3d767cc52a 2020-03-24T12:53:39Z rendered-worker-29d660dd6147feab0db552fbe701e9b1 2020-04-03T16:42:35Z rendered-worker-b3b603424816b24c4fcc69582af95f68 2020-04-06T16:47:34Z rendered-worker-965bc0a2f7f5db799ded97e2c12b800e 2020-04-16T18:01:10Z rendered-worker-8758e49e3f03d58f6f2f1be35ea7ec7f 2020-04-27T21:42:35Z rendered-worker-48c36cdab5e98e3a58c1c4887ab97bd4 2020-04-27T22:42:37Z rendered-worker-fd5d7107556afa4cfaa157fe73e4a36e [8]: $ for P in cluster-scoped-resources/core/nodes/*; do yaml2json < "${P}" | jq -r '.metadata | .annotations["machineconfiguration.openshift.io/desiredConfig"] + " " + .annotations["machineconfiguration.openshift.io/currentConfig"] + " " + .name'; done | grep rendered-worker | sort rendered-worker-8758e49e3f03d58f6f2f1be35ea7ec7f rendered-worker-8758e49e3f03d58f6f2f1be35ea7ec7f ip-10-0-174-84.ec2.internal rendered-worker-965bc0a2f7f5db799ded97e2c12b800e rendered-worker-965bc0a2f7f5db799ded97e2c12b800e ip-10-0-130-84.ec2.internal rendered-worker-965bc0a2f7f5db799ded97e2c12b800e rendered-worker-965bc0a2f7f5db799ded97e2c12b800e ip-10-0-132-55.ec2.internal rendered-worker-965bc0a2f7f5db799ded97e2c12b800e rendered-worker-965bc0a2f7f5db799ded97e2c12b800e ip-10-0-141-108.ec2.internal rendered-worker-fd5d7107556afa4cfaa157fe73e4a36e rendered-worker-965bc0a2f7f5db799ded97e2c12b800e ip-10-0-143-120.ec2.internal rendered-worker-fd5d7107556afa4cfaa157fe73e4a36e rendered-worker-fd5d7107556afa4cfaa157fe73e4a36e ip-10-0-152-34.ec2.internal rendered-worker-m5d4x-049d207be5cfe536c82a9ac80757780c rendered-worker-m5d4x-049d207be5cfe536c82a9ac80757780c ip-10-0-143-70.ec2.internal rendered-worker-m5d4x-049d207be5cfe536c82a9ac80757780c rendered-worker-m5d4x-049d207be5cfe536c82a9ac80757780c ip-10-0-146-117.ec2.internal rendered-worker-m5d4x-049d207be5cfe536c82a9ac80757780c rendered-worker-m5d4x-049d207be5cfe536c82a9ac80757780c ip-10-0-166-107.ec2.internal [9]: $ for P in cluster-scoped-resources/core/nodes/*; do yaml2json < "${P}" | jq -r '(.status.conditions[] | select(.type == "Ready") | .lastTransitionTime + " " + .lastHeartbeatTime + " " + .status) + " " + .metadata.name'; done $ for P in cluster-scoped-resources/core/nodes/*; do yaml2json < "${P}" | jq -r '(.status.conditions[] | select(.type == "Ready") | .lastTransitionTime + " " + .lastHeartbeatTime + " " + .status) + " " + .metadata.name'; done | sort | sed 's/^/ /' 2020-04-16T22:45:11Z 2020-04-28T14:18:22Z True ip-10-0-143-120.ec2.internal 2020-04-17T08:39:38Z 2020-04-28T14:18:03Z True ip-10-0-132-55.ec2.internal 2020-04-17T14:14:24Z 2020-04-28T14:18:30Z True ip-10-0-141-108.ec2.internal 2020-04-17T15:02:07Z 2020-04-28T14:18:21Z True ip-10-0-166-107.ec2.internal 2020-04-17T15:07:06Z 2020-04-28T14:18:33Z True ip-10-0-146-117.ec2.internal 2020-04-20T13:31:19Z 2020-04-28T14:18:42Z True ip-10-0-174-84.ec2.internal 2020-04-27T14:41:06Z 2020-04-28T14:18:17Z True ip-10-0-130-84.ec2.internal 2020-04-27T14:43:12Z 2020-04-28T14:18:17Z True ip-10-0-143-70.ec2.internal 2020-04-27T22:45:55Z 2020-04-28T14:17:06Z True ip-10-0-159-123.ec2.internal 2020-04-27T22:50:28Z 2020-04-28T14:16:46Z True ip-10-0-140-81.ec2.internal 2020-04-27T22:54:27Z 2020-04-28T14:16:21Z True ip-10-0-175-171.ec2.internal 2020-04-28T00:14:31Z 2020-04-28T14:14:28Z True ip-10-0-152-34.ec2.internal
Oops. I'd been thinking about re-assigning to the machine-API operator, but it seems like this is really a machine-config issue, not a Machine issue. Moving it back...
Adding UpcomingSprint as this won't make the current sprint. We'll try to work on this bug in the next sprint.
After thinking about this a lot, I think this is overloading the alert itself and that this is really addressing a hole in our general troubleshooting documentation...
Given that this issue has been moved the last two releases, I am closing. Please reopen if it's worth revisiting.