Bug 1952694 - Normal (<24h) machine config draining is reported as "failures" and "degraded" - it is not
Summary: Normal (<24h) machine config draining is reported as "failures" and "degraded...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.8.0
Assignee: Kirsten Garrison
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-22 21:33 UTC by Clayton Coleman
Modified: 2021-06-13 21:26 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-06-13 21:26:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2597 0 None open [WIP] Bug 1952694: Bump drain backoff 24h 2021-06-03 19:18:18 UTC

Description Clayton Coleman 2021-04-22 21:33:26 UTC
The current drain logic and behavior from MCD during normal upgrades is reporting very normal operations as errors.  The process of draining a node is a *normal* interaction between workload (represented by PDB and graceful shutdown) and the drain process.

Currently, MCD is reporting these normal states as errors, using a number of messages that portray normal behavior as errors, failures, or appear to a normal admin as a "degraded" state when instead it is a "progressing" state. It is entirely normal to take O(hours) to drain nodes, and we should only be alerting or erroring when the error is *not* "waiting for pod to complete or pdb to exit".

As long as we have been draining less than 24h (last i checked, that was our tolerance window), we are not in an error state and we should be sending messages like (just an example of tone) as events:

"Draining node X: 5 pods remaining after Y minutes (optionally list pods remaining in namespace/name form)"

The alert is also misplaced - we already have a PDB at limit error, so drain should not be erroring or reporting errors when PDB clearly tells us to wait (it is not an error, and we already have a warning alert PodDisruptionBudgetAtLimit).

To an end user, we have taken completely normal use of the product and portrayed it as significant failure state, and that then means the customer trusts the product and our alerts a bit less each time we over use this context.

I will be adding a test case that adds more PDB testing to upgrade, and that test will fail if we are alerting during upgrade due to other 4.8 changes.  I may grant a limited exception of time to fix this, but this is high priority given how terribly the build01 4.7.8 cluster upgrade was presented.

----

Example of alert saying "error" when this is not an error (also, this is not a timeout)

ALERTS{alertname="MCDDrainError", alertstate="firing", container="oauth-proxy", endpoint="metrics", err="WaitTimeout", instance="10.0.130.44:9001", job="machine-config-daemon", namespace="openshift-machine-config-operator", node="ip-10-0-130-44.ec2.internal", pod="machine-config-daemon-gdfm6", service="machine-config-daemon", severity="warning"}

Examples of error portrayals:

Apr 22 17:21:11.000 I node/ip-10-0-164-190.ec2.internal reason/Drain Draining node to update config. (38 times)
Apr 22 17:21:11.000 W node/ip-10-0-164-190.ec2.internal reason/FailedToDrain 5 tries: error when waiting for pod "0e03b42b-961c-11eb-bd4a-0a580a800913" terminating: global timeout reached: 1m30s (29 times)

Comment 1 Clayton Coleman 2021-04-22 21:54:06 UTC
In addition, we confirmed on the 4.7.8 upgrading cluster that the nodes being drained were reported as "degraded" due to the same error: showing

   Message:               Node ip-10-0-164-190.ec2.internal is reporting: "failed to drain node (5 tries): timed out waiting for the condition: error when waiting for pod \"0e03b42b-961c-11eb-bd4a-0a580a800913\" terminating: global timeout reached: 1m30s", Node ip-10-0-170-27.ec2.internal is reporting: "failed to drain node (5 tries): timed out waiting for the condition: error when evicting pods/\"rpm-repo-55b8d65474-pz2dp\" -n \"ci-op-0cjpxgpy\": global timeout reached: 1m30s", Node ip-10-0-152-206.ec2.internal is reporting: "failed to drain node (5 tries): timed out waiting for the condition: [error when evicting pods/\"rpm-repo-6b9796d6dc-kdbz7\" -n \"ci-op-jnm9nlqs\": global timeout reached: 1m30s, error when evicting pods/\"rpm-repo-67468cbc78-ngqsd\" -n \"ci-op-0vy1gw11\": global timeout reached: 1m30s, error when evicting pods/\"ecc5bf3a-a3ac-11eb-b1ca-0a580a820289\" -n \"ci\": global timeout reached: 1m30s, error when evicting pods/\"9fcddd82-a3a0-11eb-8691-0a580a81043f\" -n \"ci\": global timeout reached: 1m30s, error when evicting pods/\"rpm-repo-9f845d9bf-9pl6p\" -n \"ci-op-8zbq9w3n\": global timeout reached: 1m30s]", Node ip-10-0-142-241.ec2.internal is reporting: "failed to drain node (5 tries): timed out waiting for the condition: error when evicting pods/\"rpm-repo-b78cc9bcc-xf9jl\" -n \"ci-op-ffqr2nt3\": global timeout reached: 1m30s", Node ip-10-0-147-73.ec2.internal is reporting: "failed to drain node (5 tries): timed out waiting for the condition: [error when evicting pods/\"4.6.0-0.okd-2021-04-22-205018-upgrade\" -n \"ci\": global timeout reached: 1m30s, error when evicting pods/\"3c68fc17-a3ac-11eb-a0ee-0a580a82028a\" -n \"ci\": global timeout reached: 1m30s, error when evicting pods/\"3c5a1b0a-a3ac-11eb-a0ee-0a580a82028a\" -n \"ci\": global timeout reached: 1m30s, error when evicting pods/\"3c6bd84e-a3ac-11eb-a0ee-0a580a82028a\" -n \"ci\": global timeout reached: 1m30s, error when evicting pods/\"ecc25d28-a3ac-11eb-b1ca-0a580a820289\" -n \"ci\": global timeout reached: 1m30s, error when evicting pods/\"a910d779-a3ac-11eb-aa6d-0a580a810c6d\" -n \"ci\": global timeout reached: 1m30s, error when evicting pods/\"rpm-repo-b44d8487d-pjzt2\" -n \"ci-op-78pdb3v6\": global timeout reached: 1m30s]", Node ip-10-0-149-24.ec2.internal is reporting: "failed to drain node (5 tries): timed out waiting for the condition: [error when evicting pods/\"chat-bot-2021-04-22-171327.038\" -n \"ci\": global timeout reached: 1m30s, error when evicting pods/\"4babbf5e-a3a0-11eb-b1d6-0a580a820a51\" -n \"ci\": global timeout reached: 1m30s, error when evicting pods/\"e2e-aws-operator-test\" -n \"ci-op-tgj50d18\": global timeout reached: 1m30s, error when evicting pods/\"415694bc-a3a0-11eb-8691-0a580a81043f\" -n \"ci\": global timeout reached: 1m30s, error when evicting pods/\"4baa1566-a3a0-11eb-b1d6-0a580a820a51\" -n \"ci\": global timeout reached: 1m30s, error when evicting pods/\"4d14809c-a3a0-11eb-8691-0a580a81043f\" -n \"ci\": global timeout reached: 1m30s, error when evicting pods/\"4c392788-a3a0-11eb-8691-0a580a81043f\" -n \"ci\": global timeout reached: 1m30s, error when evicting pods/\"c62e4522-a3ad-11eb-b81c-0a58ac1d0654\" -n \"ci\": global timeout reached: 1m30s, error when evicting pods/\"8bf604d9-a3a7-11eb-8691-0a580a81043f\" -n \"ci\": global timeout reached: 1m30s, error when evicting pods/\"rpm-repo-7cc744dc6f-dq22k\" -n \"ci-op-f8wfyxn1\": global timeout reached: 1m30s]", Node ip-10-0-167-153.ec2.internal is reporting: "failed to drain node (5 tries): timed out waiting for the condition: [error when waiting for pod \"db872992-9d34-11eb-8d4a-0a580a8208a8\" terminating: global timeout reached: 1m30s, error when evicting pods/\"rpm-repo-56cff9b449-rsld7\" -n \"ci-op-h6t2m22s\": global timeout reached: 1m30s]", Node ip-10-0-132-223.ec2.internal is reporting: "failed to drain node (5 tries): timed out waiting for the condition: error when waiting for pod \"bf1d5cee-9316-11eb-8453-0a580a8102f7\" terminating: global timeout reached: 1m30s", Node ip-10-0-138-161.ec2.internal is reporting: "failed to drain node (5 tries): timed out waiting for the condition: error when evicting pods/\"rpm-repo-56cff9b449-29ls9\" -n \"ci-op-h6t2m22s\": global timeout reached: 1m30s"

Comment 2 Yu Qi Zhang 2021-04-22 22:56:05 UTC
For background: this has always been the behaviour in the MCO, where errors during an upgrade process bubble up to the pool and eventually the operator status. Up until recently this has been more hidden, since slow draining pods generally happen on worker nodes, and worker nodes and not considered required during upgrades.

As of Openshift 4.7 degraded pools block upgrade completion, including workers, so it's much more visible (and expected, such that we report issues faster and more transparently). The MCO should thus correspondingly improve how we report progressing/degraded nodes, especially since drains are currently considered "degraded" after a failure of 1min30s (very short). In the case of a regular workload drain, we should consider it progressing but not degraded, in accordance with how the CVO defines these status.

This does bring up the further questions of:
1. how/whether we should distinguish between different types of drain failures (timeouts vs mis-configured pdbs)
2. what really constitutes a failing drain timeout wise (is it 24hrs? is that arbitrary?)

Comment 3 W. Trevor King 2021-05-14 17:06:15 UTC
> As of Openshift 4.7 degraded pools block upgrade completion, including workers, so it's much more visible...

This overlaps a good deal with bug 1943289.  I dunno if they're close enough to be worth closing one or the other as a dup.

> how/whether we should distinguish between different types of drain failures (timeouts vs mis-configured pdbs)

For PDBs in general, unblocking the drain involves both getting rid of an existing pod and bringing up replacements.  If it takes more than 15m to get a replacement ready, that's probably fairly safe to alarm on generically.  For getting rid of an existing pod, we can look at terminationGracePeriodSeconds [1] to get an appropriate duration for every pod, right?

And there's some contention between workload maintainers and pool maintainers here, so it might be appropriate for pool authors to have a knob on the pool for "ping me if drains take longer than $DURATION", so they can go out and discuss with the workload maintainers who are exceeding that threshold.  OSD has a hard-cap on drains that's configurable but which cannot exceed 8h.  I dunno if I'd go that far (being able to drain nodes is important, but some workloads are not interruptable, and they're important too), but it is precedent for pool-level drain thresholds, even if the MCO decides to take a different action (e.g. warning instead of forcing) when hitting the threshold.  And a pool-level guard would provide a backstop for pods which accidentally set crazy high terminationGracePeriodSeconds values or whatever.

[1]: https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-v1/#lifecycle

Comment 6 Kirsten Garrison 2021-06-13 21:26:26 UTC
After consideration with product and other interested parties, we built in a default so that drain is considered a failure after 1hr: https://issues.redhat.com/browse/OCPBUGSM-30052

I'll be looking into allowing users to specific custom per pool drain periods in 4.9 via https://issues.redhat.com/browse/GRPA-3842

Closing this bug for now and tracking via the above JIRA card.


Note You need to log in before you can comment on or make changes to this bug.