Bug 1992328

Summary: Machine controller fails to drain node
Product: OpenShift Container Platform Reporter: Trevor Nierman <tnierman>
Component: Cloud ComputeAssignee: Joel Speed <jspeed>
Cloud Compute sub component: Other Providers QA Contact: sunzhaohua <zhsun>
Status: CLOSED INSUFFICIENT_DATA Docs Contact:
Severity: unspecified    
Priority: unspecified CC: cblecker, jspeed, rrackow
Version: 4.8Keywords: ServiceDeliveryImpact
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-08-16 14:43:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Trevor Nierman 2021-08-10 21:33:57 UTC
Description of problem:
Machine-controller failed to drain a worker node after a cluster upgrade to 4.8.3.

Version-Release number of selected component (if applicable):
4.8.3

How reproducible:


Steps to Reproduce:
1. Have a pod stuck terminating 
2. Upgrade cluster
3. Observe failure logs in machine-controller

Actual results:
Node drain fails, machine never updates. 

Expected results:
If drain fails, machine-controller should force-evict pods and complete upgrade.

Additional info:

Comment 4 Joel Speed 2021-08-12 12:56:12 UTC
Draining seems to be blocked permanently by a single pod, configure-alertmanager-operator-registry-8j96p

```
2021-08-10T20:10:48.141504670Z W0810 20:10:48.141460       1 controller.go:434] drain failed for machine "adjstsaug10-nldwt-worker-ca-central-1a-8pc69": error when waiting for pod "configure-alertmanager-operator-registry-8j96p" terminating: global timeout reached: 20s
```

This log line/error appears 132 times and accounts for the entire list of drain failed error messages.

This typically happens when either a PDB is blocking us from removing the pod, or the pod itself isn't shutting down and doesn't have a termination grace period set.

Looking at the pod in the must gather we have the creation/deletion timestamp within a very quick succession:

```
    creationTimestamp: "2021-08-10T16:59:09Z"
    deletionGracePeriodSeconds: 1
    deletionTimestamp: "2021-08-10T16:59:10Z"
```

This is a bit weird, it's like the pod was created and assigned to the Node even though the Node was being drained already.

I noticed in the machine controller log we see:

```
deleting Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: openshift-monitoring/configure-alertmanager-operator-registry-8j96p
```

Which then leads me on to, what is creating that pod? Is whatever is creating it doing the right thing with respect to cordoned nodes, or is it interfering with the process somehow.

The node which the pod is assigned to appears to be healthy and running, so I don't think it's an issue with the Node, but we can't rule that out just yet as the must gather doesn't have the kubelet logs for that node.

@cblecker Would you be able to retrieve the kubelet logs from the Node ip-10-0-213-12.ca-central-1.compute.internal so that we can rule out a host level issue for why this is being blocked?

Comment 6 Christoph Blecker 2021-08-12 14:56:59 UTC
@jspeed Unfortunately this cluster has since been destroyed, so further log collection outside of the MG will not be possible.

Comment 7 Joel Speed 2021-08-12 16:06:38 UTC
In that case, as per the slack thread, I don't think there's much more we can do here.

We know that the pod is created by the catalog source controller.
The pod is deleted and is past its deletion timestamp/grace period.
It should at this point be removed by Kubelet.

However, it's not been removed, but the node is reporting healthy, implying that kubelet is still running.

My suspicion is that the kubelet logs would have told us there was some problem removing the pod, but without that I don't think there's anything more we will be able to find from the must gather.

If anyone is able to reproduce this, please grab those kubelet logs and a fresh must gather.

As this isn't actually a Machine API problem explicitly, I won't have time to try and reproduce this myself.

Comment 8 Joel Speed 2021-08-16 14:43:53 UTC
Have spoken with Christoph on slack and as we are unable to reproduce this right now and have only seen 1 occurrence, we will close this out for now and reopen later if we can managed to reproduce again