Bug 1992328
| Summary: | Machine controller fails to drain node | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Trevor Nierman <tnierman> |
| Component: | Cloud Compute | Assignee: | Joel Speed <jspeed> |
| Cloud Compute sub component: | Other Providers | QA Contact: | sunzhaohua <zhsun> |
| Status: | CLOSED INSUFFICIENT_DATA | Docs Contact: | |
| Severity: | unspecified | ||
| Priority: | unspecified | CC: | cblecker, jspeed, rrackow |
| Version: | 4.8 | Keywords: | ServiceDeliveryImpact |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-08-16 14:43:53 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Trevor Nierman
2021-08-10 21:33:57 UTC
Draining seems to be blocked permanently by a single pod, configure-alertmanager-operator-registry-8j96p
```
2021-08-10T20:10:48.141504670Z W0810 20:10:48.141460 1 controller.go:434] drain failed for machine "adjstsaug10-nldwt-worker-ca-central-1a-8pc69": error when waiting for pod "configure-alertmanager-operator-registry-8j96p" terminating: global timeout reached: 20s
```
This log line/error appears 132 times and accounts for the entire list of drain failed error messages.
This typically happens when either a PDB is blocking us from removing the pod, or the pod itself isn't shutting down and doesn't have a termination grace period set.
Looking at the pod in the must gather we have the creation/deletion timestamp within a very quick succession:
```
creationTimestamp: "2021-08-10T16:59:09Z"
deletionGracePeriodSeconds: 1
deletionTimestamp: "2021-08-10T16:59:10Z"
```
This is a bit weird, it's like the pod was created and assigned to the Node even though the Node was being drained already.
I noticed in the machine controller log we see:
```
deleting Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: openshift-monitoring/configure-alertmanager-operator-registry-8j96p
```
Which then leads me on to, what is creating that pod? Is whatever is creating it doing the right thing with respect to cordoned nodes, or is it interfering with the process somehow.
The node which the pod is assigned to appears to be healthy and running, so I don't think it's an issue with the Node, but we can't rule that out just yet as the must gather doesn't have the kubelet logs for that node.
@cblecker Would you be able to retrieve the kubelet logs from the Node ip-10-0-213-12.ca-central-1.compute.internal so that we can rule out a host level issue for why this is being blocked?
@jspeed Unfortunately this cluster has since been destroyed, so further log collection outside of the MG will not be possible. In that case, as per the slack thread, I don't think there's much more we can do here. We know that the pod is created by the catalog source controller. The pod is deleted and is past its deletion timestamp/grace period. It should at this point be removed by Kubelet. However, it's not been removed, but the node is reporting healthy, implying that kubelet is still running. My suspicion is that the kubelet logs would have told us there was some problem removing the pod, but without that I don't think there's anything more we will be able to find from the must gather. If anyone is able to reproduce this, please grab those kubelet logs and a fresh must gather. As this isn't actually a Machine API problem explicitly, I won't have time to try and reproduce this myself. Have spoken with Christoph on slack and as we are unable to reproduce this right now and have only seen 1 occurrence, we will close this out for now and reopen later if we can managed to reproduce again |