Created attachment 1606113 [details]
Description of problem:
While attempting to delete a machine, machine hangs indefinitely. Logs reveal inability to complete drain operation. This does not appear to be related to PDBs, API is accepting delete/evict request as there is room to schedule on another node.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Delete one of the machines that is created by the installer. I'm not sure if these pods run on all the machines or just one.
Node never finishes draining.
"deleting pods with local storage: alertmanager-main-0, prometheus-adapter-6ffcd77594-226mt, prometheus-k8s-0;"
These pods should go away.
Attached version info, pod info
Created attachment 1606115 [details]
I have the must-gather, lmk what parts you think would be helpful.
Created attachment 1606119 [details]
must-gather for openshift-monitoring namespace
I would say this is possible 4.2 blocker. If we can't reliably drain a node because one of our pods is broken somehow, that's going to be a problem for maintaining a healthy cluster.
This may have always been a problem and was uncovered by this bugfix: https://bugzilla.redhat.com/show_bug.cgi?id=1729243
In standard development OpenShift installation you get 6 nodes (3 masters, 3 workers) and monitoring components are deployed only on worker nodes. We need 3 replicas for alertmanager and we also set anti-affinity for those pods and do not set PodDisruptionBudget. All this means that in standard development OpenShift installation alertmanager always needs 3 replicas distributed on different nodes. This in turn will prevent node draining as you experienced. That said this is happening in basic installation and should happen in clusters with more worker nodes and as such I wouldn't consider it a release blocker or even a bug. That said, did you check on larger cluster?
Sorry, it is late.
It SHOULDN'T happen in clusters with more worker nodes.
So, looks like QE has seen this before: https://bugzilla.redhat.com/show_bug.cgi?id=1733474
That is a 3 node cluster according to the logs. I was also ping about a similar situation (no logs) in chat today.
If we're hitting this, our users are going to hit it. 3 worker nodes is the minimum for a production cluster, not 4, and this is definitely a bug.
Exactly, 3 nodes are a minimum requirement and you are trying to scale below that. So this is not a bug, but a user misconfiguration.
(In reply to Pawel Krupa from comment #9)
> Exactly, 3 nodes are a minimum requirement and you are trying to scale below
> that. So this is not a bug, but a user misconfiguration.
That is untrue. 3 node clusters support upgrades, and if you can't drain a node, it's going to block the upgrade.
So, I'm still working through this on my end because it might be a problem with the node/something else rather than the monitoring pods and those are just what's left after the node starts having problems.
So far, I've been unable to replicate on AWS with nightly build. I'm going to try to replicate in on GCP in different conditions. In AWS, draining a node works perfectly fine, the monitoring pods terminate as expected.
Okay, I think we might be making some progress (hopefully?)
I have a node in GCP that's pending server-side CSR. I suspect this might be causing the issue? Not sure why it's still pending.
NAME AGE REQUESTOR CONDITION
csr-7msw5 14m system:node:mgdevx-jt9sx-w-c-2qqdq.c.openshift-gce-devel.internal Approved,Issued
csr-9rhvf 23m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-cvc4m 14m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-hpsqm 14m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-jq2hw 23m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-kb5nb 23m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-lfbvs 14m system:node:mgdevx-jt9sx-w-d-lzd5h.c.openshift-gce-devel.internal Pending
csr-ljf27 23m system:node:mgdevx-jt9sx-m-2.c.openshift-gce-devel.internal Approved,Issued
csr-llpp6 23m system:node:mgdevx-jt9sx-m-0.c.openshift-gce-devel.internal Approved,Issued
csr-w629r 14m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-zmn4x 14m system:node:mgdevx-jt9sx-w-b-l7nvl.c.openshift-gce-devel.internal Approved,Issued
csr-zz8lj 23m system:node:mgdevx-jt9sx-m-1.c.openshift-gce-devel.internal Approved,Issued
@Michael: Could you approve all the pending CSRs (oc adm certificate approve [csr-id]), and retry your drain test? The logs for controller-manager and the machine-approver would be useful. Are you getting the CSR issue with other clusters?
*** This bug has been marked as a duplicate of bug 1744029 ***