Bug 1743741 - GCP: monitoring pods stuck in terminating state
Summary: GCP: monitoring pods stuck in terminating state
Status: CLOSED DUPLICATE of bug 1744029
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.2.0
Assignee: Ryan Phillips
QA Contact: Sunil Choudhary
Depends On:
TreeView+ depends on / blocked
Reported: 2019-08-20 15:00 UTC by Michael Gugino
Modified: 2020-06-30 01:26 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2019-08-28 19:29:18 UTC
Target Upstream Version:

Attachments (Terms of Use)
nodes/machines output (2.11 KB, text/plain)
2019-08-20 15:00 UTC, Michael Gugino
no flags Details
alert-manager-pod.yaml (9.00 KB, text/plain)
2019-08-20 15:02 UTC, Michael Gugino
no flags Details
must-gather for openshift-monitoring namespace (375.20 KB, application/x-xz)
2019-08-20 15:31 UTC, Michael Gugino
no flags Details

Description Michael Gugino 2019-08-20 15:00:32 UTC
Created attachment 1606113 [details]
nodes/machines output

Description of problem:
While attempting to delete a machine, machine hangs indefinitely.  Logs reveal inability to complete drain operation.  This does not appear to be related to PDBs, API is accepting delete/evict request as there is room to schedule on another node.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Delete one of the machines that is created by the installer.  I'm not sure if these pods run on all the machines or just one.

Actual results:
Node never finishes draining.

"deleting pods with local storage: alertmanager-main-0, prometheus-adapter-6ffcd77594-226mt, prometheus-k8s-0;"

Expected results:
These pods should go away.

Additional info:
Attached version info, pod info

Comment 1 Michael Gugino 2019-08-20 15:02:57 UTC
Created attachment 1606115 [details]

Comment 2 Michael Gugino 2019-08-20 15:03:28 UTC
I have the must-gather, lmk what parts you think would be helpful.

Comment 3 Michael Gugino 2019-08-20 15:31:34 UTC
Created attachment 1606119 [details]
must-gather for openshift-monitoring namespace

Comment 4 Michael Gugino 2019-08-20 16:12:09 UTC
I would say this is possible 4.2 blocker.  If we can't reliably drain a node because one of our pods is broken somehow, that's going to be a problem for maintaining a healthy cluster.

Comment 5 Michael Gugino 2019-08-20 19:37:34 UTC
This may have always been a problem and was uncovered by this bugfix: https://bugzilla.redhat.com/show_bug.cgi?id=1729243

Comment 6 Pawel Krupa 2019-08-20 22:11:07 UTC
In standard development OpenShift installation you get 6 nodes (3 masters, 3 workers) and monitoring components are deployed only on worker nodes. We need 3 replicas for alertmanager and we also set anti-affinity for those pods and do not set PodDisruptionBudget. All this means that in standard development OpenShift installation alertmanager always needs 3 replicas distributed on different nodes. This in turn will prevent node draining as you experienced. That said this is happening in basic installation and should happen in clusters with more worker nodes and as such I wouldn't consider it a release blocker or even a bug. That said, did you check on larger cluster?

Comment 7 Pawel Krupa 2019-08-20 22:35:36 UTC
Sorry, it is late.

It SHOULDN'T happen in clusters with more worker nodes.

Comment 8 Michael Gugino 2019-08-22 21:55:31 UTC
So, looks like QE has seen this before: https://bugzilla.redhat.com/show_bug.cgi?id=1733474

That is a 3 node cluster according to the logs.  I was also ping about a similar situation (no logs) in chat today.

If we're hitting this, our users are going to hit it.  3 worker nodes is the minimum for a production cluster, not 4, and this is definitely a bug.

Comment 9 Pawel Krupa 2019-08-22 22:14:06 UTC
Exactly, 3 nodes are a minimum requirement and you are trying to scale below that. So this is not a bug, but a user misconfiguration.

Comment 10 Michael Gugino 2019-08-22 22:23:14 UTC
(In reply to Pawel Krupa from comment #9)
> Exactly, 3 nodes are a minimum requirement and you are trying to scale below
> that. So this is not a bug, but a user misconfiguration.

That is untrue.  3 node clusters support upgrades, and if you can't drain a node, it's going to block the upgrade.

Comment 13 Michael Gugino 2019-08-23 12:43:45 UTC
So, I'm still working through this on my end because it might be a problem with the node/something else rather than the monitoring pods and those are just what's left after the node starts having problems.

So far, I've been unable to replicate on AWS with nightly build.  I'm going to try to replicate in on GCP in different conditions.  In AWS, draining a node works perfectly fine, the monitoring pods terminate as expected.

Comment 14 Michael Gugino 2019-08-23 13:34:26 UTC
Okay, I think we might be making some progress (hopefully?)

I have a node in GCP that's pending server-side CSR.  I suspect this might be causing the issue?  Not sure why it's still pending.

NAME        AGE   REQUESTOR                                                                   CONDITION
csr-7msw5   14m   system:node:mgdevx-jt9sx-w-c-2qqdq.c.openshift-gce-devel.internal           Approved,Issued
csr-9rhvf   23m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-cvc4m   14m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-hpsqm   14m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-jq2hw   23m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-kb5nb   23m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-lfbvs   14m   system:node:mgdevx-jt9sx-w-d-lzd5h.c.openshift-gce-devel.internal           Pending
csr-ljf27   23m   system:node:mgdevx-jt9sx-m-2.c.openshift-gce-devel.internal                 Approved,Issued
csr-llpp6   23m   system:node:mgdevx-jt9sx-m-0.c.openshift-gce-devel.internal                 Approved,Issued
csr-w629r   14m   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-zmn4x   14m   system:node:mgdevx-jt9sx-w-b-l7nvl.c.openshift-gce-devel.internal           Approved,Issued
csr-zz8lj   23m   system:node:mgdevx-jt9sx-m-1.c.openshift-gce-devel.internal                 Approved,Issued

Comment 16 Ryan Phillips 2019-08-26 18:39:07 UTC
@Michael: Could you approve all the pending CSRs (oc adm certificate approve [csr-id]), and retry your drain test? The logs for controller-manager and the machine-approver would be useful. Are you getting the CSR issue with other clusters?

Comment 18 Seth Jennings 2019-08-28 19:29:18 UTC
see https://bugzilla.redhat.com/show_bug.cgi?id=1744029#c4

*** This bug has been marked as a duplicate of bug 1744029 ***

Note You need to log in before you can comment on or make changes to this bug.