Bug 1916042 - pods not evicted after node goes NotReady
Summary: pods not evicted after node goes NotReady
Keywords:
Status: CLOSED DUPLICATE of bug 1901208
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.11.0
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: 3.11.z
Assignee: Ryan Phillips
QA Contact: Weinan Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-14 01:47 UTC by Brendan Shirren
Modified: 2024-06-13 23:55 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-01-26 13:52:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Brendan Shirren 2021-01-14 01:47:06 UTC
Description of problem: Some pods (using persistent volume) are not evicted after node goes NotReady

Version-Release number of selected component (if applicable): v3.11.248+

How reproducible: every time.

Steps to Reproduce:
1. deploy OCP 3.11 cluster with Monitoring stack (Prometheus, Alertmanager) using persistent volumes (eg. Gluster)
2. down node like hardware fail (loss of atomic-openshift-node and crio service)
3. some pods (eg. Prometheus) are stuck forever being deleted never evicted

Actual results:

Pod is marked for deletion but never terminates and PVC is not detached.

Expected results:

Pod terminates after "Termination Grace Period" and releases PVC so new pod can start on other node.

Additional info:

$ oc get pvc -n openshift-monitoring
NAME                                       STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS              AGE
alertmanager-main-db-alertmanager-main-0   Bound     pvc-994475a1-4037-11eb-a07c-fa163ee92557   256Mi      RWO            glusterfs-storage-block   24d
alertmanager-main-db-alertmanager-main-1   Bound     pvc-b7c424cc-4037-11eb-a07c-fa163ee92557   256Mi      RWO            glusterfs-storage-block   24d
alertmanager-main-db-alertmanager-main-2   Bound     pvc-cdfea613-4037-11eb-a07c-fa163ee92557   256Mi      RWO            glusterfs-storage-block   24d
prometheus-k8s-db-prometheus-k8s-0         Bound     pvc-a058ce9a-4036-11eb-a07c-fa163ee92557   2Gi        RWO            glusterfs-storage-block   24d
prometheus-k8s-db-prometheus-k8s-1         Bound     pvc-f7fec9d5-4036-11eb-a07c-fa163ee92557   2Gi        RWO            glusterfs-storage-block   24d


After downing "infra-0" node 20mins ago:

NAME                                           READY     STATUS     RESTARTS   AGE       IP              NODE
alertmanager-main-0                            3/3       Running    0          48m       10.131.0.17     node-0.example.com
alertmanager-main-1                            3/3       Running    0          24d       10.130.0.11     infra-1.example.com
alertmanager-main-2                            3/3       Running    0          17m       10.128.2.15     infra-2.example.com
cluster-monitoring-operator-5bf785d58c-z46ns   1/1       Running    0          24d       10.130.0.9      infra-1.example.com
grafana-65c68b58f5-fg8mq                       2/2       Running    0          17m       10.128.2.14     infra-2.example.com
grafana-65c68b58f5-w499j                       2/2       Unknown    0          24d       10.129.0.6      infra-0.example.com
kube-state-metrics-7ffb5dc8ff-989cp            3/3       Running    0          48m       10.131.0.16     node-0.example.com 
node-exporter-6xfb8                            2/2       Running    0          24d       10.74.180.149   infra-1.example.com
node-exporter-7jq98                            2/2       NodeLost   0          24d       10.74.182.171   infra-0.example.com
node-exporter-c92k7                            2/2       Running    4          24d       10.74.178.75    node-0.example.com
node-exporter-ch6jg                            2/2       Running    6          24d       10.74.180.200   infra-2.example.com
node-exporter-vttsp                            2/2       Running    0          24d       10.74.183.60    master-0.example.com
prometheus-k8s-0                               4/4       Unknown    1          24d       10.129.0.7      infra-0.example.com
prometheus-k8s-1                               4/4       Running    1          24d       10.130.0.8      infra-1.example.com
prometheus-operator-687c7dd6c4-djzgl           1/1       Unknown    0          24d       10.129.0.5      infra-0.example.com
prometheus-operator-687c7dd6c4-hccsd           1/1       Running    0          17m       10.131.0.18     node-0.example.com


master-controllers log shows alertmanager PVC forcefully detached after maxWaitForUnmountDuration 6m0s:  

I0110 23:44:48.359850       1 controller_utils.go:224] Recording status change NodeNotReady event message for node infra-0.example.com
I0110 23:44:48.359916       1 controller_utils.go:141] Update ready status of pods on node [infra-0.example.com]
I0110 23:46:18.841192       1 node_lifecycle_controller.go:651] Node is unresponsive. Adding Pods on Node infra-0.example.com to eviction queues: 2021-01-10 23:46:18.841162346 -0500 EST m=+2082764.641742551 is later than 2021-01-10 23:44:48.359817099 -0500 EST m=+2082674.160397328 + 1m30s
I0110 23:46:18.941484       1 controller_utils.go:212] Recording Deleting all Pods from Node infra-0.example.com. event message for node infra-0.example.com
W0110 23:46:19.335042       1 reconciler.go:327] Multi-Attach error for volume "pvc-cdfea613-4037-11eb-a07c-fa163ee92557" (UniqueName: "kubernetes.io/iscsi/10.74.180.200:iqn.2016-12.org.gluster-block:0345570f-eada-480d-a240-50a64ea9cf46:0") from node "infra-2.example.com" Volume is already exclusively attached to node infra-0.example.com and can't be attached to another
W0110 23:52:19.224855       1 reconciler.go:231] attacherDetacher.DetachVolume started for volume "pvc-cdfea613-4037-11eb-a07c-fa163ee92557" (UniqueName: "kubernetes.io/iscsi/10.74.180.200:iqn.2016-12.org.gluster-block:0345570f-eada-480d-a240-50a64ea9cf46:0") on node "infra-0.example.com" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching
I0110 23:52:19.224945       1 operation_generator.go:419] DetachVolume.Detach succeeded for volume "pvc-cdfea613-4037-11eb-a07c-fa163ee92557" (UniqueName: "kubernetes.io/iscsi/10.74.180.200:iqn.2016-12.org.gluster-block:0345570f-eada-480d-a240-50a64ea9cf46:0") on node "infra-0.example.com"


[master-controllers log]

I0110 23:44:48.636533       1 controller_utils.go:158] Updating ready status of pod prometheus-k8s-0 to false
I0110 23:46:19.266664       1 controller_utils.go:95] Starting deletion of pod openshift-monitoring/prometheus-k8s-0


$ oc describe pod prometheus-k8s-0
Node:                      infra-0.example.com/10.74.182.171
Start Time:                Thu, 17 Dec 2020 02:13:56 -0500
Status:                    Terminating (lasts 4m)
Termination Grace Period:  600s
Reason:                    NodeLost
Message:                   Node infra-0.example.com which was running pod prometheus-k8s-0 is unresponsive
IP:                        10.129.0.7
Controlled By:             StatefulSet/prometheus-k8s

Comment 1 Ryan Phillips 2021-01-26 13:52:25 UTC
This has been fixed in 4.x. I talked to the engineer about backporting the patch into 3.11. The patch itself has a dependency on Golang, and 3.11 is getting mostly only security patches. The patch complexity is too large to place it into 3.11 and an upgrade to 4.x is advisable.

The issue is with HTTP/2 timeouts. Stale http/2 sockets can be re-used and placed back into the connection pool.

*** This bug has been marked as a duplicate of bug 1901208 ***


Note You need to log in before you can comment on or make changes to this bug.