1715672 – Cluster shows that some pods - static or created by DaemonSet (etcd, api, etc..) have 'Running' state even if node (where they were run before) turned off

Bug 1715672 - Cluster shows that some pods - static or created by DaemonSet (etcd, api, etc..) have 'Running' state even if node (where they were run before) turned off

Summary: Cluster shows that some pods - static or created by DaemonSet (etcd, api, etc...

Keywords:
Status:	CLOSED DUPLICATE of bug 1701099
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Seth Jennings
QA Contact:	Jianwei Hou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-31 00:22 UTC by Vadim Zharov
Modified:	2019-05-31 08:19 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-05-31 08:19:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Vadim Zharov 2019-05-31 00:22:27 UTC

Description of problem:
Cluster shows that some pods - static or created by DaemonSet (etcd, api, etc..) have 'Running' state even if node (where they were run before) turned off

Version-Release number of selected component (if applicable):
[vadim@vadim openshift4-ops]$ oc4 get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-rc.7   True        False         24h     Error while reconciling 4.1.0-rc.7: an unknown error has occurred

The same was tested on
[vadim@vadim openshift4-ops]$ oc4 get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-rc.5   True        False         8d      Error while reconciling 4.1.0-rc.5: the cluster operator kube-scheduler is degraded

How reproducible:
Always

Steps to Reproduce:
1. Turn off any master node (in my case - from AWS console)
2. Check if node has NotReady state
[vadim@vadim openshift4-ops]$ oc4 get nodes -l node-role.kubernetes.io/master=
NAME                                           STATUS     ROLES    AGE   VERSION
ip-192-168-1-1.us-west-2.compute.internal    Ready      master   25h   v1.13.4+cb455d664
ip-192-168-1-2.us-west-2.compute.internal   NotReady   master   25h   v1.13.4+cb455d664
ip-192-168-1-3.us-west-2.compute.internal   Ready      master   25h   v1.13.4+cb455d664

Actual results:
Pods on the node still in Running state (even if node turned off for long period of time):

1. Get static pods status, i.e. in openshift-etcd:
[vadim@vadim openshift4-ops]$ oc4 get pods -n openshift-etcd -o wide
NAME                                                       READY   STATUS    RESTARTS   AGE   IP               NODE                                           NOMINATED NODE   READINESS GATES
etcd-member-ip-192-168-1-1.us-west-2.compute.internal    2/2     Running   4          25h   192.168.1.1    ip-192-168-1-1.us-west-2.compute.internal    <none>           <none>
etcd-member-ip-192-168-1-2.us-west-2.compute.internal   2/2     Running   5          25h   192.168.1.2   ip-192-168-1-2.us-west-2.compute.internal   <none>           <none>
etcd-member-ip-192-168-1-3.us-west-2.compute.internal   2/2     Running   4          25h   192.1668.1.3   ip-192-168-1-3.us-west-2.compute.internal   <none>           <none>

Pod 'etcd-member-ip-192-168-1-2.us-west-2.compute.internal' has Running state even if node turned off for a long time (more than couple hours).

2. Try to connect to pod on disabled node:
[vadim@vadim openshift4-ops]$ oc4 -n openshift-etcd rsh etcd-member-ip-192-168-1-2.us-west-2.compute.internal
Defaulting container name to etcd-member.
Use 'oc describe pod/etcd-member-ip-192-168-1-2.us-west-2.compute.internal -n openshift-etcd' to see all of the containers in this pod.
Error from server: error dialing backend: dial tcp 192.168.1.2:10250: i/o timeout
[vadim@vadim openshift4-ops]$ 

3. The same is for pods created by daemonset
Image-registry namespace:
[vadim@vadim openshift4-ops]$ oc4 get ds -n openshift-image-registry
NAME      DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                 AGE
node-ca   5         5         5       5            5           beta.kubernetes.io/os=linux   26h

[vadim@vadim openshift4-ops]$ oc4 get pods -n openshift-image-registry -o wide | grep ip-192-168-1-2.us-west-2.compute.internal
node-ca-bgpt6                                      1/1     Running   2          26h   241.0.0.70    ip-192-168-1-2.us-west-2.compute.internal   <none>           <none>

apiserver namespace:
[vadim@vadim openshift4-ops]$ oc4 get all -n openshift-apiserver
NAME                  READY   STATUS    RESTARTS   AGE
pod/apiserver-8722v   1/1     Running   2          22h
pod/apiserver-knwrm   1/1     Running   1          22h
pod/apiserver-sjlgm   1/1     Running   2          22h

NAME          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/api   ClusterIP   172.31.17.111   <none>        443/TCP   26h

NAME                       DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
daemonset.apps/apiserver   3         3         2       3            2           node-role.kubernetes.io/master=   26h


4. Get all pods filtered by state Running and node name:
[vadim@vadim openshift4-ops]$ oc4 get pods --all-namespaces -o wide | grep ip-192-168-1-2.us-west-2.compute.internal | grep Running
openshift-apiserver                                     apiserver-8722v                                                         1/1     Running       2          21h     241.0.0.69       ip-192-168-1-2.us-west-2.compute.internal   <none>           <none>
openshift-cluster-node-tuning-operator                  tuned-km8zr                                                             1/1     Running       2          25h     192.168.1.2   ip-192-168-1-2.us-west-2.compute.internal   <none>           <none>
openshift-dns                                           dns-default-27bs5                                                       2/2     Running       4          25h     241.0.0.71       ip-192-168-1-2.us-west-2.compute.internal   <none>           <none>
openshift-etcd                                          etcd-member-ip-192-168-1-2.us-west-2.compute.internal                2/2     Running       5          25h     192.168.1.2   ip-192-168-1-2.us-west-2.compute.internal   <none>           <none>
openshift-image-registry                                node-ca-bgpt6                                                           1/1     Running       2          24h     241.0.0.70       ip-192-168-1-2.us-west-2.compute.internal   <none>           <none>
openshift-kube-apiserver                                kube-apiserver-ip-192-168-1-2.us-west-2.compute.internal             2/2     Running       0          6h4m    192.168.1.2   ip-192-168-1-2.us-west-2.compute.internal   <none>           <none>
openshift-kube-controller-manager                       kube-controller-manager-ip-192-168-1-2.us-west-2.compute.internal    2/2     Running       0          6h6m    192.168.1.2   ip-192-168-1-2.us-west-2.compute.internal   <none>           <none>
openshift-kube-scheduler                                openshift-kube-scheduler-ip-192-168-1-2.us-west-2.compute.internal   1/1     Running       2          24h     192.168.1.2   ip-192-168-1-2.us-west-2.compute.internal   <none>           <none>
openshift-machine-config-operator                       machine-config-daemon-2pb9f                                             1/1     Running       2          25h     192.168.1.2   ip-192-168-1-2.us-west-2.compute.internal   <none>           <none>
openshift-machine-config-operator                       machine-config-server-lmhhp                                             1/1     Running       2          25h     192.168.1.2   ip-192-168-1-2.us-west-2.compute.internal   <none>           <none>
openshift-monitoring                                    node-exporter-lcl2w                                                     2/2     Running       4          24h     192.168.1.2   ip-192-168-1-2.us-west-2.compute.internal   <none>           <none>
openshift-multus                                        multus-bbrmz                                                            1/1     Running       2          25h     192.168.1.2   ip-192-168-1-2.us-west-2.compute.internal   <none>           <none>
openshift-sdn                                           ovs-lsx4x                                                               1/1     Running       2          25h     192.168.1.2   ip-192-168-1-2.us-west-2.compute.internal   <none>           <none>
openshift-sdn                                           sdn-bzfg6                                                               1/1     Running       4          25h     192.168.1.2   ip-192-168-1-2.us-west-2.compute.internal   <none>           <none>
openshift-sdn                                           sdn-controller-wbj7b                                                    1/1     Running       2          25h     192.168.1.2   ip-192-168-1-2.us-west-2.compute.internal   <none>           <none>
[vadim@vadim openshift4-ops]$ 

5. Get all pods filtered by state Terminating and node name:
[vadim@vadim openshift4-ops]$ oc4 get pods --all-namespaces -o wide | grep ip-192-168-1-2.us-west-2.compute.internal | grep Termin
openshift-authentication                                oauth-openshift-77586b5dbd-2xpwm                                        1/1     Terminating   0          8h      241.0.0.73       ip-192-168-1-2.us-west-2.compute.internal   <none>           <none>
openshift-controller-manager                            controller-manager-sjw9x                                                1/1     Terminating   0          7h12m   241.0.0.81       ip-192-168-1-2.us-west-2.compute.internal   <none>           <none>
openshift-machine-config-operator                       etcd-quorum-guard-66b78568d6-mtdqv                                      1/1     Terminating   1          22h     192.168.1.2   ip-192-168-1-2.us-west-2.compute.internal   <none>           <none>
openshift-operator-lifecycle-manager                    packageserver-5894fb9cc5-z9zqp                                          1/1     Terminating   0          7h14m   241.0.0.79       ip-192-168-1-2.us-west-2.compute.internal   <none>           <none>

Looks like this bug affects only static pods or pods created by daemonset - they still counts as 'Running', however pods created from deployments have status 'Terminating' 


Expected results:
Pods have status Terminated or any other than Running

Additional info:

Comment 1 Xingxing Xia 2019-05-31 03:23:18 UTC

Looks like same as https://bugzilla.redhat.com/show_bug.cgi?id=1701099

Note You need to log in before you can comment on or make changes to this bug.