Description of problem: Cluster shows that some pods - static or created by DaemonSet (etcd, api, etc..) have 'Running' state even if node (where they were run before) turned off Version-Release number of selected component (if applicable): [vadim@vadim openshift4-ops]$ oc4 get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-rc.7 True False 24h Error while reconciling 4.1.0-rc.7: an unknown error has occurred The same was tested on [vadim@vadim openshift4-ops]$ oc4 get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-rc.5 True False 8d Error while reconciling 4.1.0-rc.5: the cluster operator kube-scheduler is degraded How reproducible: Always Steps to Reproduce: 1. Turn off any master node (in my case - from AWS console) 2. Check if node has NotReady state [vadim@vadim openshift4-ops]$ oc4 get nodes -l node-role.kubernetes.io/master= NAME STATUS ROLES AGE VERSION ip-192-168-1-1.us-west-2.compute.internal Ready master 25h v1.13.4+cb455d664 ip-192-168-1-2.us-west-2.compute.internal NotReady master 25h v1.13.4+cb455d664 ip-192-168-1-3.us-west-2.compute.internal Ready master 25h v1.13.4+cb455d664 Actual results: Pods on the node still in Running state (even if node turned off for long period of time): 1. Get static pods status, i.e. in openshift-etcd: [vadim@vadim openshift4-ops]$ oc4 get pods -n openshift-etcd -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES etcd-member-ip-192-168-1-1.us-west-2.compute.internal 2/2 Running 4 25h 192.168.1.1 ip-192-168-1-1.us-west-2.compute.internal <none> <none> etcd-member-ip-192-168-1-2.us-west-2.compute.internal 2/2 Running 5 25h 192.168.1.2 ip-192-168-1-2.us-west-2.compute.internal <none> <none> etcd-member-ip-192-168-1-3.us-west-2.compute.internal 2/2 Running 4 25h 192.1668.1.3 ip-192-168-1-3.us-west-2.compute.internal <none> <none> Pod 'etcd-member-ip-192-168-1-2.us-west-2.compute.internal' has Running state even if node turned off for a long time (more than couple hours). 2. Try to connect to pod on disabled node: [vadim@vadim openshift4-ops]$ oc4 -n openshift-etcd rsh etcd-member-ip-192-168-1-2.us-west-2.compute.internal Defaulting container name to etcd-member. Use 'oc describe pod/etcd-member-ip-192-168-1-2.us-west-2.compute.internal -n openshift-etcd' to see all of the containers in this pod. Error from server: error dialing backend: dial tcp 192.168.1.2:10250: i/o timeout [vadim@vadim openshift4-ops]$ 3. The same is for pods created by daemonset Image-registry namespace: [vadim@vadim openshift4-ops]$ oc4 get ds -n openshift-image-registry NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE node-ca 5 5 5 5 5 beta.kubernetes.io/os=linux 26h [vadim@vadim openshift4-ops]$ oc4 get pods -n openshift-image-registry -o wide | grep ip-192-168-1-2.us-west-2.compute.internal node-ca-bgpt6 1/1 Running 2 26h 241.0.0.70 ip-192-168-1-2.us-west-2.compute.internal <none> <none> apiserver namespace: [vadim@vadim openshift4-ops]$ oc4 get all -n openshift-apiserver NAME READY STATUS RESTARTS AGE pod/apiserver-8722v 1/1 Running 2 22h pod/apiserver-knwrm 1/1 Running 1 22h pod/apiserver-sjlgm 1/1 Running 2 22h NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/api ClusterIP 172.31.17.111 <none> 443/TCP 26h NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/apiserver 3 3 2 3 2 node-role.kubernetes.io/master= 26h 4. Get all pods filtered by state Running and node name: [vadim@vadim openshift4-ops]$ oc4 get pods --all-namespaces -o wide | grep ip-192-168-1-2.us-west-2.compute.internal | grep Running openshift-apiserver apiserver-8722v 1/1 Running 2 21h 241.0.0.69 ip-192-168-1-2.us-west-2.compute.internal <none> <none> openshift-cluster-node-tuning-operator tuned-km8zr 1/1 Running 2 25h 192.168.1.2 ip-192-168-1-2.us-west-2.compute.internal <none> <none> openshift-dns dns-default-27bs5 2/2 Running 4 25h 241.0.0.71 ip-192-168-1-2.us-west-2.compute.internal <none> <none> openshift-etcd etcd-member-ip-192-168-1-2.us-west-2.compute.internal 2/2 Running 5 25h 192.168.1.2 ip-192-168-1-2.us-west-2.compute.internal <none> <none> openshift-image-registry node-ca-bgpt6 1/1 Running 2 24h 241.0.0.70 ip-192-168-1-2.us-west-2.compute.internal <none> <none> openshift-kube-apiserver kube-apiserver-ip-192-168-1-2.us-west-2.compute.internal 2/2 Running 0 6h4m 192.168.1.2 ip-192-168-1-2.us-west-2.compute.internal <none> <none> openshift-kube-controller-manager kube-controller-manager-ip-192-168-1-2.us-west-2.compute.internal 2/2 Running 0 6h6m 192.168.1.2 ip-192-168-1-2.us-west-2.compute.internal <none> <none> openshift-kube-scheduler openshift-kube-scheduler-ip-192-168-1-2.us-west-2.compute.internal 1/1 Running 2 24h 192.168.1.2 ip-192-168-1-2.us-west-2.compute.internal <none> <none> openshift-machine-config-operator machine-config-daemon-2pb9f 1/1 Running 2 25h 192.168.1.2 ip-192-168-1-2.us-west-2.compute.internal <none> <none> openshift-machine-config-operator machine-config-server-lmhhp 1/1 Running 2 25h 192.168.1.2 ip-192-168-1-2.us-west-2.compute.internal <none> <none> openshift-monitoring node-exporter-lcl2w 2/2 Running 4 24h 192.168.1.2 ip-192-168-1-2.us-west-2.compute.internal <none> <none> openshift-multus multus-bbrmz 1/1 Running 2 25h 192.168.1.2 ip-192-168-1-2.us-west-2.compute.internal <none> <none> openshift-sdn ovs-lsx4x 1/1 Running 2 25h 192.168.1.2 ip-192-168-1-2.us-west-2.compute.internal <none> <none> openshift-sdn sdn-bzfg6 1/1 Running 4 25h 192.168.1.2 ip-192-168-1-2.us-west-2.compute.internal <none> <none> openshift-sdn sdn-controller-wbj7b 1/1 Running 2 25h 192.168.1.2 ip-192-168-1-2.us-west-2.compute.internal <none> <none> [vadim@vadim openshift4-ops]$ 5. Get all pods filtered by state Terminating and node name: [vadim@vadim openshift4-ops]$ oc4 get pods --all-namespaces -o wide | grep ip-192-168-1-2.us-west-2.compute.internal | grep Termin openshift-authentication oauth-openshift-77586b5dbd-2xpwm 1/1 Terminating 0 8h 241.0.0.73 ip-192-168-1-2.us-west-2.compute.internal <none> <none> openshift-controller-manager controller-manager-sjw9x 1/1 Terminating 0 7h12m 241.0.0.81 ip-192-168-1-2.us-west-2.compute.internal <none> <none> openshift-machine-config-operator etcd-quorum-guard-66b78568d6-mtdqv 1/1 Terminating 1 22h 192.168.1.2 ip-192-168-1-2.us-west-2.compute.internal <none> <none> openshift-operator-lifecycle-manager packageserver-5894fb9cc5-z9zqp 1/1 Terminating 0 7h14m 241.0.0.79 ip-192-168-1-2.us-west-2.compute.internal <none> <none> Looks like this bug affects only static pods or pods created by daemonset - they still counts as 'Running', however pods created from deployments have status 'Terminating' Expected results: Pods have status Terminated or any other than Running Additional info:
Looks like same as https://bugzilla.redhat.com/show_bug.cgi?id=1701099