Bug 1701099

Summary: Pods are still forever shown Running after the corresponding hosting node is powered off
Product: OpenShift Container Platform Reporter: Xingxing Xia <xxia>
Component: kube-controller-managerAssignee: Maciej Szulik <maszulik>
Status: CLOSED ERRATA QA Contact: zhou ying <yinzhou>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: aos-bugs, clasohm, gblomqui, gparente, jokerman, mfojtik, mmccomas, srozen, sttts, vzharov, yinzhou
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-13 17:11:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Xingxing Xia 2019-04-18 05:13:22 UTC
Description of problem:
This is bug is filed separately for https://bugzilla.redhat.com/show_bug.cgi?id=1672894#c16 .
Pods are still shown Running after the corresponding hosting node is powered off for long time.

Version-Release number of selected component (if applicable):
4.0.0-0.nightly-2019-04-10-182914

How reproducible:
Always

Steps to Reproduce:
1. Create a NextGen cluster
2. Power off one master by ssh to the master and run `shutdown -h now`
3. Check the pods that were Running on the powered-off node.

Actual results:
3. The pods are still shown Running after long time.

Expected results:
3. The pods are not shown Running.

Additional info:

Comment 1 Seth Jennings 2019-04-18 20:11:19 UTC
What is "a long time"?  We expect them to show Running for 5m until the node controller evicts all the pods on the non-responsive node, moving them to NodeLost state.

Comment 2 Xingxing Xia 2019-04-19 01:38:54 UTC
(In reply to Seth Jennings from comment #1)
> What is "a long time"?

Forever. So far since the time of https://bugzilla.redhat.com/show_bug.cgi?id=1672894#c15 , 40 hours elapsed, the pods of the powered-off node (ip-172-31-167-112.us-east-2.compute.internal) on the env are still observed Running :
oc get no ip-172-31-167-112.us-east-2.compute.internal
NAME                                           STATUS     ROLES    AGE   VERSION
ip-172-31-167-112.us-east-2.compute.internal   NotReady   master   41h   v1.12.4+509916ce1

$ oc get po -o wide -n openshift-apiserver | grep ip-172-31-167-112.us-east-2.compute.internal
apiserver-bmgcw   1/1     Running   0          41h   10.130.0.31   ip-172-31-167-112.us-east-2.compute.internal   <none>           <none>

[xxia@fedora29 my]$ oc get po -o wide --all-namespaces | grep ip-172-31-167-112.us-east-2.compute.internal
kube-system                                             etcd-member-ip-172-31-167-112.us-east-2.compute.internal                2/2     Running       0          41h
   172.31.167.112   ip-172-31-167-112.us-east-2.compute.internal   <none>           <none>
openshift-apiserver                                     apiserver-bmgcw                                                         1/1     Running       0          41h
   10.130.0.31      ip-172-31-167-112.us-east-2.compute.internal   <none>           <none>
...snipped...

Comment 3 Seth Jennings 2019-04-22 14:02:14 UTC
Those pods are mirror pods (i.e. apiserver mirrors of a static pod manifests on the node).  I'm not sure if mirror pod status is every updated by anything other than the node.  Since, by definintion, this is no higher level controller that could start the pod on another node.

Comment 4 Seth Jennings 2019-04-22 16:22:13 UTC
This is not a blocker and is probably by design (i.e. not a bug).  Confirming...

Comment 6 Sunil Choudhary 2019-05-31 08:19:13 UTC
*** Bug 1715672 has been marked as a duplicate of this bug. ***

Comment 7 Vadim Zharov 2019-06-19 21:56:38 UTC
It is not related only to static/mirror pods, but also to all pods created by DaemonSets - don't think it is by design.
For example, if you shutdown master node and run 
oc get pods -n openshift-image-registry -o wide

it will show that pod are running on all nodes including master node.

At the same time Prometheus shows that this pods are NOT in "Ready" - if query for kube_pod_status_ready{namespace="openshift-etcd",condition="true"} - it will show you only two pods, not three as oc get pods -n openshift-etcd shows.

Comment 8 Seth Jennings 2019-08-05 17:55:15 UTC
I took down a master-1 in my test cluster, waited 10m and this is the result

$ oc get pod --all-namespaces -owide | grep master-1 | grep Running
openshift-apiserver                                     apiserver-plmkl                                                   1/1     Running       0          53m    10.131.0.38   master-1   <none>           <none>
openshift-cluster-node-tuning-operator                  tuned-9dsn5                                                       1/1     Running       0          55m    10.42.15.10   master-1   <none>           <none>
openshift-controller-manager                            controller-manager-bc7s9                                          1/1     Running       0          51m    10.131.0.44   master-1   <none>           <none>
openshift-dns                                           dns-default-8dpbm                                                 2/2     Running       0          58m    10.131.0.8    master-1   <none>           <none>
openshift-etcd                                          etcd-member-master-1                                              2/2     Running       0          59m    10.42.15.10   master-1   <none>           <none>
openshift-image-registry                                node-ca-tdxpk                                                     1/1     Running       0          54m    10.131.0.32   master-1   <none>           <none>
openshift-kube-apiserver                                kube-apiserver-master-1                                           3/3     Running       0          53m    10.42.15.10   master-1   <none>           <none>
openshift-kube-controller-manager                       kube-controller-manager-master-1                                  2/2     Running       0          53m    10.42.15.10   master-1   <none>           <none>
openshift-kube-scheduler                                openshift-kube-scheduler-master-1                                 1/1     Running       0          52m    10.42.15.10   master-1   <none>           <none>
openshift-machine-config-operator                       machine-config-daemon-7npbl                                       1/1     Running       0          58m    10.42.15.10   master-1   <none>           <none>
openshift-machine-config-operator                       machine-config-server-r4q87                                       1/1     Running       0          58m    10.42.15.10   master-1   <none>           <none>
openshift-monitoring                                    node-exporter-66st4                                               2/2     Running       0          54m    10.42.15.10   master-1   <none>           <none>
openshift-multus                                        multus-admission-controller-qjbmg                                 1/1     Running       0          60m    10.131.0.13   master-1   <none>           <none>
openshift-multus                                        multus-xftv5                                                      1/1     Running       0          60m    10.42.15.10   master-1   <none>           <none>
openshift-sdn                                           ovs-r45nv                                                         1/1     Running       0          60m    10.42.15.10   master-1   <none>           <none>
openshift-sdn                                           sdn-controller-nlxs2                                              1/1     Running       0          60m    10.42.15.10   master-1   <none>           <none>
openshift-sdn                                           sdn-kvfsf                                                         1/1     Running       1          59m    10.42.15.10   master-1   <none>           <none>

All of these are either static pods or DS pods

The DS does track that the pod on the offline node not available and the node is not included in the desired count.

$ oc get ds -n openshift-machine-config-operator machine-config-server
NAME                    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
machine-config-server   2         2         2       2            2           node-role.kubernetes.io/master=   62m

$ oc get pod -ojson -n openshift-machine-config-operator machine-config-server-r4q87 | jq '{phase: .status.phase, conditions: .status.conditions}'
{
  "phase": "Running",
  "conditions": [
    {
      "lastProbeTime": null,
      "lastTransitionTime": "2019-08-05T16:41:29Z",
      "status": "True",
      "type": "Initialized"
    },
    {
      "lastProbeTime": null,
      "lastTransitionTime": "2019-08-05T16:41:31Z",
      "status": "False",
      "type": "Ready"
    },
    {
      "lastProbeTime": null,
      "lastTransitionTime": "2019-08-05T16:41:31Z",
      "status": "True",
      "type": "ContainersReady"
    },
    {
      "lastProbeTime": null,
      "lastTransitionTime": "2019-08-05T16:41:29Z",
      "status": "True",
      "type": "PodScheduled"
    }
  ]
}

The pod phase is Running (what `oc get pod` shows) but the condition `Type: Ready` is `False`, which is not reflected in `oc get pod` output.

This is more a source of confusion than it is a bug.

The `oc get pod` printer has always listed a pod `STATUS` column which doesn't correspond directly to anything in the pod status.  It most closely mirrors status.phase (e.g. Pending, Running, Succeeded, Failed, Unknown), but is can take on other values that are not a pod phase a well (e.g. ImagePullBackOff).

Sending to CLI to see if they want to make this clearer in the pod printer.

Comment 9 Seth Jennings 2019-08-07 13:39:12 UTC
*** Bug 1738243 has been marked as a duplicate of this bug. ***

Comment 10 Maciej Szulik 2019-08-08 13:46:08 UTC
There's nothing in the pod definition nor status that would allow CLI to provide better information when invoking oc get command.
oc describe on a pod will show a warning iff probes where defined b/c these will fail. This might require a deeper upstream 
discussion so I'm moving this to 4.3.

Seth do you have any ideas if kubelet could provide such unavailable condition or something like that on pods?

Comment 11 Seth Jennings 2019-08-08 19:10:26 UTC
Only thing I could think was we show container readiness in the `x/y` form but not pod readiness.  Maybe we show status as `Running,NotReady` to reflect this?  That was my only potential idea.

Comment 12 Maciej Szulik 2019-08-19 14:42:17 UTC
> Only thing I could think was we show container readiness in the `x/y` form but not pod readiness.  Maybe we show status as `Running,NotReady` to reflect this?  That was my only potential idea.

I think that's a reasonable middle ground, I'll work on a PR.

Comment 13 Maciej Szulik 2019-11-06 14:13:47 UTC
This will have to happen upstream first, moving to 4.4

Comment 14 Maciej Szulik 2020-02-17 16:18:53 UTC
I've opened upstream PR to address this, but I'm not planning on backporting this, so it'll be available after the next k8s bump.
With that, I'm moving this to 4.5.

Comment 15 Maciej Szulik 2020-04-23 12:02:18 UTC
This merged with https://github.com/openshift/origin/pull/24719, moving accordingly.

Comment 19 Xingxing Xia 2020-04-28 04:55:42 UTC
(In reply to zhou ying from comment #18)
> I still could reproduce the issue now , could you please have a double check ?
> version   4.5.0-0.nightly-2020-04-25-170442   True        False         9h      Cluster version is 4.5.0-0.nightly-2020-04-25-170442
The verification steps have no incorrect place IMO. Then you can assign back.

Comment 20 Maciej Szulik 2020-05-20 09:26:09 UTC
*** Bug 1808445 has been marked as a duplicate of this bug. ***

Comment 21 Maciej Szulik 2020-05-20 09:26:50 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 22 Maciej Szulik 2020-05-20 12:50:42 UTC
I've verified this with 4.5.0-0.ci-2020-05-18-204038 cluster and when I bring one node down and wait 10 mins
the pods created from a user workloads are reporting terminated and are moved to a different node over time.
The only pods remaining in Running state are those controller by DS, just like Seth described it in comment 8,
but that's perfectly fine.

I'd suggest when testing using:

oc get po -A -owide | grep node

and not filter our Running like done previously since that will distort your pov.

Comment 26 errata-xmlrpc 2020-07-13 17:11:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409