Bug 1701099
Summary: | Pods are still forever shown Running after the corresponding hosting node is powered off | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Xingxing Xia <xxia> |
Component: | kube-controller-manager | Assignee: | Maciej Szulik <maszulik> |
Status: | CLOSED ERRATA | QA Contact: | zhou ying <yinzhou> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.1.0 | CC: | aos-bugs, clasohm, gblomqui, gparente, jokerman, mfojtik, mmccomas, srozen, sttts, vzharov, yinzhou |
Target Milestone: | --- | ||
Target Release: | 4.5.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-07-13 17:11:03 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Xingxing Xia
2019-04-18 05:13:22 UTC
What is "a long time"? We expect them to show Running for 5m until the node controller evicts all the pods on the non-responsive node, moving them to NodeLost state. (In reply to Seth Jennings from comment #1) > What is "a long time"? Forever. So far since the time of https://bugzilla.redhat.com/show_bug.cgi?id=1672894#c15 , 40 hours elapsed, the pods of the powered-off node (ip-172-31-167-112.us-east-2.compute.internal) on the env are still observed Running : oc get no ip-172-31-167-112.us-east-2.compute.internal NAME STATUS ROLES AGE VERSION ip-172-31-167-112.us-east-2.compute.internal NotReady master 41h v1.12.4+509916ce1 $ oc get po -o wide -n openshift-apiserver | grep ip-172-31-167-112.us-east-2.compute.internal apiserver-bmgcw 1/1 Running 0 41h 10.130.0.31 ip-172-31-167-112.us-east-2.compute.internal <none> <none> [xxia@fedora29 my]$ oc get po -o wide --all-namespaces | grep ip-172-31-167-112.us-east-2.compute.internal kube-system etcd-member-ip-172-31-167-112.us-east-2.compute.internal 2/2 Running 0 41h 172.31.167.112 ip-172-31-167-112.us-east-2.compute.internal <none> <none> openshift-apiserver apiserver-bmgcw 1/1 Running 0 41h 10.130.0.31 ip-172-31-167-112.us-east-2.compute.internal <none> <none> ...snipped... Those pods are mirror pods (i.e. apiserver mirrors of a static pod manifests on the node). I'm not sure if mirror pod status is every updated by anything other than the node. Since, by definintion, this is no higher level controller that could start the pod on another node. This is not a blocker and is probably by design (i.e. not a bug). Confirming... *** Bug 1715672 has been marked as a duplicate of this bug. *** It is not related only to static/mirror pods, but also to all pods created by DaemonSets - don't think it is by design. For example, if you shutdown master node and run oc get pods -n openshift-image-registry -o wide it will show that pod are running on all nodes including master node. At the same time Prometheus shows that this pods are NOT in "Ready" - if query for kube_pod_status_ready{namespace="openshift-etcd",condition="true"} - it will show you only two pods, not three as oc get pods -n openshift-etcd shows. I took down a master-1 in my test cluster, waited 10m and this is the result $ oc get pod --all-namespaces -owide | grep master-1 | grep Running openshift-apiserver apiserver-plmkl 1/1 Running 0 53m 10.131.0.38 master-1 <none> <none> openshift-cluster-node-tuning-operator tuned-9dsn5 1/1 Running 0 55m 10.42.15.10 master-1 <none> <none> openshift-controller-manager controller-manager-bc7s9 1/1 Running 0 51m 10.131.0.44 master-1 <none> <none> openshift-dns dns-default-8dpbm 2/2 Running 0 58m 10.131.0.8 master-1 <none> <none> openshift-etcd etcd-member-master-1 2/2 Running 0 59m 10.42.15.10 master-1 <none> <none> openshift-image-registry node-ca-tdxpk 1/1 Running 0 54m 10.131.0.32 master-1 <none> <none> openshift-kube-apiserver kube-apiserver-master-1 3/3 Running 0 53m 10.42.15.10 master-1 <none> <none> openshift-kube-controller-manager kube-controller-manager-master-1 2/2 Running 0 53m 10.42.15.10 master-1 <none> <none> openshift-kube-scheduler openshift-kube-scheduler-master-1 1/1 Running 0 52m 10.42.15.10 master-1 <none> <none> openshift-machine-config-operator machine-config-daemon-7npbl 1/1 Running 0 58m 10.42.15.10 master-1 <none> <none> openshift-machine-config-operator machine-config-server-r4q87 1/1 Running 0 58m 10.42.15.10 master-1 <none> <none> openshift-monitoring node-exporter-66st4 2/2 Running 0 54m 10.42.15.10 master-1 <none> <none> openshift-multus multus-admission-controller-qjbmg 1/1 Running 0 60m 10.131.0.13 master-1 <none> <none> openshift-multus multus-xftv5 1/1 Running 0 60m 10.42.15.10 master-1 <none> <none> openshift-sdn ovs-r45nv 1/1 Running 0 60m 10.42.15.10 master-1 <none> <none> openshift-sdn sdn-controller-nlxs2 1/1 Running 0 60m 10.42.15.10 master-1 <none> <none> openshift-sdn sdn-kvfsf 1/1 Running 1 59m 10.42.15.10 master-1 <none> <none> All of these are either static pods or DS pods The DS does track that the pod on the offline node not available and the node is not included in the desired count. $ oc get ds -n openshift-machine-config-operator machine-config-server NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE machine-config-server 2 2 2 2 2 node-role.kubernetes.io/master= 62m $ oc get pod -ojson -n openshift-machine-config-operator machine-config-server-r4q87 | jq '{phase: .status.phase, conditions: .status.conditions}' { "phase": "Running", "conditions": [ { "lastProbeTime": null, "lastTransitionTime": "2019-08-05T16:41:29Z", "status": "True", "type": "Initialized" }, { "lastProbeTime": null, "lastTransitionTime": "2019-08-05T16:41:31Z", "status": "False", "type": "Ready" }, { "lastProbeTime": null, "lastTransitionTime": "2019-08-05T16:41:31Z", "status": "True", "type": "ContainersReady" }, { "lastProbeTime": null, "lastTransitionTime": "2019-08-05T16:41:29Z", "status": "True", "type": "PodScheduled" } ] } The pod phase is Running (what `oc get pod` shows) but the condition `Type: Ready` is `False`, which is not reflected in `oc get pod` output. This is more a source of confusion than it is a bug. The `oc get pod` printer has always listed a pod `STATUS` column which doesn't correspond directly to anything in the pod status. It most closely mirrors status.phase (e.g. Pending, Running, Succeeded, Failed, Unknown), but is can take on other values that are not a pod phase a well (e.g. ImagePullBackOff). Sending to CLI to see if they want to make this clearer in the pod printer. *** Bug 1738243 has been marked as a duplicate of this bug. *** There's nothing in the pod definition nor status that would allow CLI to provide better information when invoking oc get command. oc describe on a pod will show a warning iff probes where defined b/c these will fail. This might require a deeper upstream discussion so I'm moving this to 4.3. Seth do you have any ideas if kubelet could provide such unavailable condition or something like that on pods? Only thing I could think was we show container readiness in the `x/y` form but not pod readiness. Maybe we show status as `Running,NotReady` to reflect this? That was my only potential idea. > Only thing I could think was we show container readiness in the `x/y` form but not pod readiness. Maybe we show status as `Running,NotReady` to reflect this? That was my only potential idea.
I think that's a reasonable middle ground, I'll work on a PR.
This will have to happen upstream first, moving to 4.4 I've opened upstream PR to address this, but I'm not planning on backporting this, so it'll be available after the next k8s bump. With that, I'm moving this to 4.5. This merged with https://github.com/openshift/origin/pull/24719, moving accordingly. (In reply to zhou ying from comment #18) > I still could reproduce the issue now , could you please have a double check ? > version 4.5.0-0.nightly-2020-04-25-170442 True False 9h Cluster version is 4.5.0-0.nightly-2020-04-25-170442 The verification steps have no incorrect place IMO. Then you can assign back. *** Bug 1808445 has been marked as a duplicate of this bug. *** I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. I've verified this with 4.5.0-0.ci-2020-05-18-204038 cluster and when I bring one node down and wait 10 mins the pods created from a user workloads are reporting terminated and are moved to a different node over time. The only pods remaining in Running state are those controller by DS, just like Seth described it in comment 8, but that's perfectly fine. I'd suggest when testing using: oc get po -A -owide | grep node and not filter our Running like done previously since that will distort your pov. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |