+++ This bug was initially created as a clone of Bug #1822268 +++ +++ This bug was initially created as a clone of Bug #1820507 +++ +++ This bug was initially created as a clone of Bug #1819954 +++ Description of problem: oc get ns | grep Terminating ci-op-0kfbll3t Terminating 15h ci-op-0s7gqtnf Terminating 27h ci-op-30c903rw Terminating 18h ci-op-fpwy7vwm Terminating 7d13h ci-op-i7xcfyv9 Terminating 7d21h ci-op-jripq40x Terminating 7d13h ci-op-lc136j9k Terminating 18h ci-op-m2qdij2v Terminating 14h ci-op-rmyyz9vc Terminating 26h ci-op-sw5540vd Terminating 26h ci-op-t59xf5b1 Terminating 7d13h ci-op-vz3wn11c Terminating 7d13h ci-op-x7zyz5c5 Terminating 42h ci-op-y1wt39w2 Terminating 7d18h ci-op-z4fh5nin Terminating 26h oc get all -n ci-op-0kfbll3t -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/release-initial 0/2 Terminating 0 14h <none> ip-10-0-169-106.ec2.internal <none> <none> #typical description of the terminating pods oc describe pod/release-initial -n ci-op-0kfbll3t | grep "was terminated" -A3 -B3 State: Terminated Reason: ContainerStatusUnknown Message: The container could not be located when the pod was terminated Exit Code: 137 Started: Mon, 01 Jan 0001 00:00:00 +0000 Finished: Mon, 01 Jan 0001 00:00:00 +0000 As a side effect, autoscaler failed on draining the nodes with the those pods and thus cannot scale down the cluster. Version-Release number of selected component (if applicable): oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.3.0-0.nightly-2020-03-23-130439 True False 8d Cluster version is 4.3.0-0.nightly-2020-03-23-130439 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Also tried: oc describe pod -n ci 355d34f2-733a-11ea-9b47-0a58ac10a1a6 | grep "was terminated" -A3 -B3 /tools/entrypoint State: Terminated Reason: ContainerStatusUnknown Message: The container could not be located when the pod was terminated Exit Code: 137 Started: Mon, 01 Jan 0001 00:00:00 +0000 Finished: Mon, 01 Jan 0001 00:00:00 +0000 oc get pod -n ci 355d34f2-733a-11ea-9b47-0a58ac10a1a6 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES 355d34f2-733a-11ea-9b47-0a58ac10a1a6 0/2 Terminating 1 35h 10.128.64.76 ip-10-0-131-192.ec2.internal <none> <none> oc adm drain "ip-10-0-131-192.ec2.internal" --delete-local-data --ignore-daemonsets --force node/ip-10-0-131-192.ec2.internal already cordoned WARNING: deleting Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: ci/355d34f2-733a-11ea-9b47-0a58ac10a1a6, ci/70f91686-7339-11ea-9b47-0a58ac10a1a6; ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-zmpqj, openshift-dns/dns-default-tfg2f, openshift-image-registry/node-ca-c8gbg, openshift-machine-config-operator/machine-config-daemon-p2j7l, openshift-monitoring/node-exporter-xhzfd, openshift-multus/multus-67xxk, openshift-sdn/ovs-z2zsd, openshift-sdn/sdn-2sqht evicting pod "70f91686-7339-11ea-9b47-0a58ac10a1a6" evicting pod "kubevirt-test-build" evicting pod "openshift-acme-exposer-build" evicting pod "baremetal-installer-build" evicting pod "355d34f2-733a-11ea-9b47-0a58ac10a1a6" pod/70f91686-7339-11ea-9b47-0a58ac10a1a6 evicted pod/355d34f2-733a-11ea-9b47-0a58ac10a1a6 evicted There are pending pods in node "ip-10-0-131-192.ec2.internal" when an error occurred: [error when evicting pod "baremetal-installer-build": pods "baremetal-installer-build" is forbidden: unable to create new content in namespace ci-op-0s7gqtnf because it is being terminated, error when evicting pod "openshift-acme-exposer-build": pods "openshift-acme-exposer-build" is forbidden: unable to create new content in namespace ci-op-i7xcfyv9 because it is being terminated, error when evicting pod "kubevirt-test-build": pods "kubevirt-test-build" is forbidden: unable to create new content in namespace ci-op-rmyyz9vc because it is being terminated] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x23425ef] goroutine 1 [running]: github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/drain.(*podDeleteList).Pods(0x0, 0xc00000e010, 0x3828790, 0x3d) /go/src/github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/drain/filters.go:49 +0x4f github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain.(*DrainCmdOptions).deleteOrEvictPodsSimple(0xc0010447e0, 0xc000630380, 0x0, 0x0) /go/src/github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain/drain.go:335 +0x219 github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain.(*DrainCmdOptions).RunDrain(0xc0010447e0, 0x0, 0x3960c88) /go/src/github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain/drain.go:293 +0x5bd github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain.NewCmdDrain.func1(0xc0011aec80, 0xc0012cce00, 0x1, 0x4) /go/src/github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain/drain.go:185 +0xa0 github.com/openshift/oc/vendor/github.com/spf13/cobra.(*Command).execute(0xc0011aec80, 0xc0012ccdc0, 0x4, 0x4, 0xc0011aec80, 0xc0012ccdc0) /go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:830 +0x2ae github.com/openshift/oc/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc00100db80, 0x2, 0xc00100db80, 0x2) /go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:914 +0x2fc github.com/openshift/oc/vendor/github.com/spf13/cobra.(*Command).Execute(...) /go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:864 main.main() /go/src/github.com/openshift/oc/cmd/oc/oc.go:107 +0x835 --- Additional comment from Hongkai Liu on 2020-04-02 00:30:07 CEST --- oc adm must-gather --dest-dir='./must-gather' [must-gather ] OUT Using must-gather plugin-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:97ea12139f980154850164233b34c8eb4622823bd6dbb8e7772f873cb157f221 [must-gather ] OUT namespace/openshift-must-gather-zvdnt created [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-f2282 created [must-gather ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:97ea12139f980154850164233b34c8eb4622823bd6dbb8e7772f873cb157f221 created [must-gather-zqwrb] POD Wrote inspect data to must-gather. [must-gather-zqwrb] POD Gathering data for ns/openshift-cluster-version... [must-gather-zqwrb] POD Wrote inspect data to must-gather. [must-gather-zqwrb] POD Gathering data for ns/openshift-config... [must-gather-zqwrb] POD Gathering data for ns/openshift-config-managed... [must-gather-zqwrb] POD Gathering data for ns/openshift-authentication... [must-gather-zqwrb] POD Gathering data for ns/openshift-authentication-operator... [must-gather-zqwrb] POD Gathering data for ns/openshift-ingress... [must-gather-zqwrb] POD Gathering data for ns/openshift-cloud-credential-operator... [must-gather-zqwrb] POD Gathering data for ns/openshift-machine-api... [must-gather-zqwrb] POD Gathering data for ns/openshift-console-operator... [must-gather-zqwrb] POD Gathering data for ns/openshift-console... [must-gather-zqwrb] POD Gathering data for ns/openshift-dns-operator... [must-gather-zqwrb] POD Gathering data for ns/openshift-dns... [must-gather-zqwrb] POD Gathering data for ns/openshift-image-registry... [must-gather-zqwrb] OUT waiting for gather to complete [must-gather-zqwrb] OUT gather never finished: timed out waiting for the condition [must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-f2282 deleted [must-gather ] OUT namespace/openshift-must-gather-zvdnt deleted error: gather never finished for pod must-gather-zqwrb: timed out waiting for the condition --- Additional comment from Hongkai Liu on 2020-04-02 01:37:51 CEST --- https://coreos.slack.com/archives/CFDM5CQMN/p1585781207023200 --- Additional comment from Maciej Szulik on 2020-04-03 09:00:05 UTC --- I split this part from https://bugzilla.redhat.com/show_bug.cgi?id=1819954 b/c there are 2 problems one is oc panicking which is handled in the other BZ, and the other problem is with terminating pods. Although the latter is very unclear. Hongkai can you provide more concrete information what failed and where? I'm asking for kubelet logs which parts are failing what is stuck the above description does not help debugging the problem. --- Additional comment from Hongkai Liu on 2020-04-03 13:09:33 UTC --- The logs of clusterautoscaler showed that it failed to scale down the cluster because of the failures of draining nodes. And then I tried to run `oc adm drain <node>` manually and it failed with the similar logs like There are pending pods in node "ip-10-0-131-192.ec2.internal" when an error occurred: [error when evicting pod "baremetal-installer-build": pods "baremetal-installer-build" is forbidden: unable to create new content in namespace ci-op-0s7gqtnf because it is being terminated, error when evicting pod "openshift-acme-exposer-build": pods "openshift-acme-exposer-build" is forbidden: unable to create new content in namespace ci-op-i7xcfyv9 because it is being terminated, error when evicting pod "kubevirt-test-build": pods "kubevirt-test-build" is forbidden: unable to create new content in namespace ci-op-rmyyz9vc because it is being terminated] And then panic after sometime (see above). I was trying to use `oc adm must-gather` because i thought you might need me to collect them later for debugging, and failed too. I collected the node log by `oc adm node-logs` (only for ip-10-0-131-192.ec2.internal). I will upload later. Yesterday, we deleted the nodes which autoscaler failed to scale down with. The cluster is used as production of CI infra and we have to fix it. And now must-gather works (i guess it does not help much because the broken nodes are removed from the cluster). --- Additional comment from Hongkai Liu on 2020-04-03 13:14:14 UTC --- http://file.rdu.redhat.com/~hongkliu/test_result/bz1820507/ip-10-0-131-192.ec2.internal.log.zip --- Additional comment from Hongkai Liu on 2020-04-03 19:59:58 UTC --- Node team is helping us with this issue. https://coreos.slack.com/archives/CHY2E1BL4/p1585927741220600 --- Additional comment from Ryan Phillips on 2020-04-06 16:15:41 UTC --- Backport PR: https://github.com/openshift/origin/pull/24841 Going to duplicate this to the backport BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1821341 --- Additional comment from Ryan Phillips on 2020-04-08 15:34:30 UTC --- Going to reopen... https://github.com/elastic/cloud-on-k8s/pull/1716
This 4.3 bug depends on a 4.5 bug?: https://bugzilla.redhat.com/show_bug.cgi?id=1819906 Wouldn't that bug have to be backported to 4.3 and this depends on that then?
pod stuck in Terminating status, error is Apr 29 09:50:53 qe-jia-nfsjd-w-a-l-0 hyperkube[1319]: I0429 09:50:53.924652 1319 kubelet_pods.go:934] Pod "node-exporter-mzlvs_openshift-monitoring(0e69a9e8-89c9-11ea-a50f-42010a000004)" is terminated, but some volumes have not been cleaned up # oc -n openshift-monitoring get pod -o wide | grep node-exporter-mzlvs | grep Terminating node-exporter-mzlvs 0/2 Terminating 0 6h59m 10.0.32.5 qe-jia-nfsjd-w-a-l-0 <none> <none> # oc -n openshift-monitoring describe pod node-exporter-mzlvs ... Containers: node-exporter: Container ID: cri-o://8816d2321a354d07da3ac09b9003f4cdf28b5e890075cd41175aa9abae8c22f8 Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9ca176cdb8e9925ac20d2935be5470f75b6ca21a23976b527300b8fdefdbee62 Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9ca176cdb8e9925ac20d2935be5470f75b6ca21a23976b527300b8fdefdbee62 Port: <none> Host Port: <none> Args: --web.listen-address=127.0.0.1:9100 --path.procfs=/host/proc --path.sysfs=/host/sys --path.rootfs=/host/root --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+)($|/) --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$ --no-collector.wifi --collector.cpu.info --collector.textfile.directory=/var/node_exporter/textfile State: Terminated Reason: Error .. Exit Code: 143 Started: Tue, 28 Apr 2020 23:25:35 -0400 Finished: Wed, 29 Apr 2020 05:49:40 -0400 Ready: False more info see the attached file
Created attachment 1682838 [details] pod in Terminating status Exit Code: 143 # oc debug node/qe-jia-nfsjd-w-a-l-0 sh-4.2# chroot /host sh-4.2# crictl ps -a | grep node-exporter no result
Also related to this bug series are the two post-fix mitigation bugs: bug 1829664 and bug 1829999.
Once this gets fixed in 4.3, we will probably pull all edges from 4.2 -> earlier 4.3 to keep folks from getting stuck nodes. Folks who update in the meantime and happen to get stuck nodes will be caught and walked through mitigation via the bug 1829999 backstop.
https://bugzilla.redhat.com/show_bug.cgi?id=1820507#c13 has the impact-statement request (on the masterward-tip of this bug series).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2006
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475