Bug 1822269 - [4.3] pod/project stuck at terminating status: The container could not be located when the pod was terminated (Exit Code: 137)
Summary: [4.3] pod/project stuck at terminating status: The container could not be loc...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.3.z
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.3.z
Assignee: Ryan Phillips
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On: 1822268
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-04-08 15:58 UTC by Ryan Phillips
Modified: 2021-04-05 17:36 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1822268
Environment:
Last Closed: 2020-05-11 21:20:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
pod in Terminating status (11.92 KB, text/plain)
2020-04-29 10:50 UTC, Junqi Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1668 0 None closed [release-4.3] Bug 1822269: Add new crio.conf field to the template 2020-11-23 21:10:51 UTC
Red Hat Product Errata RHBA-2020:2006 0 None None None 2020-05-11 21:20:51 UTC

Description Ryan Phillips 2020-04-08 15:58:20 UTC
+++ This bug was initially created as a clone of Bug #1822268 +++

+++ This bug was initially created as a clone of Bug #1820507 +++

+++ This bug was initially created as a clone of Bug #1819954 +++

Description of problem:

oc get ns  | grep Terminating
ci-op-0kfbll3t                                          Terminating   15h
ci-op-0s7gqtnf                                          Terminating   27h
ci-op-30c903rw                                          Terminating   18h
ci-op-fpwy7vwm                                          Terminating   7d13h
ci-op-i7xcfyv9                                          Terminating   7d21h
ci-op-jripq40x                                          Terminating   7d13h
ci-op-lc136j9k                                          Terminating   18h
ci-op-m2qdij2v                                          Terminating   14h
ci-op-rmyyz9vc                                          Terminating   26h
ci-op-sw5540vd                                          Terminating   26h
ci-op-t59xf5b1                                          Terminating   7d13h
ci-op-vz3wn11c                                          Terminating   7d13h
ci-op-x7zyz5c5                                          Terminating   42h
ci-op-y1wt39w2                                          Terminating   7d18h
ci-op-z4fh5nin                                          Terminating   26h

oc get all -n ci-op-0kfbll3t  -o wide
NAME                  READY   STATUS        RESTARTS   AGE   IP       NODE                           NOMINATED NODE   READINESS GATES
pod/release-initial   0/2     Terminating   0          14h   <none>   ip-10-0-169-106.ec2.internal   <none>           <none>

#typical description of the terminating pods
oc describe pod/release-initial -n ci-op-0kfbll3t | grep "was terminated" -A3 -B3

    State:          Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was terminated
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000

As a side effect, autoscaler failed on draining the nodes with the those pods and thus cannot scale down the cluster.


Version-Release number of selected component (if applicable):
oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2020-03-23-130439   True        False         8d      Cluster version is 4.3.0-0.nightly-2020-03-23-130439

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:


Also tried:

oc describe pod -n ci 355d34f2-733a-11ea-9b47-0a58ac10a1a6 | grep "was terminated" -A3 -B3
      /tools/entrypoint
    State:          Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was terminated
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000

oc get pod -n ci 355d34f2-733a-11ea-9b47-0a58ac10a1a6 -o wide
NAME                                   READY   STATUS        RESTARTS   AGE   IP             NODE                           NOMINATED NODE   READINESS GATES
355d34f2-733a-11ea-9b47-0a58ac10a1a6   0/2     Terminating   1          35h   10.128.64.76   ip-10-0-131-192.ec2.internal   <none>           <none>


oc adm drain "ip-10-0-131-192.ec2.internal"  --delete-local-data --ignore-daemonsets --force
node/ip-10-0-131-192.ec2.internal already cordoned
WARNING: deleting Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: ci/355d34f2-733a-11ea-9b47-0a58ac10a1a6, ci/70f91686-7339-11ea-9b47-0a58ac10a1a6; ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-zmpqj, openshift-dns/dns-default-tfg2f, openshift-image-registry/node-ca-c8gbg, openshift-machine-config-operator/machine-config-daemon-p2j7l, openshift-monitoring/node-exporter-xhzfd, openshift-multus/multus-67xxk, openshift-sdn/ovs-z2zsd, openshift-sdn/sdn-2sqht
evicting pod "70f91686-7339-11ea-9b47-0a58ac10a1a6"
evicting pod "kubevirt-test-build"
evicting pod "openshift-acme-exposer-build"
evicting pod "baremetal-installer-build"
evicting pod "355d34f2-733a-11ea-9b47-0a58ac10a1a6"
pod/70f91686-7339-11ea-9b47-0a58ac10a1a6 evicted
pod/355d34f2-733a-11ea-9b47-0a58ac10a1a6 evicted
There are pending pods in node "ip-10-0-131-192.ec2.internal" when an error occurred: [error when evicting pod "baremetal-installer-build": pods "baremetal-installer-build" is forbidden: unable to create new content in namespace ci-op-0s7gqtnf because it is being terminated, error when evicting pod "openshift-acme-exposer-build": pods "openshift-acme-exposer-build" is forbidden: unable to create new content in namespace ci-op-i7xcfyv9 because it is being terminated, error when evicting pod "kubevirt-test-build": pods "kubevirt-test-build" is forbidden: unable to create new content in namespace ci-op-rmyyz9vc because it is being terminated]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x23425ef]

goroutine 1 [running]:
github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/drain.(*podDeleteList).Pods(0x0, 0xc00000e010, 0x3828790, 0x3d)
	/go/src/github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/drain/filters.go:49 +0x4f
github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain.(*DrainCmdOptions).deleteOrEvictPodsSimple(0xc0010447e0, 0xc000630380, 0x0, 0x0)
	/go/src/github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain/drain.go:335 +0x219
github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain.(*DrainCmdOptions).RunDrain(0xc0010447e0, 0x0, 0x3960c88)
	/go/src/github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain/drain.go:293 +0x5bd
github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain.NewCmdDrain.func1(0xc0011aec80, 0xc0012cce00, 0x1, 0x4)
	/go/src/github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain/drain.go:185 +0xa0
github.com/openshift/oc/vendor/github.com/spf13/cobra.(*Command).execute(0xc0011aec80, 0xc0012ccdc0, 0x4, 0x4, 0xc0011aec80, 0xc0012ccdc0)
	/go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:830 +0x2ae
github.com/openshift/oc/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc00100db80, 0x2, 0xc00100db80, 0x2)
	/go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:914 +0x2fc
github.com/openshift/oc/vendor/github.com/spf13/cobra.(*Command).Execute(...)
	/go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:864
main.main()
	/go/src/github.com/openshift/oc/cmd/oc/oc.go:107 +0x835

--- Additional comment from Hongkai Liu on 2020-04-02 00:30:07 CEST ---

oc adm must-gather --dest-dir='./must-gather'
[must-gather      ] OUT Using must-gather plugin-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:97ea12139f980154850164233b34c8eb4622823bd6dbb8e7772f873cb157f221
[must-gather      ] OUT namespace/openshift-must-gather-zvdnt created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-f2282 created
[must-gather      ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:97ea12139f980154850164233b34c8eb4622823bd6dbb8e7772f873cb157f221 created
[must-gather-zqwrb] POD Wrote inspect data to must-gather.
[must-gather-zqwrb] POD Gathering data for ns/openshift-cluster-version...
[must-gather-zqwrb] POD Wrote inspect data to must-gather.
[must-gather-zqwrb] POD Gathering data for ns/openshift-config...
[must-gather-zqwrb] POD Gathering data for ns/openshift-config-managed...
[must-gather-zqwrb] POD Gathering data for ns/openshift-authentication...
[must-gather-zqwrb] POD Gathering data for ns/openshift-authentication-operator...
[must-gather-zqwrb] POD Gathering data for ns/openshift-ingress...
[must-gather-zqwrb] POD Gathering data for ns/openshift-cloud-credential-operator...
[must-gather-zqwrb] POD Gathering data for ns/openshift-machine-api...
[must-gather-zqwrb] POD Gathering data for ns/openshift-console-operator...
[must-gather-zqwrb] POD Gathering data for ns/openshift-console...
[must-gather-zqwrb] POD Gathering data for ns/openshift-dns-operator...
[must-gather-zqwrb] POD Gathering data for ns/openshift-dns...
[must-gather-zqwrb] POD Gathering data for ns/openshift-image-registry...
[must-gather-zqwrb] OUT waiting for gather to complete
[must-gather-zqwrb] OUT gather never finished: timed out waiting for the condition
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-f2282 deleted
[must-gather      ] OUT namespace/openshift-must-gather-zvdnt deleted
error: gather never finished for pod must-gather-zqwrb: timed out waiting for the condition

--- Additional comment from Hongkai Liu on 2020-04-02 01:37:51 CEST ---

https://coreos.slack.com/archives/CFDM5CQMN/p1585781207023200

--- Additional comment from Maciej Szulik on 2020-04-03 09:00:05 UTC ---

I split this part from https://bugzilla.redhat.com/show_bug.cgi?id=1819954 b/c there
are 2 problems one is oc panicking which is handled in the other BZ, and the other problem
is with terminating pods. Although the latter is very unclear.

Hongkai can you provide more concrete information what failed and where? 
I'm asking for kubelet logs which parts are failing what is stuck the above description
does not help debugging the problem.

--- Additional comment from Hongkai Liu on 2020-04-03 13:09:33 UTC ---

The logs of clusterautoscaler showed that it failed to scale down the cluster because of the failures of draining nodes.
And then I tried to run `oc adm drain <node>` manually and it failed with the similar logs like

There are pending pods in node "ip-10-0-131-192.ec2.internal" when an error occurred: [error when evicting pod "baremetal-installer-build": pods "baremetal-installer-build" is forbidden: unable to create new content in namespace ci-op-0s7gqtnf because it is being terminated, error when evicting pod "openshift-acme-exposer-build": pods "openshift-acme-exposer-build" is forbidden: unable to create new content in namespace ci-op-i7xcfyv9 because it is being terminated, error when evicting pod "kubevirt-test-build": pods "kubevirt-test-build" is forbidden: unable to create new content in namespace ci-op-rmyyz9vc because it is being terminated]


And then panic after sometime (see above).

I was trying to use `oc adm must-gather` because i thought you might need me to collect them later for debugging, and failed too.
I collected the node log by `oc adm node-logs` (only for ip-10-0-131-192.ec2.internal).
I will upload later.


Yesterday, we deleted the nodes which autoscaler failed to scale down with.
The cluster is used as production of CI infra and we have to fix it.
And now must-gather works (i guess it does not help much because the broken nodes are removed from the cluster).

--- Additional comment from Hongkai Liu on 2020-04-03 13:14:14 UTC ---

http://file.rdu.redhat.com/~hongkliu/test_result/bz1820507/ip-10-0-131-192.ec2.internal.log.zip

--- Additional comment from Hongkai Liu on 2020-04-03 19:59:58 UTC ---

Node team is helping us with this issue.
https://coreos.slack.com/archives/CHY2E1BL4/p1585927741220600

--- Additional comment from Ryan Phillips on 2020-04-06 16:15:41 UTC ---

Backport PR: https://github.com/openshift/origin/pull/24841

Going to duplicate this to the backport BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1821341

--- Additional comment from Ryan Phillips on 2020-04-08 15:34:30 UTC ---

Going to reopen... https://github.com/elastic/cloud-on-k8s/pull/1716

Comment 1 Kirsten Garrison 2020-04-23 18:35:02 UTC
This 4.3 bug depends on a 4.5 bug?: https://bugzilla.redhat.com/show_bug.cgi?id=1819906

Wouldn't that bug have to be backported to 4.3 and this depends on that then?

Comment 4 Junqi Zhao 2020-04-29 10:49:07 UTC
pod stuck in Terminating status, error is 
Apr 29 09:50:53 qe-jia-nfsjd-w-a-l-0 hyperkube[1319]: I0429 09:50:53.924652    1319 kubelet_pods.go:934] Pod "node-exporter-mzlvs_openshift-monitoring(0e69a9e8-89c9-11ea-a50f-42010a000004)" is terminated, but some volumes have not been cleaned up

# oc -n openshift-monitoring get pod -o wide | grep node-exporter-mzlvs | grep Terminating
node-exporter-mzlvs                            0/2     Terminating   0          6h59m   10.0.32.5     qe-jia-nfsjd-w-a-l-0                             <none>           <none>

# oc -n openshift-monitoring describe pod node-exporter-mzlvs
...
Containers:
  node-exporter:
    Container ID:  cri-o://8816d2321a354d07da3ac09b9003f4cdf28b5e890075cd41175aa9abae8c22f8
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9ca176cdb8e9925ac20d2935be5470f75b6ca21a23976b527300b8fdefdbee62
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9ca176cdb8e9925ac20d2935be5470f75b6ca21a23976b527300b8fdefdbee62
    Port:          <none>
    Host Port:     <none>
    Args:
      --web.listen-address=127.0.0.1:9100
      --path.procfs=/host/proc
      --path.sysfs=/host/sys
      --path.rootfs=/host/root
      --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+)($|/)
      --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$
      --no-collector.wifi
      --collector.cpu.info
      --collector.textfile.directory=/var/node_exporter/textfile
    State:      Terminated
      Reason:   Error
..

      Exit Code:    143
      Started:      Tue, 28 Apr 2020 23:25:35 -0400
      Finished:     Wed, 29 Apr 2020 05:49:40 -0400
    Ready:          False

more info see the attached file

Comment 5 Junqi Zhao 2020-04-29 10:50:32 UTC
Created attachment 1682838 [details]
pod in Terminating status

Exit Code:    143

# oc debug node/qe-jia-nfsjd-w-a-l-0
sh-4.2# chroot /host
sh-4.2#  crictl ps -a | grep node-exporter
no result

Comment 7 W. Trevor King 2020-04-30 17:23:41 UTC
Also related to this bug series are the two post-fix mitigation bugs: bug 1829664 and bug 1829999.

Comment 8 W. Trevor King 2020-04-30 18:05:13 UTC
Once this gets fixed in 4.3, we will probably pull all edges from 4.2 -> earlier 4.3 to keep folks from getting stuck nodes.  Folks who update in the meantime and happen to get stuck nodes will be caught and walked through mitigation via the bug 1829999 backstop.

Comment 9 W. Trevor King 2020-04-30 18:20:49 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1820507#c13 has the impact-statement request (on the masterward-tip of this bug series).

Comment 14 errata-xmlrpc 2020-05-11 21:20:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2006

Comment 15 W. Trevor King 2021-04-05 17:36:32 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475


Note You need to log in before you can comment on or make changes to this bug.