Bug 1820507 - pod/project stuck at terminating status: The container could not be located when the pod was terminated (Exit Code: 137) [NEEDINFO]
Summary: pod/project stuck at terminating status: The container could not be located w...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.5.0
Assignee: Mrunal Patel
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks: 1822268
TreeView+ depends on / blocked
 
Reported: 2020-04-03 08:56 UTC by Maciej Szulik
Modified: 2020-08-04 18:04 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1819954
: 1822268 (view as bug list)
Environment:
Last Closed: 2020-08-04 18:03:57 UTC
Target Upstream Version:
wking: needinfo? (mpatel)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1664 0 None closed Bug 1820507: Add new crio.conf field to the template 2020-11-26 09:59:01 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-08-04 18:04:01 UTC

Description Maciej Szulik 2020-04-03 08:56:37 UTC
+++ This bug was initially created as a clone of Bug #1819954 +++

Description of problem:

oc get ns  | grep Terminating
ci-op-0kfbll3t                                          Terminating   15h
ci-op-0s7gqtnf                                          Terminating   27h
ci-op-30c903rw                                          Terminating   18h
ci-op-fpwy7vwm                                          Terminating   7d13h
ci-op-i7xcfyv9                                          Terminating   7d21h
ci-op-jripq40x                                          Terminating   7d13h
ci-op-lc136j9k                                          Terminating   18h
ci-op-m2qdij2v                                          Terminating   14h
ci-op-rmyyz9vc                                          Terminating   26h
ci-op-sw5540vd                                          Terminating   26h
ci-op-t59xf5b1                                          Terminating   7d13h
ci-op-vz3wn11c                                          Terminating   7d13h
ci-op-x7zyz5c5                                          Terminating   42h
ci-op-y1wt39w2                                          Terminating   7d18h
ci-op-z4fh5nin                                          Terminating   26h

oc get all -n ci-op-0kfbll3t  -o wide
NAME                  READY   STATUS        RESTARTS   AGE   IP       NODE                           NOMINATED NODE   READINESS GATES
pod/release-initial   0/2     Terminating   0          14h   <none>   ip-10-0-169-106.ec2.internal   <none>           <none>

#typical description of the terminating pods
oc describe pod/release-initial -n ci-op-0kfbll3t | grep "was terminated" -A3 -B3

    State:          Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was terminated
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000

As a side effect, autoscaler failed on draining the nodes with the those pods and thus cannot scale down the cluster.


Version-Release number of selected component (if applicable):
oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2020-03-23-130439   True        False         8d      Cluster version is 4.3.0-0.nightly-2020-03-23-130439

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:


Also tried:

oc describe pod -n ci 355d34f2-733a-11ea-9b47-0a58ac10a1a6 | grep "was terminated" -A3 -B3
      /tools/entrypoint
    State:          Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was terminated
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000

oc get pod -n ci 355d34f2-733a-11ea-9b47-0a58ac10a1a6 -o wide
NAME                                   READY   STATUS        RESTARTS   AGE   IP             NODE                           NOMINATED NODE   READINESS GATES
355d34f2-733a-11ea-9b47-0a58ac10a1a6   0/2     Terminating   1          35h   10.128.64.76   ip-10-0-131-192.ec2.internal   <none>           <none>


oc adm drain "ip-10-0-131-192.ec2.internal"  --delete-local-data --ignore-daemonsets --force
node/ip-10-0-131-192.ec2.internal already cordoned
WARNING: deleting Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: ci/355d34f2-733a-11ea-9b47-0a58ac10a1a6, ci/70f91686-7339-11ea-9b47-0a58ac10a1a6; ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-zmpqj, openshift-dns/dns-default-tfg2f, openshift-image-registry/node-ca-c8gbg, openshift-machine-config-operator/machine-config-daemon-p2j7l, openshift-monitoring/node-exporter-xhzfd, openshift-multus/multus-67xxk, openshift-sdn/ovs-z2zsd, openshift-sdn/sdn-2sqht
evicting pod "70f91686-7339-11ea-9b47-0a58ac10a1a6"
evicting pod "kubevirt-test-build"
evicting pod "openshift-acme-exposer-build"
evicting pod "baremetal-installer-build"
evicting pod "355d34f2-733a-11ea-9b47-0a58ac10a1a6"
pod/70f91686-7339-11ea-9b47-0a58ac10a1a6 evicted
pod/355d34f2-733a-11ea-9b47-0a58ac10a1a6 evicted
There are pending pods in node "ip-10-0-131-192.ec2.internal" when an error occurred: [error when evicting pod "baremetal-installer-build": pods "baremetal-installer-build" is forbidden: unable to create new content in namespace ci-op-0s7gqtnf because it is being terminated, error when evicting pod "openshift-acme-exposer-build": pods "openshift-acme-exposer-build" is forbidden: unable to create new content in namespace ci-op-i7xcfyv9 because it is being terminated, error when evicting pod "kubevirt-test-build": pods "kubevirt-test-build" is forbidden: unable to create new content in namespace ci-op-rmyyz9vc because it is being terminated]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x23425ef]

goroutine 1 [running]:
github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/drain.(*podDeleteList).Pods(0x0, 0xc00000e010, 0x3828790, 0x3d)
	/go/src/github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/drain/filters.go:49 +0x4f
github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain.(*DrainCmdOptions).deleteOrEvictPodsSimple(0xc0010447e0, 0xc000630380, 0x0, 0x0)
	/go/src/github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain/drain.go:335 +0x219
github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain.(*DrainCmdOptions).RunDrain(0xc0010447e0, 0x0, 0x3960c88)
	/go/src/github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain/drain.go:293 +0x5bd
github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain.NewCmdDrain.func1(0xc0011aec80, 0xc0012cce00, 0x1, 0x4)
	/go/src/github.com/openshift/oc/vendor/k8s.io/kubectl/pkg/cmd/drain/drain.go:185 +0xa0
github.com/openshift/oc/vendor/github.com/spf13/cobra.(*Command).execute(0xc0011aec80, 0xc0012ccdc0, 0x4, 0x4, 0xc0011aec80, 0xc0012ccdc0)
	/go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:830 +0x2ae
github.com/openshift/oc/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc00100db80, 0x2, 0xc00100db80, 0x2)
	/go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:914 +0x2fc
github.com/openshift/oc/vendor/github.com/spf13/cobra.(*Command).Execute(...)
	/go/src/github.com/openshift/oc/vendor/github.com/spf13/cobra/command.go:864
main.main()
	/go/src/github.com/openshift/oc/cmd/oc/oc.go:107 +0x835

--- Additional comment from Hongkai Liu on 2020-04-02 00:30:07 CEST ---

oc adm must-gather --dest-dir='./must-gather'
[must-gather      ] OUT Using must-gather plugin-in image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:97ea12139f980154850164233b34c8eb4622823bd6dbb8e7772f873cb157f221
[must-gather      ] OUT namespace/openshift-must-gather-zvdnt created
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-f2282 created
[must-gather      ] OUT pod for plug-in image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:97ea12139f980154850164233b34c8eb4622823bd6dbb8e7772f873cb157f221 created
[must-gather-zqwrb] POD Wrote inspect data to must-gather.
[must-gather-zqwrb] POD Gathering data for ns/openshift-cluster-version...
[must-gather-zqwrb] POD Wrote inspect data to must-gather.
[must-gather-zqwrb] POD Gathering data for ns/openshift-config...
[must-gather-zqwrb] POD Gathering data for ns/openshift-config-managed...
[must-gather-zqwrb] POD Gathering data for ns/openshift-authentication...
[must-gather-zqwrb] POD Gathering data for ns/openshift-authentication-operator...
[must-gather-zqwrb] POD Gathering data for ns/openshift-ingress...
[must-gather-zqwrb] POD Gathering data for ns/openshift-cloud-credential-operator...
[must-gather-zqwrb] POD Gathering data for ns/openshift-machine-api...
[must-gather-zqwrb] POD Gathering data for ns/openshift-console-operator...
[must-gather-zqwrb] POD Gathering data for ns/openshift-console...
[must-gather-zqwrb] POD Gathering data for ns/openshift-dns-operator...
[must-gather-zqwrb] POD Gathering data for ns/openshift-dns...
[must-gather-zqwrb] POD Gathering data for ns/openshift-image-registry...
[must-gather-zqwrb] OUT waiting for gather to complete
[must-gather-zqwrb] OUT gather never finished: timed out waiting for the condition
[must-gather      ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-f2282 deleted
[must-gather      ] OUT namespace/openshift-must-gather-zvdnt deleted
error: gather never finished for pod must-gather-zqwrb: timed out waiting for the condition

--- Additional comment from Hongkai Liu on 2020-04-02 01:37:51 CEST ---

https://coreos.slack.com/archives/CFDM5CQMN/p1585781207023200

Comment 1 Maciej Szulik 2020-04-03 09:00:05 UTC
I split this part from https://bugzilla.redhat.com/show_bug.cgi?id=1819954 b/c there
are 2 problems one is oc panicking which is handled in the other BZ, and the other problem
is with terminating pods. Although the latter is very unclear.

Hongkai can you provide more concrete information what failed and where? 
I'm asking for kubelet logs which parts are failing what is stuck the above description
does not help debugging the problem.

Comment 2 Hongkai Liu 2020-04-03 13:09:33 UTC
The logs of clusterautoscaler showed that it failed to scale down the cluster because of the failures of draining nodes.
And then I tried to run `oc adm drain <node>` manually and it failed with the similar logs like

There are pending pods in node "ip-10-0-131-192.ec2.internal" when an error occurred: [error when evicting pod "baremetal-installer-build": pods "baremetal-installer-build" is forbidden: unable to create new content in namespace ci-op-0s7gqtnf because it is being terminated, error when evicting pod "openshift-acme-exposer-build": pods "openshift-acme-exposer-build" is forbidden: unable to create new content in namespace ci-op-i7xcfyv9 because it is being terminated, error when evicting pod "kubevirt-test-build": pods "kubevirt-test-build" is forbidden: unable to create new content in namespace ci-op-rmyyz9vc because it is being terminated]


And then panic after sometime (see above).

I was trying to use `oc adm must-gather` because i thought you might need me to collect them later for debugging, and failed too.
I collected the node log by `oc adm node-logs` (only for ip-10-0-131-192.ec2.internal).
I will upload later.


Yesterday, we deleted the nodes which autoscaler failed to scale down with.
The cluster is used as production of CI infra and we have to fix it.
And now must-gather works (i guess it does not help much because the broken nodes are removed from the cluster).

Comment 9 Ryan Phillips 2020-04-18 01:26:11 UTC
To clarity, the dentries are leaking on curl liveness calls. This is technically a kernel bug, but we are fixing this in crio by adding default_env support into crio (https://github.com/cri-o/cri-o/pull/3611), and setting NSS_SDB_USE_CACHE=no within containers. If containers contain a different value, then the container overrides crio's injected variable.

There are other fixes going into crio fixing the error handling in deferred functions.

https://github.com/cri-o/cri-o/pull/3600
https://github.com/cri-o/cri-o/pull/3608
https://github.com/cri-o/cri-o/pull/3597
https://github.com/cri-o/cri-o/pull/3592

Comment 12 W. Trevor King 2020-04-30 17:23:07 UTC
Also related to this bug series are the two post-fix mitigation bugs: bug 1829664 and bug 1829999.

Comment 13 W. Trevor King 2020-04-30 18:18:41 UTC
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to bug 1822269. It will be removed if the assessment indicates that this should not block upgrade edges.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
  Example: Customers running 4.3 (under some conditions?) who try to drain nodes. Also customers running 4.4 before bug 1822268 landed (in some specific 4.4 RC?).  We don't (usually?) pull edges into RCs or for non-regressions, so this would probably be pulling edges from 4.2 -> 4.3 for 4.3 that do not carry the fix for this bug's 4.3 clone (bug 1822269).
What is the impact?  Is it serious enough to warrant blocking edges?
  Example: Nodes get stuck draining forever and possibly need manual intervention to unstick them.
How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
  Example: Updating to a fixed version keeps *new* nodes from getting stuck like this, but does not automatically resolve existing stuck nodes.  Bug 1829999 is about alerting admins impacted by this issue and pointing them at mitigation.

Also, if anyone has ideas about how we can think about the remediation question more actively so we aren't surprised by the "fix avoids lock for new attempts but does not resolve resources that are already locked up" weeks after diagnosing the issue and floating a PR, we'd love to improve that part of our bugfix/impact flow.

Comment 16 errata-xmlrpc 2020-08-04 18:03:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.