1734524 – ContainerStatusUnknown: container could not be located when the pod was terminated.

Bug 1734524 - ContainerStatusUnknown: container could not be located when the pod was terminated.

Summary: ContainerStatusUnknown: container could not be located when the pod was termi...

Keywords:
Status:	CLOSED DUPLICATE of bug 1810722
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-07-30 18:57 UTC by Lokesh Mandvekar
Modified:	2023-09-15 00:17 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-04-07 18:48:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Lokesh Mandvekar 2019-07-30 18:57:38 UTC

Description of problem:

release-openshift-origin-installer-e2e-aws-serial-4.1 has 1 test failing at 
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.1/996

Monitor cluster while tests execute	1h8m13s
6 error level events were detected during this test run:

Jul 30 15:46:08.408 E ns/openshift-image-registry pod/image-registry-6b6c598bf8-pl8z9 node/ip-10-0-138-97.ec2.internal container=registry container exited with code 137 (Error): 
Jul 30 15:46:08.438 E ns/openshift-image-registry pod/node-ca-xtp98 node/ip-10-0-138-97.ec2.internal container=node-ca container exited with code 137 (Error): 
Jul 30 15:51:24.937 E ns/openshift-image-registry pod/node-ca-tfxx4 node/ip-10-0-138-97.ec2.internal container=node-ca container exited with code 137 (Error): 
Jul 30 16:02:59.133 E ns/openshift-image-registry pod/node-ca-sjr2x node/ip-10-0-138-97.ec2.internal container=node-ca container exited with code 137 (Error): 
Jul 30 16:04:44.456 E ns/openshift-image-registry pod/node-ca-kngb4 node/ip-10-0-138-97.ec2.internal container=node-ca container exited with code 137 (ContainerStatusUnknown): The container could not be located when the pod was terminated
Jul 30 16:17:28.797 E ns/openshift-image-registry pod/node-ca-k47lm node/ip-10-0-138-97.ec2.internal container=node-ca container exited with code 137 (Error):

Comment 1 Adam Kaplan 2019-08-08 20:44:11 UTC

I suspect the node-ca dameon script is not gracefully terminating.

Comment 2 Oleg Bulatov 2019-08-19 11:14:46 UTC

Yes, we have a separate bug to do termination gracefully. But I suspect ContainerStatusUnknown shouldn't be there even when the pod wasn't terminated gracefully.

Comment 3 Seth Jennings 2019-08-19 18:33:30 UTC

Clayton added this container termination state 3 months ago
https://github.com/openshift/origin/pull/22833

Comment 4 Seth Jennings 2019-08-19 19:27:28 UTC

I'm not sure how that test determines what is a "error level event"

Comment 8 W. Trevor King 2019-12-05 20:48:50 UTC

More specific link than the full rebase PR: [1].  Upstream PR still open [2].

[1]: https://github.com/openshift/origin/commit/dab036ffe5eaad960c69e38d701cc53444588f8b
[2]: https://github.com/kubernetes/kubernetes/pull/77870

Comment 9 W. Trevor King 2019-12-16 22:55:53 UTC

ContainerStatusUnknown now showing up in 309 of the past 24h's failures (26% of all failing e2e jobs) [1].  For example, from [2]:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.4/330/build-log.txt | sort | uniq | grep ContainerStatusUnknown
Dec 16 21:40:33.739 E ns/openshift-service-catalog-apiserver pod/apiserver-95czv node/ci-op-4gxdm-m-2.c.openshift-gce-devel-ci.internal container=apiserver container exited with code 137 (ContainerStatusUnknown): The container could not be located when the pod was terminated
Dec 16 21:40:33.767 E ns/openshift-service-catalog-apiserver pod/apiserver-mxvgx node/ci-op-4gxdm-m-0.c.openshift-gce-devel-ci.internal container=apiserver container exited with code 137 (ContainerStatusUnknown): The container could not be located when the pod was terminated
Dec 16 21:40:33.775 E ns/openshift-service-catalog-apiserver pod/apiserver-zphcg node/ci-op-4gxdm-m-1.c.openshift-gce-devel-ci.internal container=apiserver container exited with code 137 (ContainerStatusUnknown): The container could not be located when the pod was terminated
...

There are a quite a few of these over the past 24h, but the leading containers are:

$ curl -s 'https://search.svc.ci.openshift.org/search?maxAge=24h&context=0&search=ContainerStatusUnknown' | jq -r '. | to_entries[].value | to_entries[].value[].context[]' | sed -n 's|.*ns/\([^ ]*\) pod/\([^ -]*\).* container=\([^ ]*\) .*|ns=\1 pod=\2... container=\3|p' | sort | uniq -c | sort -n | tail
     34 ns=openshift-image-registry pod=node... container=node-ca
     38 ns=openshift-monitoring pod=prometheus... container=prometheus-operator
     39 ns=openshift-controller-manager pod=controller... container=controller-manager
     40 ns=openshift-authentication pod=oauth... container=oauth-openshift
     62 ns=openshift-operator-lifecycle-manager pod=packageserver... container=packageserver
    113 ns=openshift-image-registry pod=image... container=registry
    182 ns=openshift-marketplace pod=csctestlabel... container=csctestlabel
    272 ns=openshift-marketplace pod=samename... container=samename
   1392 ns=openshift-service-catalog-apiserver pod=apiserver... container=apiserver
   1481 ns=openshift-service-catalog-controller-manager pod=controller... container=controller-manager

[1]: https://search.svc.ci.openshift.org/chart?search=ContainerStatusUnknown
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.4/330

Comment 10 Phil Cameron 2019-12-18 20:43:32 UTC

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_oc/218/pull-ci-openshift-oc-master-e2e-aws/736

Dec 18 17:14:33.383 E ns/openshift-service-catalog-apiserver pod/apiserver-sm9l5 node/ip-10-0-138-211.ec2.internal container=apiserver container exited with code 137 (ContainerStatusUnknown): The container could not be located when the pod was terminated
Dec 18 17:14:33.406 E ns/openshift-service-catalog-apiserver pod/apiserver-c9psm node/ip-10-0-132-253.ec2.internal container=apiserver container exited with code 137 (ContainerStatusUnknown): The container could not be located when the pod was terminated
Dec 18 17:14:33.406 E ns/openshift-service-catalog-apiserver pod/apiserver-4zvgj node/ip-10-0-157-32.ec2.internal container=apiserver container exited with code 137 (ContainerStatusUnknown): The container could not be located when the pod was terminated
Dec 18 17:15:08.322 E ns/openshift-service-catalog-controller-manager pod/controller-manager-bxzhl node/ip-10-0-132-253.ec2.internal container=controller-manager container exited with code 137 (ContainerStatusUnknown): The container could not be located when the pod was terminated
Dec 18 17:15:08.341 E ns/openshift-service-catalog-controller-manager pod/controller-manager-q2jk5 node/ip-10-0-157-32.ec2.internal container=controller-manager container exited with code 137 (ContainerStatusUnknown): The container could not be located when the pod was terminated
Dec 18 17:15:08.367 E ns/openshift-service-catalog-controller-manager pod/controller-manager-7rf92 node/ip-10-0-138-211.ec2.internal container=controller-manager container exited with code 137 (ContainerStatusUnknown): The container could not be located when the pod was terminated
Dec 18 17:20:40.042 E ns/openshift-marketplace pod/samename-bd4c5f67b-4brv5 node/ip-10-0-137-238.ec2.internal container=samename container exited with code 137 (ContainerStatusUnknown): The container could not be located when the pod was terminated
Dec 18 17:32:48.807 E ns/openshift-marketplace pod/opsrctestlabel-69f88cd7bb-xnksw node/ip-10-0-137-238.ec2.internal container=opsrctestlabel container exited with code 2 (Error): 
Dec 18 17:40:32.700 E ns/default pod/recycler-for-nfs-dtb5w node/ip-10-0-137-238.ec2.internal pod failed (DeadlineExceeded): Pod was active on the node longer than the specified deadline

Comment 11 Lalatendu Mohanty 2020-03-02 14:41:23 UTC

I am seeing this in "release-openshift-origin-installer-e2e-aws-upgrade" jobs from 4.2.x to 4.3 nightly runs. 

Mar 01 20:41:55.039 W clusteroperator/image-registry changed Progressing to True: DeploymentNotCompleted: The deployment has not completed
Mar 01 20:41:55.046 W ns/openshift-image-registry pod/image-registry-6cdb76454b-4x42l node/ip-10-0-130-6.us-west-2.compute.internal graceful deletion within 30s
Mar 01 20:41:55.068 E ns/openshift-image-registry pod/image-registry-6cdb76454b-4x42l node/ip-10-0-130-6.us-west-2.compute.internal container=registry container exited with code 137 (ContainerStatusUnknown): The container could not be located when the pod was terminated
Mar 01 20:41:55.245 W clusteroperator/image-registry changed Progressing to False: Ready: The registry is ready
Mar 01 20:41:55.276 I ns/openshift-ingress-operator pod/ingress-operator-5c94cd878c-4mrrr Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:05c22ede2c7b9596374b5d1cda99f90f808e0a49f6e72651411ed0313b9f2dfb"
Mar 01 20:41:55.380 W ns/openshift-monitoring pod/prometheus-operator-d7684d9fd-pz9wj MountVolume.SetUp failed for volume "prometheus-operator-token-q8xhr" : couldn't propagate object cache: timed out waiting for the condition

Comment 12 Lalatendu Mohanty 2020-03-02 16:32:13 UTC

To help with impact analysis with respect to over the air updates we need to find answers to the following questions. It is fine if we do not answer some of these questions at this point of time, but we should try to get answers.

What symptoms (in Telemetry, Insights, etc.) does a cluster experiencing this bug exhibit?
What kind of clusters are impacted because of the bug? 
What cluster functionality is degraded while hitting the bug?
Does the upgrade complete?
What is the expected rate of the failure (%) for vulnerable clusters which attempt the update?
What is the observed rate of failure we see in CI?
Can this bug cause data loss? Data loss = API server data loss or CRD state information loss etc. 
Is it possible to recover the cluster from the bug?
Is recovery automatic without intervention?  I.e. is the condition transient?
Is recovery possible with the only intervention being 'oc adm upgrade …' to a new release image with a fix?
Is there a manual workaround that exists to recover from the bug? What are manual steps? 
Approximate time estimation for fixing this bug?
Is this a regression? From which version does this regress??

Comment 13 Lalatendu Mohanty 2020-03-03 13:21:09 UTC

I see similar errors i.e. "container exited with code 2" in more jobs

Feb 27 22:04:02.332 E ns/openshift-monitoring pod/openshift-state-metrics-c77c6dff8-2plcr node/ip-10-0-135-170.us-west-1.compute.internal container=openshift-state-metrics container exited with code 2 (Error):

Comment 16 Ryan Phillips 2020-03-04 15:39:16 UTC

I think this bug is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1810136. I am seeing context deadline exceeded in the log messages. I submitted a backport here: https://github.com/openshift/origin/pull/24631

Comment 18 Ryan Phillips 2020-03-11 14:28:50 UTC

After further review, I don't think my comment in #16 is accurate for these failures. Every failure case has to do with networking healthchecks, golang network crashes, etc. This is likely an SDN or runtime issue of some sort.

Comment 19 Joel Smith 2020-03-12 15:27:05 UTC

This appears to be a problem in the Kubelet where containers are restarted while a pod is being terminated.

It looks like it has been fixed in 4.5, and I think that the cherry-pick PR https://github.com/openshift/origin/pull/24649 will get us in 4.4 to where we are in 4.5. This change in particular from helpers.go will prevent a container from trying to restart if the pod has been marked for deletion. 

https://github.com/openshift/origin/pull/24649/files#diff-210061ac34ae8005d47572b5008741e2R65-R68

Comment 21 Scott Dodson 2020-03-30 20:08:38 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges.

Who is impacted?
  Customers upgrading from 4.2.99 to 4.3.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  All customers upgrading from 4.2.z to 4.3.z fail approximately 10% of the time
What is the impact?
  Up to 2 minute disruption in edge routing
  Up to 90seconds of API downtime
  etcd loses quorum and you have to restore from backup
How involved is remediation?
  Issue resolves itself after five minutes
  Admin uses oc to fix things
  Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression?
  No, it’s always been like this we just never noticed
  Yes, from 4.2 and 4.3.1

Additionally, specific to this bug unless there are additional specific known fixes to eliminate problems that remain in 4.4 and 4.5 codebases I think this should be closed as a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1810652 and a clone of the 4.4 downstream bug created so that we can begin backporting to 4.3.z.

I'd like to see these questions answered before we remove the UpgradeBlocker keyword, but to me this seems like it's fallen below the threshold which warrants being deemed a blocker.

Comment 23 W. Trevor King 2021-04-05 17:47:07 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Comment 24 Red Hat Bugzilla 2023-09-15 00:17:47 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.