Description of problem: release-openshift-origin-installer-e2e-aws-serial-4.1 has 1 test failing at https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.1/996 Monitor cluster while tests execute 1h8m13s 6 error level events were detected during this test run: Jul 30 15:46:08.408 E ns/openshift-image-registry pod/image-registry-6b6c598bf8-pl8z9 node/ip-10-0-138-97.ec2.internal container=registry container exited with code 137 (Error): Jul 30 15:46:08.438 E ns/openshift-image-registry pod/node-ca-xtp98 node/ip-10-0-138-97.ec2.internal container=node-ca container exited with code 137 (Error): Jul 30 15:51:24.937 E ns/openshift-image-registry pod/node-ca-tfxx4 node/ip-10-0-138-97.ec2.internal container=node-ca container exited with code 137 (Error): Jul 30 16:02:59.133 E ns/openshift-image-registry pod/node-ca-sjr2x node/ip-10-0-138-97.ec2.internal container=node-ca container exited with code 137 (Error): Jul 30 16:04:44.456 E ns/openshift-image-registry pod/node-ca-kngb4 node/ip-10-0-138-97.ec2.internal container=node-ca container exited with code 137 (ContainerStatusUnknown): The container could not be located when the pod was terminated Jul 30 16:17:28.797 E ns/openshift-image-registry pod/node-ca-k47lm node/ip-10-0-138-97.ec2.internal container=node-ca container exited with code 137 (Error):
I suspect the node-ca dameon script is not gracefully terminating.
Yes, we have a separate bug to do termination gracefully. But I suspect ContainerStatusUnknown shouldn't be there even when the pod wasn't terminated gracefully.
Clayton added this container termination state 3 months ago https://github.com/openshift/origin/pull/22833
I'm not sure how that test determines what is a "error level event"
More specific link than the full rebase PR: [1]. Upstream PR still open [2]. [1]: https://github.com/openshift/origin/commit/dab036ffe5eaad960c69e38d701cc53444588f8b [2]: https://github.com/kubernetes/kubernetes/pull/77870
ContainerStatusUnknown now showing up in 309 of the past 24h's failures (26% of all failing e2e jobs) [1]. For example, from [2]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.4/330/build-log.txt | sort | uniq | grep ContainerStatusUnknown Dec 16 21:40:33.739 E ns/openshift-service-catalog-apiserver pod/apiserver-95czv node/ci-op-4gxdm-m-2.c.openshift-gce-devel-ci.internal container=apiserver container exited with code 137 (ContainerStatusUnknown): The container could not be located when the pod was terminated Dec 16 21:40:33.767 E ns/openshift-service-catalog-apiserver pod/apiserver-mxvgx node/ci-op-4gxdm-m-0.c.openshift-gce-devel-ci.internal container=apiserver container exited with code 137 (ContainerStatusUnknown): The container could not be located when the pod was terminated Dec 16 21:40:33.775 E ns/openshift-service-catalog-apiserver pod/apiserver-zphcg node/ci-op-4gxdm-m-1.c.openshift-gce-devel-ci.internal container=apiserver container exited with code 137 (ContainerStatusUnknown): The container could not be located when the pod was terminated ... There are a quite a few of these over the past 24h, but the leading containers are: $ curl -s 'https://search.svc.ci.openshift.org/search?maxAge=24h&context=0&search=ContainerStatusUnknown' | jq -r '. | to_entries[].value | to_entries[].value[].context[]' | sed -n 's|.*ns/\([^ ]*\) pod/\([^ -]*\).* container=\([^ ]*\) .*|ns=\1 pod=\2... container=\3|p' | sort | uniq -c | sort -n | tail 34 ns=openshift-image-registry pod=node... container=node-ca 38 ns=openshift-monitoring pod=prometheus... container=prometheus-operator 39 ns=openshift-controller-manager pod=controller... container=controller-manager 40 ns=openshift-authentication pod=oauth... container=oauth-openshift 62 ns=openshift-operator-lifecycle-manager pod=packageserver... container=packageserver 113 ns=openshift-image-registry pod=image... container=registry 182 ns=openshift-marketplace pod=csctestlabel... container=csctestlabel 272 ns=openshift-marketplace pod=samename... container=samename 1392 ns=openshift-service-catalog-apiserver pod=apiserver... container=apiserver 1481 ns=openshift-service-catalog-controller-manager pod=controller... container=controller-manager [1]: https://search.svc.ci.openshift.org/chart?search=ContainerStatusUnknown [2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.4/330
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_oc/218/pull-ci-openshift-oc-master-e2e-aws/736 Dec 18 17:14:33.383 E ns/openshift-service-catalog-apiserver pod/apiserver-sm9l5 node/ip-10-0-138-211.ec2.internal container=apiserver container exited with code 137 (ContainerStatusUnknown): The container could not be located when the pod was terminated Dec 18 17:14:33.406 E ns/openshift-service-catalog-apiserver pod/apiserver-c9psm node/ip-10-0-132-253.ec2.internal container=apiserver container exited with code 137 (ContainerStatusUnknown): The container could not be located when the pod was terminated Dec 18 17:14:33.406 E ns/openshift-service-catalog-apiserver pod/apiserver-4zvgj node/ip-10-0-157-32.ec2.internal container=apiserver container exited with code 137 (ContainerStatusUnknown): The container could not be located when the pod was terminated Dec 18 17:15:08.322 E ns/openshift-service-catalog-controller-manager pod/controller-manager-bxzhl node/ip-10-0-132-253.ec2.internal container=controller-manager container exited with code 137 (ContainerStatusUnknown): The container could not be located when the pod was terminated Dec 18 17:15:08.341 E ns/openshift-service-catalog-controller-manager pod/controller-manager-q2jk5 node/ip-10-0-157-32.ec2.internal container=controller-manager container exited with code 137 (ContainerStatusUnknown): The container could not be located when the pod was terminated Dec 18 17:15:08.367 E ns/openshift-service-catalog-controller-manager pod/controller-manager-7rf92 node/ip-10-0-138-211.ec2.internal container=controller-manager container exited with code 137 (ContainerStatusUnknown): The container could not be located when the pod was terminated Dec 18 17:20:40.042 E ns/openshift-marketplace pod/samename-bd4c5f67b-4brv5 node/ip-10-0-137-238.ec2.internal container=samename container exited with code 137 (ContainerStatusUnknown): The container could not be located when the pod was terminated Dec 18 17:32:48.807 E ns/openshift-marketplace pod/opsrctestlabel-69f88cd7bb-xnksw node/ip-10-0-137-238.ec2.internal container=opsrctestlabel container exited with code 2 (Error): Dec 18 17:40:32.700 E ns/default pod/recycler-for-nfs-dtb5w node/ip-10-0-137-238.ec2.internal pod failed (DeadlineExceeded): Pod was active on the node longer than the specified deadline
I am seeing this in "release-openshift-origin-installer-e2e-aws-upgrade" jobs from 4.2.x to 4.3 nightly runs. Mar 01 20:41:55.039 W clusteroperator/image-registry changed Progressing to True: DeploymentNotCompleted: The deployment has not completed Mar 01 20:41:55.046 W ns/openshift-image-registry pod/image-registry-6cdb76454b-4x42l node/ip-10-0-130-6.us-west-2.compute.internal graceful deletion within 30s Mar 01 20:41:55.068 E ns/openshift-image-registry pod/image-registry-6cdb76454b-4x42l node/ip-10-0-130-6.us-west-2.compute.internal container=registry container exited with code 137 (ContainerStatusUnknown): The container could not be located when the pod was terminated Mar 01 20:41:55.245 W clusteroperator/image-registry changed Progressing to False: Ready: The registry is ready Mar 01 20:41:55.276 I ns/openshift-ingress-operator pod/ingress-operator-5c94cd878c-4mrrr Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:05c22ede2c7b9596374b5d1cda99f90f808e0a49f6e72651411ed0313b9f2dfb" Mar 01 20:41:55.380 W ns/openshift-monitoring pod/prometheus-operator-d7684d9fd-pz9wj MountVolume.SetUp failed for volume "prometheus-operator-token-q8xhr" : couldn't propagate object cache: timed out waiting for the condition
To help with impact analysis with respect to over the air updates we need to find answers to the following questions. It is fine if we do not answer some of these questions at this point of time, but we should try to get answers. What symptoms (in Telemetry, Insights, etc.) does a cluster experiencing this bug exhibit? What kind of clusters are impacted because of the bug? What cluster functionality is degraded while hitting the bug? Does the upgrade complete? What is the expected rate of the failure (%) for vulnerable clusters which attempt the update? What is the observed rate of failure we see in CI? Can this bug cause data loss? Data loss = API server data loss or CRD state information loss etc. Is it possible to recover the cluster from the bug? Is recovery automatic without intervention? I.e. is the condition transient? Is recovery possible with the only intervention being 'oc adm upgrade …' to a new release image with a fix? Is there a manual workaround that exists to recover from the bug? What are manual steps? Approximate time estimation for fixing this bug? Is this a regression? From which version does this regress??
I see similar errors i.e. "container exited with code 2" in more jobs Feb 27 22:04:02.332 E ns/openshift-monitoring pod/openshift-state-metrics-c77c6dff8-2plcr node/ip-10-0-135-170.us-west-1.compute.internal container=openshift-state-metrics container exited with code 2 (Error):
I think this bug is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1810136. I am seeing context deadline exceeded in the log messages. I submitted a backport here: https://github.com/openshift/origin/pull/24631
After further review, I don't think my comment in #16 is accurate for these failures. Every failure case has to do with networking healthchecks, golang network crashes, etc. This is likely an SDN or runtime issue of some sort.
This appears to be a problem in the Kubelet where containers are restarted while a pod is being terminated. It looks like it has been fixed in 4.5, and I think that the cherry-pick PR https://github.com/openshift/origin/pull/24649 will get us in 4.4 to where we are in 4.5. This change in particular from helpers.go will prevent a container from trying to restart if the pod has been marked for deletion. https://github.com/openshift/origin/pull/24649/files#diff-210061ac34ae8005d47572b5008741e2R65-R68
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. Who is impacted? Customers upgrading from 4.2.99 to 4.3.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet All customers upgrading from 4.2.z to 4.3.z fail approximately 10% of the time What is the impact? Up to 2 minute disruption in edge routing Up to 90seconds of API downtime etcd loses quorum and you have to restore from backup How involved is remediation? Issue resolves itself after five minutes Admin uses oc to fix things Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression? No, it’s always been like this we just never noticed Yes, from 4.2 and 4.3.1 Additionally, specific to this bug unless there are additional specific known fixes to eliminate problems that remain in 4.4 and 4.5 codebases I think this should be closed as a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1810652 and a clone of the 4.4 downstream bug created so that we can begin backporting to 4.3.z. I'd like to see these questions answered before we remove the UpgradeBlocker keyword, but to me this seems like it's fallen below the threshold which warrants being deemed a blocker.
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days