Bug 1908378
Summary: | [sig-network] pods should successfully create sandboxes by getting pod - Static Pod Failures | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Tim Rozet <trozet> | |
Component: | Node | Assignee: | Elana Hashman <ehashman> | |
Node sub component: | Kubelet | QA Contact: | Weinan Liu <weinliu> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | urgent | |||
Priority: | high | CC: | anbhat, anusaxen, aos-bugs, bbennett, deads, ecordell, ehashman, emoss, ffranz, fpaoline, gzaidman, jcallen, jchaloup, jerzhang, juzhao, kir, nagrawal, openshift-bugzilla-robot, rphillips, schoudha, tsweeney, weinliu, wking, xiyuan | |
Version: | 4.7 | Keywords: | Reopened | |
Target Milestone: | --- | Flags: | ehashman:
needinfo-
ehashman: needinfo- ehashman: needinfo- |
|
Target Release: | 4.8.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: Sometimes when pods are created and deleted rapidly, the pod sandbox does not have a chance to finish creation before the pod starts deletion.
Consequence: "CreatePodSandbox" will throw an error.
Fix: We ignore this error if the pod is terminating.
Result: Pod termination will no longer fail when CreatePodSandbox did not complete successfully during pod deletion.
|
Story Points: | --- | |
Clone Of: | 1886922 | |||
: | 1929674 (view as bug list) | Environment: |
[sig-network] pods should successfully create sandboxes by getting pod
[sig-network] pods should successfully create sandboxes by other
[sig-network] pods should successfully create sandboxes by reading container
[sig-network] pods should successfully create sandboxes by writing network status
|
|
Last Closed: | 2021-07-27 22:35:07 UTC | Type: | --- | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1886922 | |||
Bug Blocks: | 1919069 |
Description
Tim Rozet
2020-12-16 14:55:23 UTC
Seems to be cropping up again. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ocp-4.7-e2e-vsphere-upi/1347487017306427392 https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.7/1347586669322178560 https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-fips-4.7/1347592202766782464 Build watcher: This bug is linked to the top four failing jobs for release 4.7, https://sippy.ci.openshift.org/?release=4.7#TopFailingTestsWithABug On ovirt the: [sig-network] pods should successfully create.... Tests are always failing, at least one of them each run. I know they are ignored in terms of the job passing or not, but it is still annoying. I saw this is the only bug that is open, but do we know why? is it a real problem that needs to be fixed or just a problem in the test? For reference: https://search.ci.openshift.org/?search=pods+should+successfully+create+sandboxes+&maxAge=336h&context=1&type=junit&name=release-openshift-ocp-installer-e2e-ovirt-4.7&maxMatches=5&maxBytes=20971520&groupBy=job Some of those errors from CI in the last comment look like this: ns/e2e-pods-4806 pod/pod-submit-status-2-0 node/ovirt12-xw2l5-worker-0-9z6pb - 21.03 seconds after deletion - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF Tracking that at https://github.com/kubernetes/kubernetes/issues/98142 / https://bugzilla.redhat.com/show_bug.cgi?id=1915085 That is different than the errors linked in the initial report, fwiw, which look like ns/e2e-deployment-5578 pod/webserver-9569696c8-tgks7 node/ci-op-2wxd6sh9-56515-7jz4t-worker-6xxnz - 21.99 seconds after deletion - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_webserver-9569696c8-tgks7_e2e-deployment-5578_dbb1018e-5a39-4710-94bb-b193ae9718ec_0(c407d51fd9da8c86c345bd42c067a222ccdd2513ac48ff9b03309e26bd743204): [e2e-deployment-5578/webserver-9569696c8-tgks7:openshift-sdn]: error adding container to network "openshift-sdn": CNI request failed with status 400: 'pods "webserver-9569696c8-tgks7" not found This is another test that involves a lot of pods being created and then deleted quickly. I suspect this is a manifestation of the race condition in #1915085. Will track there. *** This bug has been marked as a duplicate of bug 1915085 *** I see this was closed as duplicate, but https://bugzilla.redhat.com/show_bug.cgi?id=1915085 is verified and CI is still failing (see for example https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-ovn-4.7/1361939592273465344 ) Reopening the bz Following comment#24, creating and deleting a static pod prior to start work without errors. 4.8.0-0.nightly-2021-03-22-104536 Should this be closed now? I see in the linked gcp-rt job that there is a long amount of time before kube runtime tries to create the sandbox: Mar 30 09:14:49.322899 ci-op-6y6zi1g1-134e7-dc6xh-worker-d-whddz hyperkube[1554]: I0330 09:14:49.322891 1554 kubelet.go:1920] SyncLoop (ADD, "api"): "webserver-fc8b59899-w2tvg_e2e-deployment-427(d5e58a4b-c3ac-4bf4-b6b0-13e8cb4142bf)" Mar 30 09:15:09.297827 ci-op-6y6zi1g1-134e7-dc6xh-worker-d-whddz crio[1524]: time="2021-03-30 09:15:09.297415658Z" level=info msg="Running pod sandbox: e2e-deployment-427/webserver-fc8b59899-w2tvg/POD" id=2d15b70d-461a-46ec-8a99-ee0e9d9a6249 Mar 30 09:15:09.574818 ci-op-6y6zi1g1-134e7-dc6xh-worker-d-whddz crio[1524]: time="2021-03-30 09:15:09.574201273Z" level=error msg="Error adding network: Multus: [e2e-deployment-427/webserver-fc8b59899-w2tvg]: error getting pod: pods \"webserver-fc8b59899-w2tvg\" not found" The question is why does it take kube runtime so long to start the sandbox. Ryan took a look at this with me and we think there is some type of network outage trying to get to the API service. Either way the gcp-rt job looks pretty unstable with a lot of failures and on the other gcp nightly job I dont see the failure. I think it's OK to close the bug for now and if we see it again reopen. *** Bug 1927102 has been marked as a duplicate of this bug. *** Checking some upgrade jobs between 4.7 to 4.8 today I see a lot of [sig-network] pods should successfully create sandboxes by other Which linked to this BZ. Recent examples: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-upgrade-from-stable-4.7-e2e-metal-ipi-upgrade/1377201785415929856 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1377172439389179904 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1377172439347236864 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1377218187266887680 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1377257801742553088 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1377233268348293120 Reopening with higher priority, is this the same issue being observed? Hi Yu Qi Zhang, I just looked at the CI runs linked and none of the pods affected are static pods. Updating the title to clarify. Hi, Thanks for clarifying. I see you've also moved this to verified. Is that associated with any PRs merging or a bump in upstream kube? Hi Yu Qi, Yes, see the linked PRs. This BZ was previously in VERIFIED when you moved it back to NEW. I returned it to its previous status. I've spun out a new bug 1948066 for Jerry's 4.7->4.8 update issue from comment 29. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |