This test has gone from 0.5% failure rate and 0.2% flake rate last week to 8% failure rate and 10% flake rate this week. https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=4.6&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bk8s%5C.io%5C%5D+%5C%5Bsig-node%5C%5D+Pods+Extended+%5C%5Bk8s%5C.io%5C%5D+Pod+Container+Status+should+never+report+success+for+a+pending+container Realtime appears particular vulnerable https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-gcp-rt-4.6&grid=old https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-rt-4.6/1311995948477452288
Looking at https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.6/1312045708018192384 for a non-RT baseline $ grep status-1-5 test.log Oct 2 15:53:51.310: FAIL: timed out waiting for watch events for pod-submit-status-1-5 Oct 2 15:53:51.326: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for pod-submit-status-1-5: { } Scheduled: Successfully assigned e2e-pods-9247/pod-submit-status-1-5 to ci-op-vyjjxg8x-95aca-pnr79-worker-d-9cptd Oct 2 15:53:51.327: INFO: At 2020-10-02 15:48:54 +0000 UTC - event for pod-submit-status-1-5: {multus } AddedInterface: Add eth0 [10.128.2.118/23] Oct 2 15:53:51.327: INFO: At 2020-10-02 15:48:55 +0000 UTC - event for pod-submit-status-1-5: {kubelet ci-op-vyjjxg8x-95aca-pnr79-worker-d-9cptd} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF Oct 2 15:53:51.344: INFO: pod-submit-status-1-5 ci-op-vyjjxg8x-95aca-pnr79-worker-d-9cptd Pending 1s [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-10-02 15:48:51 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2020-10-02 15:48:51 +0000 UTC ContainersNotReady containers with unready status: [busybox]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2020-10-02 15:48:51 +0000 UTC ContainersNotReady containers with unready status: [busybox]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-10-02 15:48:51 +0000 UTC }] fail [k8s.io/kubernetes.0/test/e2e/node/pods.go:323]: Oct 2 15:53:51.310: timed out waiting for watch events for pod-submit-status-1-5 pod-submit-status-1-5 seems to fail to start due to "error reading container (probably exited) json message: EOF" $ grep EOF test.log Oct 2 15:53:51.326: INFO: At 2020-10-02 15:48:37 +0000 UTC - event for pod-submit-status-2-2: {kubelet ci-op-vyjjxg8x-95aca-pnr79-worker-b-pxbvj} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF Oct 2 15:53:51.327: INFO: At 2020-10-02 15:48:55 +0000 UTC - event for pod-submit-status-1-5: {kubelet ci-op-vyjjxg8x-95aca-pnr79-worker-d-9cptd} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF Oct 2 15:53:51.327: INFO: At 2020-10-02 15:49:10 +0000 UTC - event for pod-submit-status-0-6: {kubelet ci-op-vyjjxg8x-95aca-pnr79-worker-d-9cptd} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF Oct 2 15:53:51.327: INFO: At 2020-10-02 15:49:47 +0000 UTC - event for pod-submit-status-0-13: {kubelet ci-op-vyjjxg8x-95aca-pnr79-worker-d-9cptd} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF 3 other pods in that same test experience the same error $ grep -e status-2-2 -e status-1-5 -e status-0-6 -e status-0-13 test.log Oct 2 15:48:39.015: INFO: watch delete seen for pod-submit-status-2-2 Oct 2 15:48:39.015: INFO: Pod pod-submit-status-2-2 on node ci-op-vyjjxg8x-95aca-pnr79-worker-b-pxbvj timings total=7.015599937s t=694ms run=0s execute=0s Oct 2 15:49:08.559: INFO: watch delete seen for pod-submit-status-0-6 Oct 2 15:49:08.559: INFO: Pod pod-submit-status-0-6 on node ci-op-vyjjxg8x-95aca-pnr79-worker-d-9cptd timings total=7.595781401s t=371ms run=0s execute=0s Oct 2 15:49:50.799: INFO: watch delete seen for pod-submit-status-0-13 Oct 2 15:49:50.799: INFO: Pod pod-submit-status-0-13 on node ci-op-vyjjxg8x-95aca-pnr79-worker-d-9cptd timings total=7.416990452s t=991ms run=0s execute=0s Oct 2 15:53:51.310: FAIL: timed out waiting for watch events for pod-submit-status-1-5 However, the other 3 have to expected watch delete.
Similar situation with the mention gcp-rt run $ grep pod-submit-status-2-10 test.log Oct 2 12:54:47.949: FAIL: timed out waiting for watch events for pod-submit-status-2-10 Oct 2 12:54:47.958: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for pod-submit-status-2-10: { } Scheduled: Successfully assigned e2e-pods-4964/pod-submit-status-2-10 to ci-op-tbz0thi7-1a9c0-gkkhf-worker-c-vzd5b Oct 2 12:54:47.958: INFO: At 2020-10-02 12:49:49 +0000 UTC - event for pod-submit-status-2-10: {multus } AddedInterface: Add eth0 [10.128.2.190/23] Oct 2 12:54:47.958: INFO: At 2020-10-02 12:49:53 +0000 UTC - event for pod-submit-status-2-10: {kubelet ci-op-tbz0thi7-1a9c0-gkkhf-worker-c-vzd5b} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF Oct 2 12:54:47.964: INFO: pod-submit-status-2-10 ci-op-tbz0thi7-1a9c0-gkkhf-worker-c-vzd5b Pending 1s [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-10-02 12:49:46 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2020-10-02 12:49:46 +0000 UTC ContainersNotReady containers with unready status: [busybox]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2020-10-02 12:49:46 +0000 UTC ContainersNotReady containers with unready status: [busybox]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-10-02 12:49:46 +0000 UTC }] fail [k8s.io/kubernetes.0/test/e2e/node/pods.go:323]: Oct 2 12:54:47.949: timed out waiting for watch events for pod-submit-status-2-10 $ grep EOF test.log Oct 2 12:54:47.958: INFO: At 2020-10-02 12:48:42 +0000 UTC - event for pod-submit-status-0-3: {kubelet ci-op-tbz0thi7-1a9c0-gkkhf-worker-d-wzhpc} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF Oct 2 12:54:47.958: INFO: At 2020-10-02 12:49:53 +0000 UTC - event for pod-submit-status-2-10: {kubelet ci-op-tbz0thi7-1a9c0-gkkhf-worker-c-vzd5b} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF Oct 2 12:54:47.958: INFO: At 2020-10-02 12:50:24 +0000 UTC - event for pod-submit-status-0-13: {kubelet ci-op-tbz0thi7-1a9c0-gkkhf-worker-c-vzd5b} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
The onset of the issue is correlated with a bump in runc. There were known issues with the bump fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1883640 There is also a race in the test itself resolved by this upstream PR https://github.com/kubernetes/kubernetes/pull/95261 o/k pick https://github.com/openshift/kubernetes/pull/385
Leaving this in 4.6 over the weekend to observe cleaned up CI signal after race is removed. I want to make sure the runc bump didn't do something bad. If the test is still failing at a higher rate, we can try reverting the runc bump and just pick forward the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1883640
*** This bug has been marked as a duplicate of bug 1884035 ***