Bug 1884697 - suddenly new failure: [k8s.io] [sig-node] Pods Extended [k8s.io] Pod Container Status should never report success for a pending container
Summary: suddenly new failure: [k8s.io] [sig-node] Pods Extended [k8s.io] Pod Containe...
Keywords:
Status: CLOSED DUPLICATE of bug 1884035
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.0
Assignee: Seth Jennings
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-02 17:31 UTC by David Eads
Modified: 2020-10-26 14:58 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
[k8s.io] [sig-node] Pods Extended [k8s.io] Pod Container Status should never report success for a pending container
Last Closed: 2020-10-08 14:57:07 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 385 0 None closed Bug 1884697: UPSTREAM: 95261: test: e2e: fix race in pods test 2021-02-17 09:24:46 UTC

Comment 1 Seth Jennings 2020-10-02 17:46:29 UTC
Looking at https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.6/1312045708018192384 for a non-RT baseline

$ grep status-1-5 test.log 
Oct  2 15:53:51.310: FAIL: timed out waiting for watch events for pod-submit-status-1-5
Oct  2 15:53:51.326: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for pod-submit-status-1-5: { } Scheduled: Successfully assigned e2e-pods-9247/pod-submit-status-1-5 to ci-op-vyjjxg8x-95aca-pnr79-worker-d-9cptd
Oct  2 15:53:51.327: INFO: At 2020-10-02 15:48:54 +0000 UTC - event for pod-submit-status-1-5: {multus } AddedInterface: Add eth0 [10.128.2.118/23]
Oct  2 15:53:51.327: INFO: At 2020-10-02 15:48:55 +0000 UTC - event for pod-submit-status-1-5: {kubelet ci-op-vyjjxg8x-95aca-pnr79-worker-d-9cptd} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
Oct  2 15:53:51.344: INFO: pod-submit-status-1-5  ci-op-vyjjxg8x-95aca-pnr79-worker-d-9cptd  Pending  1s     [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-10-02 15:48:51 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2020-10-02 15:48:51 +0000 UTC ContainersNotReady containers with unready status: [busybox]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2020-10-02 15:48:51 +0000 UTC ContainersNotReady containers with unready status: [busybox]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-10-02 15:48:51 +0000 UTC  }]
fail [k8s.io/kubernetes@v1.19.0/test/e2e/node/pods.go:323]: Oct  2 15:53:51.310: timed out waiting for watch events for pod-submit-status-1-5

pod-submit-status-1-5 seems to fail to start due to "error reading container (probably exited) json message: EOF"

$ grep EOF test.log 
Oct  2 15:53:51.326: INFO: At 2020-10-02 15:48:37 +0000 UTC - event for pod-submit-status-2-2: {kubelet ci-op-vyjjxg8x-95aca-pnr79-worker-b-pxbvj} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
Oct  2 15:53:51.327: INFO: At 2020-10-02 15:48:55 +0000 UTC - event for pod-submit-status-1-5: {kubelet ci-op-vyjjxg8x-95aca-pnr79-worker-d-9cptd} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
Oct  2 15:53:51.327: INFO: At 2020-10-02 15:49:10 +0000 UTC - event for pod-submit-status-0-6: {kubelet ci-op-vyjjxg8x-95aca-pnr79-worker-d-9cptd} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
Oct  2 15:53:51.327: INFO: At 2020-10-02 15:49:47 +0000 UTC - event for pod-submit-status-0-13: {kubelet ci-op-vyjjxg8x-95aca-pnr79-worker-d-9cptd} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF

3 other pods in that same test experience the same error

$ grep -e status-2-2 -e status-1-5 -e status-0-6 -e status-0-13 test.log 
Oct  2 15:48:39.015: INFO: watch delete seen for pod-submit-status-2-2
Oct  2 15:48:39.015: INFO: Pod pod-submit-status-2-2 on node ci-op-vyjjxg8x-95aca-pnr79-worker-b-pxbvj timings total=7.015599937s t=694ms run=0s execute=0s
Oct  2 15:49:08.559: INFO: watch delete seen for pod-submit-status-0-6
Oct  2 15:49:08.559: INFO: Pod pod-submit-status-0-6 on node ci-op-vyjjxg8x-95aca-pnr79-worker-d-9cptd timings total=7.595781401s t=371ms run=0s execute=0s
Oct  2 15:49:50.799: INFO: watch delete seen for pod-submit-status-0-13
Oct  2 15:49:50.799: INFO: Pod pod-submit-status-0-13 on node ci-op-vyjjxg8x-95aca-pnr79-worker-d-9cptd timings total=7.416990452s t=991ms run=0s execute=0s
Oct  2 15:53:51.310: FAIL: timed out waiting for watch events for pod-submit-status-1-5

However, the other 3 have to expected watch delete.

Comment 2 Seth Jennings 2020-10-02 17:52:55 UTC
Similar situation with the mention gcp-rt run

$ grep pod-submit-status-2-10 test.log 
Oct  2 12:54:47.949: FAIL: timed out waiting for watch events for pod-submit-status-2-10
Oct  2 12:54:47.958: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for pod-submit-status-2-10: { } Scheduled: Successfully assigned e2e-pods-4964/pod-submit-status-2-10 to ci-op-tbz0thi7-1a9c0-gkkhf-worker-c-vzd5b
Oct  2 12:54:47.958: INFO: At 2020-10-02 12:49:49 +0000 UTC - event for pod-submit-status-2-10: {multus } AddedInterface: Add eth0 [10.128.2.190/23]
Oct  2 12:54:47.958: INFO: At 2020-10-02 12:49:53 +0000 UTC - event for pod-submit-status-2-10: {kubelet ci-op-tbz0thi7-1a9c0-gkkhf-worker-c-vzd5b} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
Oct  2 12:54:47.964: INFO: pod-submit-status-2-10  ci-op-tbz0thi7-1a9c0-gkkhf-worker-c-vzd5b  Pending  1s     [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-10-02 12:49:46 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2020-10-02 12:49:46 +0000 UTC ContainersNotReady containers with unready status: [busybox]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2020-10-02 12:49:46 +0000 UTC ContainersNotReady containers with unready status: [busybox]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-10-02 12:49:46 +0000 UTC  }]
fail [k8s.io/kubernetes@v1.19.0/test/e2e/node/pods.go:323]: Oct  2 12:54:47.949: timed out waiting for watch events for pod-submit-status-2-10

$ grep EOF test.log 
Oct  2 12:54:47.958: INFO: At 2020-10-02 12:48:42 +0000 UTC - event for pod-submit-status-0-3: {kubelet ci-op-tbz0thi7-1a9c0-gkkhf-worker-d-wzhpc} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
Oct  2 12:54:47.958: INFO: At 2020-10-02 12:49:53 +0000 UTC - event for pod-submit-status-2-10: {kubelet ci-op-tbz0thi7-1a9c0-gkkhf-worker-c-vzd5b} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
Oct  2 12:54:47.958: INFO: At 2020-10-02 12:50:24 +0000 UTC - event for pod-submit-status-0-13: {kubelet ci-op-tbz0thi7-1a9c0-gkkhf-worker-c-vzd5b} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF

Comment 3 Seth Jennings 2020-10-02 19:44:06 UTC
The onset of the issue is correlated with a bump in runc.  There were known issues with the bump fixed in

https://bugzilla.redhat.com/show_bug.cgi?id=1883640

There is also a race in the test itself resolved by this upstream PR

https://github.com/kubernetes/kubernetes/pull/95261

o/k pick

https://github.com/openshift/kubernetes/pull/385

Comment 4 Seth Jennings 2020-10-02 20:42:48 UTC
Leaving this in 4.6 over the weekend to observe cleaned up CI signal after race is removed.  I want to make sure the runc bump didn't do something bad.

If the test is still failing at a higher rate, we can try reverting the runc bump and just pick forward the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1883640

Comment 10 Seth Jennings 2020-10-08 14:57:07 UTC

*** This bug has been marked as a duplicate of bug 1884035 ***


Note You need to log in before you can comment on or make changes to this bug.