Bug 1884697

Summary: suddenly new failure: [k8s.io] [sig-node] Pods Extended [k8s.io] Pod Container Status should never report success for a pending container
Product: OpenShift Container Platform Reporter: David Eads <deads>
Component: NodeAssignee: Seth Jennings <sjenning>
Status: CLOSED DUPLICATE QA Contact: Sunil Choudhary <schoudha>
Severity: high Docs Contact:
Priority: high    
Version: 4.6CC: aos-bugs, jokerman, nagrawal, sjenning
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
[k8s.io] [sig-node] Pods Extended [k8s.io] Pod Container Status should never report success for a pending container
Last Closed: 2020-10-08 14:57:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 1 Seth Jennings 2020-10-02 17:46:29 UTC
Looking at https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.6/1312045708018192384 for a non-RT baseline

$ grep status-1-5 test.log 
Oct  2 15:53:51.310: FAIL: timed out waiting for watch events for pod-submit-status-1-5
Oct  2 15:53:51.326: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for pod-submit-status-1-5: { } Scheduled: Successfully assigned e2e-pods-9247/pod-submit-status-1-5 to ci-op-vyjjxg8x-95aca-pnr79-worker-d-9cptd
Oct  2 15:53:51.327: INFO: At 2020-10-02 15:48:54 +0000 UTC - event for pod-submit-status-1-5: {multus } AddedInterface: Add eth0 [10.128.2.118/23]
Oct  2 15:53:51.327: INFO: At 2020-10-02 15:48:55 +0000 UTC - event for pod-submit-status-1-5: {kubelet ci-op-vyjjxg8x-95aca-pnr79-worker-d-9cptd} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
Oct  2 15:53:51.344: INFO: pod-submit-status-1-5  ci-op-vyjjxg8x-95aca-pnr79-worker-d-9cptd  Pending  1s     [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-10-02 15:48:51 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2020-10-02 15:48:51 +0000 UTC ContainersNotReady containers with unready status: [busybox]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2020-10-02 15:48:51 +0000 UTC ContainersNotReady containers with unready status: [busybox]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-10-02 15:48:51 +0000 UTC  }]
fail [k8s.io/kubernetes.0/test/e2e/node/pods.go:323]: Oct  2 15:53:51.310: timed out waiting for watch events for pod-submit-status-1-5

pod-submit-status-1-5 seems to fail to start due to "error reading container (probably exited) json message: EOF"

$ grep EOF test.log 
Oct  2 15:53:51.326: INFO: At 2020-10-02 15:48:37 +0000 UTC - event for pod-submit-status-2-2: {kubelet ci-op-vyjjxg8x-95aca-pnr79-worker-b-pxbvj} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
Oct  2 15:53:51.327: INFO: At 2020-10-02 15:48:55 +0000 UTC - event for pod-submit-status-1-5: {kubelet ci-op-vyjjxg8x-95aca-pnr79-worker-d-9cptd} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
Oct  2 15:53:51.327: INFO: At 2020-10-02 15:49:10 +0000 UTC - event for pod-submit-status-0-6: {kubelet ci-op-vyjjxg8x-95aca-pnr79-worker-d-9cptd} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
Oct  2 15:53:51.327: INFO: At 2020-10-02 15:49:47 +0000 UTC - event for pod-submit-status-0-13: {kubelet ci-op-vyjjxg8x-95aca-pnr79-worker-d-9cptd} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF

3 other pods in that same test experience the same error

$ grep -e status-2-2 -e status-1-5 -e status-0-6 -e status-0-13 test.log 
Oct  2 15:48:39.015: INFO: watch delete seen for pod-submit-status-2-2
Oct  2 15:48:39.015: INFO: Pod pod-submit-status-2-2 on node ci-op-vyjjxg8x-95aca-pnr79-worker-b-pxbvj timings total=7.015599937s t=694ms run=0s execute=0s
Oct  2 15:49:08.559: INFO: watch delete seen for pod-submit-status-0-6
Oct  2 15:49:08.559: INFO: Pod pod-submit-status-0-6 on node ci-op-vyjjxg8x-95aca-pnr79-worker-d-9cptd timings total=7.595781401s t=371ms run=0s execute=0s
Oct  2 15:49:50.799: INFO: watch delete seen for pod-submit-status-0-13
Oct  2 15:49:50.799: INFO: Pod pod-submit-status-0-13 on node ci-op-vyjjxg8x-95aca-pnr79-worker-d-9cptd timings total=7.416990452s t=991ms run=0s execute=0s
Oct  2 15:53:51.310: FAIL: timed out waiting for watch events for pod-submit-status-1-5

However, the other 3 have to expected watch delete.

Comment 2 Seth Jennings 2020-10-02 17:52:55 UTC
Similar situation with the mention gcp-rt run

$ grep pod-submit-status-2-10 test.log 
Oct  2 12:54:47.949: FAIL: timed out waiting for watch events for pod-submit-status-2-10
Oct  2 12:54:47.958: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for pod-submit-status-2-10: { } Scheduled: Successfully assigned e2e-pods-4964/pod-submit-status-2-10 to ci-op-tbz0thi7-1a9c0-gkkhf-worker-c-vzd5b
Oct  2 12:54:47.958: INFO: At 2020-10-02 12:49:49 +0000 UTC - event for pod-submit-status-2-10: {multus } AddedInterface: Add eth0 [10.128.2.190/23]
Oct  2 12:54:47.958: INFO: At 2020-10-02 12:49:53 +0000 UTC - event for pod-submit-status-2-10: {kubelet ci-op-tbz0thi7-1a9c0-gkkhf-worker-c-vzd5b} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
Oct  2 12:54:47.964: INFO: pod-submit-status-2-10  ci-op-tbz0thi7-1a9c0-gkkhf-worker-c-vzd5b  Pending  1s     [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2020-10-02 12:49:46 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2020-10-02 12:49:46 +0000 UTC ContainersNotReady containers with unready status: [busybox]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2020-10-02 12:49:46 +0000 UTC ContainersNotReady containers with unready status: [busybox]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2020-10-02 12:49:46 +0000 UTC  }]
fail [k8s.io/kubernetes.0/test/e2e/node/pods.go:323]: Oct  2 12:54:47.949: timed out waiting for watch events for pod-submit-status-2-10

$ grep EOF test.log 
Oct  2 12:54:47.958: INFO: At 2020-10-02 12:48:42 +0000 UTC - event for pod-submit-status-0-3: {kubelet ci-op-tbz0thi7-1a9c0-gkkhf-worker-d-wzhpc} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
Oct  2 12:54:47.958: INFO: At 2020-10-02 12:49:53 +0000 UTC - event for pod-submit-status-2-10: {kubelet ci-op-tbz0thi7-1a9c0-gkkhf-worker-c-vzd5b} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF
Oct  2 12:54:47.958: INFO: At 2020-10-02 12:50:24 +0000 UTC - event for pod-submit-status-0-13: {kubelet ci-op-tbz0thi7-1a9c0-gkkhf-worker-c-vzd5b} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probably exited) json message: EOF

Comment 3 Seth Jennings 2020-10-02 19:44:06 UTC
The onset of the issue is correlated with a bump in runc.  There were known issues with the bump fixed in

https://bugzilla.redhat.com/show_bug.cgi?id=1883640

There is also a race in the test itself resolved by this upstream PR

https://github.com/kubernetes/kubernetes/pull/95261

o/k pick

https://github.com/openshift/kubernetes/pull/385

Comment 4 Seth Jennings 2020-10-02 20:42:48 UTC
Leaving this in 4.6 over the weekend to observe cleaned up CI signal after race is removed.  I want to make sure the runc bump didn't do something bad.

If the test is still failing at a higher rate, we can try reverting the runc bump and just pick forward the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1883640

Comment 10 Seth Jennings 2020-10-08 14:57:07 UTC

*** This bug has been marked as a duplicate of bug 1884035 ***