Description of problem: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.1/932 [sig-scheduling] Multi-AZ Clusters should spread the pods of a replication controller across zones [Suite:openshift/conformance/parallel] [Suite:k8s] expand_less 27s fail [k8s.io/kubernetes/test/e2e/scheduling/ubernetes_lite.go:170]: Pods were not evenly spread across zones. 1 in one zone and 4 in another zone Expected <int>: 1 to be ~ <int>: 4 Version-Release number of selected component (if applicable): Shows up in a 4.1 job and master/4.2 PR jobs How reproducible: Also shows up in 6 PR jobs in past 7 days - https://search.svc.ci.openshift.org/?search=Pods+were+not+evenly+spread+across+zones&maxAge=168h&context=2&type=all
Note https://bugzilla.redhat.com//show_bug.cgi?id=1690620#c10 but I don't think it's related - 3 workers were scheduled in this job
I looked into the 2 scenarios when this test failed. In one case: `kubelet ip-10-0-145-251.ec2.internal} FailedCreatePodSandBox: Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_ubelite-spread-rc-e380ea24-ae02-11e9-9588-0a58ac108b83-sdf2k_multi-az-3160_e385bba7` In other case: `Jul 25 05:47:07.243: INFO: At 2019-07-25 05:45:10 +0000 UTC - event for ubelite-spread-rc-5a533c3f-ae9f-11e9-8f0c-0a58ac101b8c-7prwn: {kubelet ip-10-0-133-82.ec2.internal} Killing: Stopping container ubelite-spread-rc-5a533c3f-ae9f-11e9-8f0c-0a58ac101b8c` I think by the time test completes, kubelet is killing pods causing them to re-created somewhere, since spreading is priority function, there is no guarantee that they'd land in different zones again. Seth, I think sometimes it's because of failed pod sandbox and sometimes it's not clear why kubelet is killing it. Can you help debug this?
This hasn't happened in the last 4 days. Possibly an infra thing. Either way, the kubelet does not initiate the deletion of any pods except in the eviction path. https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/2072/pull-ci-openshift-installer-master-e2e-aws/6778/ Jul 24 11:05:50.281: INFO: At 2019-07-24 11:05:04 +0000 UTC - event for ubelite-spread-rc-e380ea24-ae02-11e9-9588-0a58ac108b83-54vcs: {default-scheduler } Scheduled: Successfully assigned multi-az-3160/ubelite-spread-rc-e380ea24-ae02-11e9-9588-0a58ac108b83-54vcs to ip-10-0-138-26.ec2.internal Jul 24 11:05:50.281: INFO: At 2019-07-24 11:05:04 +0000 UTC - event for ubelite-spread-rc-e380ea24-ae02-11e9-9588-0a58ac108b83-7rjt9: {default-scheduler } Scheduled: Successfully assigned multi-az-3160/ubelite-spread-rc-e380ea24-ae02-11e9-9588-0a58ac108b83-7rjt9 to ip-10-0-135-92.ec2.internal Jul 24 11:05:50.281: INFO: At 2019-07-24 11:05:04 +0000 UTC - event for ubelite-spread-rc-e380ea24-ae02-11e9-9588-0a58ac108b83-ccntn: {default-scheduler } Scheduled: Successfully assigned multi-az-3160/ubelite-spread-rc-e380ea24-ae02-11e9-9588-0a58ac108b83-ccntn to ip-10-0-138-26.ec2.internal Jul 24 11:05:50.281: INFO: At 2019-07-24 11:05:04 +0000 UTC - event for ubelite-spread-rc-e380ea24-ae02-11e9-9588-0a58ac108b83-kxv9f: {default-scheduler } Scheduled: Successfully assigned multi-az-3160/ubelite-spread-rc-e380ea24-ae02-11e9-9588-0a58ac108b83-kxv9f to ip-10-0-135-92.ec2.internal Jul 24 11:05:50.281: INFO: At 2019-07-24 11:05:04 +0000 UTC - event for ubelite-spread-rc-e380ea24-ae02-11e9-9588-0a58ac108b83-sdf2k: {default-scheduler } Scheduled: Successfully assigned multi-az-3160/ubelite-spread-rc-e380ea24-ae02-11e9-9588-0a58ac108b83-sdf2k to ip-10-0-145-251.ec2.internal Logs indicate the original scheduling decision with {ip-10-0-138-26.ec2.internal, ip-10-0-135-92.internal} in one zone and {ip-10-0-145-251.ec2.internal} in another. This runs in the parallel test group, but it seems like it could be influenced by load on the nodes. Maybe it should be [Serial]?
I've seen this happening only a few times for a day almost 2 weeks ago, I'm going to lower the priority and move it out of 4.2
*** This bug has been marked as a duplicate of bug 1760193 ***