Hide Forgot
test: [sig-scheduling] Multi-AZ Clusters should spread the pods of a service across zones [Serial] is failing frequently in CI, see search results: https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-scheduling%5C%5D+Multi-AZ+Clusters+should+spread+the+pods+of+a+service+across+zones+%5C%5BSerial%5C%5D Here are two particular jobs that are failing: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.6/1325615541506805760 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.6/1324165713664937984
Mike you looked previously at this https://bugzilla.redhat.com/show_bug.cgi?id=1806594
Saw this on a 4.5 job. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.5/1328647274242248704
Due to being occupied fixing higher severity bugs I was not able to address this, I'm adding UpcomingSprint to investigate it in a future sprint
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.
This test is still flaking, removing LifecycleStale to investigate more
The LifecycleStale keyword was removed because the needinfo? flag was reset and the bug got commented on recently. The bug assignee was notified.
I was unable to focus on this bug again this sprint, but adding upcomingsprint to look at it in the future. There is also an upstream issue for this test at https://github.com/kubernetes/kubernetes/issues/89178
Moving back to NEW - PR #525 is just skipping the test pending a fix.
still see the test failing , so moving it back to assigned. [1] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2362/pull-ci-openshift-machine-config-operator-master-e2e-aws-serial/1354846185298333696 [2] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2362/pull-ci-openshift-machine-config-operator-master-e2e-aws-serial/1354846185298333696 https://search.ci.openshift.org/?search=%5C%5Bsig-scheduling%5C%5D+Multi-AZ+Clusters+should+spread+the+pods+of+a+service+across+zones+%5C%5BSerial%5C%5D&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job
The upstream PR for this is still undergoing review: https://github.com/kubernetes/kubernetes/pull/98583
Still see the case failing in last 8 hours in link [1], so moving it back to assigned. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/557/pull-ci-openshift-kubernetes-master-e2e-aws-serial/1359269500494548992 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/557/pull-ci-openshift-kubernetes-master-e2e-aws-serial/1359269500494548992 [1] https://search.ci.openshift.org/?search=%5C%5Bsig-scheduling%5C%5D+Multi-AZ+Clusters+should+spread+the+pods+of+a+service+across+zones+%5C%5BSerial%5C%5D&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job
The recent failures are occurring when trying to create the balancing pod (balancing pods were just added to this test as part of the fix). The errors appear to be the same as in https://github.com/kubernetes/kubernetes/pull/98073 (not meeting cri-o minimum memory limit) The failure that you linked (for PR 557) is interesting, because it does appear at first to be the legitimate failure: ``` fail [k8s.io/kubernetes.0/test/e2e/scheduling/ubernetes_lite.go:190]: Pods were not evenly spread across zones. 3 in one zone and 7 in another zone Expected <int>: 4 to be within 2 of ~ <int>: 0 ``` But looking at the logs, the pod distribution on the nodes was actually [3,3,4] which is within the limit: ``` STEP: Getting zone name for pod test-service-0, on node ip-10-0-182-156.us-west-2.compute.internal STEP: Getting zone name for pod test-service-1, on node ip-10-0-219-149.us-west-2.compute.internal STEP: Getting zone name for pod test-service-2, on node ip-10-0-145-212.us-west-2.compute.internal STEP: Getting zone name for pod test-service-3, on node ip-10-0-219-149.us-west-2.compute.internal STEP: Getting zone name for pod test-service-4, on node ip-10-0-182-156.us-west-2.compute.internal STEP: Getting zone name for pod test-service-5, on node ip-10-0-145-212.us-west-2.compute.internal STEP: Getting zone name for pod test-service-6, on node ip-10-0-219-149.us-west-2.compute.internal STEP: Getting zone name for pod test-service-7, on node ip-10-0-182-156.us-west-2.compute.internal STEP: Getting zone name for pod test-service-8, on node ip-10-0-145-212.us-west-2.compute.internal STEP: Getting zone name for pod test-service-9, on node ip-10-0-182-156.us-west-2.compute.internal ``` However, when I tried to check these nodes from the must-gather (attached), the node `ip-10-0-182-156` does not exist. This is very odd to me, and if somehow the test was looking at a non-existent node I can see that being misinterpreted to some default. Since that node had size [4], if its pods were grouped with either other node in the test that would present the [7,3] size difference reported in the failure. I need to keep working on the kubernetes PR linked above to get approval, in order to make sure the minimum memory limit failures don't happen. In the meantime, @Rama please keep an eye for more failures that match the "Expected / to be within 2 of" message. These are the legitimate failures that we will need to be aware of, if more occur.
Created attachment 1756212 [details] must-gather from failed CI run missing node
The relevant PRs I was referring to [1][2] have merged now, so moving this back to ON_QA to check for failures following my above comment: https://bugzilla.redhat.com/show_bug.cgi?id=1896558#c14 1. https://github.com/openshift/kubernetes/pull/547 2. https://github.com/openshift/kubernetes/pull/526
Sorry, this still needs a bump PR in origin
Switching this back to MODIFIED, as origin has since been bumped to include all the above changes
Hello Mike, I see that the bug has been failed on 4.8 cluster 34 hours back, is this expected ? Below is the link where i saw the bug failure [1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.8/1394736589866799104
That does look like the same failure, but has it occurred in any more 4.8 runs? It's also odd that it looks like it's running from the v1.21-rc.0 tag, since we should be rebased onto the published 1.21 by now: >fail [k8s.io/kubernetes.0-rc.0/test/e2e/scheduling/ubernetes_lite.go:190]: Pods were not evenly spread across zones.
(In reply to Mike Dame from comment #21) > That does look like the same failure, but has it occurred in any more 4.8 > runs? > AFAIS, that is only the run where it failed. > It's also odd that it looks like it's running from the v1.21-rc.0 tag, since > we should be rebased onto the published 1.21 by now: > >fail [k8s.io/kubernetes.0-rc.0/test/e2e/scheduling/ubernetes_lite.go:190]: Pods were not evenly spread across zones. Yup, you are right, not sure how that has happened.
Moving the bug to assigned as per comment 21.
@knarra have there been any more failures since that one? I don't think it is relevant since it was on the `rc.0` version so if it's been clean otherwise then I think we can mark this resolved
Still see that the test is failing. Below are the links where the test fails, so moving the bug back to assigned state. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-serial/1400395534132318208 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.8/1399645999227473920 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.8/1400578680282943488 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.8/1400619401945812992 https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/622/pull-ci-openshift-cluster-ingress-operator-master-e2e-gcp-serial/1400840321608192000
Thanks for checking, The last 3 failures are timeouts, coupled with many other test failures, so I think we can ignore those as infra flakes The first 2 may or may not be valid, as they show some different behavior than what we have seen before. From the logs: > Jun 3 10:59:08.874: INFO: ComputeCPUMemFraction for node: ip-10-0-181-127.ec2.internal > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Node: ip-10-0-181-127.ec2.internal, totalRequestedCPUResource: 370, cpuAllocatableMil: 3500, cpuFraction: 0.10571428571428572 > Jun 3 10:59:08.874: INFO: Node: ip-10-0-181-127.ec2.internal, totalRequestedMemResource: 3502243840, memAllocatableVal: 15622287360, memFraction: 0.22418252585516388 > Jun 3 10:59:08.874: INFO: ComputeCPUMemFraction for node: ip-10-0-181-216.ec2.internal > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:08.874: INFO: Node: ip-10-0-181-216.ec2.internal, totalRequestedCPUResource: 310, cpuAllocatableMil: 3500, cpuFraction: 0.08857142857142856 > Jun 3 10:59:08.874: INFO: Node: ip-10-0-181-216.ec2.internal, totalRequestedMemResource: 2747269120, memAllocatableVal: 15622287360, memFraction: 0.17585575381452975 > Jun 3 10:59:08.910: INFO: Waiting for running... > Jun 3 10:59:13.995: INFO: Waiting for running... > STEP: Compute Cpu, Mem Fraction after create balanced pods. > Jun 3 10:59:19.164: INFO: ComputeCPUMemFraction for node: ip-10-0-181-127.ec2.internal > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Node: ip-10-0-181-127.ec2.internal, totalRequestedCPUResource: 380, cpuAllocatableMil: 3500, cpuFraction: 0.10857142857142857 > Jun 3 10:59:19.164: INFO: Node: ip-10-0-181-127.ec2.internal, totalRequestedMemResource: 3628072960, memAllocatableVal: 15622287360, memFraction: 0.23223698786193625 > STEP: Compute Cpu, Mem Fraction after create balanced pods. > Jun 3 10:59:19.164: INFO: ComputeCPUMemFraction for node: ip-10-0-181-216.ec2.internal > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120 > Jun 3 10:59:19.164: INFO: Node: ip-10-0-181-216.ec2.internal, totalRequestedCPUResource: 320, cpuAllocatableMil: 3500, cpuFraction: 0.09142857142857143 > Jun 3 10:59:19.164: INFO: Node: ip-10-0-181-216.ec2.internal, totalRequestedMemResource: 2873098240, memAllocatableVal: 15622287360, memFraction: 0.18391021582130213 This shows a few interesting things: 1) the memory being calculated is only looking at 1 pod, over and over. Therefore the calculation is wrong, and the attempt at "balancing" is also wrong (you see the fractions after balancing are different). I wonder if this could be related to the recent upstream change where we create the balanced pods in parallel, but I am not so sure (https://github.com/kubernetes/kubernetes/pull/102138) 2) There are only 2 nodes Point 2 is important because, further in the test logs, we see: > Jun 3 10:59:19.542: INFO: Waiting for running... > STEP: Getting zone name for pod test-service-0, on node ip-10-0-181-216.ec2.internal > STEP: Getting zone name for pod test-service-1, on node ip-10-0-181-127.ec2.internal > STEP: Getting zone name for pod test-service-2, on node ip-10-0-181-216.ec2.internal > STEP: Getting zone name for pod test-service-3, on node ip-10-0-181-127.ec2.internal > STEP: Getting zone name for pod test-service-4, on node ip-10-0-181-216.ec2.internal > STEP: Getting zone name for pod test-service-5, on node ip-10-0-181-127.ec2.internal > STEP: Getting zone name for pod test-service-6, on node ip-10-0-181-216.ec2.internal > STEP: Getting zone name for pod test-service-7, on node ip-10-0-181-127.ec2.internal > STEP: Getting zone name for pod test-service-8, on node ip-10-0-181-216.ec2.internal > STEP: Getting zone name for pod test-service-9, on node ip-10-0-181-127.ec2.internal This indicates that the pods are actually evenly spread between the 2 nodes (5 on each). However, the pods are all "Pending". Then, the test fails with: > fail [k8s.io/kubernetes.1/test/e2e/scheduling/ubernetes_lite.go:190]: Pods were not evenly spread across zones. 0 in one zone and 10 in another zone > Expected > <int>: 10 > to be within 2 of ~ > <int>: 0 Which does not reflect the scheduling balance that we see above. Therefore, I am currently unconvinced that this is not also an infra flake, especially given that it has only shown up in 2 runs. When I have the time to investigate further I will, but in the meantime I think we should continue to monitor these tests to see if there are any real failures that show up. We can dismiss rare flakes like this
@knarra are we still observing any failures for this test?
*** Bug 1925941 has been marked as a duplicate of this bug. ***
Due to higher priority tasks I have been able to resolve this issue in time. Moving to the next sprint.
Given the issue is reported against 4.6 (we are releasing 4.10 and working already on 4.11, so we are fixing only 4.7+) and the comment in https://bugzilla.redhat.com/show_bug.cgi?id=1896558#c27 I am closing this issue as not fixed.