1896558 – [sig-scheduling] Multi-AZ Clusters should spread the pods of a service across zones [Serial]

Bug 1896558 - [sig-scheduling] Multi-AZ Clusters should spread the pods of a service across zones [Serial]

Summary: [sig-scheduling] Multi-AZ Clusters should spread the pods of a service across...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-scheduler
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Jan Chaloupka
QA Contact:	RamaKasturi
Docs Contact:
URL:
Whiteboard:	LifecycleReset tag-ci
Duplicates (1):	1925941 (view as bug list)
Depends On:
Blocks:	1925941 1929389
TreeView+	depends on / blocked

Reported:	2020-11-10 21:34 UTC by Patrick Dillon
Modified:	2023-09-15 00:50 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	[sig-scheduling] Multi-AZ Clusters should spread the pods of a service across zones [Serial]
Last Closed:	2022-03-07 14:54:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
must-gather from failed CI run missing node (12.90 MB, application/gzip) 2021-02-10 14:13 UTC, Mike Dame	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift kubernetes pull 525	None	closed	Bug 1896558: Skip multiaz e2e	2021-02-15 17:08:27 UTC
Github	openshift kubernetes pull 545	None	closed	Bug 1896558: Revert undesired multi az skip	2021-02-15 17:08:26 UTC
Github	openshift kubernetes pull 547	None	closed	Bug 1896558: Balance nodes in scheduling e2e	2021-02-15 17:08:26 UTC
Github	openshift origin pull 25848	None	closed	Bug 1896558: bump(openshift/kubernetes): fix flaking multi-AZ test	2021-02-15 17:08:27 UTC

Description Patrick Dillon 2020-11-10 21:34:36 UTC

test:
[sig-scheduling] Multi-AZ Clusters should spread the pods of a service across zones [Serial] 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-scheduling%5C%5D+Multi-AZ+Clusters+should+spread+the+pods+of+a+service+across+zones+%5C%5BSerial%5C%5D


Here are two particular jobs that are failing:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.6/1325615541506805760

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.6/1324165713664937984

Comment 1 Maciej Szulik 2020-11-12 12:08:57 UTC

Mike you looked previously at this https://bugzilla.redhat.com/show_bug.cgi?id=1806594

Comment 2 Hongkai Liu 2020-11-17 18:48:37 UTC

Saw this on a 4.5 job.
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.5/1328647274242248704

Comment 3 Mike Dame 2020-12-04 18:14:27 UTC

Due to being occupied fixing higher severity bugs I was not able to address this, I'm adding UpcomingSprint to investigate it in a future sprint

Comment 4 Michal Fojtik 2020-12-17 19:10:09 UTC

This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 5 Mike Dame 2021-01-14 20:40:10 UTC

This test is still flaking, removing LifecycleStale to investigate more

Comment 6 Michal Fojtik 2021-01-14 21:38:55 UTC

The LifecycleStale keyword was removed because the needinfo? flag was reset and the bug got commented on recently.
The bug assignee was notified.

Comment 7 Mike Dame 2021-01-17 02:23:50 UTC

I was unable to focus on this bug again this sprint, but adding upcomingsprint to look at it in the future. There is also an upstream issue for this test at https://github.com/kubernetes/kubernetes/issues/89178

Comment 8 Maru Newby 2021-01-18 15:12:37 UTC

Moving back to NEW - PR #525 is just skipping the test pending a fix.

Comment 10 RamaKasturi 2021-01-29 12:52:47 UTC

still see the test failing , so moving it back to assigned.

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2362/pull-ci-openshift-machine-config-operator-master-e2e-aws-serial/1354846185298333696
[2] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2362/pull-ci-openshift-machine-config-operator-master-e2e-aws-serial/1354846185298333696

https://search.ci.openshift.org/?search=%5C%5Bsig-scheduling%5C%5D+Multi-AZ+Clusters+should+spread+the+pods+of+a+service+across+zones+%5C%5BSerial%5C%5D&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 11 Mike Dame 2021-02-04 18:09:53 UTC

The upstream PR for this is still undergoing review: https://github.com/kubernetes/kubernetes/pull/98583

Comment 13 RamaKasturi 2021-02-10 09:52:54 UTC

Still see the case failing in last 8 hours in link [1], so moving it back to assigned.

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/557/pull-ci-openshift-kubernetes-master-e2e-aws-serial/1359269500494548992

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/557/pull-ci-openshift-kubernetes-master-e2e-aws-serial/1359269500494548992

[1] https://search.ci.openshift.org/?search=%5C%5Bsig-scheduling%5C%5D+Multi-AZ+Clusters+should+spread+the+pods+of+a+service+across+zones+%5C%5BSerial%5C%5D&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 14 Mike Dame 2021-02-10 14:12:26 UTC

The recent failures are occurring when trying to create the balancing pod (balancing pods were just added to this test as part of the fix). The errors appear to be the same as in https://github.com/kubernetes/kubernetes/pull/98073 (not meeting cri-o minimum memory limit)

The failure that you linked (for PR 557) is interesting, because it does appear at first to be the legitimate failure:
```
fail [k8s.io/kubernetes.0/test/e2e/scheduling/ubernetes_lite.go:190]: Pods were not evenly spread across zones.  3 in one zone and 7 in another zone
Expected
    <int>: 4
to be within 2 of ~
    <int>: 0
```

But looking at the logs, the pod distribution on the nodes was actually [3,3,4] which is within the limit:
```
STEP: Getting zone name for pod test-service-0, on node ip-10-0-182-156.us-west-2.compute.internal
STEP: Getting zone name for pod test-service-1, on node ip-10-0-219-149.us-west-2.compute.internal
STEP: Getting zone name for pod test-service-2, on node ip-10-0-145-212.us-west-2.compute.internal
STEP: Getting zone name for pod test-service-3, on node ip-10-0-219-149.us-west-2.compute.internal
STEP: Getting zone name for pod test-service-4, on node ip-10-0-182-156.us-west-2.compute.internal
STEP: Getting zone name for pod test-service-5, on node ip-10-0-145-212.us-west-2.compute.internal
STEP: Getting zone name for pod test-service-6, on node ip-10-0-219-149.us-west-2.compute.internal
STEP: Getting zone name for pod test-service-7, on node ip-10-0-182-156.us-west-2.compute.internal
STEP: Getting zone name for pod test-service-8, on node ip-10-0-145-212.us-west-2.compute.internal
STEP: Getting zone name for pod test-service-9, on node ip-10-0-182-156.us-west-2.compute.internal
```

However, when I tried to check these nodes from the must-gather (attached), the node `ip-10-0-182-156` does not exist. This is very odd to me, and if somehow the test was looking at a non-existent node I can see that being misinterpreted to some default. Since that node had size [4], if its pods were grouped with either other node in the test that would present the [7,3] size difference reported in the failure.

I need to keep working on the kubernetes PR linked above to get approval, in order to make sure the minimum memory limit failures don't happen.

In the meantime, @Rama please keep an eye for more failures that match the "Expected / to be within 2 of" message. These are the legitimate failures that we will need to be aware of, if more occur.

Comment 15 Mike Dame 2021-02-10 14:13:29 UTC

Created attachment 1756212 [details]
must-gather from failed CI run missing node

Comment 16 Mike Dame 2021-02-23 13:45:02 UTC

The relevant PRs I was referring to [1][2] have merged now, so moving this back to ON_QA to check for failures following my above comment: https://bugzilla.redhat.com/show_bug.cgi?id=1896558#c14

1. https://github.com/openshift/kubernetes/pull/547
2. https://github.com/openshift/kubernetes/pull/526

Comment 17 Mike Dame 2021-02-23 14:55:20 UTC

Sorry, this still needs a bump PR in origin

Comment 18 Mike Dame 2021-05-17 15:02:10 UTC

Switching this back to MODIFIED, as origin has since been bumped to include all the above changes

Comment 20 RamaKasturi 2021-05-20 07:24:04 UTC

Hello Mike,

   I see that the bug has been failed on 4.8 cluster 34 hours back, is this expected ? Below is the link where i saw the bug failure

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.8/1394736589866799104

Comment 21 Mike Dame 2021-05-20 14:08:29 UTC

That does look like the same failure, but has it occurred in any more 4.8 runs?

It's also odd that it looks like it's running from the v1.21-rc.0 tag, since we should be rebased onto the published 1.21 by now:
>fail [k8s.io/kubernetes.0-rc.0/test/e2e/scheduling/ubernetes_lite.go:190]: Pods were not evenly spread across zones.

Comment 22 RamaKasturi 2021-05-20 15:58:17 UTC

(In reply to Mike Dame from comment #21)
> That does look like the same failure, but has it occurred in any more 4.8
> runs?
> 
AFAIS, that is only the run where it failed.

> It's also odd that it looks like it's running from the v1.21-rc.0 tag, since
> we should be rebased onto the published 1.21 by now:
> >fail [k8s.io/kubernetes.0-rc.0/test/e2e/scheduling/ubernetes_lite.go:190]: Pods were not evenly spread across zones.

Yup, you are right, not sure how that has happened.

Comment 23 RamaKasturi 2021-05-26 12:50:26 UTC

Moving the bug to assigned as per comment 21.

Comment 24 Mike Dame 2021-06-01 13:55:15 UTC

@knarra have there been any more failures since that one? I don't think it is relevant since it was on the `rc.0` version so if it's been clean otherwise then I think we can mark this resolved

Comment 26 RamaKasturi 2021-06-07 13:21:43 UTC

Still see that the test is failing. Below are the links where the test fails, so moving the bug back to assigned state.

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-serial/1400395534132318208

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.8/1399645999227473920

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.8/1400578680282943488

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.8/1400619401945812992

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/622/pull-ci-openshift-cluster-ingress-operator-master-e2e-gcp-serial/1400840321608192000

Comment 27 Mike Dame 2021-06-17 15:04:11 UTC

Thanks for checking,
The last 3 failures are timeouts, coupled with many other test failures, so I think we can ignore those as infra flakes

The first 2 may or may not be valid, as they show some different behavior than what we have seen before. From the logs:
> Jun  3 10:59:08.874: INFO: ComputeCPUMemFraction for node: ip-10-0-181-127.ec2.internal
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Node: ip-10-0-181-127.ec2.internal, totalRequestedCPUResource: 370, cpuAllocatableMil: 3500, cpuFraction: 0.10571428571428572
> Jun  3 10:59:08.874: INFO: Node: ip-10-0-181-127.ec2.internal, totalRequestedMemResource: 3502243840, memAllocatableVal: 15622287360, memFraction: 0.22418252585516388
> Jun  3 10:59:08.874: INFO: ComputeCPUMemFraction for node: ip-10-0-181-216.ec2.internal
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Node: ip-10-0-181-216.ec2.internal, totalRequestedCPUResource: 310, cpuAllocatableMil: 3500, cpuFraction: 0.08857142857142856
> Jun  3 10:59:08.874: INFO: Node: ip-10-0-181-216.ec2.internal, totalRequestedMemResource: 2747269120, memAllocatableVal: 15622287360, memFraction: 0.17585575381452975
> Jun  3 10:59:08.910: INFO: Waiting for running...
> Jun  3 10:59:13.995: INFO: Waiting for running...
> STEP: Compute Cpu, Mem Fraction after create balanced pods.
> Jun  3 10:59:19.164: INFO: ComputeCPUMemFraction for node: ip-10-0-181-127.ec2.internal
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Node: ip-10-0-181-127.ec2.internal, totalRequestedCPUResource: 380, cpuAllocatableMil: 3500, cpuFraction: 0.10857142857142857
> Jun  3 10:59:19.164: INFO: Node: ip-10-0-181-127.ec2.internal, totalRequestedMemResource: 3628072960, memAllocatableVal: 15622287360, memFraction: 0.23223698786193625
> STEP: Compute Cpu, Mem Fraction after create balanced pods.
> Jun  3 10:59:19.164: INFO: ComputeCPUMemFraction for node: ip-10-0-181-216.ec2.internal
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Node: ip-10-0-181-216.ec2.internal, totalRequestedCPUResource: 320, cpuAllocatableMil: 3500, cpuFraction: 0.09142857142857143
> Jun  3 10:59:19.164: INFO: Node: ip-10-0-181-216.ec2.internal, totalRequestedMemResource: 2873098240, memAllocatableVal: 15622287360, memFraction: 0.18391021582130213

This shows a few interesting things:
1) the memory being calculated is only looking at 1 pod, over and over. Therefore the calculation is wrong, and the attempt at "balancing" is also wrong (you see the fractions after balancing are different). I wonder if this could be related to the recent upstream change where we create the balanced pods in parallel, but I am not so sure (https://github.com/kubernetes/kubernetes/pull/102138)
2) There are only 2 nodes

Point 2 is important because, further in the test logs, we see:
> Jun  3 10:59:19.542: INFO: Waiting for running...
> STEP: Getting zone name for pod test-service-0, on node ip-10-0-181-216.ec2.internal
> STEP: Getting zone name for pod test-service-1, on node ip-10-0-181-127.ec2.internal
> STEP: Getting zone name for pod test-service-2, on node ip-10-0-181-216.ec2.internal
> STEP: Getting zone name for pod test-service-3, on node ip-10-0-181-127.ec2.internal
> STEP: Getting zone name for pod test-service-4, on node ip-10-0-181-216.ec2.internal
> STEP: Getting zone name for pod test-service-5, on node ip-10-0-181-127.ec2.internal
> STEP: Getting zone name for pod test-service-6, on node ip-10-0-181-216.ec2.internal
> STEP: Getting zone name for pod test-service-7, on node ip-10-0-181-127.ec2.internal
> STEP: Getting zone name for pod test-service-8, on node ip-10-0-181-216.ec2.internal
> STEP: Getting zone name for pod test-service-9, on node ip-10-0-181-127.ec2.internal

This indicates that the pods are actually evenly spread between the 2 nodes (5 on each). However, the pods are all "Pending".


Then, the test fails with:
> fail [k8s.io/kubernetes.1/test/e2e/scheduling/ubernetes_lite.go:190]: Pods were not evenly spread across zones.  0 in one zone and 10 in another zone
> Expected
>     <int>: 10
> to be within 2 of ~
>     <int>: 0

Which does not reflect the scheduling balance that we see above.

Therefore, I am currently unconvinced that this is not also an infra flake, especially given that it has only shown up in 2 runs. When I have the time to investigate further I will, but in the meantime I think we should continue to monitor these tests to see if there are any real failures that show up. We can dismiss rare flakes like this

Comment 28 Mike Dame 2021-09-03 15:47:21 UTC

@knarra are we still observing any failures for this test?

Comment 30 Maciej Szulik 2022-01-17 16:21:43 UTC

*** Bug 1925941 has been marked as a duplicate of this bug. ***

Comment 31 Jan Chaloupka 2022-02-18 11:24:01 UTC

Due to higher priority tasks I have been able to resolve this issue in time. Moving to the next sprint.

Comment 32 Jan Chaloupka 2022-03-07 14:54:07 UTC

Given the issue is reported against 4.6 (we are releasing 4.10 and working already on 4.11, so we are fixing only 4.7+) and the comment in https://bugzilla.redhat.com/show_bug.cgi?id=1896558#c27 I am closing this issue as not fixed.

Comment 33 Red Hat Bugzilla 2023-09-15 00:50:58 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days

Note You need to log in before you can comment on or make changes to this bug.