Bug 1896558 - [sig-scheduling] Multi-AZ Clusters should spread the pods of a service across zones [Serial] [NEEDINFO]
Summary: [sig-scheduling] Multi-AZ Clusters should spread the pods of a service across...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-scheduler
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Jan Chaloupka
QA Contact: RamaKasturi
URL:
Whiteboard: LifecycleReset tag-ci
: 1925941 (view as bug list)
Depends On:
Blocks: 1925941 1929389
TreeView+ depends on / blocked
 
Reported: 2020-11-10 21:34 UTC by Patrick Dillon
Modified: 2022-03-07 14:54 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
[sig-scheduling] Multi-AZ Clusters should spread the pods of a service across zones [Serial]
Last Closed: 2022-03-07 14:54:07 UTC
Target Upstream Version:
mdame: needinfo? (knarra)


Attachments (Terms of Use)
must-gather from failed CI run missing node (12.90 MB, application/gzip)
2021-02-10 14:13 UTC, Mike Dame
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 525 0 None closed Bug 1896558: Skip multiaz e2e 2021-02-15 17:08:27 UTC
Github openshift kubernetes pull 545 0 None closed Bug 1896558: Revert undesired multi az skip 2021-02-15 17:08:26 UTC
Github openshift kubernetes pull 547 0 None closed Bug 1896558: Balance nodes in scheduling e2e 2021-02-15 17:08:26 UTC
Github openshift origin pull 25848 0 None closed Bug 1896558: bump(openshift/kubernetes): fix flaking multi-AZ test 2021-02-15 17:08:27 UTC

Comment 1 Maciej Szulik 2020-11-12 12:08:57 UTC
Mike you looked previously at this https://bugzilla.redhat.com/show_bug.cgi?id=1806594

Comment 3 Mike Dame 2020-12-04 18:14:27 UTC
Due to being occupied fixing higher severity bugs I was not able to address this, I'm adding UpcomingSprint to investigate it in a future sprint

Comment 4 Michal Fojtik 2020-12-17 19:10:09 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. Additionally, you can add LifecycleFrozen into Keywords if you think this bug should never be marked as stale. Please consult with bug assignee before you do that.

Comment 5 Mike Dame 2021-01-14 20:40:10 UTC
This test is still flaking, removing LifecycleStale to investigate more

Comment 6 Michal Fojtik 2021-01-14 21:38:55 UTC
The LifecycleStale keyword was removed because the needinfo? flag was reset and the bug got commented on recently.
The bug assignee was notified.

Comment 7 Mike Dame 2021-01-17 02:23:50 UTC
I was unable to focus on this bug again this sprint, but adding upcomingsprint to look at it in the future. There is also an upstream issue for this test at https://github.com/kubernetes/kubernetes/issues/89178

Comment 8 Maru Newby 2021-01-18 15:12:37 UTC
Moving back to NEW - PR #525 is just skipping the test pending a fix.

Comment 11 Mike Dame 2021-02-04 18:09:53 UTC
The upstream PR for this is still undergoing review: https://github.com/kubernetes/kubernetes/pull/98583

Comment 14 Mike Dame 2021-02-10 14:12:26 UTC
The recent failures are occurring when trying to create the balancing pod (balancing pods were just added to this test as part of the fix). The errors appear to be the same as in https://github.com/kubernetes/kubernetes/pull/98073 (not meeting cri-o minimum memory limit)

The failure that you linked (for PR 557) is interesting, because it does appear at first to be the legitimate failure:
```
fail [k8s.io/kubernetes@v1.20.0/test/e2e/scheduling/ubernetes_lite.go:190]: Pods were not evenly spread across zones.  3 in one zone and 7 in another zone
Expected
    <int>: 4
to be within 2 of ~
    <int>: 0
```

But looking at the logs, the pod distribution on the nodes was actually [3,3,4] which is within the limit:
```
STEP: Getting zone name for pod test-service-0, on node ip-10-0-182-156.us-west-2.compute.internal
STEP: Getting zone name for pod test-service-1, on node ip-10-0-219-149.us-west-2.compute.internal
STEP: Getting zone name for pod test-service-2, on node ip-10-0-145-212.us-west-2.compute.internal
STEP: Getting zone name for pod test-service-3, on node ip-10-0-219-149.us-west-2.compute.internal
STEP: Getting zone name for pod test-service-4, on node ip-10-0-182-156.us-west-2.compute.internal
STEP: Getting zone name for pod test-service-5, on node ip-10-0-145-212.us-west-2.compute.internal
STEP: Getting zone name for pod test-service-6, on node ip-10-0-219-149.us-west-2.compute.internal
STEP: Getting zone name for pod test-service-7, on node ip-10-0-182-156.us-west-2.compute.internal
STEP: Getting zone name for pod test-service-8, on node ip-10-0-145-212.us-west-2.compute.internal
STEP: Getting zone name for pod test-service-9, on node ip-10-0-182-156.us-west-2.compute.internal
```

However, when I tried to check these nodes from the must-gather (attached), the node `ip-10-0-182-156` does not exist. This is very odd to me, and if somehow the test was looking at a non-existent node I can see that being misinterpreted to some default. Since that node had size [4], if its pods were grouped with either other node in the test that would present the [7,3] size difference reported in the failure.

I need to keep working on the kubernetes PR linked above to get approval, in order to make sure the minimum memory limit failures don't happen.

In the meantime, @Rama please keep an eye for more failures that match the "Expected / to be within 2 of" message. These are the legitimate failures that we will need to be aware of, if more occur.

Comment 15 Mike Dame 2021-02-10 14:13:29 UTC
Created attachment 1756212 [details]
must-gather from failed CI run missing node

Comment 16 Mike Dame 2021-02-23 13:45:02 UTC
The relevant PRs I was referring to [1][2] have merged now, so moving this back to ON_QA to check for failures following my above comment: https://bugzilla.redhat.com/show_bug.cgi?id=1896558#c14

1. https://github.com/openshift/kubernetes/pull/547
2. https://github.com/openshift/kubernetes/pull/526

Comment 17 Mike Dame 2021-02-23 14:55:20 UTC
Sorry, this still needs a bump PR in origin

Comment 18 Mike Dame 2021-05-17 15:02:10 UTC
Switching this back to MODIFIED, as origin has since been bumped to include all the above changes

Comment 20 RamaKasturi 2021-05-20 07:24:04 UTC
Hello Mike,

   I see that the bug has been failed on 4.8 cluster 34 hours back, is this expected ? Below is the link where i saw the bug failure

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-serial-4.8/1394736589866799104

Comment 21 Mike Dame 2021-05-20 14:08:29 UTC
That does look like the same failure, but has it occurred in any more 4.8 runs?

It's also odd that it looks like it's running from the v1.21-rc.0 tag, since we should be rebased onto the published 1.21 by now:
>fail [k8s.io/kubernetes@v1.21.0-rc.0/test/e2e/scheduling/ubernetes_lite.go:190]: Pods were not evenly spread across zones.

Comment 22 RamaKasturi 2021-05-20 15:58:17 UTC
(In reply to Mike Dame from comment #21)
> That does look like the same failure, but has it occurred in any more 4.8
> runs?
> 
AFAIS, that is only the run where it failed.

> It's also odd that it looks like it's running from the v1.21-rc.0 tag, since
> we should be rebased onto the published 1.21 by now:
> >fail [k8s.io/kubernetes@v1.21.0-rc.0/test/e2e/scheduling/ubernetes_lite.go:190]: Pods were not evenly spread across zones.

Yup, you are right, not sure how that has happened.

Comment 23 RamaKasturi 2021-05-26 12:50:26 UTC
Moving the bug to assigned as per comment 21.

Comment 24 Mike Dame 2021-06-01 13:55:15 UTC
@knarra@redhat.com have there been any more failures since that one? I don't think it is relevant since it was on the `rc.0` version so if it's been clean otherwise then I think we can mark this resolved

Comment 27 Mike Dame 2021-06-17 15:04:11 UTC
Thanks for checking,
The last 3 failures are timeouts, coupled with many other test failures, so I think we can ignore those as infra flakes

The first 2 may or may not be valid, as they show some different behavior than what we have seen before. From the logs:
> Jun  3 10:59:08.874: INFO: ComputeCPUMemFraction for node: ip-10-0-181-127.ec2.internal
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Node: ip-10-0-181-127.ec2.internal, totalRequestedCPUResource: 370, cpuAllocatableMil: 3500, cpuFraction: 0.10571428571428572
> Jun  3 10:59:08.874: INFO: Node: ip-10-0-181-127.ec2.internal, totalRequestedMemResource: 3502243840, memAllocatableVal: 15622287360, memFraction: 0.22418252585516388
> Jun  3 10:59:08.874: INFO: ComputeCPUMemFraction for node: ip-10-0-181-216.ec2.internal
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:08.874: INFO: Node: ip-10-0-181-216.ec2.internal, totalRequestedCPUResource: 310, cpuAllocatableMil: 3500, cpuFraction: 0.08857142857142856
> Jun  3 10:59:08.874: INFO: Node: ip-10-0-181-216.ec2.internal, totalRequestedMemResource: 2747269120, memAllocatableVal: 15622287360, memFraction: 0.17585575381452975
> Jun  3 10:59:08.910: INFO: Waiting for running...
> Jun  3 10:59:13.995: INFO: Waiting for running...
> STEP: Compute Cpu, Mem Fraction after create balanced pods.
> Jun  3 10:59:19.164: INFO: ComputeCPUMemFraction for node: ip-10-0-181-127.ec2.internal
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Node: ip-10-0-181-127.ec2.internal, totalRequestedCPUResource: 380, cpuAllocatableMil: 3500, cpuFraction: 0.10857142857142857
> Jun  3 10:59:19.164: INFO: Node: ip-10-0-181-127.ec2.internal, totalRequestedMemResource: 3628072960, memAllocatableVal: 15622287360, memFraction: 0.23223698786193625
> STEP: Compute Cpu, Mem Fraction after create balanced pods.
> Jun  3 10:59:19.164: INFO: ComputeCPUMemFraction for node: ip-10-0-181-216.ec2.internal
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Pod for on the node: service-ca-59c9bc945c-hb8fr, Cpu: 10, Mem: 125829120
> Jun  3 10:59:19.164: INFO: Node: ip-10-0-181-216.ec2.internal, totalRequestedCPUResource: 320, cpuAllocatableMil: 3500, cpuFraction: 0.09142857142857143
> Jun  3 10:59:19.164: INFO: Node: ip-10-0-181-216.ec2.internal, totalRequestedMemResource: 2873098240, memAllocatableVal: 15622287360, memFraction: 0.18391021582130213

This shows a few interesting things:
1) the memory being calculated is only looking at 1 pod, over and over. Therefore the calculation is wrong, and the attempt at "balancing" is also wrong (you see the fractions after balancing are different). I wonder if this could be related to the recent upstream change where we create the balanced pods in parallel, but I am not so sure (https://github.com/kubernetes/kubernetes/pull/102138)
2) There are only 2 nodes

Point 2 is important because, further in the test logs, we see:
> Jun  3 10:59:19.542: INFO: Waiting for running...
> STEP: Getting zone name for pod test-service-0, on node ip-10-0-181-216.ec2.internal
> STEP: Getting zone name for pod test-service-1, on node ip-10-0-181-127.ec2.internal
> STEP: Getting zone name for pod test-service-2, on node ip-10-0-181-216.ec2.internal
> STEP: Getting zone name for pod test-service-3, on node ip-10-0-181-127.ec2.internal
> STEP: Getting zone name for pod test-service-4, on node ip-10-0-181-216.ec2.internal
> STEP: Getting zone name for pod test-service-5, on node ip-10-0-181-127.ec2.internal
> STEP: Getting zone name for pod test-service-6, on node ip-10-0-181-216.ec2.internal
> STEP: Getting zone name for pod test-service-7, on node ip-10-0-181-127.ec2.internal
> STEP: Getting zone name for pod test-service-8, on node ip-10-0-181-216.ec2.internal
> STEP: Getting zone name for pod test-service-9, on node ip-10-0-181-127.ec2.internal

This indicates that the pods are actually evenly spread between the 2 nodes (5 on each). However, the pods are all "Pending".


Then, the test fails with:
> fail [k8s.io/kubernetes@v1.21.1/test/e2e/scheduling/ubernetes_lite.go:190]: Pods were not evenly spread across zones.  0 in one zone and 10 in another zone
> Expected
>     <int>: 10
> to be within 2 of ~
>     <int>: 0

Which does not reflect the scheduling balance that we see above.

Therefore, I am currently unconvinced that this is not also an infra flake, especially given that it has only shown up in 2 runs. When I have the time to investigate further I will, but in the meantime I think we should continue to monitor these tests to see if there are any real failures that show up. We can dismiss rare flakes like this

Comment 28 Mike Dame 2021-09-03 15:47:21 UTC
@knarra@redhat.com are we still observing any failures for this test?

Comment 30 Maciej Szulik 2022-01-17 16:21:43 UTC
*** Bug 1925941 has been marked as a duplicate of this bug. ***

Comment 31 Jan Chaloupka 2022-02-18 11:24:01 UTC
Due to higher priority tasks I have been able to resolve this issue in time. Moving to the next sprint.

Comment 32 Jan Chaloupka 2022-03-07 14:54:07 UTC
Given the issue is reported against 4.6 (we are releasing 4.10 and working already on 4.11, so we are fixing only 4.7+) and the comment in https://bugzilla.redhat.com/show_bug.cgi?id=1896558#c27 I am closing this issue as not fixed.


Note You need to log in before you can comment on or make changes to this bug.