Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1929389

Summary:	[sig-scheduling] Multi-AZ Clusters should spread the pods of a replication controller across zones
Product:	OpenShift Container Platform	Reporter:	Surya Seetharaman <surya>
Component:	kube-scheduler	Assignee:	Maciej Szulik <maszulik>
Status:	CLOSED EOL	QA Contact:	RamaKasturi <knarra>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.7	CC:	aos-bugs, dgoodwin, fpaoline, mfojtik
Target Milestone:	---
Target Release:	4.7.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	tag-ci
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1929684 (view as bug list)		Environment:	[sig-scheduling] Multi-AZ Clusters should spread the pods of a replication controller across zones
Last Closed:	2022-05-25 11:00:59 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1896558
Bug Blocks:	1929684

Description Surya Seetharaman 2021-02-16 18:53:58 UTC

test:
[sig-scheduling] Multi-AZ Clusters should spread the pods of a replication controller across zones 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-scheduling%5C%5D+Multi-AZ+Clusters+should+spread+the+pods+of+a+replication+controller+across+zones


Examples:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.7/1361634319889600512

: [sig-scheduling] Multi-AZ Clusters should spread the pods of a replication controller across zones [Suite:openshift/conformance/parallel] [Suite:k8s] expand_less
Run #0: Failed expand_less	48s
fail [k8s.io/kubernetes.0/test/e2e/scheduling/ubernetes_lite.go:174]: Pods were not evenly spread across zones.  3 in one zone and 6 in another zone
Expected
    <int>: 3
to be within 2 of ~
    <int>: 0

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-ovn-upgrade-4.6-stable-to-4.7-ci/1361548482682294272

: [sig-scheduling] Multi-AZ Clusters should spread the pods of a replication controller across zones [Suite:openshift/conformance/parallel] [Suite:k8s] expand_less	14s
fail [k8s.io/kubernetes.0/test/e2e/scheduling/ubernetes_lite.go:174]: Pods were not evenly spread across zones.  0 in one zone and 10 in another zone
Expected
    <int>: 10
to be within 2 of ~
    <int>: 0

Comment 1 Maciej Szulik 2021-02-18 10:49:47 UTC

*** Bug 1929684 has been marked as a duplicate of this bug. ***

Comment 2 Mike Dame 2021-02-23 13:45:49 UTC

This should be addressed by fixes added in https://github.com/openshift/kubernetes/pull/547 and https://github.com/openshift/kubernetes/pull/526

Comment 3 Mike Dame 2021-02-23 18:28:08 UTC

I am wondering if the fix from https://github.com/openshift/kubernetes/pull/547 (which seems to have fixed the Service spreading test) created this failure, or if this failure existed before that.

One thing I notice is that in these failures (example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.8/1364212636652146688) the pods being reported on each node are changing throughout the test. This makes it impossible for the above fix to actually balance the nodes, meaning that resource usage will interfere with the scheduling decision.

Take the output from the above test:

(before balancing)
> Feb 23 14:46:12.203: INFO: Waiting up to 1m0s for all nodes to be ready
> Feb 23 14:47:12.744: INFO: ComputeCPUMemFraction for node: ci-op-cvr5bfr2-df208-g28mm-worker-b-vcsxj
> Feb 23 14:47:12.873: INFO: Pod for on the node: pod-handle-http-request, Cpu: 100, Mem: 209715200
> Feb 23 14:47:12.873: INFO: Pod for on the node: csi-mockplugin-0, Cpu: 300, Mem: 629145600
> Feb 23 14:47:12.873: INFO: Pod for on the node: csi-mockplugin-attacher-0, Cpu: 100, Mem: 209715200
> Feb 23 14:47:12.873: INFO: Pod for on the node: test-recreate-deployment-5888b58954-2nwzf, Cpu: 100, Mem: 209715200
> Feb 23 14:47:12.873: INFO: Pod for on the node: simpletest.rc-5kngs, Cpu: 100, Mem: 209715200
> Feb 23 14:47:12.873: INFO: Pod for on the node: simpletest.rc-8pxdv, Cpu: 100, Mem: 209715200
> Feb 23 14:47:12.873: INFO: Pod for on the node: simpletest.rc-hdh6s, Cpu: 100, Mem: 209715200
> Feb 23 14:47:12.873: INFO: Pod for on the node: hostexec-ci-op-cvr5bfr2-df208-g28mm-worker-b-vcsxj-7nbt7, Cpu: 100, Mem: 209715200
> Feb 23 14:47:12.873: INFO: Pod for on the node: netserver-0, Cpu: 100, Mem: 209715200
> Feb 23 14:47:12.873: INFO: Pod for on the node: pod-submit-status-0-2, Cpu: 5, Mem: 10485760
> Feb 23 14:47:12.873: INFO: Pod for on the node: explicit-nonroot-uid, Cpu: 100, Mem: 209715200
> Feb 23 14:47:12.873: INFO: Pod for on the node: hostexec-ci-op-cvr5bfr2-df208-g28mm-worker-b-vcsxj-5pft2, Cpu: 100, Mem: 209715200
> Feb 23 14:47:12.873: INFO: Pod for on the node: gcp-pd-csi-driver-node-ck46r, Cpu: 30, Mem: 157286400
> Feb 23 14:47:12.873: INFO: Pod for on the node: tuned-zz6c9, Cpu: 10, Mem: 52428800
> Feb 23 14:47:12.873: INFO: Pod for on the node: downloads-846fcb6857-xxs7w, Cpu: 10, Mem: 52428800
> Feb 23 14:47:12.873: INFO: Pod for on the node: dns-default-9fszd, Cpu: 65, Mem: 137363456
> Feb 23 14:47:12.873: INFO: Pod for on the node: image-registry-5d7cbc6796-5phf5, Cpu: 100, Mem: 268435456
> Feb 23 14:47:12.873: INFO: Pod for on the node: node-ca-hp5w8, Cpu: 10, Mem: 10485760
> Feb 23 14:47:12.873: INFO: Pod for on the node: ingress-canary-hv7t8, Cpu: 10, Mem: 20971520
> Feb 23 14:47:12.873: INFO: Pod for on the node: router-default-58bb79bdb8-4q4wj, Cpu: 100, Mem: 268435456
> Feb 23 14:47:12.873: INFO: Pod for on the node: migrator-7bc78664fd-fwvcj, Cpu: 10, Mem: 209715200
> Feb 23 14:47:12.873: INFO: Pod for on the node: machine-config-daemon-98694, Cpu: 40, Mem: 104857600
> Feb 23 14:47:12.873: INFO: Pod for on the node: ab0ec41ac51719de72554e09c32400b13c6d15dcf7d38302d5ed14fcb2qfbfm, Cpu: 100, Mem: 209715200
> Feb 23 14:47:12.873: INFO: Pod for on the node: certified-operators-v2lm5, Cpu: 10, Mem: 52428800
> Feb 23 14:47:12.873: INFO: Pod for on the node: community-operators-mkzjl, Cpu: 10, Mem: 52428800
> Feb 23 14:47:12.873: INFO: Pod for on the node: community-operators-rhvb9, Cpu: 10, Mem: 52428800
> Feb 23 14:47:12.873: INFO: Pod for on the node: redhat-marketplace-k7t7v, Cpu: 10, Mem: 52428800
> Feb 23 14:47:12.873: INFO: Pod for on the node: redhat-operators-l8586, Cpu: 10, Mem: 52428800
> Feb 23 14:47:12.873: INFO: Pod for on the node: alertmanager-main-1, Cpu: 8, Mem: 283115520
> Feb 23 14:47:12.873: INFO: Pod for on the node: kube-state-metrics-54b6ff9dc-wfm7f, Cpu: 4, Mem: 125829120
> Feb 23 14:47:12.873: INFO: Pod for on the node: node-exporter-jx2mr, Cpu: 9, Mem: 220200960
> Feb 23 14:47:12.873: INFO: Pod for on the node: openshift-state-metrics-6757ffd766-mmrxq, Cpu: 3, Mem: 199229440
> Feb 23 14:47:12.873: INFO: Pod for on the node: prometheus-adapter-5557d74fdf-htmsl, Cpu: 1, Mem: 26214400
> Feb 23 14:47:12.873: INFO: Pod for on the node: prometheus-k8s-1, Cpu: 76, Mem: 1262485504
> Feb 23 14:47:12.873: INFO: Pod for on the node: telemeter-client-649ff75866-dfxb7, Cpu: 3, Mem: 73400320
> Feb 23 14:47:12.873: INFO: Pod for on the node: thanos-querier-57564f89f7-hzjnz, Cpu: 9, Mem: 96468992
> Feb 23 14:47:12.873: INFO: Pod for on the node: multus-dclw7, Cpu: 10, Mem: 157286400
> Feb 23 14:47:12.873: INFO: Pod for on the node: network-metrics-daemon-998jw, Cpu: 20, Mem: 125829120
> Feb 23 14:47:12.873: INFO: Pod for on the node: network-check-source-5584f5cfcc-2dcdt, Cpu: 10, Mem: 41943040
> Feb 23 14:47:12.873: INFO: Pod for on the node: network-check-target-c4tqd, Cpu: 10, Mem: 15728640
> Feb 23 14:47:12.873: INFO: Pod for on the node: ovs-9zwnr, Cpu: 15, Mem: 419430400
> Feb 23 14:47:12.873: INFO: Pod for on the node: sdn-287mj, Cpu: 110, Mem: 230686720
> Feb 23 14:47:12.873: INFO: Node: ci-op-cvr5bfr2-df208-g28mm-worker-b-vcsxj, totalRequestedCPUResource: 828, cpuAllocatableMil: 3500, cpuFraction: 0.23657142857142857
> Feb 23 14:47:12.873: INFO: Node: ci-op-cvr5bfr2-df208-g28mm-worker-b-vcsxj, totalRequestedMemResource: 4937744384, memAllocatableVal: 14568333312, memFraction: 0.33893680754357525
> Feb 23 14:47:12.873: INFO: ComputeCPUMemFraction for node: ci-op-cvr5bfr2-df208-g28mm-worker-c-sw428
> Feb 23 14:47:13.028: INFO: Pod for on the node: startup-b78f504b-237f-4758-9d3e-a89ce75ff8ea, Cpu: 100, Mem: 209715200
> Feb 23 14:47:13.028: INFO: Pod for on the node: simpletest.rc-4fcgf, Cpu: 100, Mem: 209715200
> Feb 23 14:47:13.028: INFO: Pod for on the node: simpletest.rc-6zwpd, Cpu: 100, Mem: 209715200
> Feb 23 14:47:13.028: INFO: Pod for on the node: simpletest.rc-ddjnt, Cpu: 100, Mem: 209715200
> Feb 23 14:47:13.028: INFO: Pod for on the node: simpletest.rc-kzt2n, Cpu: 100, Mem: 209715200
> Feb 23 14:47:13.028: INFO: Pod for on the node: gluster-server, Cpu: 100, Mem: 209715200
> Feb 23 14:47:13.028: INFO: Pod for on the node: busybox-readonly-fs8c25040f-a95a-4c95-ab00-1c4b8a16bf67, Cpu: 100, Mem: 209715200
> Feb 23 14:47:13.028: INFO: Pod for on the node: server-7fx9g, Cpu: 200, Mem: 419430400
> Feb 23 14:47:13.028: INFO: Pod for on the node: netserver-1, Cpu: 100, Mem: 209715200
> Feb 23 14:47:13.028: INFO: Pod for on the node: example-1-deploy, Cpu: 100, Mem: 209715200
> Feb 23 14:47:13.028: INFO: Pod for on the node: deployment-simple-1-deploy, Cpu: 100, Mem: 209715200
> Feb 23 14:47:13.028: INFO: Pod for on the node: deployment-simple-1-hook-pre, Cpu: 100, Mem: 209715200
> Feb 23 14:47:13.028: INFO: Pod for on the node: custom-builder-image-1-build, Cpu: 100, Mem: 209715200
> Feb 23 14:47:13.028: INFO: Pod for on the node: sample-custom-build-1-build, Cpu: 100, Mem: 209715200
> Feb 23 14:47:13.028: INFO: Pod for on the node: pod-6b81707d-e327-4646-86e7-4018c3794134, Cpu: 100, Mem: 209715200
> Feb 23 14:47:13.028: INFO: Pod for on the node: gcp-pd-csi-driver-node-49jdt, Cpu: 30, Mem: 157286400
> Feb 23 14:47:13.028: INFO: Pod for on the node: tuned-c2c5t, Cpu: 10, Mem: 52428800
> Feb 23 14:47:13.028: INFO: Pod for on the node: dns-default-mfxcx, Cpu: 65, Mem: 137363456
> Feb 23 14:47:13.028: INFO: Pod for on the node: node-ca-6z27x, Cpu: 10, Mem: 10485760
> Feb 23 14:47:13.028: INFO: Pod for on the node: ingress-canary-6cmx4, Cpu: 10, Mem: 20971520
> Feb 23 14:47:13.028: INFO: Pod for on the node: machine-config-daemon-6xd7h, Cpu: 40, Mem: 104857600
> Feb 23 14:47:13.028: INFO: Pod for on the node: node-exporter-q5pbd, Cpu: 9, Mem: 220200960
> Feb 23 14:47:13.028: INFO: Pod for on the node: multus-hzw4g, Cpu: 10, Mem: 157286400
> Feb 23 14:47:13.028: INFO: Pod for on the node: network-metrics-daemon-drp8f, Cpu: 20, Mem: 125829120
> Feb 23 14:47:13.028: INFO: Pod for on the node: network-check-target-zwx95, Cpu: 10, Mem: 15728640
> Feb 23 14:47:13.028: INFO: Pod for on the node: ovs-2lsjd, Cpu: 15, Mem: 419430400
> Feb 23 14:47:13.028: INFO: Pod for on the node: sdn-tgkxh, Cpu: 110, Mem: 230686720
> Feb 23 14:47:13.028: INFO: Node: ci-op-cvr5bfr2-df208-g28mm-worker-c-sw428, totalRequestedCPUResource: 439, cpuAllocatableMil: 3500, cpuFraction: 0.12542857142857142
> Feb 23 14:47:13.028: INFO: Node: ci-op-cvr5bfr2-df208-g28mm-worker-c-sw428, totalRequestedMemResource: 1757413376, memAllocatableVal: 14568333312, memFraction: 0.12063242502506522
> Feb 23 14:47:13.028: INFO: ComputeCPUMemFraction for node: ci-op-cvr5bfr2-df208-g28mm-worker-d-qp78t
> Feb 23 14:47:13.234: INFO: Pod for on the node: simpletest.rc-2zcbt, Cpu: 100, Mem: 209715200
> Feb 23 14:47:13.234: INFO: Pod for on the node: simpletest.rc-7tbcx, Cpu: 100, Mem: 209715200
> Feb 23 14:47:13.234: INFO: Pod for on the node: simpletest.rc-qsj72, Cpu: 100, Mem: 209715200
> Feb 23 14:47:13.234: INFO: Pod for on the node: gluster-client, Cpu: 100, Mem: 209715200
> Feb 23 14:47:13.234: INFO: Pod for on the node: agnhost-pod, Cpu: 100, Mem: 209715200
> Feb 23 14:47:13.234: INFO: Pod for on the node: netserver-2, Cpu: 100, Mem: 209715200
> Feb 23 14:47:13.234: INFO: Pod for on the node: readiness-1-deploy, Cpu: 100, Mem: 209715200
> Feb 23 14:47:13.234: INFO: Pod for on the node: example-1-g58rb, Cpu: 200, Mem: 419430400
> Feb 23 14:47:13.234: INFO: Pod for on the node: append-test, Cpu: 100, Mem: 209715200
> Feb 23 14:47:13.234: INFO: Pod for on the node: test-oauth-server, Cpu: 10, Mem: 52428800
> Feb 23 14:47:13.234: INFO: Pod for on the node: sample-webhook-deployment-7fdfd97c84-bqscf, Cpu: 100, Mem: 209715200
> Feb 23 14:47:13.234: INFO: Pod for on the node: gcp-pd-csi-driver-node-gdmg2, Cpu: 30, Mem: 157286400
> Feb 23 14:47:13.234: INFO: Pod for on the node: tuned-l5b4b, Cpu: 10, Mem: 52428800
> Feb 23 14:47:13.234: INFO: Pod for on the node: dns-default-2djwv, Cpu: 65, Mem: 137363456
> Feb 23 14:47:13.234: INFO: Pod for on the node: image-registry-5d7cbc6796-47p55, Cpu: 100, Mem: 268435456
> Feb 23 14:47:13.234: INFO: Pod for on the node: node-ca-swk58, Cpu: 10, Mem: 10485760
> Feb 23 14:47:13.234: INFO: Pod for on the node: ingress-canary-qhzf6, Cpu: 10, Mem: 20971520
> Feb 23 14:47:13.234: INFO: Pod for on the node: router-default-58bb79bdb8-zs7s6, Cpu: 100, Mem: 268435456
> Feb 23 14:47:13.234: INFO: Pod for on the node: machine-config-daemon-dplfm, Cpu: 40, Mem: 104857600
> Feb 23 14:47:13.234: INFO: Pod for on the node: alertmanager-main-0, Cpu: 8, Mem: 283115520
> Feb 23 14:47:13.234: INFO: Pod for on the node: alertmanager-main-2, Cpu: 8, Mem: 283115520
> Feb 23 14:47:13.234: INFO: Pod for on the node: grafana-5b8f5b6d96-gwb98, Cpu: 5, Mem: 125829120
> Feb 23 14:47:13.234: INFO: Pod for on the node: node-exporter-6pw82, Cpu: 9, Mem: 220200960
> Feb 23 14:47:13.234: INFO: Pod for on the node: prometheus-adapter-5557d74fdf-xj5sq, Cpu: 1, Mem: 26214400
> Feb 23 14:47:13.234: INFO: Pod for on the node: prometheus-k8s-0, Cpu: 76, Mem: 1262485504
> Feb 23 14:47:13.234: INFO: Pod for on the node: thanos-querier-57564f89f7-xvh4z, Cpu: 9, Mem: 96468992
> Feb 23 14:47:13.234: INFO: Pod for on the node: multus-d76x9, Cpu: 10, Mem: 157286400
> Feb 23 14:47:13.234: INFO: Pod for on the node: network-metrics-daemon-nv4nm, Cpu: 20, Mem: 125829120
> Feb 23 14:47:13.234: INFO: Pod for on the node: network-check-target-rz8wn, Cpu: 10, Mem: 15728640
> Feb 23 14:47:13.234: INFO: Pod for on the node: ovs-rxjl5, Cpu: 15, Mem: 419430400
> Feb 23 14:47:13.234: INFO: Pod for on the node: sdn-8q7d2, Cpu: 110, Mem: 230686720
> Feb 23 14:47:13.234: INFO: Node: ci-op-cvr5bfr2-df208-g28mm-worker-d-qp78t, totalRequestedCPUResource: 756, cpuAllocatableMil: 3500, cpuFraction: 0.216
> Feb 23 14:47:13.234: INFO: Node: ci-op-cvr5bfr2-df208-g28mm-worker-d-qp78t, totalRequestedMemResource: 4423942144, memAllocatableVal: 14568333312, memFraction: 0.30366837779281036
> Feb 23 14:47:13.327: INFO: Waiting for running...
> Feb 23 14:47:23.416: INFO: Waiting for running...
> Feb 23 14:47:33.686: INFO: Waiting for running...

(after balancing)
> STEP: Compute Cpu, Mem Fraction after create balanced pods.
> Feb 23 14:47:38.737: INFO: ComputeCPUMemFraction for node: ci-op-cvr5bfr2-df208-g28mm-worker-b-vcsxj
> Feb 23 14:47:39.794: INFO: Pod for on the node: csi-mockplugin-0, Cpu: 300, Mem: 629145600
> Feb 23 14:47:39.794: INFO: Pod for on the node: csi-mockplugin-attacher-0, Cpu: 100, Mem: 209715200
> Feb 23 14:47:39.794: INFO: Pod for on the node: csi-hostpath-attacher-0, Cpu: 100, Mem: 209715200
> Feb 23 14:47:39.794: INFO: Pod for on the node: csi-hostpath-provisioner-0, Cpu: 100, Mem: 209715200
> Feb 23 14:47:39.794: INFO: Pod for on the node: csi-hostpath-resizer-0, Cpu: 100, Mem: 209715200
> Feb 23 14:47:39.794: INFO: Pod for on the node: csi-hostpath-snapshotter-0, Cpu: 100, Mem: 209715200
> Feb 23 14:47:39.794: INFO: Pod for on the node: csi-hostpathplugin-0, Cpu: 300, Mem: 629145600
> Feb 23 14:47:39.794: INFO: Pod for on the node: inline-volume-tester-kr5cw, Cpu: 100, Mem: 209715200
> Feb 23 14:47:39.794: INFO: Pod for on the node: deployment-1e9e1d60-efb8-4d8a-a3a1-7443062287c6-675fd6b69bdwhct, Cpu: 100, Mem: 209715200
> Feb 23 14:47:39.794: INFO: Pod for on the node: f5c0a6bf-206a-485e-94a6-32762d3a07bc-0, Cpu: 358, Mem: 0
> Feb 23 14:47:39.794: INFO: Pod for on the node: hostexec-ci-op-cvr5bfr2-df208-g28mm-worker-b-vcsxj-n6xdb, Cpu: 100, Mem: 209715200
> Feb 23 14:47:39.794: INFO: Pod for on the node: pod-e96ffac9-93ed-470f-a36e-9899cedaa49b, Cpu: 100, Mem: 209715200
> Feb 23 14:47:39.794: INFO: Pod for on the node: hostexec-ci-op-cvr5bfr2-df208-g28mm-worker-b-vcsxj-7nbt7, Cpu: 100, Mem: 209715200
> Feb 23 14:47:39.794: INFO: Pod for on the node: hostexec-ci-op-cvr5bfr2-df208-g28mm-worker-b-vcsxj-cd9w4, Cpu: 100, Mem: 209715200
> Feb 23 14:47:39.794: INFO: Pod for on the node: host-test-container-pod, Cpu: 100, Mem: 209715200
> Feb 23 14:47:39.794: INFO: Pod for on the node: netserver-0, Cpu: 100, Mem: 209715200
> Feb 23 14:47:39.794: INFO: Pod for on the node: pod-submit-status-0-2, Cpu: 5, Mem: 10485760
> Feb 23 14:47:39.794: INFO: Pod for on the node: explicit-nonroot-uid, Cpu: 100, Mem: 209715200
> Feb 23 14:47:39.794: INFO: Pod for on the node: history-limit-1-5bxvx, Cpu: 100, Mem: 209715200
> Feb 23 14:47:39.795: INFO: Pod for on the node: bc-custom-1-build, Cpu: 100, Mem: 209715200
> Feb 23 14:47:39.795: INFO: Pod for on the node: exec-volume-test-preprovisionedpv-jc8b, Cpu: 100, Mem: 209715200
> Feb 23 14:47:39.795: INFO: Pod for on the node: gcp-pd-csi-driver-node-ck46r, Cpu: 30, Mem: 157286400
> Feb 23 14:47:39.795: INFO: Pod for on the node: tuned-zz6c9, Cpu: 10, Mem: 52428800
> Feb 23 14:47:39.795: INFO: Pod for on the node: downloads-846fcb6857-xxs7w, Cpu: 10, Mem: 52428800
> Feb 23 14:47:39.795: INFO: Pod for on the node: dns-default-9fszd, Cpu: 65, Mem: 137363456
> Feb 23 14:47:39.795: INFO: Pod for on the node: image-registry-5d7cbc6796-5phf5, Cpu: 100, Mem: 268435456
> Feb 23 14:47:39.795: INFO: Pod for on the node: node-ca-hp5w8, Cpu: 10, Mem: 10485760
> Feb 23 14:47:39.795: INFO: Pod for on the node: ingress-canary-hv7t8, Cpu: 10, Mem: 20971520
> Feb 23 14:47:39.795: INFO: Pod for on the node: router-default-58bb79bdb8-4q4wj, Cpu: 100, Mem: 268435456
> Feb 23 14:47:39.795: INFO: Pod for on the node: migrator-7bc78664fd-fwvcj, Cpu: 10, Mem: 209715200
> Feb 23 14:47:39.795: INFO: Pod for on the node: machine-config-daemon-98694, Cpu: 40, Mem: 104857600
> Feb 23 14:47:39.795: INFO: Pod for on the node: ab0ec41ac51719de72554e09c32400b13c6d15dcf7d38302d5ed14fcb2qfbfm, Cpu: 100, Mem: 209715200
> Feb 23 14:47:39.795: INFO: Pod for on the node: certified-operators-v2lm5, Cpu: 10, Mem: 52428800
> Feb 23 14:47:39.795: INFO: Pod for on the node: community-operators-mkzjl, Cpu: 10, Mem: 52428800
> Feb 23 14:47:39.795: INFO: Pod for on the node: redhat-marketplace-k7t7v, Cpu: 10, Mem: 52428800
> Feb 23 14:47:39.795: INFO: Pod for on the node: redhat-operators-l8586, Cpu: 10, Mem: 52428800
> Feb 23 14:47:39.795: INFO: Pod for on the node: alertmanager-main-1, Cpu: 8, Mem: 283115520
> Feb 23 14:47:39.795: INFO: Pod for on the node: kube-state-metrics-54b6ff9dc-wfm7f, Cpu: 4, Mem: 125829120
> Feb 23 14:47:39.795: INFO: Pod for on the node: node-exporter-jx2mr, Cpu: 9, Mem: 220200960
> Feb 23 14:47:39.795: INFO: Pod for on the node: openshift-state-metrics-6757ffd766-mmrxq, Cpu: 3, Mem: 199229440
> Feb 23 14:47:39.795: INFO: Pod for on the node: prometheus-adapter-5557d74fdf-htmsl, Cpu: 1, Mem: 26214400
> Feb 23 14:47:39.795: INFO: Pod for on the node: prometheus-k8s-1, Cpu: 76, Mem: 1262485504
> Feb 23 14:47:39.795: INFO: Pod for on the node: telemeter-client-649ff75866-dfxb7, Cpu: 3, Mem: 73400320
> Feb 23 14:47:39.795: INFO: Pod for on the node: thanos-querier-57564f89f7-hzjnz, Cpu: 9, Mem: 96468992
> Feb 23 14:47:39.795: INFO: Pod for on the node: multus-dclw7, Cpu: 10, Mem: 157286400
> Feb 23 14:47:39.795: INFO: Pod for on the node: network-metrics-daemon-998jw, Cpu: 20, Mem: 125829120
> Feb 23 14:47:39.795: INFO: Pod for on the node: network-check-source-5584f5cfcc-2dcdt, Cpu: 10, Mem: 41943040
> Feb 23 14:47:39.795: INFO: Pod for on the node: network-check-target-c4tqd, Cpu: 10, Mem: 15728640
> Feb 23 14:47:39.795: INFO: Pod for on the node: ovs-9zwnr, Cpu: 15, Mem: 419430400
> Feb 23 14:47:39.795: INFO: Pod for on the node: sdn-287mj, Cpu: 110, Mem: 230686720
> Feb 23 14:47:39.795: INFO: Node: ci-op-cvr5bfr2-df208-g28mm-worker-b-vcsxj, totalRequestedCPUResource: 1176, cpuAllocatableMil: 3500, cpuFraction: 0.336
> Feb 23 14:47:39.795: INFO: Node: ci-op-cvr5bfr2-df208-g28mm-worker-b-vcsxj, totalRequestedMemResource: 4885315584, memAllocatableVal: 14568333312, memFraction: 0.33533798818125227
> STEP: Compute Cpu, Mem Fraction after create balanced pods.
> Feb 23 14:47:39.795: INFO: ComputeCPUMemFraction for node: ci-op-cvr5bfr2-df208-g28mm-worker-c-sw428
> Feb 23 14:47:40.148: INFO: Pod for on the node: startup-b78f504b-237f-4758-9d3e-a89ce75ff8ea, Cpu: 100, Mem: 209715200
> Feb 23 14:47:40.148: INFO: Pod for on the node: pod-init-991109f2-3e8d-45f4-93d0-b1d59d834c23, Cpu: 100, Mem: 209715200
> Feb 23 14:47:40.148: INFO: Pod for on the node: agnhost-primary-vknxs, Cpu: 100, Mem: 209715200
> Feb 23 14:47:40.148: INFO: Pod for on the node: busybox-readonly-fs8c25040f-a95a-4c95-ab00-1c4b8a16bf67, Cpu: 100, Mem: 209715200
> Feb 23 14:47:40.148: INFO: Pod for on the node: c884b330-bf23-4f22-8086-835d75e71028-0, Cpu: 747, Mem: 3180331008
> Feb 23 14:47:40.148: INFO: Pod for on the node: client-can-connect-81-fd6ph, Cpu: 100, Mem: 209715200
> Feb 23 14:47:40.148: INFO: Pod for on the node: server-7fx9g, Cpu: 200, Mem: 419430400
> Feb 23 14:47:40.148: INFO: Pod for on the node: netserver-1, Cpu: 100, Mem: 209715200
> Feb 23 14:47:40.148: INFO: Pod for on the node: test-container-pod, Cpu: 100, Mem: 209715200
> Feb 23 14:47:40.148: INFO: Pod for on the node: alpine-nnp-nil-bef53aa2-5554-4f0e-9de5-fae83135f91f, Cpu: 100, Mem: 209715200
> Feb 23 14:47:40.148: INFO: Pod for on the node: example-1-deploy, Cpu: 100, Mem: 209715200
> Feb 23 14:47:40.148: INFO: Pod for on the node: history-limit-2-deploy, Cpu: 100, Mem: 209715200
> Feb 23 14:47:40.148: INFO: Pod for on the node: custom-builder-image-1-build, Cpu: 100, Mem: 209715200
> Feb 23 14:47:40.148: INFO: Pod for on the node: sample-custom-build-1-build, Cpu: 100, Mem: 209715200
> Feb 23 14:47:40.148: INFO: Pod for on the node: gcp-pd-csi-driver-node-49jdt, Cpu: 30, Mem: 157286400
> Feb 23 14:47:40.148: INFO: Pod for on the node: tuned-c2c5t, Cpu: 10, Mem: 52428800
> Feb 23 14:47:40.148: INFO: Pod for on the node: dns-default-mfxcx, Cpu: 65, Mem: 137363456
> Feb 23 14:47:40.148: INFO: Pod for on the node: node-ca-6z27x, Cpu: 10, Mem: 10485760
> Feb 23 14:47:40.148: INFO: Pod for on the node: ingress-canary-6cmx4, Cpu: 10, Mem: 20971520
> Feb 23 14:47:40.148: INFO: Pod for on the node: machine-config-daemon-6xd7h, Cpu: 40, Mem: 104857600
> Feb 23 14:47:40.148: INFO: Pod for on the node: node-exporter-q5pbd, Cpu: 9, Mem: 220200960
> Feb 23 14:47:40.148: INFO: Pod for on the node: multus-hzw4g, Cpu: 10, Mem: 157286400
> Feb 23 14:47:40.148: INFO: Pod for on the node: network-metrics-daemon-drp8f, Cpu: 20, Mem: 125829120
> Feb 23 14:47:40.148: INFO: Pod for on the node: network-check-target-zwx95, Cpu: 10, Mem: 15728640
> Feb 23 14:47:40.148: INFO: Pod for on the node: ovs-2lsjd, Cpu: 15, Mem: 419430400
> Feb 23 14:47:40.148: INFO: Pod for on the node: sdn-tgkxh, Cpu: 110, Mem: 230686720
> Feb 23 14:47:40.148: INFO: Node: ci-op-cvr5bfr2-df208-g28mm-worker-c-sw428, totalRequestedCPUResource: 1286, cpuAllocatableMil: 3500, cpuFraction: 0.36742857142857144
> Feb 23 14:47:40.148: INFO: Node: ci-op-cvr5bfr2-df208-g28mm-worker-c-sw428, totalRequestedMemResource: 5147459584, memAllocatableVal: 14568333312, memFraction: 0.353332084992867
> STEP: Compute Cpu, Mem Fraction after create balanced pods.
> Feb 23 14:47:40.148: INFO: ComputeCPUMemFraction for node: ci-op-cvr5bfr2-df208-g28mm-worker-d-qp78t
> Feb 23 14:47:40.376: INFO: Pod for on the node: dns-test-588d38bf-13a5-4f7d-b8b7-6fd0a8b65494, Cpu: 300, Mem: 629145600
> Feb 23 14:47:40.376: INFO: Pod for on the node: labelsupdate1563eafb-212d-4d40-aaaa-c7068b7ccf62, Cpu: 100, Mem: 209715200
> Feb 23 14:47:40.376: INFO: Pod for on the node: acabda1b-0083-4a80-a1cd-0a4c10cc5949-0, Cpu: 430, Mem: 513802239
> Feb 23 14:47:40.376: INFO: Pod for on the node: netserver-2, Cpu: 100, Mem: 209715200
> Feb 23 14:47:40.376: INFO: Pod for on the node: nosrc-build-1-build, Cpu: 100, Mem: 209715200
> Feb 23 14:47:40.376: INFO: Pod for on the node: readiness-1-deploy, Cpu: 100, Mem: 209715200
> Feb 23 14:47:40.376: INFO: Pod for on the node: readiness-1-ns69h, Cpu: 100, Mem: 209715200
> Feb 23 14:47:40.376: INFO: Pod for on the node: example-1-g58rb, Cpu: 200, Mem: 419430400
> Feb 23 14:47:40.376: INFO: Pod for on the node: history-limit-1-deploy, Cpu: 100, Mem: 209715200
> Feb 23 14:47:40.376: INFO: Pod for on the node: append-test, Cpu: 100, Mem: 209715200
> Feb 23 14:47:40.376: INFO: Pod for on the node: bc-docker-1-build, Cpu: 100, Mem: 209715200
> Feb 23 14:47:40.376: INFO: Pod for on the node: bc-source-1-build, Cpu: 100, Mem: 209715200
> Feb 23 14:47:40.376: INFO: Pod for on the node: test-oauth-server, Cpu: 10, Mem: 52428800
> Feb 23 14:47:40.376: INFO: Pod for on the node: execpod, Cpu: 100, Mem: 209715200
> Feb 23 14:47:40.376: INFO: Pod for on the node: hostexec-ci-op-cvr5bfr2-df208-g28mm-worker-d-qp78t-hmrth, Cpu: 100, Mem: 209715200
> Feb 23 14:47:40.376: INFO: Pod for on the node: local-injector, Cpu: 100, Mem: 209715200
> Feb 23 14:47:40.376: INFO: Pod for on the node: gcp-pd-csi-driver-node-gdmg2, Cpu: 30, Mem: 157286400
> Feb 23 14:47:40.376: INFO: Pod for on the node: tuned-l5b4b, Cpu: 10, Mem: 52428800
> Feb 23 14:47:40.376: INFO: Pod for on the node: dns-default-2djwv, Cpu: 65, Mem: 137363456
> Feb 23 14:47:40.376: INFO: Pod for on the node: image-registry-5d7cbc6796-47p55, Cpu: 100, Mem: 268435456
> Feb 23 14:47:40.376: INFO: Pod for on the node: node-ca-swk58, Cpu: 10, Mem: 10485760
> Feb 23 14:47:40.376: INFO: Pod for on the node: ingress-canary-qhzf6, Cpu: 10, Mem: 20971520
> Feb 23 14:47:40.376: INFO: Pod for on the node: router-default-58bb79bdb8-zs7s6, Cpu: 100, Mem: 268435456
> Feb 23 14:47:40.376: INFO: Pod for on the node: machine-config-daemon-dplfm, Cpu: 40, Mem: 104857600
> Feb 23 14:47:40.376: INFO: Pod for on the node: alertmanager-main-0, Cpu: 8, Mem: 283115520
> Feb 23 14:47:40.376: INFO: Pod for on the node: alertmanager-main-2, Cpu: 8, Mem: 283115520
> Feb 23 14:47:40.376: INFO: Pod for on the node: grafana-5b8f5b6d96-gwb98, Cpu: 5, Mem: 125829120
> Feb 23 14:47:40.376: INFO: Pod for on the node: node-exporter-6pw82, Cpu: 9, Mem: 220200960
> Feb 23 14:47:40.376: INFO: Pod for on the node: prometheus-adapter-5557d74fdf-xj5sq, Cpu: 1, Mem: 26214400
> Feb 23 14:47:40.376: INFO: Pod for on the node: prometheus-k8s-0, Cpu: 76, Mem: 1262485504
> Feb 23 14:47:40.376: INFO: Pod for on the node: thanos-querier-57564f89f7-xvh4z, Cpu: 9, Mem: 96468992
> Feb 23 14:47:40.376: INFO: Pod for on the node: multus-d76x9, Cpu: 10, Mem: 157286400
> Feb 23 14:47:40.376: INFO: Pod for on the node: network-metrics-daemon-nv4nm, Cpu: 20, Mem: 125829120
> Feb 23 14:47:40.376: INFO: Pod for on the node: network-check-target-rz8wn, Cpu: 10, Mem: 15728640
> Feb 23 14:47:40.376: INFO: Pod for on the node: ovs-rxjl5, Cpu: 15, Mem: 419430400
> Feb 23 14:47:40.376: INFO: Pod for on the node: sdn-8q7d2, Cpu: 110, Mem: 230686720
> Feb 23 14:47:40.376: INFO: Node: ci-op-cvr5bfr2-df208-g28mm-worker-d-qp78t, totalRequestedCPUResource: 1186, cpuAllocatableMil: 3500, cpuFraction: 0.33885714285714286
> Feb 23 14:47:40.376: INFO: Node: ci-op-cvr5bfr2-df208-g28mm-worker-d-qp78t, totalRequestedMemResource: 4937744383, memAllocatableVal: 14568333312, memFraction: 0.3389368074749332

You can see that the pods on each node are different. I am thinking these tests would benefit from being serial (or removed) due to their unpredictability in high-usage clusters like ours.

Comment 4 Mike Dame 2021-05-17 15:06:23 UTC

Moving to MODIFIED, as all the linked PRs have merged or been closed in favor of other PRs which then merged

Comment 6 RamaKasturi 2021-05-20 07:33:03 UTC

Hello Mike,

  Tried verifying the bug here but do not see any failures / flakes from 4.8 cluster but when looked in the link [1] i see that on 4.7 it has always been falking. Is this expected ? Thanks !!

[1] https://search.ci.openshift.org/?search=Multi-AZ+Clusters+should+spread+the+pods+of+a+replication+controller+across+zones&maxAge=48h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 7 Mike Dame 2021-05-20 13:49:00 UTC

Okay, thanks for checking. It looks like the changes we merged need to be backported to 4.7 then (I wasn't sure if they already had been). I'll open those PRs and link them to this bug

Comment 8 RamaKasturi 2021-05-20 16:16:45 UTC

Do not see any failures with respect to 4.8 runs but still see that it fails with 4.7, so moving the bug back to assigned state.

Comment 9 RamaKasturi 2021-09-06 09:05:21 UTC

Hello Mike,

   I checked the bug again in the following test runs and i still see that it is being listed as flaky in all 4.7 runs again and when looked into the details i see that log says 'passed' with details being nil in  [1]..[6] & in one of the run it failed with the error listed at [7].

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-ovn-upgrade/1434744704989138944
[2] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.7-upgrade-from-stable-4.6-e2e-aws-upgrade/1434735807087775744
[3] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.7-e2e-gcp/1434735817128939520
[4] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upi-4.7/1434735807347822592
[5] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-e2e-gcp/1434625145909022720
[6] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-e2e-gcp/1434267140524871680
[7] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/719/pull-ci-openshift-ovn-kubernetes-release-4.7-e2e-gcp-ovn/1434605526221590528

Thanks
kasturi

Comment 11 Devan Goodwin 2022-03-07 18:15:45 UTC

Only appearing in 4.7 tests and at a very low rate. Propose this gets closed, from TRT perspective this is not a prio.

Comment 12 Maciej Szulik 2022-05-25 11:00:59 UTC

Given the priority and the current time frame I don't think we'll be able to address this issue in 4.7.