Bug 1806594 - Test flake: [sig-scheduling] Multi-AZ Clusters should spread the pods of a service across zones
Summary: Test flake: [sig-scheduling] Multi-AZ Clusters should spread the pods of a se...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-scheduler
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Mike Dame
QA Contact: RamaKasturi
URL:
Whiteboard:
: 1760193 1819961 1831844 1852903 1856341 1859900 (view as bug list)
Depends On:
Blocks: 1814359 1814360 1814363
TreeView+ depends on / blocked
 
Reported: 2020-02-24 15:23 UTC by Dan Mace
Modified: 2020-10-27 15:56 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1814359 1814360 (view as bug list)
Environment:
Last Closed: 2020-10-27 15:55:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 15:56:10 UTC

Description Dan Mace 2020-02-24 15:23:40 UTC
Description of problem:

Top flake in the 4.3 blocking job grid:

[sig-scheduling] Multi-AZ Clusters should spread the pods of a service across zones [Suite:openshift/conformance/parallel] [Suite:k8s]

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.3-blocking#release-openshift-origin-installer-e2e-gcp-4.3

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.3/1434
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.3/1452

Not observed recently in 4.1, 4.2, or 4.4 branches. Hard to tell if that's significant.

This is not to be confused with https://bugzilla.redhat.com/show_bug.cgi?id=1760193, which is a similar test but for replication controllers.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 4 Sebastian Soto 2020-03-16 21:13:28 UTC
This issue caused a 4.5 ci build failure today https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.5/645

Comment 5 Clayton Coleman 2020-03-17 17:27:50 UTC
This fails 1/4 runs on AWS because AWS only does two zones.

Workaround in 4.5 (should be backported to at least 4.3 to clean up flakes) https://github.com/openshift/origin/pull/24709

The upstream issue https://github.com/kubernetes/kubernetes/issues/89178 is that the test is wrong (assumes all zones of all nodes are schedulable) but requires significant rework of the upstream tests, so that won't be available anytime soon.  The tests need to take as input an argument of nodes that can be considered for the test.

Comment 6 Clayton Coleman 2020-03-17 17:28:45 UTC
*** Bug 1760193 has been marked as a duplicate of this bug. ***

Comment 7 Maciej Szulik 2020-04-03 11:02:32 UTC
*** Bug 1819961 has been marked as a duplicate of this bug. ***

Comment 8 Ben Luddy 2020-05-05 18:24:26 UTC
*** Bug 1831844 has been marked as a duplicate of this bug. ***

Comment 9 Abhinav Dahiya 2020-05-14 20:54:03 UTC
[buildcop] seeing this test is most flaky test on gcp on 4.3
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.3-blocking#release-openshift-origin-installer-e2e-gcp-4.3&sort-by-flakiness=

so yes we should backport it to 4.3 if possible

Comment 10 Clayton Coleman 2020-05-21 13:40:54 UTC
There's still something we don't understand failing in the test - was hoping the other changes to the test made that clearer (like a machine dies) in which case the test isn't flaky, our product is.

Comment 11 Clayton Coleman 2020-05-21 13:49:04 UTC
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.3/1871

May 20 18:25:51.853: INFO: At 2020-05-20 18:25:06 +0000 UTC - event for ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb-4cj6w: {kubelet ci-op-w7csw-w-d-dnl9q.c.openshift-gce-devel-ci.internal} Killing: Stopping container ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb
May 20 18:25:51.853: INFO: At 2020-05-20 18:25:06 +0000 UTC - event for ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb-4sdhn: {kubelet ci-op-w7csw-w-b-2sqvx.c.openshift-gce-devel-ci.internal} Killing: Stopping container ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb
May 20 18:25:51.853: INFO: At 2020-05-20 18:25:06 +0000 UTC - event for ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb-c554c: {kubelet ci-op-w7csw-w-b-2sqvx.c.openshift-gce-devel-ci.internal} Killing: Stopping container ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb
May 20 18:25:51.853: INFO: At 2020-05-20 18:25:06 +0000 UTC - event for ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb-fdqdz: {kubelet ci-op-w7csw-w-c-8dq7b.c.openshift-gce-devel-ci.internal} Killing: Stopping container ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb
May 20 18:25:51.853: INFO: At 2020-05-20 18:25:06 +0000 UTC - event for ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb-mkrw6: {kubelet ci-op-w7csw-w-b-2sqvx.c.openshift-gce-devel-ci.internal} Killing: Stopping container ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb
May 20 18:25:51.853: INFO: At 2020-05-20 18:25:07 +0000 UTC - event for ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb-bw4j4: {kubelet ci-op-w7csw-w-d-dnl9q.c.openshift-gce-devel-ci.internal} Killing: Stopping container ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb
May 20 18:25:51.853: INFO: At 2020-05-20 18:25:07 +0000 UTC - event for ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb-t5vxn: {kubelet ci-op-w7csw-w-d-dnl9q.c.openshift-gce-devel-ci.internal} Killing: Stopping container ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb

7 pods, 3 zones, end result is 3 pods on b, 2 pods on d, and 1 pod on c.  So this is scheduling - could be a race condition.  Not the same as my change.

Comment 12 Maciej Szulik 2020-05-25 11:58:54 UTC
This still requires further investigation and as such won't make 4.5 cut, moving to 4.6.

Comment 13 Maciej Szulik 2020-07-02 07:51:54 UTC
*** Bug 1852903 has been marked as a duplicate of this bug. ***

Comment 14 Mike Dame 2020-07-23 17:38:00 UTC
*** Bug 1856341 has been marked as a duplicate of this bug. ***

Comment 15 Mike Dame 2020-07-23 17:38:34 UTC
*** Bug 1859900 has been marked as a duplicate of this bug. ***

Comment 16 W. Trevor King 2020-07-24 04:36:51 UTC
Including:

[sig-scheduling] Multi-AZ Cluster Volumes [sig-storage] should schedule pods in the same zones as statically provisioned PVs

from bug 1859900, so Sippy will associate this bug with failures from that test-case.

Comment 18 Mike Dame 2020-08-21 13:28:59 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug in a future sprint.

Comment 19 Mike Dame 2020-09-10 16:09:02 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

However, I'm not seeing many recent failures in 4.6:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-blocking#release-openshift-origin-installer-e2e-gcp-4.6&sort-by-flakiness
(Though I do see 2 failures in the past week for the similar test listed in https://bugzilla.redhat.com/show_bug.cgi?id=1806594#c16, 

or the 4.3 links shared above:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.3-blocking#release-openshift-origin-installer-e2e-gcp-4.3

So this flake may have fallen below the significant threshold. Can anyone confirm if this is still occurring frequently?

Comment 20 Maciej Szulik 2020-09-11 10:05:33 UTC
Based on previous comment and the fact that we landed k8s 1.19 recently I'm moving this to qa for verification.

Comment 26 errata-xmlrpc 2020-10-27 15:55:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.