Bug 1806594
Summary: | Test flake: [sig-scheduling] Multi-AZ Clusters should spread the pods of a service across zones | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Dan Mace <dmace> | |
Component: | kube-scheduler | Assignee: | Mike Dame <mdame> | |
Status: | CLOSED ERRATA | QA Contact: | RamaKasturi <knarra> | |
Severity: | medium | Docs Contact: | ||
Priority: | medium | |||
Version: | 4.3.0 | CC: | aos-bugs, bluddy, carangog, ccoleman, cdaley, jhou, maszulik, mfojtik, nmoraiti, ssoto, vrutkovs, wking, yinzhou | |
Target Milestone: | --- | Keywords: | UpcomingSprint | |
Target Release: | 4.6.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | No Doc Update | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1814359 1814360 (view as bug list) | Environment: | ||
Last Closed: | 2020-10-27 15:55:31 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1814359, 1814360, 1814363 |
Description
Dan Mace
2020-02-24 15:23:40 UTC
This issue caused a 4.5 ci build failure today https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.5/645 This fails 1/4 runs on AWS because AWS only does two zones. Workaround in 4.5 (should be backported to at least 4.3 to clean up flakes) https://github.com/openshift/origin/pull/24709 The upstream issue https://github.com/kubernetes/kubernetes/issues/89178 is that the test is wrong (assumes all zones of all nodes are schedulable) but requires significant rework of the upstream tests, so that won't be available anytime soon. The tests need to take as input an argument of nodes that can be considered for the test. *** Bug 1760193 has been marked as a duplicate of this bug. *** *** Bug 1819961 has been marked as a duplicate of this bug. *** *** Bug 1831844 has been marked as a duplicate of this bug. *** [buildcop] seeing this test is most flaky test on gcp on 4.3 https://testgrid.k8s.io/redhat-openshift-ocp-release-4.3-blocking#release-openshift-origin-installer-e2e-gcp-4.3&sort-by-flakiness= so yes we should backport it to 4.3 if possible There's still something we don't understand failing in the test - was hoping the other changes to the test made that clearer (like a machine dies) in which case the test isn't flaky, our product is. https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.3/1871 May 20 18:25:51.853: INFO: At 2020-05-20 18:25:06 +0000 UTC - event for ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb-4cj6w: {kubelet ci-op-w7csw-w-d-dnl9q.c.openshift-gce-devel-ci.internal} Killing: Stopping container ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb May 20 18:25:51.853: INFO: At 2020-05-20 18:25:06 +0000 UTC - event for ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb-4sdhn: {kubelet ci-op-w7csw-w-b-2sqvx.c.openshift-gce-devel-ci.internal} Killing: Stopping container ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb May 20 18:25:51.853: INFO: At 2020-05-20 18:25:06 +0000 UTC - event for ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb-c554c: {kubelet ci-op-w7csw-w-b-2sqvx.c.openshift-gce-devel-ci.internal} Killing: Stopping container ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb May 20 18:25:51.853: INFO: At 2020-05-20 18:25:06 +0000 UTC - event for ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb-fdqdz: {kubelet ci-op-w7csw-w-c-8dq7b.c.openshift-gce-devel-ci.internal} Killing: Stopping container ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb May 20 18:25:51.853: INFO: At 2020-05-20 18:25:06 +0000 UTC - event for ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb-mkrw6: {kubelet ci-op-w7csw-w-b-2sqvx.c.openshift-gce-devel-ci.internal} Killing: Stopping container ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb May 20 18:25:51.853: INFO: At 2020-05-20 18:25:07 +0000 UTC - event for ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb-bw4j4: {kubelet ci-op-w7csw-w-d-dnl9q.c.openshift-gce-devel-ci.internal} Killing: Stopping container ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb May 20 18:25:51.853: INFO: At 2020-05-20 18:25:07 +0000 UTC - event for ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb-t5vxn: {kubelet ci-op-w7csw-w-d-dnl9q.c.openshift-gce-devel-ci.internal} Killing: Stopping container ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb 7 pods, 3 zones, end result is 3 pods on b, 2 pods on d, and 1 pod on c. So this is scheduling - could be a race condition. Not the same as my change. This still requires further investigation and as such won't make 4.5 cut, moving to 4.6. *** Bug 1852903 has been marked as a duplicate of this bug. *** *** Bug 1856341 has been marked as a duplicate of this bug. *** *** Bug 1859900 has been marked as a duplicate of this bug. *** Including: [sig-scheduling] Multi-AZ Cluster Volumes [sig-storage] should schedule pods in the same zones as statically provisioned PVs from bug 1859900, so Sippy will associate this bug with failures from that test-case. I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug in a future sprint. I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. However, I'm not seeing many recent failures in 4.6: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-blocking#release-openshift-origin-installer-e2e-gcp-4.6&sort-by-flakiness (Though I do see 2 failures in the past week for the similar test listed in https://bugzilla.redhat.com/show_bug.cgi?id=1806594#c16, or the 4.3 links shared above: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.3-blocking#release-openshift-origin-installer-e2e-gcp-4.3 So this flake may have fallen below the significant threshold. Can anyone confirm if this is still occurring frequently? Based on previous comment and the fact that we landed k8s 1.19 recently I'm moving this to qa for verification. Can't find the issue in : https://testgrid.k8s.io/redhat-openshift-ocp-release-4.3-blocking#release-openshift-origin-installer-e2e-gcp-4.3。 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |