Description of problem: Top flake in the 4.3 blocking job grid: [sig-scheduling] Multi-AZ Clusters should spread the pods of a service across zones [Suite:openshift/conformance/parallel] [Suite:k8s] https://testgrid.k8s.io/redhat-openshift-ocp-release-4.3-blocking#release-openshift-origin-installer-e2e-gcp-4.3 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.3/1434 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.3/1452 Not observed recently in 4.1, 4.2, or 4.4 branches. Hard to tell if that's significant. This is not to be confused with https://bugzilla.redhat.com/show_bug.cgi?id=1760193, which is a similar test but for replication controllers. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
This issue caused a 4.5 ci build failure today https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-serial-4.5/645
This fails 1/4 runs on AWS because AWS only does two zones. Workaround in 4.5 (should be backported to at least 4.3 to clean up flakes) https://github.com/openshift/origin/pull/24709 The upstream issue https://github.com/kubernetes/kubernetes/issues/89178 is that the test is wrong (assumes all zones of all nodes are schedulable) but requires significant rework of the upstream tests, so that won't be available anytime soon. The tests need to take as input an argument of nodes that can be considered for the test.
*** Bug 1760193 has been marked as a duplicate of this bug. ***
*** Bug 1819961 has been marked as a duplicate of this bug. ***
*** Bug 1831844 has been marked as a duplicate of this bug. ***
[buildcop] seeing this test is most flaky test on gcp on 4.3 https://testgrid.k8s.io/redhat-openshift-ocp-release-4.3-blocking#release-openshift-origin-installer-e2e-gcp-4.3&sort-by-flakiness= so yes we should backport it to 4.3 if possible
There's still something we don't understand failing in the test - was hoping the other changes to the test made that clearer (like a machine dies) in which case the test isn't flaky, our product is.
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.3/1871 May 20 18:25:51.853: INFO: At 2020-05-20 18:25:06 +0000 UTC - event for ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb-4cj6w: {kubelet ci-op-w7csw-w-d-dnl9q.c.openshift-gce-devel-ci.internal} Killing: Stopping container ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb May 20 18:25:51.853: INFO: At 2020-05-20 18:25:06 +0000 UTC - event for ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb-4sdhn: {kubelet ci-op-w7csw-w-b-2sqvx.c.openshift-gce-devel-ci.internal} Killing: Stopping container ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb May 20 18:25:51.853: INFO: At 2020-05-20 18:25:06 +0000 UTC - event for ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb-c554c: {kubelet ci-op-w7csw-w-b-2sqvx.c.openshift-gce-devel-ci.internal} Killing: Stopping container ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb May 20 18:25:51.853: INFO: At 2020-05-20 18:25:06 +0000 UTC - event for ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb-fdqdz: {kubelet ci-op-w7csw-w-c-8dq7b.c.openshift-gce-devel-ci.internal} Killing: Stopping container ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb May 20 18:25:51.853: INFO: At 2020-05-20 18:25:06 +0000 UTC - event for ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb-mkrw6: {kubelet ci-op-w7csw-w-b-2sqvx.c.openshift-gce-devel-ci.internal} Killing: Stopping container ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb May 20 18:25:51.853: INFO: At 2020-05-20 18:25:07 +0000 UTC - event for ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb-bw4j4: {kubelet ci-op-w7csw-w-d-dnl9q.c.openshift-gce-devel-ci.internal} Killing: Stopping container ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb May 20 18:25:51.853: INFO: At 2020-05-20 18:25:07 +0000 UTC - event for ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb-t5vxn: {kubelet ci-op-w7csw-w-d-dnl9q.c.openshift-gce-devel-ci.internal} Killing: Stopping container ubelite-spread-rc-6304aa5e-6754-4515-85b7-58e31af48ffb 7 pods, 3 zones, end result is 3 pods on b, 2 pods on d, and 1 pod on c. So this is scheduling - could be a race condition. Not the same as my change.
This still requires further investigation and as such won't make 4.5 cut, moving to 4.6.
*** Bug 1852903 has been marked as a duplicate of this bug. ***
*** Bug 1856341 has been marked as a duplicate of this bug. ***
*** Bug 1859900 has been marked as a duplicate of this bug. ***
Including: [sig-scheduling] Multi-AZ Cluster Volumes [sig-storage] should schedule pods in the same zones as statically provisioned PVs from bug 1859900, so Sippy will associate this bug with failures from that test-case.
Recent failure in https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/928/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws-serial/1292742478645956608
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug in a future sprint.
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. However, I'm not seeing many recent failures in 4.6: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-blocking#release-openshift-origin-installer-e2e-gcp-4.6&sort-by-flakiness (Though I do see 2 failures in the past week for the similar test listed in https://bugzilla.redhat.com/show_bug.cgi?id=1806594#c16, or the 4.3 links shared above: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.3-blocking#release-openshift-origin-installer-e2e-gcp-4.3 So this flake may have fallen below the significant threshold. Can anyone confirm if this is still occurring frequently?
Based on previous comment and the fact that we landed k8s 1.19 recently I'm moving this to qa for verification.
Can't find the issue in : https://testgrid.k8s.io/redhat-openshift-ocp-release-4.3-blocking#release-openshift-origin-installer-e2e-gcp-4.3。
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196