It looks like some of the tests are Azure are failing because of zone conflicts: "message": "Disk /subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-op-2fyk0cvj-52597-kmkll-rg/providers/Microsoft.Compute/disks/e2e-e1220d04-a4bb-4e5c-8866-92b03686120c cannot be attached to the VM because it is not in the same zone as the VM. VM zone: '2'. Disk zone: '1'." Test Run: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/301/pull-ci-openshift-cluster-etcd-operator-master-e2e-azure/1237
this is creating a lot of noise in our CI runs, i'd like to see this prioritized and backported to 4.4.
This seems like the major reason currently. I notice this error in more than one test run: Apr 13 16:28:38.349: INFO: Successfully deleted PD "/subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-op-2fyk0cvj-52597-kmkll-rg/providers/Microsoft.Compute/disks/e2e-e1220d04-a4bb-4e5c-8866-92b03686120c". Apr 13 16:28:38.349: INFO: In-tree plugin kubernetes.io/azure-disk is not migrated, not validating any metrics [AfterEach] [Testpattern: Inline-volume (ext4)] volumes /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:179 STEP: Collecting events from namespace "e2e-volume-2319". STEP: Found 4 events. Apr 13 16:28:38.401: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for exec-volume-test-inlinevolume-trkh: {default-scheduler } Scheduled: Successfully assigned e2e-volume-2319/exec-volume-test-inlinevolume-trkh to ci-op-2fyk0cvj-52597-kmkll-worker-eastus22-5lwt7 Apr 13 16:28:38.401: INFO: At 2020-04-13 16:20:59 +0000 UTC - event for exec-volume-test-inlinevolume-trkh: {attachdetach-controller } FailedAttachVolume: AttachVolume.Attach failed for volume "vol1" : Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: { "error": { "code": "BadRequest", "message": "Disk /subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-op-2fyk0cvj-52597-kmkll-rg/providers/Microsoft.Compute/disks/e2e-e1220d04-a4bb-4e5c-8866-92b03686120c cannot be attached to the VM because it is not in the same zone as the VM. VM zone: '2'. Disk zone: '1'." } } Apr 13 16:28:38.401: INFO: At 2020-04-13 16:22:54 +0000 UTC - event for exec-volume-test-inlinevolume-trkh: {kubelet ci-op-2fyk0cvj-52597-kmkll-worker-eastus22-5lwt7} FailedMount: Unable to attach or mount volumes: unmounted volumes=[vol1], unattached volumes=[vol1 default-token-r2tph]: timed out waiting for the condition Apr 13 16:28:38.401: INFO: At 2020-04-13 16:27:24 +0000 UTC - event for exec-volume-test-inlinevolume-trkh: {kubelet ci-op-2fyk0cvj-52597-kmkll-worker-eastus22-5lwt7} FailedMount: Unable to attach or mount volumes: unmounted volumes=[default-token-r2tph vol1], unattached volumes=[default-token-r2tph vol1]: timed out waiting for the condition Apr 13 16:28:38.415: INFO: POD NODE PHASE GRACE CONDITIONS Apr 13 16:28:38.415: INFO: Apr 13 16:28:38.437: INFO: skipping dumping cluster info - cluster too large Apr 13 16:28:38.437: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready STEP: Destroying namespace "e2e-volume-2319" for this suite. Apr 13 16:28:38.476: INFO: Running AfterSuite actions on all nodes Apr 13 16:28:38.476: INFO: Running AfterSuite actions on node 1 fail [k8s.io/kubernetes/test/e2e/framework/util.go:798]: Unexpected error: <*errors.errorString | 0xc0026cde20>: { s: "expected pod \"exec-volume-test-inlinevolume-trkh\" success: Gave up after waiting 5m0s for pod \"exec-volume-test-inlinevolume-trkh\" to be \"Succeeded or Failed\"", } expected pod "exec-volume-test-inlinevolume-trkh" success: Gave up after waiting 5m0s for pod "exec-volume-test-inlinevolume-trkh" to be "Succeeded or Failed" occurred
Given that these failures are coming from inline volumes. I think what is happening is - the disk that is provisioned outside from e2e is not in zone that is selected for compute. These tests were disabled previously in https://bugzilla.redhat.com/show_bug.cgi?id=1723603 via https://github.com/openshift/origin/blob/cf923545a180bbe4bfd03db7d7fc01a2bf9ff23d/test/extended/util/test.go#L445 but looks like they are enabled again and failing. I would like us to run these tests and hence I wonder if cloud-config that test is using (not the *cluster*) has zone parameter set
Also the reason they broke after rebase is because in 1.18 the name of the driver in test was changed from "azure" to "azure-disk" and current regexp that skips these tests no longer matches the string - https://github.com/openshift/origin/blob/master/test/extended/util/annotate/rules.go#L142
There already is bug #1723603 to fix azure tests. It seems some work has been done there (it was failing 100%), but some tests are still flaky. Should we fix the regexp here and leave #1723603 for the zonal work?
Fix in upstream to fix the tests, so as they don't require zone configuration - https://github.com/kubernetes/kubernetes/pull/90147
PR for OCP - https://github.com/openshift/origin/pull/24900
Checked some e2e logs from: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.5/915, no this issue is found. So marked this bug as verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409