1823494 – Tests in Azure environment are failing because of zone conflicts on Attach

Bug 1823494 - Tests in Azure environment are failing because of zone conflicts on Attach

Summary: Tests in Azure environment are failing because of zone conflicts on Attach

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Hemant Kumar
QA Contact:	Qin Ping
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-13 20:09 UTC by Hemant Kumar
Modified:	2020-07-13 17:27 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-13 17:27:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 24900	0	None	closed	Bug 1823494: Storage tests should more carefully select zones for testing	2020-10-19 09:07:48 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:27:45 UTC

Description Hemant Kumar 2020-04-13 20:09:25 UTC

It looks like some of the tests are Azure are failing because of zone conflicts:

    "message": "Disk /subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-op-2fyk0cvj-52597-kmkll-rg/providers/Microsoft.Compute/disks/e2e-e1220d04-a4bb-4e5c-8866-92b03686120c cannot be attached to the VM because it is not in the same zone as the VM. VM zone: '2'. Disk zone: '1'."

Test Run:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/301/pull-ci-openshift-cluster-etcd-operator-master-e2e-azure/1237

Comment 1 Ben Parees 2020-04-13 20:11:38 UTC

this is creating a lot of noise in our CI runs, i'd like to see this prioritized and backported to 4.4.

Comment 2 Hemant Kumar 2020-04-13 20:32:20 UTC

This seems like the major reason currently. I notice this error in more than one test run:

Apr 13 16:28:38.349: INFO: Successfully deleted PD "/subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-op-2fyk0cvj-52597-kmkll-rg/providers/Microsoft.Compute/disks/e2e-e1220d04-a4bb-4e5c-8866-92b03686120c".
Apr 13 16:28:38.349: INFO: In-tree plugin kubernetes.io/azure-disk is not migrated, not validating any metrics
[AfterEach] [Testpattern: Inline-volume (ext4)] volumes
  /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:179
STEP: Collecting events from namespace "e2e-volume-2319".
STEP: Found 4 events.
Apr 13 16:28:38.401: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for exec-volume-test-inlinevolume-trkh: {default-scheduler } Scheduled: Successfully assigned e2e-volume-2319/exec-volume-test-inlinevolume-trkh to ci-op-2fyk0cvj-52597-kmkll-worker-eastus22-5lwt7
Apr 13 16:28:38.401: INFO: At 2020-04-13 16:20:59 +0000 UTC - event for exec-volume-test-inlinevolume-trkh: {attachdetach-controller } FailedAttachVolume: AttachVolume.Attach failed for volume "vol1" : Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: {
  "error": {
    "code": "BadRequest",
    "message": "Disk /subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-op-2fyk0cvj-52597-kmkll-rg/providers/Microsoft.Compute/disks/e2e-e1220d04-a4bb-4e5c-8866-92b03686120c cannot be attached to the VM because it is not in the same zone as the VM. VM zone: '2'. Disk zone: '1'."
  }
}
Apr 13 16:28:38.401: INFO: At 2020-04-13 16:22:54 +0000 UTC - event for exec-volume-test-inlinevolume-trkh: {kubelet ci-op-2fyk0cvj-52597-kmkll-worker-eastus22-5lwt7} FailedMount: Unable to attach or mount volumes: unmounted volumes=[vol1], unattached volumes=[vol1 default-token-r2tph]: timed out waiting for the condition
Apr 13 16:28:38.401: INFO: At 2020-04-13 16:27:24 +0000 UTC - event for exec-volume-test-inlinevolume-trkh: {kubelet ci-op-2fyk0cvj-52597-kmkll-worker-eastus22-5lwt7} FailedMount: Unable to attach or mount volumes: unmounted volumes=[default-token-r2tph vol1], unattached volumes=[default-token-r2tph vol1]: timed out waiting for the condition
Apr 13 16:28:38.415: INFO: POD  NODE  PHASE  GRACE  CONDITIONS
Apr 13 16:28:38.415: INFO: 
Apr 13 16:28:38.437: INFO: skipping dumping cluster info - cluster too large
Apr 13 16:28:38.437: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-volume-2319" for this suite.
Apr 13 16:28:38.476: INFO: Running AfterSuite actions on all nodes
Apr 13 16:28:38.476: INFO: Running AfterSuite actions on node 1
fail [k8s.io/kubernetes/test/e2e/framework/util.go:798]: Unexpected error:
    <*errors.errorString | 0xc0026cde20>: {
        s: "expected pod \"exec-volume-test-inlinevolume-trkh\" success: Gave up after waiting 5m0s for pod \"exec-volume-test-inlinevolume-trkh\" to be \"Succeeded or Failed\"",
    }
    expected pod "exec-volume-test-inlinevolume-trkh" success: Gave up after waiting 5m0s for pod "exec-volume-test-inlinevolume-trkh" to be "Succeeded or Failed"
occurred

Comment 3 Hemant Kumar 2020-04-13 21:30:10 UTC

Given that these failures are coming from inline volumes. I think what is happening is - the disk that is provisioned outside from e2e is not in zone that is selected for compute.

These tests were disabled previously in https://bugzilla.redhat.com/show_bug.cgi?id=1723603 via https://github.com/openshift/origin/blob/cf923545a180bbe4bfd03db7d7fc01a2bf9ff23d/test/extended/util/test.go#L445 but looks like they are enabled again and failing. I would like us to run these tests and hence I wonder if cloud-config that test is using (not the *cluster*) has zone parameter set

Comment 4 Hemant Kumar 2020-04-14 02:52:39 UTC

Also the reason they broke after rebase is because in 1.18 the name of the driver in test was changed from "azure" to "azure-disk" and current regexp that skips these tests no longer matches the string - https://github.com/openshift/origin/blob/master/test/extended/util/annotate/rules.go#L142

Comment 5 Jan Safranek 2020-04-14 11:57:43 UTC

There already is bug #1723603 to fix azure tests. It seems some work has been done there (it was failing 100%), but some tests are still flaky.

Should we fix the regexp here and leave #1723603 for the zonal work?

Comment 6 Hemant Kumar 2020-04-14 20:42:47 UTC

Fix in upstream to fix the tests, so as they don't require zone configuration - https://github.com/kubernetes/kubernetes/pull/90147

Comment 7 Hemant Kumar 2020-04-20 21:40:56 UTC

PR for OCP - https://github.com/openshift/origin/pull/24900

Comment 10 Qin Ping 2020-05-11 02:57:08 UTC

Checked some e2e logs from: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.5/915, no this issue is found.

So marked this bug as verified.

Comment 11 errata-xmlrpc 2020-07-13 17:27:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.