Bug 1723603
Summary: | Azure storage e2e tests are consistently failing | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Abhinav Dahiya <adahiya> |
Component: | Storage | Assignee: | Fabio Bertinatto <fbertina> |
Storage sub component: | Kubernetes | QA Contact: | Wei Duan <wduan> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | unspecified | CC: | agarcial, aos-bugs, aos-storage-staff, brad.ison, fbertina, gblomqui, jchaloup, jsafrane |
Version: | 3.10.0 | Keywords: | Reopened |
Target Milestone: | --- | ||
Target Release: | 4.6.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-10-27 15:54:19 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Abhinav Dahiya
2019-06-24 23:36:05 UTC
Updating severity to high as this is blocking CI for Azure. openshift-tests itself looks misconfigured, because it can't pre-create a volume for tests, all "Pre-provisioned PV" or "Inline-volume" fail. This is the code that's failing: https://github.com/openshift/origin/blob/d7a4539442e59eb8ccd4bdc8aca5eec731dd219d/vendor/k8s.io/kubernetes/test/e2e/framework/providers/azure/azure.go#L65 accountName, accountType and location parameters are empty, which is then resolved in EnsureStorageAccount(): https://github.com/openshift/origin/blob/d7a4539442e59eb8ccd4bdc8aca5eec731dd219d/vendor/k8s.io/kubernetes/pkg/cloudprovider/providers/azure/azure_storageaccount.go#L93 How do you run openshift-tests? Does it have all azure-specific options / config files / env. variables so the account name + type discovery above can work? On the bright side, it looks like that the cluster under test is configured correctly - most of tests with dynamically provisioned volumes are working. From openshift-tests output: Jun 24 19:30:37.629: INFO: Couldn't create a new PD, sleeping 5 seconds: could not get storage key for storage account : could not list storage accounts for account type : azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-op-dp142r4t-5cef7-2f6dk-rg/providers/Microsoft.Storage/storageAccounts?api-version=2018-07-01: StatusCode=404 -- Original Error: adal: Refresh request failed. Status Code = '404'. Response body: <!DOCTYPE html> <html lang=en> <meta charset=utf-8> <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width"> <title>Error 404 (Not Found)!!1</title> <style> *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px} </style> <a href=//www.google.com/><span id=logo aria-label=Google></span></a> <p><b>404.</b> <ins>That’s an error.</ins> <p>The requested URL <code>/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fmanagement.core.windows.net%2F</code> was not found on this server. <ins>That’s all we know.</ins> The contents for cloud.conf for the Azure tests ``` aadClientCertPassword: "" aadClientCertPath: "" aadClientId: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx aadClientSecret: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx cloud: AzurePublicCloud cloudProviderBackoff: true cloudProviderBackoffDuration: 6 cloudProviderBackoffExponent: 0 cloudProviderBackoffJitter: 0 cloudProviderBackoffMode: "" cloudProviderBackoffRetries: 0 cloudProviderRateLimit: true cloudProviderRateLimitBucket: 10 cloudProviderRateLimitBucketWrite: 10 cloudProviderRateLimitQPS: 6 cloudProviderRateLimitQPSWrite: 6 disableOutboundSNAT: null excludeMasterFromStandardLB: null loadBalancerSku: standard location: centralus maximumLoadBalancerRuleCount: 0 primaryAvailabilitySetName: "" primaryScaleSetName: "" resourceGroup: adahiya-1-zr9dr-rg routeTableName: adahiya-1-zr9dr-node-routetable securityGroupName: adahiya-1-zr9dr-node-nsg subnetName: adahiya-1-zr9dr-node-subnet subscriptionId: 433715e6-37fe-4328-af75-3661e13b15fc tenantId: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx useInstanceMetadata: true useManagedIdentityExtension: true userAssignedIdentityID: "" vmType: "" vnetName: adahiya-1-zr9dr-vnet vnetResourceGroup: adahiya-1-zr9dr-rg ``` What do you think is missing jsafrane ?? test case: `[sig-storage] In-tree Volumes [Driver: azure] [Testpattern: Dynamic PV (default fs)] provisioning should access volume from different nodes [Suite:openshift/conformance/parallel] [Suite:k8s]` Seems to be failing because: ``` oc get nodes -ojson | jq '.items[].metadata.labels' { "beta.kubernetes.io/arch": "amd64", "beta.kubernetes.io/instance-type": "Standard_DS4_v2", "beta.kubernetes.io/os": "linux", "failure-domain.beta.kubernetes.io/region": "centralus", "failure-domain.beta.kubernetes.io/zone": "0", "kubernetes.io/arch": "amd64", "kubernetes.io/hostname": "adahiya-1-zr9dr-master-0", "kubernetes.io/os": "linux", "node-role.kubernetes.io/master": "", "node.openshift.io/os_id": "rhcos" } { "beta.kubernetes.io/arch": "amd64", "beta.kubernetes.io/instance-type": "Standard_DS4_v2", "beta.kubernetes.io/os": "linux", "failure-domain.beta.kubernetes.io/region": "centralus", "failure-domain.beta.kubernetes.io/zone": "0", "kubernetes.io/arch": "amd64", "kubernetes.io/hostname": "adahiya-1-zr9dr-master-1", "kubernetes.io/os": "linux", "node-role.kubernetes.io/master": "", "node.openshift.io/os_id": "rhcos" } { "beta.kubernetes.io/arch": "amd64", "beta.kubernetes.io/instance-type": "Standard_DS4_v2", "beta.kubernetes.io/os": "linux", "failure-domain.beta.kubernetes.io/region": "centralus", "failure-domain.beta.kubernetes.io/zone": "0", "kubernetes.io/arch": "amd64", "kubernetes.io/hostname": "adahiya-1-zr9dr-master-2", "kubernetes.io/os": "linux", "node-role.kubernetes.io/master": "", "node.openshift.io/os_id": "rhcos" } { "beta.kubernetes.io/arch": "amd64", "beta.kubernetes.io/instance-type": "Standard_DS4_v2", "beta.kubernetes.io/os": "linux", "failure-domain.beta.kubernetes.io/region": "centralus", "failure-domain.beta.kubernetes.io/zone": "centralus-1", "kubernetes.io/arch": "amd64", "kubernetes.io/hostname": "adahiya-1-zr9dr-worker-ftcm4", "kubernetes.io/os": "linux", "node-role.kubernetes.io/worker": "", "node.openshift.io/os_id": "rhcos" } { "beta.kubernetes.io/arch": "amd64", "beta.kubernetes.io/instance-type": "Standard_DS4_v2", "beta.kubernetes.io/os": "linux", "failure-domain.beta.kubernetes.io/region": "centralus", "failure-domain.beta.kubernetes.io/zone": "centralus-2", "kubernetes.io/arch": "amd64", "kubernetes.io/hostname": "adahiya-1-zr9dr-worker-g98df", "kubernetes.io/os": "linux", "node-role.kubernetes.io/worker": "", "node.openshift.io/os_id": "rhcos" } { "beta.kubernetes.io/arch": "amd64", "beta.kubernetes.io/instance-type": "Standard_DS4_v2", "beta.kubernetes.io/os": "linux", "failure-domain.beta.kubernetes.io/region": "centralus", "failure-domain.beta.kubernetes.io/zone": "centralus-1", "kubernetes.io/arch": "amd64", "kubernetes.io/hostname": "adahiya-1-zr9dr-worker-zfl45", "kubernetes.io/os": "linux", "node-role.kubernetes.io/worker": "", "node.openshift.io/os_id": "rhcos" } ``` ``` oc get pv -oyaml apiVersion: v1 items: - apiVersion: v1 kind: PersistentVolume metadata: annotations: pv.kubernetes.io/bound-by-controller: "yes" pv.kubernetes.io/provisioned-by: kubernetes.io/azure-disk volumehelper.VolumeDynamicallyCreatedByKey: azure-disk-dynamic-provisioner creationTimestamp: "2019-06-25T17:19:55Z" finalizers: - kubernetes.io/pv-protection labels: failure-domain.beta.kubernetes.io/region: centralus failure-domain.beta.kubernetes.io/zone: centralus-2 name: pvc-6fdcb2a9-976d-11e9-9bd8-000d3a948bbe resourceVersion: "21282" selfLink: /api/v1/persistentvolumes/pvc-6fdcb2a9-976d-11e9-9bd8-000d3a948bbe uid: 7351eccf-976d-11e9-a460-000d3a3f59ef spec: accessModes: - ReadWriteOnce azureDisk: cachingMode: ReadOnly diskName: kubernetes-dynamic-pvc-6fdcb2a9-976d-11e9-9bd8-000d3a948bbe diskURI: /subscriptions/433715e6-37fe-4328-af75-3661e13b15fc/resourceGroups/adahiya-1-zr9dr-rg/providers/Microsoft.Compute/disks/kubernetes-dynamic-pvc-6fdcb2a9-976d-11e9-9bd8-000d3a948bbe fsType: "" kind: Managed readOnly: false capacity: storage: 5Gi claimRef: apiVersion: v1 kind: PersistentVolumeClaim name: pvc-z74wx namespace: provisioning-8879 resourceVersion: "21254" uid: 6fdcb2a9-976d-11e9-9bd8-000d3a948bbe nodeAffinity: required: nodeSelectorTerms: - matchExpressions: - key: failure-domain.beta.kubernetes.io/region operator: In values: - centralus - key: failure-domain.beta.kubernetes.io/zone operator: In values: - centralus-2 persistentVolumeReclaimPolicy: Delete storageClassName: provisioning-8879-azure-sc volumeMode: Filesystem status: phase: Bound kind: List metadata: resourceVersion: "" selfLink: "" ``` ``` oc describe pod pvc-reader-node2-554ch -n provisioning-8879 Name: pvc-reader-node2-554ch Namespace: provisioning-8879 Priority: 0 PriorityClassName: <none> Node: <none> Labels: app=pvc-reader-node2 Annotations: openshift.io/scc: anyuid Status: Pending IP: Containers: volume-tester: Image: docker.io/library/busybox:1.29 Port: <none> Host Port: <none> Command: /bin/sh -c grep 'hello world' /mnt/test/data Environment: <none> Mounts: /mnt/test from my-volume (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-tfk8c (ro) Conditions: Type Status PodScheduled False Volumes: my-volume: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: pvc-z74wx ReadOnly: false default-token-tfk8c: Type: Secret (a volume populated by a Secret) SecretName: default-token-tfk8c Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 88s (x3 over 2m54s) default-scheduler 0/6 nodes are available: 1 node(s) didn't match node selector, 2 node(s) had volume node affinity conflict, 3 node(s) had taints that the pod didn't tolerate. ``` The test pod is failing to scheduler because of some Zone level mismatch. What is the expactation for the storage setup? in terms of VMs in specific Zones, Regions etc...? > `[sig-storage] In-tree Volumes [Driver: azure] [Testpattern: Dynamic PV (default fs)] provisioning should access volume from different nodes [Suite:openshift/conformance/parallel] [Suite:k8s]` This is tracked in bug #1711688. Let's focus on openshift-tests setup here, which will fix most of the failures. When that's working we can sort out individual flakes in other bugs. > The contents for cloud.conf for the Azure tests [snip] > What do you think is missing jsafrane ?? I don't know. Our knowledge of Azure is very limited, I hoped that you would be more familiar with its setup. In the end, cloud provider in openshift-tests should be set up in the same way as the cloud provider in OpenShift itself. I don't know Azure enough to judge what is wrong and where. I found out why the test gets 404: it uses link-local address to get something from the cloud; http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https%3A%2F%2Fmanagement.core.windows.net%2F It expects it runs on azure, were such URL might have sense, but it runs on GCE(?) and thus it gets 404. I don't know anything about how the Kubernetes cloud provider works and if it can be configured to run on non-azure machine to provision a disk. [trying to re-assing to azure team] This bug is not about storage but about cloud provider that's used in e2e tests. It must be able to run outside of azure, use only azure API to create volumes and not to use link-local addresses. In short, this file should work outside of azure: https://github.com/kubernetes/kubernetes/blob/d3a902ff5b5b8737f2d5ff649656669b8223068f/test/e2e/framework/providers/azure/azure.go >It expects it runs on azure, were such URL might have sense, but it runs on GCE(?) and thus it gets 404.
Jan Safranek I'm confused, the e2e for Azure are running on a cluster running on Azure
No, our e2e tests do not run on Azure. I believe our OpenShift CI cluster (api.ci.openshift.org) runs on GCE and it spawns a pod for each test. The pod installs a "cluster under test" into the real cloud (AWS/Azure/GCE/vSphere/...), but the test binary itself (openshift-tests) still runs in a pod on OpenShift CI cluster, i.e. on GCE. I asked upstream, it seems it should be enough to set ""useInstanceMetadata": false in your azure cloud config", see https://kubernetes.slack.com/archives/C5HJXTT9Q/p156156401228690. openshift-tests already does that, see https://github.com/openshift/origin/blame/5d555f05619ad069ca78670ecc06b6c0fb5f0047/test/extended/util/azure/config_file.go#L36, so maybe it's already fixed. Thanks for the update Jan but I'm still confused. Regardless where you run the binary from e.g your local machine, it deploys a cluster on you cloud of choice e.g aws/azure and it runs the e2e against that cloud environment to validate the expectations there. openshift-tests contains this code: https://github.com/openshift/origin/blob/d7a4539442e59eb8ccd4bdc8aca5eec731dd219d/vendor/k8s.io/kubernetes/test/e2e/framework/providers/azure/azure.go#L65 I.e. it wants to create Azure disks for tests. It does not run on Azure && it wanted to use Azure metadata -> error. It's a long time ago, maybe it's better now. small update: These tests fail on aws as well and have been skipped,https://github.com/kubernetes/kubernetes/blob/master/test/e2e/storage/drivers/in_tree.go#L1553, at upstream k/k too Jan, do you think it's ok to skip the test completely? If so, we can just skip and close this issue. Already being skipped: https://github.com/openshift/origin/blob/master/test/extended/util/test.go#L442-L472 Initially at k/k this storage tests were only for gce. Refactoring was done in this PR, https://github.com/kubernetes/kubernetes/pull/66577/files and interfaces were introduced. To satisfy interface, half baked tests were introduced in that PR for aws and azure. Aws storage tests are commented at upstream k/k since then. This is more an enhancement rather an issue. will work on getting storage test working for azure and aws. We are already skipping aws storage test, so skipping azure aws tests, i think should be sufficient for this BZ. Closing it! I think there are several issues mixed here. 1. openshift-tests binary, running on GCE, needs to create / delete volume on a remote Azure cluster. This is IMO not fixed, I still can see the corresponding tests disabled in our CI: https://github.com/openshift/origin/blob/cf923545a180bbe4bfd03db7d7fc01a2bf9ff23d/test/extended/util/test.go#L445 I think we're a step further, with all azure tests enabled, I get this error: Sep 3 12:52:13.178: INFO: At 2019-09-03 12:47:09 +0000 UTC - event for pod-subpath-test-azure-4f4h: {attachdetach-controller } FailedAttachVolume: AttachVolume.Attach failed for volume "test-volume" : Attach volume "e2e-effe3739-ce48-11e9-86bc-0a58ac100cb7.vhd" to instance "/subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-op-85vrtic6-5cef7-dxct6-rg/providers/Microsoft.Compute/virtualMachines/ci-op-85vrtic6-5cef7-dxct6-worker-centralus3-t4685" failed with compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="OperationNotAllowed" Message="Addition of a blob based disk to VM with managed disks is not supported." Target="dataDisk" The reason is that the test creates a "blob" disk, while the virtual machine can work only with "managed" disks. See https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/23715/pull-ci-openshift-origin-master-e2e-azure/33 I filed https://github.com/kubernetes/kubernetes/issues/82272 to fix this. I.e. there is some work to be done, disabling the tests should be just a temporary measure. We do want to run the tests! 2. volume limit test was disabled upstream: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/storage/drivers/in_tree.go#L1553. It's OK to disable this particular test in our CI too. But it is completely orthogonal to 1! Storage has much more tests that volume limits. In any case, this is not a blocker bug. This has not been prioritised yet and it need to be re-evaluated. Tagging UpcomingSprint. I didn't encounter the described error at any moment in CI, and the fix was merged https://github.com/kubernetes/kubernetes/pull/82324 a while ago. It is now creating the right kind of azure disc - https://github.com/kubernetes/kubernetes/pull/82324/files#diff-d1a5ece2215eb348ea751cd0ac48592fR1470 so assuming this is done now. Could you please verify this? Moving to the storage team as well. @Fabio, I have some queries about this PR, could you check if my understanding is right? 1. Looks like we removed the skip for [Driver: azure], but now our test cases are [Driver: azure-disk], so I understand there is no change for case executing? `\[sig-storage\] In-tree Volumes \[Driver: azure\] \[Testpattern: Inline-volume`, `\[sig-storage\] In-tree Volumes \[Driver: azure\] \[Testpattern: Pre-provisioned PV`, 2. Actually I see some cases related to azure [Inline-volume]/[Pre-provisioned] [subpath] are skipped, could you help confirm it is expected? Like in https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.6/1298942957960826880 [sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Pre-provisioned PV (default fs)] subPath ... [sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Inline-volume (default fs)] subPath ... (In reply to Wei Duan from comment #23) > @Fabio, I have some queries about this PR, could you check if my > understanding is right? > > 1. Looks like we removed the skip for [Driver: azure], but now our test > cases are [Driver: azure-disk], so I understand there is no change for case > executing? > `\[sig-storage\] In-tree Volumes \[Driver: azure\] \[Testpattern: > Inline-volume`, > `\[sig-storage\] In-tree Volumes \[Driver: azure\] \[Testpattern: > Pre-provisioned PV`, That's correct, the PR was just a clean-up. At some point the driver name changed from "azure" to "azure-disk", which invalidated our skip rule. Since the tests are passing, I removed the skip rules instead of renaming them. > > 2. Actually I see some cases related to azure > [Inline-volume]/[Pre-provisioned] [subpath] are skipped, could you help > confirm it is expected? > Like in > https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift- > ocp-installer-e2e-azure-4.6/1298942957960826880 > [sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: > Pre-provisioned PV (default fs)] subPath ... > [sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: > Inline-volume (default fs)] subPath ... Good catch, @Wei. This is expected, as these tests are considered to be redundant: https://github.com/openshift/origin/blob/abc0e0c4013244b125b9f8bfcb32be8be355a3bc/vendor/k8s.io/kubernetes/test/e2e/storage/testsuites/subpath.go#L85-L89 @Fabio, thanks, it is clear now. I changed the status as "Verified" Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |