test: [sig-storage] In-tree Volumes [Driver: azure-disk] [Testpattern: Dynamic PV (ext4)] volumes should store data is failing frequently in CI, see search results: https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-storage%5C%5D+In-tree+Volumes+%5C%5BDriver%3A+azure-disk%5C%5D+%5C%5BTestpattern%3A+Dynamic+PV+%5C%28ext4%5C%29%5C%5D+volumes+should+store+data https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=4.5&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-storage%5C%5D+In-tree+Volumes+%5C%5BDriver%3A+azure-disk%5C%5D+%5C%5BTestpattern%3A+Dynamic+PV+%5C%28ext4%5C%29%5C%5D+volumes+should+store+data https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-shared-vpc-4.5/1314751477436125184 ``` Oct 10 03:28:36.483: INFO: Deleting PersistentVolumeClaim "pvc-protectionggn7n" Oct 10 03:28:36.556: INFO: Running AfterSuite actions on all nodes Oct 10 03:28:36.556: INFO: Running AfterSuite actions on node 1 fail [k8s.io/kubernetes/test/e2e/storage/pvc_protection.go:94]: While creating pod that uses the PVC or waiting for the Pod to become Running Unexpected error: <*errors.errorString | 0xc000dc7a10>: { s: "pod \"pvc-tester-rskpn\" is not Running: timed out waiting for the condition", } pod "pvc-tester-rskpn" is not Running: timed out waiting for the condition occurred ``` Checking kubelet logs: ``` Oct 10 03:25:49.969270 ci-op-vp586qpi-98686-lg8ch-worker-centralus3-5qs27 hyperkube[1489]: E1010 03:25:49.969124 1489 kubelet.go:1700] Unable to attach or mount volumes for pod "pvc-tester-rskpn_e2e-pvc-protection-4329(39c9ebf0-610a-407d-b7f0-732a0e7aee07)": unmounted volumes=[volume1], unattached volumes=[volume1 default-token-m4wnd]: timed out waiting for the condition; skipping pod Oct 10 03:28:06.617469 ci-op-vp586qpi-98686-lg8ch-worker-centralus3-5qs27 hyperkube[1489]: E1010 03:28:06.617424 1489 kubelet.go:1700] Unable to attach or mount volumes for pod "pvc-tester-rskpn_e2e-pvc-protection-4329(39c9ebf0-610a-407d-b7f0-732a0e7aee07)": unmounted volumes=[volume1], unattached volumes=[volume1 default-token-m4wnd]: timed out waiting for the condition; skipping pod Oct 10 03:30:22.617201 ci-op-vp586qpi-98686-lg8ch-worker-centralus3-5qs27 hyperkube[1489]: E1010 03:30:22.617163 1489 kubelet.go:1700] Unable to attach or mount volumes for pod "pvc-tester-rskpn_e2e-pvc-protection-4329(39c9ebf0-610a-407d-b7f0-732a0e7aee07)": unmounted volumes=[volume1 default-token-m4wnd], unattached volumes=[volume1 default-token-m4wnd]: timed out waiting for the condition; skipping pod ```
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-shared-vpc-4.5/1314751477436125184 Azure was slow attaching a volume: Oct 10 03:29:58.215: INFO: At 2020-10-10 03:24:49 +0000 UTC - event for azure-disk8q2vr: {persistentvolume-controller } WaitForFirstConsumer: waiting for first consumer to be created before binding Oct 10 03:29:58.215: INFO: At 2020-10-10 03:25:00 +0000 UTC - event for azure-disk8q2vr: {persistentvolume-controller } ProvisioningSucceeded: Successfully provisioned volume pvc-a5a23ff4-0c98-4187-a994-c580842ecfea using kubernetes.io/azure-disk Oct 10 03:29:58.215: INFO: At 2020-10-10 03:27:03 +0000 UTC - event for azure-injector: {kubelet ci-op-vp586qpi-98686-lg8ch-worker-centralus3-5qs27} FailedMount: Unable to attach or mount volumes: unmounted volumes=[azure-volume-0], unattached volumes=[azure-volume-0 default-token-wvwgq]: timed out waiting for the co ndition Oct 10 03:29:58.215: INFO: At 2020-10-10 03:29:29 +0000 UTC - event for azure-injector: {attachdetach-controller } SuccessfulAttachVolume: AttachVolume.Attach succeeded for volume "pvc-a5a23ff4-0c98-4187-a994-c580842ecfea" 4m:30s from provisioning to attach. The test times out after 5 minutes. From kube-controller-manager logs: I1010 03:25:00.881130 1 reconciler.go:275] attacherDetacher.AttachVolume started for volume "pvc-a5a23ff4-0c98-4187-a994-c580842ecfea" (UniqueName: "kubernetes.io/azure-disk//subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-op-vp586qpi-98686-lg8ch-rg/providers/Microsoft.Compute/disks/ci-op-vp586qpi-98686-lg8ch-dynamic-pvc-a5a23ff4-0c98-4187-a994-c580842ecfea") from node "ci-op-vp586qpi-98686-lg8ch-worker-centralus3-5qs27" I1010 03:25:30.943687 1 attacher.go:84] GetDiskLun returned: cannot find Lun for disk ci-op-vp586qpi-98686-lg8ch-dynamic-pvc-a5a23ff4-0c98-4187-a994-c580842ecfea. Initiating attaching volume "/subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-op-vp586qpi-98686-lg8ch-rg/providers/Microsoft.Compute/disks/ci-op-vp586qpi-98686-lg8ch-dynamic-pvc-a5a23ff4-0c98-4187-a994-c580842ecfea" to node "ci-op-vp586qpi-98686-lg8ch-worker-centralus3-5qs27". I1010 03:29:19.428299 1 azure_controller_common.go:183] Trying to attach volume "/subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-op-vp586qpi-98686-lg8ch-rg/providers/Microsoft.Compute/disks/ci-op-vp586qpi-98686-lg8ch-dynamic-pvc-a5a23ff4-0c98-4187-a994-c580842ecfea" lun 2 to node "ci-op-vp586qpi-98686-lg8ch-worker-centralus3-5qs27". I1010 03:29:19.428363 1 azure_controller_standard.go:93] azureDisk - update(ci-op-vp586qpi-98686-lg8ch-rg): vm(ci-op-vp586qpi-98686-lg8ch-worker-centralus3-5qs27) - attach disk(ci-op-vp586qpi-98686-lg8ch-dynamic-pvc-a5a23ff4-0c98-4187-a994-c580842ecfea, /subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-op-vp586qpi-98686-lg8ch-rg/providers/Microsoft.Compute/disks/ci-op-vp586qpi-98686-lg8ch-dynamic-pvc-a5a23ff4-0c98-4187-a994-c580842ecfea) with DiskEncryptionSetID() I1010 03:29:29.816271 1 azure_controller_standard.go:111] azureDisk - update(ci-op-vp586qpi-98686-lg8ch-rg): vm(ci-op-vp586qpi-98686-lg8ch-worker-centralus3-5qs27) - attach disk(ci-op-vp586qpi-98686-lg8ch-dynamic-pvc-a5a23ff4-0c98-4187-a994-c580842ecfea, /subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-op-vp586qpi-98686-lg8ch-rg/providers/Microsoft.Compute/disks/ci-op-vp586qpi-98686-lg8ch-dynamic-pvc-a5a23ff4-0c98-4187-a994-c580842ecfea) returned with <nil> I1010 03:29:29.816315 1 attacher.go:89] Attach operation successful: volume "/subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-op-vp586qpi-98686-lg8ch-rg/providers/Microsoft.Compute/disks/ci-op-vp586qpi-98686-lg8ch-dynamic-pvc-a5a23ff4-0c98-4187-a994-c580842ecfea" attached to node "ci-op-vp586qpi-98686-lg8ch-worker-centralus3-5qs27". There is 4 minute gap between "Initiating attaching volume" and "Trying to attach volume" "Initiating attaching volume" is logged just before calling diskController.AttachDisk: https://github.com/openshift/origin/blob/16abec0d471f3c40e04622210edba33d43f21704/vendor/k8s.io/kubernetes/pkg/volume/azure_dd/attacher.go#L84 "Trying to attach volume" is logged at the end of diskController.AttachDisk: https://github.com/openshift/origin/blob/26362327ba5ed375517eded918ae48ec332d10bf/vendor/k8s.io/kubernetes/staging/src/k8s.io/legacy-cloud-providers/azure/azure_controller_common.go#L183 AttachDisk() can list VMs from Azure and do other potentially slow operations, however, we don't know which one took 4 minutes.
Some more evidence from prometheus metrics from https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-shared-vpc-4.5/1314751477436125184, volume_attach histogram shows: 64 volume_attach ops took less than 600 seconds 63 volume_attach ops took less than 300 seconds 61 volume_attach ops took less than 120 seconds 60 volume_attach ops took less than 50 seconds 50 volume_attach ops took less than 25 seconds 27 volume_attach ops took less than 15 seconds (this covers whole Attach() volume plugin duration) These metrics do not show any related API errors, only load balancer creation failed once: cloudprovider_azure_api_request_errors cloudprovider_azure_api_request_ratelimited_count cloudprovider_azure_api_request_throttled_count
> There is 4 minute gap between "Initiating attaching volume" and "Trying to attach volume" It's probably caused by Azure in-tree cloud provider serializes Attach operations to a single node: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/legacy-cloud-providers/azure/azure_controller_common.go#L190-L191 When one AttachDisk is slow, all subsequent Attach() calls to the same node must wait for it to complete. In our e2e test, I noticed 3 Attach() queued up, waiting for the first one to complete: openshift-kube-controller-manager_kube-controller-manager-ci-op-vp586qpi-98686-lg8ch-master-0_kube-controller-manager.log:I1010 03:23:47.044838 1 attacher.go:84] GetDiskLun returned: cannot find Lun for disk ci-op-vp586qpi-98686-lg8ch-dynamic-pvc-e8d56c20-70fa-484c-9b32-d1244fed2ed0. Initiating attaching volume "/subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-op-vp586qpi-98686-lg8ch-rg/providers/Microsoft.Compute/disks/ci-op-vp586qpi-98686-lg8ch-dynamic-pvc-e8d56c20-70fa-484c-9b32-d1244fed2ed0" to node "ci-op-vp586qpi-98686-lg8ch-worker-centralus3-5qs27". openshift-kube-controller-manager_kube-controller-manager-ci-op-vp586qpi-98686-lg8ch-master-0_kube-controller-manager.log:I1010 03:24:43.745543 1 attacher.go:84] GetDiskLun returned: cannot find Lun for disk ci-op-vp586qpi-98686-lg8ch-dynamic-pvc-9ecee29b-47d7-4b51-bd8a-aeda5eb58255. Initiating attaching volume "/subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-op-vp586qpi-98686-lg8ch-rg/providers/Microsoft.Compute/disks/ci-op-vp586qpi-98686-lg8ch-dynamic-pvc-9ecee29b-47d7-4b51-bd8a-aeda5eb58255" to node "ci-op-vp586qpi-98686-lg8ch-worker-centralus3-5qs27". openshift-kube-controller-manager_kube-controller-manager-ci-op-vp586qpi-98686-lg8ch-master-0_kube-controller-manager.log:I1010 03:25:30.943687 1 attacher.go:84] GetDiskLun returned: cannot find Lun for disk ci-op-vp586qpi-98686-lg8ch-dynamic-pvc-a5a23ff4-0c98-4187-a994-c580842ecfea. Initiating attaching volume "/subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-op-vp586qpi-98686-lg8ch-rg/providers/Microsoft.Compute/disks/ci-op-vp586qpi-98686-lg8ch-dynamic-pvc-a5a23ff4-0c98-4187-a994-c580842ecfea" to node "ci-op-vp586qpi-98686-lg8ch-worker-centralus3-5qs27". openshift-kube-controller-manager_kube-controller-manager-ci-op-vp586qpi-98686-lg8ch-master-0_kube-controller-manager.log:I1010 03:29:03.828645 1 attacher.go:89] Attach operation successful: volume "/subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-op-vp586qpi-98686-lg8ch-rg/providers/Microsoft.Compute/disks/ci-op-vp586qpi-98686-lg8ch-dynamic-pvc-e8d56c20-70fa-484c-9b32-d1244fed2ed0" attached to node "ci-op-vp586qpi-98686-lg8ch-worker-centralus3-5qs27". openshift-kube-controller-manager_kube-controller-manager-ci-op-vp586qpi-98686-lg8ch-master-0_kube-controller-manager.log:I1010 03:29:19.398937 1 attacher.go:89] Attach operation successful: volume "/subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-op-vp586qpi-98686-lg8ch-rg/providers/Microsoft.Compute/disks/ci-op-vp586qpi-98686-lg8ch-dynamic-pvc-9ecee29b-47d7-4b51-bd8a-aeda5eb58255" attached to node "ci-op-vp586qpi-98686-lg8ch-worker-centralus3-5qs27". openshift-kube-controller-manager_kube-controller-manager-ci-op-vp586qpi-98686-lg8ch-master-0_kube-controller-manager.log:I1010 03:29:29.816315 1 attacher.go:89] Attach operation successful: volume "/subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-op-vp586qpi-98686-lg8ch-rg/providers/Microsoft.Compute/disks/ci-op-vp586qpi-98686-lg8ch-dynamic-pvc-a5a23ff4-0c98-4187-a994-c580842ecfea" attached to node "ci-op-vp586qpi-98686-lg8ch-worker-centralus3-5qs27". The last one is actually the one that caused a test to time out, it just waited ~4 minutes for the lock held by a previous Attach().
This PR can help to surface long attach/detach times caused by the underlying cloud: https://github.com/openshift/origin/pull/25614. Currently, there is nothing we can do about it. Azure has promised to speed up attach/detach times: https://github.com/Azure/aks-engine/issues/2888#issuecomment-708275363, but no deadline.
We made the attach flake more observable by https://github.com/openshift/origin/pull/25614. For example https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-4.7/1326563613581774848: - some Azure tests failed with "Gave up after waiting 5m0s for pod XYZ ..." - But there is "[Late] Metrics should report short attach times" that says: flake: Operation volume_attach of plugin [kubernetes.io/azure-disk kubernetes.io/aws-ebs kubernetes.io/gce-pd kubernetes.io/cinder kubernetes.io/vsphere-volume] took more than 120 seconds: query failed: # Operation volume_attach time of plugin kubernetes.io/azure-disk should be < 120 seconds sum(max_over_time(storage_operation_duration_seconds_bucket{job="kube-controller-manager",le="+Inf",operation_name="volume_attach",volume_plugin="kubernetes.io/azure-disk"}[2h])) - sum(max_over_time(storage_operation_duration_seconds_bucket{job="kube-controller-manager",le="120",operation_name="volume_attach",volume_plugin="kubernetes.io/azure-disk"}[2h])) > 0: promQL query: # Operation volume_attach time of plugin kubernetes.io/azure-disk should be < 120 seconds sum(max_over_time(storage_operation_duration_seconds_bucket{job="kube-controller-manager",le="+Inf",operation_name="volume_attach",volume_plugin="kubernetes.io/azure-disk"}[2h])) - sum(max_over_time(storage_operation_duration_seconds_bucket{job="kube-controller-manager",le="120",operation_name="volume_attach",volume_plugin="kubernetes.io/azure-disk"}[2h])) > 0 had reported incorrect results: [{"metric":{},"value":[1605116721.105,"10"]}] Translated to English, "10" attach operations took longer than 2 minutes. As written above, we can't do anything about Azure slowness.
Jan, thanks for the PR, lgtm, I'll change the bug state to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633