Bug 1467970 - Timeouts with Azure Disk attaching volumes to pods
Summary: Timeouts with Azure Disk attaching volumes to pods
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.5.z
Assignee: Hemant Kumar
QA Contact: Wenqi He
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-07-05 17:07 UTC by Eduardo Minguez
Modified: 2017-08-31 17:00 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Presence of lazy_itable_init=0,lazy_journal_init=0 field on mkfs arguments caused ext4 formatting to be really slow on Azure. Consequence: A pod can be in pending state or mount operation can simply timeout if formatting the disk takes too long. Fix: Remove lazy_itable_init=0,lazy_journal_init=0 options from mkfs command. Result: The time it takes to format ext4 disk is more than halved on Azure.
Clone Of:
Environment:
Last Closed: 2017-08-31 17:00:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
bunch of oc logs, oc describe, oc events, etc. infranodes and masters logs (516.29 KB, application/x-gzip)
2017-07-05 17:07 UTC, Eduardo Minguez
no flags Details
pv (1.04 KB, text/plain)
2017-07-05 17:09 UTC, Eduardo Minguez
no flags Details
pvc (780 bytes, text/plain)
2017-07-05 17:10 UTC, Eduardo Minguez
no flags Details
pv & pvc (1.17 KB, application/x-gzip)
2017-07-06 18:26 UTC, Eduardo Minguez
no flags Details
controllers (34.97 KB, application/x-gzip)
2017-07-06 18:27 UTC, Eduardo Minguez
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:1828 0 normal SHIPPED_LIVE OpenShift Container Platform 3.5, 3.4, and 3.3 bug fix update 2017-08-31 20:59:56 UTC

Description Eduardo Minguez 2017-07-05 17:07:42 UTC
Created attachment 1294672 [details]
bunch of oc logs, oc describe, oc events, etc. infranodes and masters logs

Description of problem: Trying to deploy aggregated logging in OCP fails due to timeouts in azure disk tasks.
Azure disks are created with a storageclass as follows:

kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
  name: slow
  annotations:
    storageclass.beta.kubernetes.io/is-default-class: "true"
    volume.beta.kubernetes.io/storage-class: slow
    volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/azure-disk
provisioner: kubernetes.io/azure-disk
parameters:
  storageAccount: sapv1emingueztst

PVs and PVCs seems to be ok, but there are a lot of errors in the events complaining about timeouts like:

2017-07-05 16:38:39 +0000 UTC   2017-07-05 16:34:13 +0000 UTC   3         logging-es-3typqox8-1-680vl   Pod                 Warning   FailedMount   {kubelet infranode2}   Unable to mount volumes for pod "logging-es-3typqox8-1-680vl_logging(41013793-619f-11e7-98d7-000d3a02f8d5)": timeout expired waiting for volumes to attach/mount for pod "logging"/"logging-es-3typqox8-1-680vl". list of unattached/unmounted volumes=[elasticsearch-storage]

Version-Release number of selected component (if applicable):
oc v3.5.5.26
kubernetes v1.5.2+43a9be4
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://emingueztst.eastus2.cloudapp.azure.com:8443
openshift v3.5.5.26
kubernetes v1.5.2+43a9be4

How reproducible:
Create an OCP cluster as the azure ref. arch. https://github.com/openshift/openshift-ansible-contrib/tree/master/reference-architecture/azure-ansible adding the logging.

Steps to Reproduce:
1. 
2.
3.

Actual results:
PVCs takes a lot of time to be attached/mounted and the provision fails.

Expected results:
PVCs are attached/mounted quickly.

Master Log: attached

Node Log (of failed PODs): attached

PV Dump: attached

PVC Dump: attached

StorageClass Dump (if StorageClass used by PV/PVC): added before

Additional info:
I've found some stuff upstream that can be related:
* https://github.com/colemickens/azure-kubernetes-status/issues/15
* https://github.com/Azure/acs-engine/issues/258
* https://github.com/kubernetes/kubernetes/pull/37845
* https://github.com/kubernetes/kubernetes/issues/32068

As you may see in the node logs it complains a lot about attaching/dettaching the volume.

Also, it seems that upstream k8s is improving Azure storage code https://github.com/kubernetes/kubernetes/pull/46360 (it is not merged yet)

Comment 1 Eduardo Minguez 2017-07-05 17:09:53 UTC
Created attachment 1294674 [details]
pv

Comment 2 Eduardo Minguez 2017-07-05 17:10:09 UTC
Created attachment 1294675 [details]
pvc

Comment 12 Hemant Kumar 2017-07-06 18:19:48 UTC
Also can you post problematic PVC which is not attaching?

Comment 13 Hemant Kumar 2017-07-06 18:22:13 UTC
Also, please add PV & PVC info as before and also master-controller logs. Now that attach/detach has been moved to controller, we should have clues in controller logs about the error.

Comment 14 Eduardo Minguez 2017-07-06 18:26:01 UTC
Created attachment 1295006 [details]
pv & pvc

Comment 15 Eduardo Minguez 2017-07-06 18:27:31 UTC
Created attachment 1295008 [details]
controllers

Comment 25 Mike Hepburn 2017-07-17 07:22:37 UTC
i found this to be still a problem for dynamic provisioning using ref.arch ansible installer and a deployment on azure from a day ago using ocp:

oc v3.5.5.31
kubernetes v1.5.2+43a9be4
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ocpdemo.australiasoutheast.cloudapp.azure.com:8443
openshift v3.5.5.31
kubernetes v1.5.2+43a9be4

i fixed it by running:

oc annotate $(oc get node --no-headers -o name) volumes.kubernetes.io/controller-managed-attach-detach="true"

and restarting all node processes in cluster.

i assume this ansible arg would also have sufficed:

[OSEv3:vars]
openshift_node_kubelet_args={'enable-controller-attach-detach': ['true'] }

also it took some time (~7min) to format 10Gi disk partition as raised elsewhere:

Jul 17 06:49:48 infranode3 atomic-openshift-node: I0717 06:49:48.563819    1583 mount_linux.go:364] Disk "/dev/sde" appears to be unformatted, attempting to format as type: "ext4" with options: [-E lazy_itable_init=0,lazy_journal_init=0 -F /dev/sde]
Jul 17 06:56:45 infranode3 atomic-openshift-node: I0717 06:56:45.753545    1583 mount_linux.go:369] Disk successfully formatted (mkfs): ext4 - /dev/sde /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/azure-disk/mounts/kubernetes-dynamic-pvc-03559b41-6abc-11e7-b9f2-000d3ae0ec80.vhd

but seems to work ok after that. hope that helps.

Comment 27 Hemant Kumar 2017-07-17 16:43:11 UTC
The PR for changing mkfs arguments have been merged. https://github.com/openshift/ose/pull/799 We can expect mkfs timings to go down once new build is released.

Comment 29 Wenqi He 2017-08-03 06:00:50 UTC
Tested this on below version:
openshift v3.5.5.31.6
kubernetes v1.5.2+43a9be4

Created a 10Gi dymamic pvc and a pod to consume this pvc, pod could be running within 3mins
$ oc get pvc
NAME      STATUS    VOLUME                                     CAPACITY   ACCESSMODES   STORAGECLASS   AGE
azpvc     Bound     pvc-3ee2d76b-780f-11e7-a47b-000d3a1b87c9   10Gi       RWO           sc-qb6s8       1m
$ oc get pods
NAME      READY     STATUS              RESTARTS   AGE
azpvcpo   1/1       Running              0          3m

Thanks.

Comment 31 errata-xmlrpc 2017-08-31 17:00:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1828


Note You need to log in before you can comment on or make changes to this bug.