Created attachment 1294672 [details] bunch of oc logs, oc describe, oc events, etc. infranodes and masters logs Description of problem: Trying to deploy aggregated logging in OCP fails due to timeouts in azure disk tasks. Azure disks are created with a storageclass as follows: kind: StorageClass apiVersion: storage.k8s.io/v1beta1 metadata: name: slow annotations: storageclass.beta.kubernetes.io/is-default-class: "true" volume.beta.kubernetes.io/storage-class: slow volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/azure-disk provisioner: kubernetes.io/azure-disk parameters: storageAccount: sapv1emingueztst PVs and PVCs seems to be ok, but there are a lot of errors in the events complaining about timeouts like: 2017-07-05 16:38:39 +0000 UTC 2017-07-05 16:34:13 +0000 UTC 3 logging-es-3typqox8-1-680vl Pod Warning FailedMount {kubelet infranode2} Unable to mount volumes for pod "logging-es-3typqox8-1-680vl_logging(41013793-619f-11e7-98d7-000d3a02f8d5)": timeout expired waiting for volumes to attach/mount for pod "logging"/"logging-es-3typqox8-1-680vl". list of unattached/unmounted volumes=[elasticsearch-storage] Version-Release number of selected component (if applicable): oc v3.5.5.26 kubernetes v1.5.2+43a9be4 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://emingueztst.eastus2.cloudapp.azure.com:8443 openshift v3.5.5.26 kubernetes v1.5.2+43a9be4 How reproducible: Create an OCP cluster as the azure ref. arch. https://github.com/openshift/openshift-ansible-contrib/tree/master/reference-architecture/azure-ansible adding the logging. Steps to Reproduce: 1. 2. 3. Actual results: PVCs takes a lot of time to be attached/mounted and the provision fails. Expected results: PVCs are attached/mounted quickly. Master Log: attached Node Log (of failed PODs): attached PV Dump: attached PVC Dump: attached StorageClass Dump (if StorageClass used by PV/PVC): added before Additional info: I've found some stuff upstream that can be related: * https://github.com/colemickens/azure-kubernetes-status/issues/15 * https://github.com/Azure/acs-engine/issues/258 * https://github.com/kubernetes/kubernetes/pull/37845 * https://github.com/kubernetes/kubernetes/issues/32068 As you may see in the node logs it complains a lot about attaching/dettaching the volume. Also, it seems that upstream k8s is improving Azure storage code https://github.com/kubernetes/kubernetes/pull/46360 (it is not merged yet)
Created attachment 1294674 [details] pv
Created attachment 1294675 [details] pvc
Also can you post problematic PVC which is not attaching?
Also, please add PV & PVC info as before and also master-controller logs. Now that attach/detach has been moved to controller, we should have clues in controller logs about the error.
Created attachment 1295006 [details] pv & pvc
Created attachment 1295008 [details] controllers
i found this to be still a problem for dynamic provisioning using ref.arch ansible installer and a deployment on azure from a day ago using ocp: oc v3.5.5.31 kubernetes v1.5.2+43a9be4 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://ocpdemo.australiasoutheast.cloudapp.azure.com:8443 openshift v3.5.5.31 kubernetes v1.5.2+43a9be4 i fixed it by running: oc annotate $(oc get node --no-headers -o name) volumes.kubernetes.io/controller-managed-attach-detach="true" and restarting all node processes in cluster. i assume this ansible arg would also have sufficed: [OSEv3:vars] openshift_node_kubelet_args={'enable-controller-attach-detach': ['true'] } also it took some time (~7min) to format 10Gi disk partition as raised elsewhere: Jul 17 06:49:48 infranode3 atomic-openshift-node: I0717 06:49:48.563819 1583 mount_linux.go:364] Disk "/dev/sde" appears to be unformatted, attempting to format as type: "ext4" with options: [-E lazy_itable_init=0,lazy_journal_init=0 -F /dev/sde] Jul 17 06:56:45 infranode3 atomic-openshift-node: I0717 06:56:45.753545 1583 mount_linux.go:369] Disk successfully formatted (mkfs): ext4 - /dev/sde /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/azure-disk/mounts/kubernetes-dynamic-pvc-03559b41-6abc-11e7-b9f2-000d3ae0ec80.vhd but seems to work ok after that. hope that helps.
The PR for changing mkfs arguments have been merged. https://github.com/openshift/ose/pull/799 We can expect mkfs timings to go down once new build is released.
Tested this on below version: openshift v3.5.5.31.6 kubernetes v1.5.2+43a9be4 Created a 10Gi dymamic pvc and a pod to consume this pvc, pod could be running within 3mins $ oc get pvc NAME STATUS VOLUME CAPACITY ACCESSMODES STORAGECLASS AGE azpvc Bound pvc-3ee2d76b-780f-11e7-a47b-000d3a1b87c9 10Gi RWO sc-qb6s8 1m $ oc get pods NAME READY STATUS RESTARTS AGE azpvcpo 1/1 Running 0 3m Thanks.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1828