Bug 1879944

Summary: [OCP 4.8] Slow PV creation with vsphere
Product: OpenShift Container Platform Reporter: Mario Abajo <mabajodu>
Component: StorageAssignee: Hemant Kumar <hekumar>
Storage sub component: Kubernetes QA Contact: Wei Duan <wduan>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: urgent CC: aos-bugs, hekumar, igreen, jsafrane, mfojtik, wduan
Version: 4.5Keywords: Reopened
Target Milestone: ---Flags: igreen: needinfo+
igreen: needinfo?
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1951726 (view as bug list) Environment:
Last Closed: 2021-07-27 22:33:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1951726    

Description Mario Abajo 2020-09-17 12:12:16 UTC
Description of problem:
Customer has updated from openshift 4.3.5 to 4.3.28 and then it has started complaining about PV taking 5 to 10 min to get created and bind.
They have tried, in of their cluster, to use the CSI driver from vmware and it manage to create the volume is seconds.

Version-Release number of selected component (if applicable):

Openshift 4.3.28
Vsphere 6.7u3

How reproducible:
Always

Steps to Reproduce:
1. define a PVC that uses a storageclass in vsphere
2. Wait for 5 to 10 minutes
3. PV is created and bind

Actual results:

Logs from kube-admin-controller-manager:
~~~
I0915 16:29:26.197884       1 vsphere.go:1173] Starting to create a vSphere volume with volumeOptions: &{CapacityKB:2097152 Tags:map[kubernetes.io/created-for/pv/name:pvc-089a7d35-eae9-480f-a8db-4f3440474847 kubernetes.io/created-for/pvc/name:test-volume2 kubernetes.io/created-for/pvc/namespace:default] Name:ilocpces401-rq4qs-dynamic-pvc-089a7d35-eae9-480f-a8db-4f3440474847 DiskFormat:zeroedthick Datastore:CSSA-IBOX2998-IL-OCP4-04-13-DYN-CES401 VSANStorageProfileData: StoragePolicyName: StoragePolicyID: SCSIControllerType: Zone:[] SelectedNode:nil}
^^ Here we begin the volume creation with name: "ilocpces401-rq4qs-dynamic-pvc-089a7d35-eae9-480f-a8db-4f3440474847"

I0915 16:29:27.403350       1 vsphere.go:1225] Volume topology : []

I0915 16:36:38.143990       1 vsphere_volume_util.go:158] Successfully created vsphere volume ilocpces401-rq4qs-dynamic-pvc-089a7d35-eae9-480f-a8db-4f3440474847
^^ And here we see the volume created, this is when vsphere notify us about the volume has been created

I0915 16:36:38.149080       1 pv_controller.go:1559] volume "pvc-089a7d35-eae9-480f-a8db-4f3440474847" provisioned for claim "default/test-volume2"
I0915 16:36:38.149266       1 event.go:255] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"test-volume2", UID:"089a7d35-eae9-480f-a8db-4f3440474847", APIVersion:"v1", ResourceVersion:"123010048", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-089a7d35-eae9-480f-a8db-4f3440474847 using kubernetes.io/vsphere-volume
I0915 16:36:38.154599       1 pv_controller.go:798] volume "pvc-089a7d35-eae9-480f-a8db-4f3440474847" entered phase "Bound"
I0915 16:36:38.154626       1 pv_controller.go:901] volume "pvc-089a7d35-eae9-480f-a8db-4f3440474847" bound to claim "default/test-volume2"
^^ All of this is the "discover" of the volume as part of the cluster and it's binding to the pvc
~~~

Expected results:
To be back the "seconds" range of creation in pair with CSI driver if possible

Additional info:
We are trying to gather logs or events in the vsphere side. The impact of the issue is high. 
We have increased the loglevel on the kube-controller using this:

$ oc patch kubecontrollermanager cluster -p='{"spec": { "logLevel": "debug"}}' --type=merge
$ oc patch kubecontrollermanager cluster -p='{"spec": {"forceRedeploymentReason": "loglevel change-'"$( date --rfc-3339=ns )"'"}}' --type=merge

Is there any other way to increase the loglevel or to get more information from openshift side?

Comment 1 Maciej Szulik 2020-09-17 12:29:23 UTC
Sending over to storage team.

Comment 7 Jan Safranek 2020-10-23 14:25:10 UTC
We did not get the kubelet logs nor kubelet goroutine dump and we are unable to diagnose this issue further. Feel free to reopen the bug once you have the requested information.

Comment 56 Wei Duan 2021-04-27 01:49:57 UTC
Marked as "VERIFIED" to go ahead for 4.7.

Comment 59 errata-xmlrpc 2021-07-27 22:33:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438