Bug 1863009 - vSphere provision failure on ocp46
Summary: vSphere provision failure on ocp46
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.0
Assignee: Danil Grigorev
QA Contact: Milind Yadav
URL:
Whiteboard:
Depends On:
Blocks: 1806034
TreeView+ depends on / blocked
 
Reported: 2020-08-03 13:25 UTC by Wei Duan
Modified: 2020-10-27 16:22 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:22:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
kcm-log (48.88 KB, application/gzip)
2020-08-03 13:25 UTC, Wei Duan
no flags Details
kcm-log-2 (52.78 KB, application/gzip)
2020-08-06 08:02 UTC, Wei Duan
no flags Details
kcm-2.log_20200811174916 (289.89 KB, application/x-shellscript)
2020-08-11 10:09 UTC, Wei Duan
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github kubernetes kubernetes pull 93971 0 None closed Refactor locks logic on registeredNodesLock to be non-blocking 2021-01-26 11:17:48 UTC
Github openshift kubernetes pull 319 0 None closed BUG 1863009: vSphere provision failure on ocp46 2021-01-26 11:17:47 UTC
Github openshift origin pull 25427 0 None closed BUG 1863009: vSphere provision failure on ocp46 2021-01-26 11:17:48 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:22:58 UTC

Internal Links: 1821280

Description Wei Duan 2020-08-03 13:25:29 UTC
Created attachment 1703295 [details]
kcm-log

Description of problem:
vsphere provision doesn't work, when check the kcm log, we see the error message at the beginning.

 198 E0803 00:55:03.056327       1 nodemanager.go:95] Node Discovery failed to get node uuid for node wduan0803b-jrsmj-master-0 with error: could not parse pre-release/metadata (-master+$For     mat:%h$) in version "v0.0.0-master+$Format:%h$"
 199 I0803 00:55:03.056535       1 controllermanager.go:570] Started "replicationcontroller"
 200 I0803 00:55:03.056576       1 controllermanager.go:555] Starting "daemonset"
 201 E0803 00:55:03.056335       1 vsphere.go:1506] failed to add node &Node{ObjectMeta:{wduan0803b-jrsmj-master-0   /api/v1/nodes/wduan0803b-jrsmj-master-0 06643a06-6ccc-4260-9f2b-3475d4ebf     24b 27297 0 2020-08-03 00:38:46 +0000 UTC <nil> <nil> map[beta.kubernetes.io/arch:amd64 beta.kubernetes.io/os:linux kubernetes.io/arch:amd64 kubernetes.io/hostname:wduan0803b-jrsmj-mast     er-0 kubernetes.io/os:linux node-role.kubernetes.io/master: node.openshift.io/os_id:rhcos] map[machine.openshift.io/machine:openshift-machine-api/wduan0803b-jrsmj-master-0 machineconfig     uration.openshift.io/currentConfig:rendered-master-f9b2805db7c6d6af2263b9b0c5f8ba1e machineconfiguration.openshift.io/desiredConfig:rendered-master-f9b2805db7c6d6af2263b9b0c5f8ba1e mach     ineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state:Done volumes.kubernetes.io/controller-managed-attach-detach:true] [] []


The also get the incorrect version name by "oc get node" 
[wduan@MINT 01_general]$ oc get node
NAME                            STATUS   ROLES    AGE     VERSION
wduan0803b-jrsmj-master-0       Ready    master   8h      v0.0.0-master+$Format:%h$
wduan0803b-jrsmj-master-1       Ready    master   8h      v0.0.0-master+$Format:%h$
wduan0803b-jrsmj-master-2       Ready    master   8h      v0.0.0-master+$Format:%h$
wduan0803b-jrsmj-worker-bfbll   Ready    worker   7h58m   v0.0.0-master+$Format:%h$
wduan0803b-jrsmj-worker-czrqp   Ready    worker   7h58m   v0.0.0-master+$Format:%h$


Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-08-02-091622

How reproducible:
Always

Steps to Reproduce:
1. Check the kcm log
2.
3.

Actual results:
Find the Node Discovery failed in log

Expected results:
Should no exist such error

Additional info:

Comment 3 Wei Duan 2020-08-06 08:02:54 UTC
Created attachment 1710626 [details]
kcm-log-2

Comment 4 Wei Duan 2020-08-06 08:05:28 UTC
Still re-produced in the latest nightly(both IPI or UPI).
We find some suspectable error message:

Node Discovery failed to get node uuid for node xxx with error: could not parse pre-release/metadata (-master+$Format:%h$) in version "v0.0.0-master+$Format:%h$"

E0806 02:24:18.211296       1 connection.go:65] Failed to create govmomi client. err: ServerFaultCode: Cannot complete login due to an incorrect user name or password.

W0806 02:24:28.907766       1 endpointslice_controller.go:284] Error syncing endpoint slices for service "openshift-kube-controller-manager/kube-controller-manager", retrying. Error: Error updating kube-controller-manager-hk6xs EndpointSlice for Service openshift-kube-controller-manager/kube-controller-manager: Operation cannot be fulfilled on endpointslices.discovery.k8s.io "kube-controller-manager-hk6xs": the object has been modified; please apply your changes to the latest version and try again


During pv provisioning, sometimes we can see error like:
W0806 06:09:30.589954       1 connection.go:79] Creating new client session since the existing session is not valid or not authenticated

Comment 5 Wei Duan 2020-08-11 10:07:05 UTC
After kube version issue resolved, we still hit the volume provision issue on both vSphere 6.5 and vSphere 6.7. 

[wduan@MINT 01_general]$ oc describe pvc 
Name:          mypvc01
Namespace:     wduan
StorageClass:  thin
Status:        Pending
Volume:        
Labels:        <none>
Annotations:   volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/vsphere-volume
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      
Access Modes:  
VolumeMode:    Filesystem
Mounted By:    mypod01
Events:        <none>

In kcm log, there is no response for the creating request:
I0811 09:33:20.709535       1 vsphere.go:1235] Starting to create a vSphere volume with volumeOptions: &{CapacityKB:3145728 Tags:map[kubernetes.io/created-for/pv/name:pvc-767fe342-cbe4-4819-a62f-8953deb696c9 kubernetes.io/created-for/pvc/name:mypvc01 kubernetes.io/created-for/pvc/namespace:wduan] Name:qe-yapei46vs11-9t8rm-dynamic-pvc-767fe342-cbe4-4819-a62f-8953deb696c9 DiskFormat:thin Datastore: VSANStorageProfileData: StoragePolicyName: StoragePolicyID: SCSIControllerType: Zone:[] SelectedNode:nil}
I0811 09:33:20.763690       1 vsphere.go:1287] Volume topology : []

Comment 6 Wei Duan 2020-08-11 10:08:19 UTC
I changed Component back to storage.

Comment 7 Wei Duan 2020-08-11 10:09:06 UTC
Created attachment 1711059 [details]
kcm-2.log_20200811174916

Comment 9 Hemant Kumar 2020-08-13 15:45:52 UTC
I took goroutine dump of controller-managers that were stuck in volume provisioning and I think we have a deadlock in code:

goroutine 594 [semacquire, 79 minutes]:                                                                                                       
sync.runtime_SemacquireMutex(0xc000e7826c, 0x40d400, 0x0)                                                                                     
        /opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/runtime/sema.go:71 +0x47                                                                                                                                                                                            
sync.(*RWMutex).RLock(...)                                                                                                                                                                                                                                                                  
        /opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/sync/rwmutex.go:50                                    
k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.(*NodeManager).GetNode(0xc000e78240, 0xc000d4a320, 0x1a, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/nodemanager.go:253 +0x158      
k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.(*NodeManager).RediscoverNode(0xc000e78240, 0xc000d4a320, 0x1a, 0x0, 0xc001e825c0)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/nodemanager.go:244 +0x9c                                                                                                                                              
k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.(*NodeManager).getRefreshedNodeInfo(0xc000e78240, 0xc000d4a320, 0x1a, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/nodemanager.go:327 +0x458
k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.(*NodeManager).refreshNodes(0xc000e78240, 0x0, 0x0, 0x0)        
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/nodemanager.go:311 +0x166                                                                          
k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.(*VSphere).refreshNodesForSecretChange(0xc000b6cb00)                         
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/vsphere.go:1566 +0x33
k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.(*VSphere).SecretAdded(0xc000b6cb00, 0x457fcc0, 0xc00252a080)                
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/vsphere.go:1538 +0xe4
k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)                              
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/controller.go:227
k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache.(*processorListener).run.func1()                                   
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/shared_informer.go:777 +0x129
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc001398f60)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001eabf60, 0x4dc5bc0, 0xc001000750, 0x3c22501, 0xc0000b6d80)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xa3
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001398f60, 0x3b9aca00, 0x0, 0xc00049c501, 0xc0000b6d80)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
--                                  
                                                                                                                                              
goroutine 14153 [semacquire, 76 minutes]:
sync.runtime_SemacquireMutex(0xc000e78264, 0x0, 0x1)
        /opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc000e78260)
        /opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
        /opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/sync/mutex.go:81
sync.(*RWMutex).Lock(0xc000e78260)
        /opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/sync/rwmutex.go:98 +0x97
k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.(*NodeManager).GetNodeDetails(0xc000e78240, 0x0, 0x0, 0x0, 0x0, 0x0)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/nodemanager.go:291 +0x80
k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.getSharedDatastoresInK8SCluster(0x4e3eae0, 0xc00b2cacc0, 0xc000e78240, 0x0, 0x45ec3f8, 0x14, 0xc00b4d2e70, 0x1)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/vsphere_util.go:130 +0x43
k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.(*VSphere).CreateVolume.func1(0xc00b2dec80, 0x0, 0x0, 0x0, 0x0)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/vsphere.go:1377 +0x17a1
k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.(*VSphere).CreateVolume(0xc000b6cb00, 0xc00b2dec80, 0x74db420, 0x0, 0x1, 0xc00b2a9f50)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/vsphere.go:1453 +0xe4
k8s.io/kubernetes/pkg/volume/vsphere_volume.(*VsphereDiskUtil).CreateVolume(0x74d9b38, 0xc00b28be30, 0x0, 0x74d9b38, 0x0, 0x0, 0xc00b417740, 0x2, 0x2)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/volume/vsphere_volume/vsphere_volume_util.go:142 +0xbf2
k8s.io/kubernetes/pkg/volume/vsphere_volume.(*vsphereVolumeProvisioner).Provision(0xc00b28be30, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc00b2a9f50, 0x28)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/volume/vsphere_volume/vsphere_volume.go:382 +0x295
k8s.io/kubernetes/pkg/controller/volume/persistentvolume.(*PersistentVolumeController).provisionClaimOperation(0xc0008e21c0, 0xc00b3476c0, 0x7fdc57537c20, 0xc004b99f40, 0xc007ede600, 0x45c78b5, 0x9, 0x0, 0x0)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/volume/persistentvolume/pv_controller.go:1555 +0xfec
k8s.io/kubernetes/pkg/controller/volume/persistentvolume.(*PersistentVolumeController).provisionClaim.func1(0x99e752, 0xc008590180)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/volume/persistentvolume/pv_controller.go:1447 +0x204
k8s.io/kubernetes/pkg/util/goroutinemap.(*goRoutineMap).Run.func1(0xc000e694d0, 0xc00b372740, 0x40, 0xc00b37ca50, 0x0, 0x0)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/util/goroutinemap/goroutinemap.go:115 +0xe5
created by k8s.io/kubernetes/pkg/util/goroutinemap.(*goRoutineMap).Run
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/util/goroutinemap/goroutinemap.go:108 +0x2e7


I think we have a problem introduced when we merged - https://github.com/kubernetes/kubernetes/pull/90836 and what is happening is, when GetNodeDetails in nodemanager.go is called it acquires a lock on registeredNodesLock, but same function ends up calling GetNode which acquires another lock on registeredNodesLock object . Now the problem is, GetNode function gets called in a loop and it will try to acquire lock for each node it processes and hence two operations could result in deadlock.

Comment 10 Hemant Kumar 2020-08-13 15:45:52 UTC
I took goroutine dump of controller-managers that were stuck in volume provisioning and I think we have a deadlock in code:

goroutine 594 [semacquire, 79 minutes]:                                                                                                       
sync.runtime_SemacquireMutex(0xc000e7826c, 0x40d400, 0x0)                                                                                     
        /opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/runtime/sema.go:71 +0x47                                                                                                                                                                                            
sync.(*RWMutex).RLock(...)                                                                                                                                                                                                                                                                  
        /opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/sync/rwmutex.go:50                                    
k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.(*NodeManager).GetNode(0xc000e78240, 0xc000d4a320, 0x1a, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/nodemanager.go:253 +0x158      
k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.(*NodeManager).RediscoverNode(0xc000e78240, 0xc000d4a320, 0x1a, 0x0, 0xc001e825c0)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/nodemanager.go:244 +0x9c                                                                                                                                              
k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.(*NodeManager).getRefreshedNodeInfo(0xc000e78240, 0xc000d4a320, 0x1a, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/nodemanager.go:327 +0x458
k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.(*NodeManager).refreshNodes(0xc000e78240, 0x0, 0x0, 0x0)        
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/nodemanager.go:311 +0x166                                                                          
k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.(*VSphere).refreshNodesForSecretChange(0xc000b6cb00)                         
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/vsphere.go:1566 +0x33
k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.(*VSphere).SecretAdded(0xc000b6cb00, 0x457fcc0, 0xc00252a080)                
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/vsphere.go:1538 +0xe4
k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)                              
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/controller.go:227
k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache.(*processorListener).run.func1()                                   
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/shared_informer.go:777 +0x129
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc001398f60)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001eabf60, 0x4dc5bc0, 0xc001000750, 0x3c22501, 0xc0000b6d80)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xa3
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001398f60, 0x3b9aca00, 0x0, 0xc00049c501, 0xc0000b6d80)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
--                                  
                                                                                                                                              
goroutine 14153 [semacquire, 76 minutes]:
sync.runtime_SemacquireMutex(0xc000e78264, 0x0, 0x1)
        /opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc000e78260)
        /opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
        /opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/sync/mutex.go:81
sync.(*RWMutex).Lock(0xc000e78260)
        /opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/sync/rwmutex.go:98 +0x97
k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.(*NodeManager).GetNodeDetails(0xc000e78240, 0x0, 0x0, 0x0, 0x0, 0x0)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/nodemanager.go:291 +0x80
k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.getSharedDatastoresInK8SCluster(0x4e3eae0, 0xc00b2cacc0, 0xc000e78240, 0x0, 0x45ec3f8, 0x14, 0xc00b4d2e70, 0x1)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/vsphere_util.go:130 +0x43
k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.(*VSphere).CreateVolume.func1(0xc00b2dec80, 0x0, 0x0, 0x0, 0x0)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/vsphere.go:1377 +0x17a1
k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.(*VSphere).CreateVolume(0xc000b6cb00, 0xc00b2dec80, 0x74db420, 0x0, 0x1, 0xc00b2a9f50)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/vsphere.go:1453 +0xe4
k8s.io/kubernetes/pkg/volume/vsphere_volume.(*VsphereDiskUtil).CreateVolume(0x74d9b38, 0xc00b28be30, 0x0, 0x74d9b38, 0x0, 0x0, 0xc00b417740, 0x2, 0x2)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/volume/vsphere_volume/vsphere_volume_util.go:142 +0xbf2
k8s.io/kubernetes/pkg/volume/vsphere_volume.(*vsphereVolumeProvisioner).Provision(0xc00b28be30, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc00b2a9f50, 0x28)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/volume/vsphere_volume/vsphere_volume.go:382 +0x295
k8s.io/kubernetes/pkg/controller/volume/persistentvolume.(*PersistentVolumeController).provisionClaimOperation(0xc0008e21c0, 0xc00b3476c0, 0x7fdc57537c20, 0xc004b99f40, 0xc007ede600, 0x45c78b5, 0x9, 0x0, 0x0)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/volume/persistentvolume/pv_controller.go:1555 +0xfec
k8s.io/kubernetes/pkg/controller/volume/persistentvolume.(*PersistentVolumeController).provisionClaim.func1(0x99e752, 0xc008590180)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/volume/persistentvolume/pv_controller.go:1447 +0x204
k8s.io/kubernetes/pkg/util/goroutinemap.(*goRoutineMap).Run.func1(0xc000e694d0, 0xc00b372740, 0x40, 0xc00b37ca50, 0x0, 0x0)
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/util/goroutinemap/goroutinemap.go:115 +0xe5
created by k8s.io/kubernetes/pkg/util/goroutinemap.(*goRoutineMap).Run
        /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/util/goroutinemap/goroutinemap.go:108 +0x2e7


I think we have a problem introduced when we merged - https://github.com/kubernetes/kubernetes/pull/90836 and what is happening is, when GetNodeDetails in nodemanager.go is called it acquires a lock on registeredNodesLock, but same function ends up calling GetNode which acquires another lock on registeredNodesLock object . Now the problem is, GetNode function gets called in a loop and it will try to acquire lock for each node it processes and hence two operations could result in deadlock.

Comment 12 Danil Grigorev 2020-08-18 11:21:36 UTC
Waiting on upstream PR to merge to verify the fix: https://github.com/kubernetes/kubernetes/pull/93971

Comment 13 Joel Speed 2020-08-20 12:08:02 UTC
We will need to cherry-pick the fix from upstream before we can close this bug, deferring to next sprint to do this

Comment 16 Wei Duan 2020-09-02 08:03:19 UTC
Verified pass on 4.6.0-0.nightly-2020-08-31-220837

[wduan@MINT 01_general]$ oc get pod,pvc
NAME          READY   STATUS    RESTARTS   AGE
pod/mypod01   1/1     Running   0          60s

NAME                            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
persistentvolumeclaim/mypvc01   Bound    pvc-e39f0c9e-5802-48c6-9992-0d5c0b3543db   3Gi        RWO            thin           60s



[wduan@MINT 01_general]$ oc get pod mypod01 -ojson | jq .spec.volumes
[
  {
    "name": "local",
    "persistentVolumeClaim": {
      "claimName": "mypvc01"
    }
  },
  {
    "name": "default-token-km4bm",
    "secret": {
      "defaultMode": 420,
      "secretName": "default-token-km4bm"
    }
  }
]

Comment 18 errata-xmlrpc 2020-10-27 16:22:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.