Created attachment 1703295 [details] kcm-log Description of problem: vsphere provision doesn't work, when check the kcm log, we see the error message at the beginning. 198 E0803 00:55:03.056327 1 nodemanager.go:95] Node Discovery failed to get node uuid for node wduan0803b-jrsmj-master-0 with error: could not parse pre-release/metadata (-master+$For mat:%h$) in version "v0.0.0-master+$Format:%h$" 199 I0803 00:55:03.056535 1 controllermanager.go:570] Started "replicationcontroller" 200 I0803 00:55:03.056576 1 controllermanager.go:555] Starting "daemonset" 201 E0803 00:55:03.056335 1 vsphere.go:1506] failed to add node &Node{ObjectMeta:{wduan0803b-jrsmj-master-0 /api/v1/nodes/wduan0803b-jrsmj-master-0 06643a06-6ccc-4260-9f2b-3475d4ebf 24b 27297 0 2020-08-03 00:38:46 +0000 UTC <nil> <nil> map[beta.kubernetes.io/arch:amd64 beta.kubernetes.io/os:linux kubernetes.io/arch:amd64 kubernetes.io/hostname:wduan0803b-jrsmj-mast er-0 kubernetes.io/os:linux node-role.kubernetes.io/master: node.openshift.io/os_id:rhcos] map[machine.openshift.io/machine:openshift-machine-api/wduan0803b-jrsmj-master-0 machineconfig uration.openshift.io/currentConfig:rendered-master-f9b2805db7c6d6af2263b9b0c5f8ba1e machineconfiguration.openshift.io/desiredConfig:rendered-master-f9b2805db7c6d6af2263b9b0c5f8ba1e mach ineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state:Done volumes.kubernetes.io/controller-managed-attach-detach:true] [] [] The also get the incorrect version name by "oc get node" [wduan@MINT 01_general]$ oc get node NAME STATUS ROLES AGE VERSION wduan0803b-jrsmj-master-0 Ready master 8h v0.0.0-master+$Format:%h$ wduan0803b-jrsmj-master-1 Ready master 8h v0.0.0-master+$Format:%h$ wduan0803b-jrsmj-master-2 Ready master 8h v0.0.0-master+$Format:%h$ wduan0803b-jrsmj-worker-bfbll Ready worker 7h58m v0.0.0-master+$Format:%h$ wduan0803b-jrsmj-worker-czrqp Ready worker 7h58m v0.0.0-master+$Format:%h$ Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-08-02-091622 How reproducible: Always Steps to Reproduce: 1. Check the kcm log 2. 3. Actual results: Find the Node Discovery failed in log Expected results: Should no exist such error Additional info:
Created attachment 1710626 [details] kcm-log-2
Still re-produced in the latest nightly(both IPI or UPI). We find some suspectable error message: Node Discovery failed to get node uuid for node xxx with error: could not parse pre-release/metadata (-master+$Format:%h$) in version "v0.0.0-master+$Format:%h$" E0806 02:24:18.211296 1 connection.go:65] Failed to create govmomi client. err: ServerFaultCode: Cannot complete login due to an incorrect user name or password. W0806 02:24:28.907766 1 endpointslice_controller.go:284] Error syncing endpoint slices for service "openshift-kube-controller-manager/kube-controller-manager", retrying. Error: Error updating kube-controller-manager-hk6xs EndpointSlice for Service openshift-kube-controller-manager/kube-controller-manager: Operation cannot be fulfilled on endpointslices.discovery.k8s.io "kube-controller-manager-hk6xs": the object has been modified; please apply your changes to the latest version and try again During pv provisioning, sometimes we can see error like: W0806 06:09:30.589954 1 connection.go:79] Creating new client session since the existing session is not valid or not authenticated
After kube version issue resolved, we still hit the volume provision issue on both vSphere 6.5 and vSphere 6.7. [wduan@MINT 01_general]$ oc describe pvc Name: mypvc01 Namespace: wduan StorageClass: thin Status: Pending Volume: Labels: <none> Annotations: volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/vsphere-volume Finalizers: [kubernetes.io/pvc-protection] Capacity: Access Modes: VolumeMode: Filesystem Mounted By: mypod01 Events: <none> In kcm log, there is no response for the creating request: I0811 09:33:20.709535 1 vsphere.go:1235] Starting to create a vSphere volume with volumeOptions: &{CapacityKB:3145728 Tags:map[kubernetes.io/created-for/pv/name:pvc-767fe342-cbe4-4819-a62f-8953deb696c9 kubernetes.io/created-for/pvc/name:mypvc01 kubernetes.io/created-for/pvc/namespace:wduan] Name:qe-yapei46vs11-9t8rm-dynamic-pvc-767fe342-cbe4-4819-a62f-8953deb696c9 DiskFormat:thin Datastore: VSANStorageProfileData: StoragePolicyName: StoragePolicyID: SCSIControllerType: Zone:[] SelectedNode:nil} I0811 09:33:20.763690 1 vsphere.go:1287] Volume topology : []
I changed Component back to storage.
Created attachment 1711059 [details] kcm-2.log_20200811174916
I took goroutine dump of controller-managers that were stuck in volume provisioning and I think we have a deadlock in code: goroutine 594 [semacquire, 79 minutes]: sync.runtime_SemacquireMutex(0xc000e7826c, 0x40d400, 0x0) /opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/runtime/sema.go:71 +0x47 sync.(*RWMutex).RLock(...) /opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/sync/rwmutex.go:50 k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.(*NodeManager).GetNode(0xc000e78240, 0xc000d4a320, 0x1a, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/nodemanager.go:253 +0x158 k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.(*NodeManager).RediscoverNode(0xc000e78240, 0xc000d4a320, 0x1a, 0x0, 0xc001e825c0) /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/nodemanager.go:244 +0x9c k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.(*NodeManager).getRefreshedNodeInfo(0xc000e78240, 0xc000d4a320, 0x1a, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...) /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/nodemanager.go:327 +0x458 k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.(*NodeManager).refreshNodes(0xc000e78240, 0x0, 0x0, 0x0) /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/nodemanager.go:311 +0x166 k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.(*VSphere).refreshNodesForSecretChange(0xc000b6cb00) /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/vsphere.go:1566 +0x33 k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.(*VSphere).SecretAdded(0xc000b6cb00, 0x457fcc0, 0xc00252a080) /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/vsphere.go:1538 +0xe4 k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...) /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/controller.go:227 k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache.(*processorListener).run.func1() /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/client-go/tools/cache/shared_informer.go:777 +0x129 k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc001398f60) /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001eabf60, 0x4dc5bc0, 0xc001000750, 0x3c22501, 0xc0000b6d80) /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0xa3 k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc001398f60, 0x3b9aca00, 0x0, 0xc00049c501, 0xc0000b6d80) /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98 -- goroutine 14153 [semacquire, 76 minutes]: sync.runtime_SemacquireMutex(0xc000e78264, 0x0, 0x1) /opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/runtime/sema.go:71 +0x47 sync.(*Mutex).lockSlow(0xc000e78260) /opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/sync/mutex.go:138 +0xfc sync.(*Mutex).Lock(...) /opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/sync/mutex.go:81 sync.(*RWMutex).Lock(0xc000e78260) /opt/rh/go-toolset-1.14/root/usr/lib/go-toolset-1.14-golang/src/sync/rwmutex.go:98 +0x97 k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.(*NodeManager).GetNodeDetails(0xc000e78240, 0x0, 0x0, 0x0, 0x0, 0x0) /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/nodemanager.go:291 +0x80 k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.getSharedDatastoresInK8SCluster(0x4e3eae0, 0xc00b2cacc0, 0xc000e78240, 0x0, 0x45ec3f8, 0x14, 0xc00b4d2e70, 0x1) /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/vsphere_util.go:130 +0x43 k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.(*VSphere).CreateVolume.func1(0xc00b2dec80, 0x0, 0x0, 0x0, 0x0) /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/vsphere.go:1377 +0x17a1 k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere.(*VSphere).CreateVolume(0xc000b6cb00, 0xc00b2dec80, 0x74db420, 0x0, 0x1, 0xc00b2a9f50) /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/vsphere/vsphere.go:1453 +0xe4 k8s.io/kubernetes/pkg/volume/vsphere_volume.(*VsphereDiskUtil).CreateVolume(0x74d9b38, 0xc00b28be30, 0x0, 0x74d9b38, 0x0, 0x0, 0xc00b417740, 0x2, 0x2) /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/volume/vsphere_volume/vsphere_volume_util.go:142 +0xbf2 k8s.io/kubernetes/pkg/volume/vsphere_volume.(*vsphereVolumeProvisioner).Provision(0xc00b28be30, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc00b2a9f50, 0x28) /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/volume/vsphere_volume/vsphere_volume.go:382 +0x295 k8s.io/kubernetes/pkg/controller/volume/persistentvolume.(*PersistentVolumeController).provisionClaimOperation(0xc0008e21c0, 0xc00b3476c0, 0x7fdc57537c20, 0xc004b99f40, 0xc007ede600, 0x45c78b5, 0x9, 0x0, 0x0) /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/volume/persistentvolume/pv_controller.go:1555 +0xfec k8s.io/kubernetes/pkg/controller/volume/persistentvolume.(*PersistentVolumeController).provisionClaim.func1(0x99e752, 0xc008590180) /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/volume/persistentvolume/pv_controller.go:1447 +0x204 k8s.io/kubernetes/pkg/util/goroutinemap.(*goRoutineMap).Run.func1(0xc000e694d0, 0xc00b372740, 0x40, 0xc00b37ca50, 0x0, 0x0) /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/util/goroutinemap/goroutinemap.go:115 +0xe5 created by k8s.io/kubernetes/pkg/util/goroutinemap.(*goRoutineMap).Run /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/util/goroutinemap/goroutinemap.go:108 +0x2e7 I think we have a problem introduced when we merged - https://github.com/kubernetes/kubernetes/pull/90836 and what is happening is, when GetNodeDetails in nodemanager.go is called it acquires a lock on registeredNodesLock, but same function ends up calling GetNode which acquires another lock on registeredNodesLock object . Now the problem is, GetNode function gets called in a loop and it will try to acquire lock for each node it processes and hence two operations could result in deadlock.
Waiting on upstream PR to merge to verify the fix: https://github.com/kubernetes/kubernetes/pull/93971
We will need to cherry-pick the fix from upstream before we can close this bug, deferring to next sprint to do this
Verified pass on 4.6.0-0.nightly-2020-08-31-220837 [wduan@MINT 01_general]$ oc get pod,pvc NAME READY STATUS RESTARTS AGE pod/mypod01 1/1 Running 0 60s NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/mypvc01 Bound pvc-e39f0c9e-5802-48c6-9992-0d5c0b3543db 3Gi RWO thin 60s [wduan@MINT 01_general]$ oc get pod mypod01 -ojson | jq .spec.volumes [ { "name": "local", "persistentVolumeClaim": { "claimName": "mypvc01" } }, { "name": "default-token-km4bm", "secret": { "defaultMode": 420, "secretName": "default-token-km4bm" } } ]
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196