Description of problem: vSphere PV failed to provision, error: Warning ProvisioningFailed 41s (x13 over 3m) persistentvolume-controller Failed to provision volume with StorageClass "vspheredefault": Kubernetes node nodeVmDetail details is empty. nodeVmDetails : [] Version-Release number of selected component (if applicable): openshift v3.9.0-0.20.0 kubernetes v1.9.1+a0ce1bc657 etcd 3.2.8 How reproducible: Always Steps to Reproduce: 1. Create StorageClass and PVC 2. oc describe pvc Actual results: oc describe pvc vspherec Name: vspherec Namespace: jhou StorageClass: vspheredefault Status: Pending Volume: Labels: <none> Annotations: volume.beta.kubernetes.io/storage-provisioner=kubernetes.io/vsphere-volume Finalizers: [] Capacity: Access Modes: VolumeMode: Filesystem Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ProvisioningFailed 5m (x9 over 12m) persistentvolume-controller Failed to provision volume with StorageClass "vspheredefault": Kubernetes node nodeVmDetail details is empty. nodeVmDetails : [] Expected results: PV provisioned successfully Master Log: ``` Jan 16 18:39:20 ocp39 atomic-openshift-master-controllers: I0116 18:39:20.654036 4003 pv_controller.go:1269] provisionClaim[jhou/vspherec]: started Jan 16 18:39:20 ocp39 atomic-openshift-master-controllers: I0116 18:39:20.654048 4003 pv_controller.go:1477] scheduleOperation[provision-jhou/vspherec[7c91ba0d-faa9-11e7-ac55-0050569f5abb]] Jan 16 18:39:20 ocp39 atomic-openshift-master-controllers: I0116 18:39:20.654276 4003 pv_controller.go:1288] provisionClaimOperation [jhou/vspherec] started, class: "vspheredefault" Jan 16 18:39:22 ocp39 atomic-openshift-master-controllers: I0116 18:39:22.163715 4003 vsphere.go:1007] Starting to create a vSphere volume with volumeOptions: &{CapacityKB:1048576 Tags:map[kubernetes.io/created-for/pvc/namespace:jhou kubernetes.io/created-for/pvc/name:vspherec kubernetes.io/created-for/pv/name:pvc-7c91ba0d-faa9-11e7-ac55-0050569f5abb] Name:kubernetes-dynamic-pvc-7c91ba0d-faa9-11e7-ac55-0050569f5abb DiskFormat: Datastore: VSANStorageProfileData: StoragePolicyName: StoragePolicyID: SCSIControllerType:} Jan 16 18:39:22 ocp39 atomic-openshift-master-controllers: I0116 18:39:22.171773 4003 pv_controller.go:1379] failed to provision volume for claim "jhou/vspherec" with StorageClass "vspheredefault": Kubernetes node nodeVmDetail details is empty. nodeVmDetails : [] Jan 16 18:39:22 ocp39 atomic-openshift-master-controllers: I0116 18:39:22.172366 4003 event.go:218] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"jhou", Name:"vspherec", UID:"7c91ba0d-faa9-11e7-ac55-0050569f5abb", APIVersion:"v1", ResourceVersion:"1105744", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' Failed to provision volume with StorageClass "vspheredefault": Kubernetes node nodeVmDetail details is empty. nodeVmDetails : [] ``` PVC Dump: { "kind": "PersistentVolumeClaim", "apiVersion": "v1", "metadata": { "name": "vspherec" }, "spec": { "accessModes": [ "ReadWriteOnce" ], "resources": { "requests": { "storage": "1Gi" } } } } StorageClass Dump (if StorageClass used by PV/PVC): apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: annotations: storageclass.kubernetes.io/is-default-class: "true" creationTimestamp: 2018-01-08T06:19:50Z name: vspheredefault resourceVersion: "361975" selfLink: /apis/storage.k8s.io/v1/storageclasses/vspheredefault uid: eec2d482-f43b-11e7-80b8-0050569f5abb provisioner: kubernetes.io/vsphere-volume reclaimPolicy: Delete Additional info:
Jianwei, can you please set up a vmWare machine for us where it is reproducible? Our access to vmWare is very limited right now (I am working on a solution, but it will take time).
It's more complicated, I filled https://github.com/kubernetes/kubernetes/issues/58747 upstream to get some feedback.
So, upstream does not provide a default policy for vsphere cloud provider. We should create one during installation. What's needed is to simply apply yaml from comment #5 when deploying 3.9 on vSphere.
Then ocp upgrade should apply comment 5 too.
@Scott, can you please look at this with respect to Jianwei's statement about adding this to the ocp upgrade?
@Jianwei given Jan's suggested workaround, does this still qualify as a TestBlocker?
Proposed PR - https://github.com/openshift/openshift-ansible/pull/7096
Verified the playbook fixes the clusterrole and clusterrolebinding for vSphere cloud provider.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0489
Hi, A customer has updated to the newest 3.9 packages [0] which should include the fix from this errata, however they are still seeing this issue: LAST SEEN FIRST SEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE 4s 47s 4 test.152fbb4cd95dfc9f PersistentVolumeClaim Warning ProvisioningFailed persistentvolume-controller Failed to provision volume with StorageClass "standard": Kubernetes node nodeVmDetail details is empty. nodeVmDetails : [] Is there some particular thing we should look to gather to investigate why the errata is not working in this case? [0] > grep openshift master/sosreport-ptrehiou.02100418-20180524093648/installed-rpms atomic-openshift-3.9.27-1.git.0.964617d.el7.x86_64 Wed May 23 15:57:49 2018 atomic-openshift-clients-3.9.27-1.git.0.964617d.el7.x86_64 Wed May 23 15:57:04 2018 atomic-openshift-docker-excluder-3.9.27-1.git.0.964617d.el7.noarch Wed May 23 16:00:56 2018 atomic-openshift-excluder-3.9.27-1.git.0.964617d.el7.noarch Wed May 23 16:01:23 2018 atomic-openshift-master-3.9.27-1.git.0.964617d.el7.x86_64 Wed May 23 16:03:40 2018 atomic-openshift-node-3.9.27-1.git.0.964617d.el7.x86_64 Wed May 23 17:33:50 2018 atomic-openshift-sdn-ovs-3.9.27-1.git.0.964617d.el7.x86_64 Wed May 23 17:34:15 2018 atomic-openshift-utils-3.9.27-1.git.0.52e35b5.el7.noarch Wed May 23 16:34:40 2018 openshift-ansible-3.9.27-1.git.0.52e35b5.el7.noarch Wed May 23 16:34:40 2018 openshift-ansible-docs-3.9.27-1.git.0.52e35b5.el7.noarch Wed May 23 16:34:40 2018 openshift-ansible-playbooks-3.9.27-1.git.0.52e35b5.el7.noarch Wed May 23 16:34:40 2018 openshift-ansible-roles-3.9.27-1.git.0.52e35b5.el7.noarch Wed May 23 16:34:40 2018 > grep openshift worker-node/sosreport-ptrehiou.02100418-20180524095110/installed-rpms atomic-openshift-3.9.27-1.git.0.964617d.el7.x86_64 Wed May 23 17:35:07 2018 atomic-openshift-clients-3.9.27-1.git.0.964617d.el7.x86_64 Wed May 23 17:34:50 2018 atomic-openshift-docker-excluder-3.9.27-1.git.0.964617d.el7.noarch Wed May 23 16:27:03 2018 atomic-openshift-excluder-3.9.27-1.git.0.964617d.el7.noarch Wed May 23 16:27:35 2018 atomic-openshift-node-3.9.27-1.git.0.964617d.el7.x86_64 Wed May 23 17:35:09 2018 atomic-openshift-sdn-ovs-3.9.27-1.git.0.964617d.el7.x86_64 Wed May 23 17:35:33 2018 atomic-openshift-utils-3.9.27-1.git.0.52e35b5.el7.noarch Wed May 23 16:35:41 2018 openshift-ansible-3.9.27-1.git.0.52e35b5.el7.noarch Wed May 23 16:35:39 2018 openshift-ansible-docs-3.9.27-1.git.0.52e35b5.el7.noarch Wed May 23 16:35:39 2018 openshift-ansible-playbooks-3.9.27-1.git.0.52e35b5.el7.noarch Wed May 23 16:35:39 2018 openshift-ansible-roles-3.9.27-1.git.0.52e35b5.el7.noarch Wed May 23 16:35:39 2018
Hey Eric, It looks like the API group is missing from the clusterrole. Can we issue the following commands and test again? oc delete clusterrole system:vsphere-cloud-provider wget https://raw.githubusercontent.com/openshift/openshift-ansible/master/roles/openshift_cloud_provider/files/vsphere-svc.yml oc create -f vsphere-svc.yml oc describe clusterrole system:vsphere-cloud-provider
Hello Davis, The steps from comment #26 did not helped customer and issue still persist.
One of customer running 3.9.27: # rpm -qa|grep atomic atomic-openshift-clients-3.9.27-1.git.0.964617d.el7.x86_64 atomic-openshift-excluder-3.9.27-1.git.0.964617d.el7.noarch atomic-registries-1.22.1-3.git2fd0860.el7.x86_64 atomic-openshift-docker-excluder-3.9.27-1.git.0.964617d.el7.noarch atomic-openshift-master-3.9.27-1.git.0.964617d.el7.x86_64 atomic-openshift-node-3.9.27-1.git.0.964617d.el7.x86_64 atomic-openshift-3.9.27-1.git.0.964617d.el7.x86_64 atomic-openshift-sdn-ovs-3.9.27-1.git.0.964617d.el7.x86_64 atomic-openshift-utils-3.9.27-1.git.0.52e35b5.el7.noarch The errata nor https://bugzilla.redhat.com/show_bug.cgi?id=1534955#c26 works.
Howdy Mahesh, Is the customer willing to jump on a bluejeans session with me? @Muhammad, the errata update is included in the 3.9.30 RPM. Let me know, Davis
(In reply to davis phillips from comment #32) > Howdy Mahesh, > > Is the customer willing to jump on a bluejeans session with me? > > @Muhammad, the errata update is included in the 3.9.30 RPM. > > Let me know, > Davis Hi Davis, we redeployed the cluster with 3.9.30 with the same vsphere issue persists. We would like to work with u through bluejeans session. May I know your timings?
Same issue here with 3.9.30 RPMs and a fresh installed cluster: atomic-openshift-node-3.9.30-1.git.0.dec1ba7.el7.x86_64 atomic-registries-1.22.1-3.git2fd0860.el7.x86_64 atomic-openshift-docker-excluder-3.9.30-1.git.0.dec1ba7.el7.noarch atomic-openshift-master-3.9.30-1.git.0.dec1ba7.el7.x86_64 atomic-openshift-sdn-ovs-3.9.30-1.git.0.dec1ba7.el7.x86_64 atomic-openshift-clients-3.9.30-1.git.0.dec1ba7.el7.x86_64 atomic-openshift-excluder-3.9.30-1.git.0.dec1ba7.el7.noarch atomic-openshift-3.9.30-1.git.0.dec1ba7.el7.x86_64 some workaround?
@sinsua.uk If you do an 'oc get node' do the nodenames have the FQDN or the shortname? The cloudprovider changes the node name to match the VM name. The error above can be from the missing service account (which is resolved with the RPM versions you have installed) or having the FQDN do the node names.
I got the short names: NAME STATUS ROLES AGE VERSION app-0 Ready compute 21h v1.9.1+a0ce1bc657 app-1 Ready compute 21h v1.9.1+a0ce1bc657 app-2 Ready compute 21h v1.9.1+a0ce1bc657 infra-0 Ready <none> 21h v1.9.1+a0ce1bc657 infra-1 Ready <none> 21h v1.9.1+a0ce1bc657 infra-2 Ready <none> 21h v1.9.1+a0ce1bc657 master-0 Ready master 21h v1.9.1+a0ce1bc657 master-1 Ready master 21h v1.9.1+a0ce1bc657 master-2 Ready master 21h v1.9.1+a0ce1bc657
(In reply to davis phillips from comment #41) > @sinsua.uk > > If you do an 'oc get node' do the nodenames have the FQDN or the shortname? > > The cloudprovider changes the node name to match the VM name. The error > above can be from the missing service account (which is resolved with the > RPM versions you have installed) or having the FQDN do the node names. I got the short names: NAME STATUS ROLES AGE VERSION app-0 Ready compute 21h v1.9.1+a0ce1bc657 app-1 Ready compute 21h v1.9.1+a0ce1bc657 app-2 Ready compute 21h v1.9.1+a0ce1bc657 infra-0 Ready <none> 21h v1.9.1+a0ce1bc657 infra-1 Ready <none> 21h v1.9.1+a0ce1bc657 infra-2 Ready <none> 21h v1.9.1+a0ce1bc657 master-0 Ready master 21h v1.9.1+a0ce1bc657 master-1 Ready master 21h v1.9.1+a0ce1bc657 master-2 Ready master 21h v1.9.1+a0ce1bc657
Davis, after updating from v3.9.27 to v3.9.30 we're still experiencing the same "nodeVmDetail details is empty" issue when provisioning new volumes. I'll open a SEV1 and associated with this BZ since this is a high impact customer issue. Please advise if additional information is required, or if you have further information on root cause or possible workarounds. We also attempted setting vmuuid, vmname, etc. in vsphere.conf without success.
Hey Jr, any chance for a customer call tomorrow afternoon? (Im in CDT)
Davis, let me check with the customer on their availability. Just a note that I was able to workaround this issue in my lab by downgrading the VM hardware version, though depending on customer requirements this may not be an acceptable solution: - Shutdown each node/master serially - Unregister each VM - Download/edit each vmx associated with each node/master and update virtualHW.version = "13" tovirtualHW.version = "11" - Register VM and start - Confirm output of cat /sys/class/dmi/id/product_uuid matches cat /sys/class/dmi/id/product_serial - Attempt creation of a new PV This seems related to https://github.com/kubernetes/kubernetes/pull/59602 and should be fixed in k8s 1.9.4 but not 1.9.1 shipping with OCP 3.9.27 & 3.9.30 I've requested a sosreport from the customer which should be attached to the case tonight or tomorrow morning.
I'm backporting https://github.com/kubernetes/kubernetes/pull/59602 to 3.9. I am quite sure it will solve "nodeVmDetail details is empty", however, I don't know if it helps with "AttachVolume.Attach failed for volume "pvc-073be41f-6f95-11e8-8ba2-0050569b3970" : No VM found" Davis, is it the same issue?
Hey Jeff, Yes. A temporary workaround until the PR is cherry picked would be to create the VMs with hardware compatibility 11 install the OS. Then, power them down and upgrade to hardware compatibility 13. This matches the following files: /sys/class/dmi/id/product_uuid /sys/class/dmi/id/product_serial Another work around would be to manual edit the VMX file. Please see the original issues on github for more details: https://github.com/kubernetes/kubernetes/issues/58927
OSE 3.9 backport: https://github.com/openshift/ose/pull/1319 I checked that 3.10 already contains this patch.
@Jan Safranek , what happened here. I can't see this code in 3.10 or 3.11?...