Bug 1534955 - [vSphere] Failed to provision volume, Kubernetes node nodeVmDetail details is empty
Summary: [vSphere] Failed to provision volume, Kubernetes node nodeVmDetail details i...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 3.9.z
Assignee: Jan Safranek
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-01-16 10:56 UTC by Jianwei Hou
Modified: 2020-02-06 06:49 UTC (History)
28 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: ESXi Hardware version 13 Consequence: Failed to provision volume, Kubernetes node nodeVmDetail details is empty Fix: -Manually downgrade HW version to 11 -Power on the VM -If need be upgrade HW version 13 Result: Provisioning works fine
Clone Of:
: 1599824 (view as bug list)
Environment:
Last Closed: 2018-07-10 16:16:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Origin (Github) 19605 0 None None None 2018-06-10 08:31:47 UTC
Red Hat Product Errata RHBA-2018:0489 0 None None None 2018-03-28 14:20:16 UTC

Description Jianwei Hou 2018-01-16 10:56:20 UTC
Description of problem:
vSphere PV failed to provision, error:
Warning  ProvisioningFailed  41s (x13 over 3m)  persistentvolume-controller  Failed to provision volume with StorageClass "vspheredefault": Kubernetes node nodeVmDetail details is empty. nodeVmDetails : []

Version-Release number of selected component (if applicable):
openshift v3.9.0-0.20.0
kubernetes v1.9.1+a0ce1bc657
etcd 3.2.8

How reproducible:
Always 

Steps to Reproduce:
1. Create StorageClass and PVC
2. oc describe pvc

Actual results:
oc describe pvc vspherec
Name:          vspherec
Namespace:     jhou
StorageClass:  vspheredefault
Status:        Pending
Volume:        
Labels:        <none>
Annotations:   volume.beta.kubernetes.io/storage-provisioner=kubernetes.io/vsphere-volume
Finalizers:    []
Capacity:      
Access Modes:  
VolumeMode:    Filesystem
Events:
  Type     Reason              Age                From                         Message
  ----     ------              ----               ----                         -------
  Warning  ProvisioningFailed  5m (x9 over 12m)   persistentvolume-controller  Failed to provision volume with StorageClass "vspheredefault": Kubernetes node nodeVmDetail details is empty. nodeVmDetails : []

Expected results:
PV provisioned successfully

Master Log:
```
Jan 16 18:39:20 ocp39 atomic-openshift-master-controllers: I0116 18:39:20.654036    4003 pv_controller.go:1269] provisionClaim[jhou/vspherec]: started                                         
Jan 16 18:39:20 ocp39 atomic-openshift-master-controllers: I0116 18:39:20.654048    4003 pv_controller.go:1477] scheduleOperation[provision-jhou/vspherec[7c91ba0d-faa9-11e7-ac55-0050569f5abb]]                                              
Jan 16 18:39:20 ocp39 atomic-openshift-master-controllers: I0116 18:39:20.654276    4003 pv_controller.go:1288] provisionClaimOperation [jhou/vspherec] started, class: "vspheredefault"       
Jan 16 18:39:22 ocp39 atomic-openshift-master-controllers: I0116 18:39:22.163715    4003 vsphere.go:1007] Starting to create a vSphere volume with volumeOptions: &{CapacityKB:1048576 Tags:map[kubernetes.io/created-for/pvc/namespace:jhou kubernetes.io/created-for/pvc/name:vspherec kubernetes.io/created-for/pv/name:pvc-7c91ba0d-faa9-11e7-ac55-0050569f5abb] Name:kubernetes-dynamic-pvc-7c91ba0d-faa9-11e7-ac55-0050569f5abb DiskFormat: Datastore: VSANStorageProfileData: StoragePolicyName: StoragePolicyID: SCSIControllerType:}                                                
Jan 16 18:39:22 ocp39 atomic-openshift-master-controllers: I0116 18:39:22.171773    4003 pv_controller.go:1379] failed to provision volume for claim "jhou/vspherec" with StorageClass "vspheredefault": Kubernetes node nodeVmDetail details is empty. nodeVmDetails : []                    
Jan 16 18:39:22 ocp39 atomic-openshift-master-controllers: I0116 18:39:22.172366    4003 event.go:218] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"jhou", Name:"vspherec", UID:"7c91ba0d-faa9-11e7-ac55-0050569f5abb", APIVersion:"v1", ResourceVersion:"1105744", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' Failed to provision volume with StorageClass "vspheredefault": Kubernetes node nodeVmDetail details is empty. nodeVmDetails : []     
```

PVC Dump:
{
  "kind": "PersistentVolumeClaim",
  "apiVersion": "v1",
  "metadata": {
    "name": "vspherec"
  },
  "spec": {
    "accessModes": [
      "ReadWriteOnce"
    ],
    "resources": {
      "requests": {
        "storage": "1Gi"
      }
    }
  }
}

StorageClass Dump (if StorageClass used by PV/PVC):
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
  creationTimestamp: 2018-01-08T06:19:50Z
  name: vspheredefault
  resourceVersion: "361975"
  selfLink: /apis/storage.k8s.io/v1/storageclasses/vspheredefault
  uid: eec2d482-f43b-11e7-80b8-0050569f5abb
provisioner: kubernetes.io/vsphere-volume
reclaimPolicy: Delete

Additional info:

Comment 2 Jan Safranek 2018-01-19 12:08:49 UTC
Jianwei, can you please set up a vmWare machine for us where it is reproducible? Our access to vmWare is very limited right now (I am working on a solution, but it will take time).

Comment 6 Jan Safranek 2018-01-24 11:59:31 UTC
It's more complicated, I filled https://github.com/kubernetes/kubernetes/issues/58747 upstream to get some feedback.

Comment 7 Jan Safranek 2018-01-26 09:01:05 UTC
So, upstream does not provide a default policy for vsphere cloud provider. We should create one during installation.

What's needed is to simply apply yaml from comment #5 when deploying 3.9 on vSphere.

Comment 8 Jianwei Hou 2018-01-26 09:39:58 UTC
Then ocp upgrade should apply comment 5 too.

Comment 9 N. Harrison Ripps 2018-01-26 14:19:08 UTC
@Scott, can you please look at this with respect to Jianwei's statement about adding this to the ocp upgrade?

Comment 10 N. Harrison Ripps 2018-01-26 14:20:09 UTC
@Jianwei given Jan's suggested workaround, does this still qualify as a TestBlocker?

Comment 14 davis phillips 2018-02-13 17:38:31 UTC
Proposed PR - https://github.com/openshift/openshift-ansible/pull/7096

Comment 17 Jianwei Hou 2018-03-08 07:44:20 UTC
Verified the playbook fixes the clusterrole and clusterrolebinding for vSphere cloud provider.

Comment 20 errata-xmlrpc 2018-03-28 14:19:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0489

Comment 21 Eric Jones 2018-05-29 14:07:37 UTC
Hi,

A customer has updated to the newest 3.9 packages [0] which should include the fix from this errata, however they are still seeing this issue:

LAST SEEN   FIRST SEEN   COUNT     NAME                    KIND                    SUBOBJECT   TYPE      REASON               SOURCE                        MESSAGE
4s          47s          4         test.152fbb4cd95dfc9f   PersistentVolumeClaim               Warning   ProvisioningFailed   persistentvolume-controller   Failed to provision volume with StorageClass "standard": Kubernetes node nodeVmDetail details is empty. nodeVmDetails : []


Is there some particular thing we should look to gather to investigate why the errata is not working in this case?


[0]
> grep openshift master/sosreport-ptrehiou.02100418-20180524093648/installed-rpms 
atomic-openshift-3.9.27-1.git.0.964617d.el7.x86_64          Wed May 23 15:57:49 2018
atomic-openshift-clients-3.9.27-1.git.0.964617d.el7.x86_64  Wed May 23 15:57:04 2018
atomic-openshift-docker-excluder-3.9.27-1.git.0.964617d.el7.noarch Wed May 23 16:00:56 2018
atomic-openshift-excluder-3.9.27-1.git.0.964617d.el7.noarch Wed May 23 16:01:23 2018
atomic-openshift-master-3.9.27-1.git.0.964617d.el7.x86_64   Wed May 23 16:03:40 2018
atomic-openshift-node-3.9.27-1.git.0.964617d.el7.x86_64     Wed May 23 17:33:50 2018
atomic-openshift-sdn-ovs-3.9.27-1.git.0.964617d.el7.x86_64  Wed May 23 17:34:15 2018
atomic-openshift-utils-3.9.27-1.git.0.52e35b5.el7.noarch    Wed May 23 16:34:40 2018
openshift-ansible-3.9.27-1.git.0.52e35b5.el7.noarch         Wed May 23 16:34:40 2018
openshift-ansible-docs-3.9.27-1.git.0.52e35b5.el7.noarch    Wed May 23 16:34:40 2018
openshift-ansible-playbooks-3.9.27-1.git.0.52e35b5.el7.noarch Wed May 23 16:34:40 2018
openshift-ansible-roles-3.9.27-1.git.0.52e35b5.el7.noarch   Wed May 23 16:34:40 2018

> grep openshift worker-node/sosreport-ptrehiou.02100418-20180524095110/installed-rpms 
atomic-openshift-3.9.27-1.git.0.964617d.el7.x86_64          Wed May 23 17:35:07 2018
atomic-openshift-clients-3.9.27-1.git.0.964617d.el7.x86_64  Wed May 23 17:34:50 2018
atomic-openshift-docker-excluder-3.9.27-1.git.0.964617d.el7.noarch Wed May 23 16:27:03 2018
atomic-openshift-excluder-3.9.27-1.git.0.964617d.el7.noarch Wed May 23 16:27:35 2018
atomic-openshift-node-3.9.27-1.git.0.964617d.el7.x86_64     Wed May 23 17:35:09 2018
atomic-openshift-sdn-ovs-3.9.27-1.git.0.964617d.el7.x86_64  Wed May 23 17:35:33 2018
atomic-openshift-utils-3.9.27-1.git.0.52e35b5.el7.noarch    Wed May 23 16:35:41 2018
openshift-ansible-3.9.27-1.git.0.52e35b5.el7.noarch         Wed May 23 16:35:39 2018
openshift-ansible-docs-3.9.27-1.git.0.52e35b5.el7.noarch    Wed May 23 16:35:39 2018
openshift-ansible-playbooks-3.9.27-1.git.0.52e35b5.el7.noarch Wed May 23 16:35:39 2018
openshift-ansible-roles-3.9.27-1.git.0.52e35b5.el7.noarch   Wed May 23 16:35:39 2018

Comment 26 davis phillips 2018-05-29 15:39:14 UTC
Hey Eric,

It looks like the API group is missing from the clusterrole. 

Can we issue the following commands and test again? 

oc delete clusterrole system:vsphere-cloud-provider 

wget https://raw.githubusercontent.com/openshift/openshift-ansible/master/roles/openshift_cloud_provider/files/vsphere-svc.yml

oc create -f vsphere-svc.yml 

 oc describe clusterrole system:vsphere-cloud-provider

Comment 29 Mahesh Taru 2018-06-06 09:44:17 UTC
Hello Davis,

The steps from comment #26 did not helped customer and issue still persist.

Comment 30 Muhammad Aizuddin Zali 2018-06-06 11:48:16 UTC
One of customer running 3.9.27:

# rpm -qa|grep atomic
atomic-openshift-clients-3.9.27-1.git.0.964617d.el7.x86_64
atomic-openshift-excluder-3.9.27-1.git.0.964617d.el7.noarch
atomic-registries-1.22.1-3.git2fd0860.el7.x86_64
atomic-openshift-docker-excluder-3.9.27-1.git.0.964617d.el7.noarch
atomic-openshift-master-3.9.27-1.git.0.964617d.el7.x86_64
atomic-openshift-node-3.9.27-1.git.0.964617d.el7.x86_64
atomic-openshift-3.9.27-1.git.0.964617d.el7.x86_64
atomic-openshift-sdn-ovs-3.9.27-1.git.0.964617d.el7.x86_64
atomic-openshift-utils-3.9.27-1.git.0.52e35b5.el7.noarch

The errata nor https://bugzilla.redhat.com/show_bug.cgi?id=1534955#c26 works.

Comment 32 davis phillips 2018-06-06 20:18:51 UTC
Howdy Mahesh,

Is the customer willing to jump on a bluejeans session with me?

@Muhammad, the errata update is included in the 3.9.30 RPM. 

Let me know,
Davis

Comment 34 pk 2018-06-10 07:41:41 UTC
(In reply to davis phillips from comment #32)
> Howdy Mahesh,
> 
> Is the customer willing to jump on a bluejeans session with me?
> 
> @Muhammad, the errata update is included in the 3.9.30 RPM. 
> 
> Let me know,
> Davis

Hi Davis, we redeployed the cluster with 3.9.30 with the same vsphere issue persists. We would like to work with u through bluejeans session. May I know your timings?

Comment 40 sinsua 2018-06-13 15:24:38 UTC
Same issue here with 3.9.30 RPMs and a fresh installed cluster:

atomic-openshift-node-3.9.30-1.git.0.dec1ba7.el7.x86_64
atomic-registries-1.22.1-3.git2fd0860.el7.x86_64
atomic-openshift-docker-excluder-3.9.30-1.git.0.dec1ba7.el7.noarch
atomic-openshift-master-3.9.30-1.git.0.dec1ba7.el7.x86_64
atomic-openshift-sdn-ovs-3.9.30-1.git.0.dec1ba7.el7.x86_64
atomic-openshift-clients-3.9.30-1.git.0.dec1ba7.el7.x86_64
atomic-openshift-excluder-3.9.30-1.git.0.dec1ba7.el7.noarch
atomic-openshift-3.9.30-1.git.0.dec1ba7.el7.x86_64

some workaround?

Comment 41 davis phillips 2018-06-13 21:00:11 UTC
@sinsua.uk 

If you do an 'oc get node' do the nodenames have the FQDN or the shortname?

The cloudprovider changes the node name to match the VM name. The error above can be from the missing service account (which is resolved with the RPM versions you have installed) or having the FQDN do the node names.

Comment 42 sinsua 2018-06-14 08:27:44 UTC
I got the short names:

NAME       STATUS    ROLES     AGE       VERSION
app-0      Ready     compute   21h       v1.9.1+a0ce1bc657
app-1      Ready     compute   21h       v1.9.1+a0ce1bc657
app-2      Ready     compute   21h       v1.9.1+a0ce1bc657
infra-0    Ready     <none>    21h       v1.9.1+a0ce1bc657
infra-1    Ready     <none>    21h       v1.9.1+a0ce1bc657
infra-2    Ready     <none>    21h       v1.9.1+a0ce1bc657
master-0   Ready     master    21h       v1.9.1+a0ce1bc657
master-1   Ready     master    21h       v1.9.1+a0ce1bc657
master-2   Ready     master    21h       v1.9.1+a0ce1bc657

Comment 43 sinsua 2018-06-14 08:28:35 UTC
(In reply to davis phillips from comment #41)
> @sinsua.uk 
> 
> If you do an 'oc get node' do the nodenames have the FQDN or the shortname?
> 
> The cloudprovider changes the node name to match the VM name. The error
> above can be from the missing service account (which is resolved with the
> RPM versions you have installed) or having the FQDN do the node names.

I got the short names:

NAME       STATUS    ROLES     AGE       VERSION
app-0      Ready     compute   21h       v1.9.1+a0ce1bc657
app-1      Ready     compute   21h       v1.9.1+a0ce1bc657
app-2      Ready     compute   21h       v1.9.1+a0ce1bc657
infra-0    Ready     <none>    21h       v1.9.1+a0ce1bc657
infra-1    Ready     <none>    21h       v1.9.1+a0ce1bc657
infra-2    Ready     <none>    21h       v1.9.1+a0ce1bc657
master-0   Ready     master    21h       v1.9.1+a0ce1bc657
master-1   Ready     master    21h       v1.9.1+a0ce1bc657
master-2   Ready     master    21h       v1.9.1+a0ce1bc657

Comment 45 jrmorgan 2018-06-19 19:24:01 UTC
Davis, after updating from v3.9.27 to v3.9.30 we're still experiencing the same "nodeVmDetail details is empty" issue when provisioning new volumes. I'll open a SEV1 and associated with this BZ since this is a high impact customer issue. Please advise if additional information is required, or if you have further information on root cause or possible workarounds. We also attempted setting vmuuid, vmname, etc. in vsphere.conf without success.

Comment 46 davis phillips 2018-06-19 22:12:48 UTC
Hey Jr, any chance for a customer call tomorrow afternoon? (Im in CDT)

Comment 47 jrmorgan 2018-06-19 22:56:08 UTC
Davis, let me check with the customer on their availability. Just a note that I was able to workaround this issue in my lab by downgrading the VM hardware version, though depending on customer requirements this may not be an acceptable solution:

- Shutdown each node/master serially
- Unregister each VM
- Download/edit each vmx associated with each node/master and update virtualHW.version = "13" tovirtualHW.version = "11"
- Register VM and start
- Confirm output of cat /sys/class/dmi/id/product_uuid matches cat /sys/class/dmi/id/product_serial
- Attempt creation of a new PV

This seems related to https://github.com/kubernetes/kubernetes/pull/59602 and should be fixed in k8s 1.9.4 but not 1.9.1 shipping with OCP 3.9.27 & 3.9.30

I've requested a sosreport from the customer which should be attached to the case tonight or tomorrow morning.

Comment 48 Jan Safranek 2018-06-20 07:43:52 UTC
I'm backporting https://github.com/kubernetes/kubernetes/pull/59602 to 3.9. I am quite sure it will solve "nodeVmDetail details is empty", however, I don't know if it helps with "AttachVolume.Attach failed for volume "pvc-073be41f-6f95-11e8-8ba2-0050569b3970" : No VM found"

Davis, is it the same issue?

Comment 55 davis phillips 2018-06-20 12:52:03 UTC
Hey Jeff,

Yes. A temporary workaround until the PR is cherry picked would be to create the VMs with hardware compatibility 11 install the OS. Then, power them down and upgrade to hardware compatibility 13. This matches the following files:

/sys/class/dmi/id/product_uuid
/sys/class/dmi/id/product_serial

Another work around would be to manual edit the VMX file. 

Please see the original issues on github for more details:

https://github.com/kubernetes/kubernetes/issues/58927

Comment 56 Jan Safranek 2018-06-21 09:54:11 UTC
OSE 3.9 backport: https://github.com/openshift/ose/pull/1319

I checked that 3.10 already contains this patch.

Comment 63 Emil 2020-02-06 06:49:34 UTC
@Jan Safranek , what happened here. I can't see this code in 3.10 or 3.11?...


Note You need to log in before you can comment on or make changes to this bug.