Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1434709

Summary: vSphere disks don't get mounted on containerized OCP
Product: OpenShift Container Platform Reporter: Jianwei Hou <jhou>
Component: StorageAssignee: Matthew Wong <mawong>
Status: CLOSED NOTABUG QA Contact: Jianwei Hou <jhou>
Severity: high Docs Contact:
Priority: high    
Version: 3.5.0CC: aos-bugs, bchilds, eparis, jhou, ronald.l.gavlin.civ
Target Milestone: ---Keywords: UpcomingRelease
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-15 14:07:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jianwei Hou 2017-03-22 08:33:24 UTC
Description of problem:
In containerized OCP cluster, the vSphere disks could be successfully provisioned, but don't get mounted. This is not reproducible on RPM installed OCP.

Version-Release number of selected component (if applicable):
v3.5.0.55

How reproducible:
Always

Steps to Reproduce:
1. Create a StorageClass
kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
  name: vsphere
provisioner: kubernetes.io/vsphere-volume

2. Create a PVC
{
  "kind": "PersistentVolumeClaim",
  "apiVersion": "v1",
  "metadata": {
    "name": "vsphere",
    "annotations": {
        "volume.beta.kubernetes.io/storage-class": "vsphere"
    }
  },
  "spec": {
    "accessModes": [
      "ReadWriteOnce"
    ],
    "resources": {
      "requests": {
        "storage": "1Gi"
      }
    }
  }
}

3. Once PVC is bound, create a Pod
oc get pvc
NAME      STATUS    VOLUME                                     CAPACITY   ACCESSMODES   AGE
vsphere   Bound     pvc-9baba382-0ed0-11e7-8a82-0050569fb058   1Gi        RWO           48m

{
    "apiVersion": "v1",
    "kind": "Pod",
    "metadata": {
        "name": "vsphere",
        "labels": {
            "name": "vsphere"
        }
    },
    "spec": {
        "containers": [{
            "name": "vsphere",
            "image": "aosqe/hello-openshift",
            "imagePullPolicy": "IfNotPresent",
            "volumeMounts": [{
                "mountPath": "/mnt/vsphere",
                "name": "vsphere"
            }]
        }],
       "securityContext": {
            "fsGroup": 123456,
            "supplementalGroups": [111,112,113],
            "seLinuxOptions": {
                 "level": "s0"
            }
       },
        "volumes": [{
            "name": "vsphere",
            "persistentVolumeClaim": {
                "claimName": "vsphere"
            }
        }]
    }
}


Actual results:
After step 3: Pod stuck at 'ContainerCreating' state.

oc describe pod vsphere
Name:			vsphere
Namespace:		jhou
Security Policy:	storagescc
Node:			node1.c.vsphere.local/10.66.144.139
Start Time:		Wed, 22 Mar 2017 15:25:01 +0800
Labels:			name=vsphere
Status:			Pending
IP:			
Controllers:		<none>
Containers:
  vsphere:
    Container ID:	
    Image:		aosqe/hello-openshift
    Image ID:		
    Port:		
    State:		Waiting
      Reason:		ContainerCreating
    Ready:		False
    Restart Count:	0
    Volume Mounts:
      /mnt/vsphere from vsphere (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-lzwkl (ro)
    Environment Variables:	<none>
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	False 
  PodScheduled 	True 
Volumes:
  vsphere:
    Type:	PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:	vsphere
    ReadOnly:	false
  default-token-lzwkl:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	default-token-lzwkl
QoS Class:	BestEffort
Tolerations:	<none>
Events:
  FirstSeen	LastSeen	Count	From				SubObjectPath	Type		Reason		Message
  ---------	--------	-----	----				-------------	--------	------		-------
  48m		48m		1	{default-scheduler }				Normal		Scheduled	Successfully assigned vsphere to node1.c.vsphere.local
  46m		2m		21	{kubelet node1.c.vsphere.local}			Warning		FailedMount	Unable to mount volumes for pod "vsphere_jhou(a91ac559-0ed0-11e7-8a82-0050569fb058)": timeout expired waiting for volumes to attach/mount for pod "jhou"/"vsphere". list of unattached/unmounted volumes=[vsphere]
  46m		2m		21	{kubelet node1.c.vsphere.local}			Warning		FailedSync	Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "jhou"/"vsphere". list of unattached/unmounted volumes=[vsphere]

Expected results:
Pod should be 'Running'.

Additional info:

Comment 3 Matthew Wong 2017-05-08 21:20:31 UTC
@jianwei, I tried to reproduce on newer versions

openshift v3.5.5.13
kubernetes v1.5.2+43a9be4

and mounting dynamically provisioned volumes worked, the pod entered Running state and could write to them.

However, I did find a different bug where when the pod is deleted and then the PVC is deleted, the PV enters Failed state because the disk cannot be detached then deleted. Can you try this again on the newer 3.5? I'll open a new bug in this one's place.

PV Failed state:
  25s		25s		1	{persistentvolume-controller }			Warning		VolumeFailedDelete	Error caused by file /vmfs/volumes/57a232db-dca8824d-d9cd-44a842af585d/kubevols/kubernetes-dynamic-pvc-242a3c0f-3431-11e7-abad-005056a5d7df.vmd

node logs:
May 08 16:07:35 mawong atomic-openshift-node[32763]: E0508 16:07:35.592937     353 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/vsphere-volume/[ose3-vmware] kubevols/kubernetes-dynamic-pvc-e46bca08-3431-11e7-abad-005056a5d7df.vmdk\"" failed. No retries permitted until 2017-05-08 16:09:35.592905908 -0500 CDT (durationBeforeRetry 2m0s). Error: UnmountDevice failed for volume "kubernetes.io/vsphere-volume/[ose3-vmware] kubevols/kubernetes-dynamic-pvc-e46bca08-3431-11e7-abad-005056a5d7df.vmdk" (spec.Name: "pvc-e46bca08-3431-11e7-abad-005056a5d7df") with: remove /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/vsphere-volume/mounts/[ose3-vmware] kubevols/kubernetes-dynamic-pvc-e46bca08-3431-11e7-abad-005056a5d7df.vmdk: device or resource busy
May 08 16:07:35 mawong atomic-openshift-node[32763]: W0508 16:07:35.592860     353 util.go:84] Warning: "/var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/vsphere-volume/mounts/[ose3-vmware] kubevols/kubernetes-dynamic-pvc-e46bca08-3431-11e7-abad-005056a5d7df.vmdk" is not a mountpoint, deleting
May 08 16:07:35 mawong atomic-openshift-node[32763]: I0508 16:07:35.592853     353 nsenter_mount.go:202] IsLikelyNotMountPoint: /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/vsphere-volume/mounts/[ose3-vmware] kubevols/kubernetes-dynamic-pvc-e46bca08-3431-11e7-abad-005056a5d7df.vmdk is not a mount point
May 08 16:07:35 mawong atomic-openshift-node[32763]: I0508 16:07:35.592837     353 nsenter_mount.go:196] IsLikelyNotMountPoint findmnt output for path /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/vsphere-volume/mounts/[ose3-vmware] kubevols/kubernetes-dynamic-pvc-e46bca08-3431-11e7-abad-005056a5d7df.vmdk: /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/vsphere-volume/mounts/[ose3-vmware]:
May 08 16:07:35 mawong atomic-openshift-node[32763]: I0508 16:07:35.582799     353 nsenter_mount.go:183] findmnt command: nsenter [--mount=/rootfs/proc/1/ns/mnt -- /bin/findmnt -o target,fstype --noheadings --first-only --target /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/vsphere-volume/mounts/[ose3-vmware] kubevols/kubernetes-dynamic-pvc-e46bca08-3431-11e7-abad-005056a5d7df.vmdk]
May 08 16:07:35 mawong atomic-openshift-node[32763]: W0508 16:07:35.582773     353 mount.go:136] could not determine device for path: "/var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/vsphere-volume/mounts/[ose3-vmware] kubevols/kubernetes-dynamic-pvc-e46bca08-3431-11e7-abad-005056a5d7df.vmdk"
May 08 16:07:35 mawong atomic-openshift-node[32763]: I0508 16:07:35.582300     353 reconciler.go:341] UnmountDevice operation started for volume "kubernetes.io/vsphere-volume/[ose3-vmware] kubevols/kubernetes-dynamic-pvc-e46bca08-3431-11e7-abad-005056a5d7df.vmdk" (spec.Name: "pvc-e46bca08-3431-11e7-abad-005056a5d7df")

mount on host:
/dev/sde on /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/vsphere-volume/mounts/[ose3-vmware] kubevols/kubernetes-dynamic-pvc-e46bca08-3431-11e7-abad-005056a5d7df.vmdk type ext4 (rw,relatime,seclabel,data=ordered)

mounts in container:
/dev/sde on /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/vsphere-volume/mounts/[ose3-vmware] kubevols/kubernetes-dynamic-pvc-e46bca08-3431-11e7-abad-005056a5d7df.vmdk type ext4 (rw,relatime,seclabel,data=ordered)
/dev/sde on /rootfs/var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/vsphere-volume/mounts/[ose3-vmware] kubevols/kubernetes-dynamic-pvc-e46bca08-3431-11e7-abad-005056a5d7df.vmdk type ext4 (rw,relatime,seclabel,data=ordered)

Comment 4 Jianwei Hou 2017-05-09 06:57:08 UTC
@mawong thank you. I had upgraded my cluster to latest v3.5.5.13-1. I still couldn't get my pods running. The provisioned vmdk was attached to my node, the log didn't report any errors, it kept printing the following info:

```
May  9 14:55:26 node1 journal: I0509 14:55:26.702346    6220 attacher.go:139] Checking VMDK "[datastore1] kubevols/kubernetes-dynamic-pvc-f9adb6ec-347e-11e7-9c55-0050569fb058.vmdk" is attached
May  9 14:55:26 node1 atomic-openshift-node: I0509 14:55:26.702346    6220 attacher.go:139] Checking VMDK "[datastore1] kubevols/kubernetes-dynamic-pvc-f9adb6ec-347e-11e7-9c55-0050569fb058.vmdk" is attached
```

Comment 5 Matthew Wong 2017-05-09 16:55:17 UTC
OK, my bad, the first time I tried it I had enable-controller-attach-detach true, but the default is false.

However the result is the same for me, I tried with false, and it still worked and my pod is running. (PV fails at deletion time but again that's a different bug)

Do you see "AttachVolume.Attach succeeded" or "AttachVolume.Attach failed" in the node logs? Could you attach the full kubelet/node logs capturing around the pod creation time and attempts to attach and mount? Thanks.

Comment 7 Matthew Wong 2017-05-10 18:04:35 UTC
So I dug through the logs and tried it myself on your env. The controller thinks it successfully attached the volume and returns a devicePath like /dev/disk/by-id/wwn-0x6000c29487801c9f215cffef219bc97d. But the devicePath does not exist on the node. The node is told by the controller that it attached the volume but since the devicePath does not actually exist, when the node tries to verify that the VMDK is attached at mount time it fails, hence the repeated "Checking VMDK...is attached."

May 11 00:46:02 node1.c.vsphere.local atomic-openshift-node[6160]: E0511 00:46:02.945067    6220 nestedpendingoperations.go:262] Operation for "\"kubernetes.io/vsphere-volume/[datastore1] kubevols/kubernetes-dynamic-pvc-1616e018-3541-11e7-9c55-0050569fb058.vmdk\"" failed. No retries permitted until 2017-05-11 00:48:02.945024277 +0800 CST (durationBeforeRetry 2m0s). Error: MountVolume.WaitForAttach failed for volume "kubernetes.io/vsphere-volume/[datastore1] kubevols/kubernetes-dynamic-pvc-1616e018-3541-11e7-9c55-0050569fb058.vmdk" (spec.Name: "pvc-1616e018-3541-11e7-9c55-0050569fb058") pod "1d943175-3541-11e7-9c55-0050569fb058" (UID: "1d943175-3541-11e7-9c55-0050569fb058") with: Could not find attached VMDK "[datastore1] kubevols/kubernetes-dynamic-pvc-1616e018-3541-11e7-9c55-0050569fb058.vmdk". Timeout waiting for mount paths to be created.

I don't know what the difference is between my vsphere env and yours to cause this. The vcenter or whatever API and UI insists the disk is attached when from the node POV it isn't. This needs looking at by upstream vmware people I think.

Comment 8 Matthew Wong 2017-05-11 18:45:43 UTC
I've opened an upstream issue https://github.com/kubernetes/kubernetes/issues/45679 . @jianwei please assist with providing any other environment information you can if I missed anything or they ask for it, thank you!

Comment 9 Jianwei Hou 2017-05-12 01:30:50 UTC
(In reply to Matthew Wong from comment #8)
> I've opened an upstream issue
> https://github.com/kubernetes/kubernetes/issues/45679 . @jianwei please
> assist with providing any other environment information you can if I missed
> anything or they ask for it, thank you!

Thank you, I'll keep updating the issue.

Comment 10 Matthew Wong 2017-05-12 16:32:16 UTC
"looks like you are missing disk.EnableUUID = "TRUE" on the node VM." Can you check this? I'm not sure how to do it myself, thanks!

Comment 12 Jianwei Hou 2017-05-15 03:10:15 UTC
(In reply to Matthew Wong from comment #10)
> "looks like you are missing disk.EnableUUID = "TRUE" on the node VM." Can
> you check this? I'm not sure how to do it myself, thanks!

Thank you, that has fixed the problem, I have replied https://github.com/kubernetes/kubernetes/issues/45679#issuecomment-301100081