1619514 – vSphere Cloud provider: detach volume when node is not present/ powered off

Bug 1619514 - vSphere Cloud provider: detach volume when node is not present/ powered off

Summary: vSphere Cloud provider: detach volume when node is not present/ powered off

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	3.9.z
Assignee:	Hemant Kumar
QA Contact:	Liang Xia
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1622245 (view as bug list)
Depends On:
Blocks:	1645258 1645260
TreeView+	depends on / blocked

Reported:	2018-08-21 05:45 UTC by Jaspreet Kaur
Modified:	2018-12-13 19:27 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1645258 1645260 (view as bug list)
Environment:
Last Closed:	2018-12-13 19:27:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:3748	0	None	None	None	2018-12-13 19:27:21 UTC

Description Jaspreet Kaur 2018-08-21 05:45:15 UTC

Description of problem:

When a vm is reported as no longer present in cloud provider and is deleted by node controller, there are no attempts to detach respective volumes. For example, if a VM is powered off , and pods are migrated to other nodes. In the case of vSphere, the VM cannot be started again because the VM still holds mount points to volumes that are now mounted to other VMs.

Please check
https://github.com/kubernetes/kubernetes/pull/40118
https://github.com/kubernetes/kubernetes/issues/33061


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results: Vm fails to start as volume is still mounted


Expected results: Detach volumes when vm is not present or powered off.


Additional info:

Comment 2 Hemant Kumar 2018-08-21 14:11:08 UTC

Can you please post controller logs.

Comment 21 Hemant Kumar 2018-11-02 17:11:45 UTC

*** Bug 1622245 has been marked as a duplicate of this bug. ***

Comment 25 Liang Xia 2018-11-30 11:40:06 UTC

Tested with below OCP,
openshift v3.9.55
kubernetes v1.9.1+a0ce1bc657
etcd 3.2.16

Prepare a pod with persistent volumes.
# oc get pods -n test
NAME                               READY     STATUS    RESTARTS   AGE
cakephp-mysql-persistent-1-build   1/1       Running   0          3m
mysql-1-zzfch                      1/1       Running   0          3m

# oc get pvc -n test
NAME      STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
mysql     Bound     pvc-a080ce24-f490-11e8-a6dc-0050569f5322   1Gi        RWO            standard       4m

# oc get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS    CLAIM                 STORAGECLASS   REASON    AGE
pvc-a080ce24-f490-11e8-a6dc-0050569f5322   1Gi        RWO            Delete           Bound     test/mysql            standard                 4m

# oc get pods -n test mysql-1-zzfch -o yaml | grep -i nodename
  nodeName: qe-lxia-39-node-registry-router-1

Then shutdown the node with "shutdown -h"
Wait (about 1 minute) for the node to become NotReady.
Wait (about 2 minutes) for the pod to become Unknown. And also a new pod created in ContainerCreating status.

Delete the old pod via command,
# oc delete pod -n test mysql-1-zzfch --grace-period=0 --force

And the new pod eventually becomes running after wait for some time (8 minutes in my case).
# oc get pod -n test
NAME                               READY     STATUS    RESTARTS   AGE
mysql-1-l5fqs                      1/1       Running   0          12m


Though it show mount failure in the events at first, it finally mounted successfully and become running.
Events:
  Type     Reason                 Age              From                                           Message
  ----     ------                 ----             ----                                           -------
  Normal   Scheduled              10m              default-scheduler                              Successfully assigned mysql-1-l5fqs to openshift-195
  Warning  FailedAttachVolume     10m              attachdetach-controller                        Multi-Attach error for volume "pvc-a080ce24-f490-11e8-a6dc-0050569f5322" Volume is already used by pod(s) mysql-1-zzfch
  Normal   SuccessfulMountVolume  10m              kubelet, openshift-195  MountVolume.SetUp succeeded for volume "default-token-zspdf"
  Warning  FailedMount            3m (x3 over 8m)  kubelet, openshift-195  Unable to mount volumes for pod "mysql-1-l5fqs_test(5aef15d8-f492-11e8-a6dc-0050569f5322)": timeout expired waiting for volumes to attach/mount for pod "test"/"mysql-1-l5fqs". list of unattached/unmounted volumes=[mysql-data]
  Normal   SuccessfulMountVolume  1m               kubelet, openshift-195  MountVolume.SetUp succeeded for volume "pvc-a080ce24-f490-11e8-a6dc-0050569f5322"
  Normal   Pulled                 1m               kubelet, openshift-195  Container image "registry.access.redhat.com/rhscl/mysql-57-rhel7@sha256:75665d5efd7f051fa8b308207fac269b2d8cae0848007dcad4a6ffdcddf569cb" already present on machine
  Normal   Created                1m               kubelet, openshift-195  Created container
  Normal   Started                1m               kubelet, openshift-195  Started container

Comment 27 errata-xmlrpc 2018-12-13 19:27:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3748

Note You need to log in before you can comment on or make changes to this bug.