Bug 1420645 - Cinder PV/volumes are not getting deleted by the reclaim policy
Summary: Cinder PV/volumes are not getting deleted by the reclaim policy
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Jan Safranek
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-09 07:22 UTC by Jianwei Hou
Modified: 2017-07-24 14:11 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: OpenShift used wrong InstanceID for checking that volumes are attached to nodes and thus it can think that a volume is detached while it is still attached. Consequence: Volumes are not detached when they are not needed and they can't be deleted according to their reclaim policy. Fix: OpenShift now uses the right InstanceID for all attach/detach/check operations. Result: Volumes are detached and deleted when they are not needed.
Clone Of:
Environment:
Last Closed: 2017-04-12 19:12:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
master log (9.71 MB, text/plain)
2017-02-14 14:01 UTC, Jan Safranek
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Origin (Github) 12955 0 None None None 2017-02-14 19:48:37 UTC
Red Hat Product Errata RHBA-2017:0884 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.5 RPM Release Advisory 2017-04-12 22:50:07 UTC

Description Jianwei Hou 2017-02-09 07:22:40 UTC
Description of problem:
Sometimes, the provisioned PVs/volumes aren't getting automatically deleted after their PVCs are deleted on OpenStack. On nodes, these cinder volumes are all unmounted, but they are still attached to the nodes on OpenStack console.

Version-Release number of selected component (if applicable):
openshift v3.5.0.18+9a5d1aa
kubernetes v1.5.2+43a9be4

How reproducible:
Sometimes
This is reproduced 3 times on two OpenStack environments recently.

Steps to Reproduce:
1. Keep creating bunch of dynamic PVs + Pods fast. I have seen 7 volumes stuck at 'attached' status, so I think we have to create 14 or more to reproduce.
2. Delete Pods and PVCs
3. List PVs

Actual results:
After step 3: Many PVs stuck at 'Released' status. They are not being deleted.

NAME                                       CAPACITY   ACCESSMODES   RECLAIMPOLICY   STATUS     CLAIM                      REASON    AGE
pvc-0799c71e-ee8b-11e6-b41f-fa163e86be01   1Gi        RWO           Delete          Released   9l2i0/dynamic-pvc1-9l2i0             1h
pvc-081d884a-ee8b-11e6-b41f-fa163e86be01   2Gi        RWX           Delete          Released   9l2i0/dynamic-pvc2-9l2i0             1h
pvc-08abe1a4-ee8b-11e6-b41f-fa163e86be01   3Gi        ROX           Delete          Released   9l2i0/dynamic-pvc3-9l2i0             1h
pvc-3db0c938-ee8d-11e6-b41f-fa163e86be01   1Gi        RWO           Delete          Released   htsdg/htsdg-mked7c3k                 57m
pvc-48c38a94-ee8a-11e6-81ba-fa163e86be01   1Gi        RWO           Delete          Released   rs8ap/dynamic-pvc1-rs8ap             1h
pvc-49481acd-ee8a-11e6-81ba-fa163e86be01   2Gi        RWX           Delete          Released   rs8ap/dynamic-pvc2-rs8ap             1h
pvc-49c703d5-ee8a-11e6-81ba-fa163e86be01   3Gi        ROX           Delete          Released   rs8ap/dynamic-pvc3-rs8ap             1h
pvc-56ccafc2-ee90-11e6-b41f-fa163e86be01   1Gi        RWO           Delete          Released   ni9ml/pvcsc                          35m
pvc-67b7b6a8-ee8c-11e6-b41f-fa163e86be01   1Gi        RWO           Delete          Released   0udxv/0udxv-mddly5nl                 1h
pvc-7825189a-ee90-11e6-b41f-fa163e86be01   1Gi        RWO           Delete          Released   jhou/cinderc                         34m
pvc-7d1d3e21-ee8c-11e6-b41f-fa163e86be01   1Gi        RWO           Delete          Released   34whw/34whw-0x66ulvo                 1h
pvc-916d6432-ee8c-11e6-b41f-fa163e86be01   1Gi        RWO           Delete          Released   57lw3/57lw3-kxt1-gvl                 1h
pvc-f514f8ed-ee8c-11e6-b41f-fa163e86be01   1Gi        RWO           Delete          Released   az5wb/az5wb-po48xolh  

On OpenStack console, these volumes were shown as 'Attached to xxx host on /dev/xxx'. But on nodes, none of these volumes were attached. They should already had been unmounted.

Expected results:
PVs/Volumes should be deleted.

Additional info:

Comment 8 Jan Safranek 2017-02-14 14:01:18 UTC
Created attachment 1250268 [details]
master log

I reproduced the bug (simply creating PVC + pod + deleting both, no load was required). Attached logs starts with provisioning of the volume.

PVC.Namespace/name: default/myclaim
pod.Namespace/name: default/testpod
PV.Name: pvc-1fa0e8b4-f2b5-11e6-a8bb-fa163ecb84eb
Cinder volume ID: 11a4faea-bfc7-4713-88b3-dec492480dba

At 07:57:25.475258, the pod is created.
At 07:59:03.266783, the pod is deleted (i.e. probably enters Terminating phase).
At 07:59:33.871363, the pod is really deleted.

In the log, I can frequently see that the volume is marked as detached even though the pod is still running and the volume is still attached!

I0214 07:58:16.054214   77288 reconciler.go:107] Starting reconciling attached volumes still attached
I0214 07:58:16.154409   77288 reconciler.go:206] Volume "kubernetes.io/cinder/11a4faea-bfc7-4713-88b3-dec492480dba"/Node "host-8-175-110.host.centralci.eng.rdu2.redhat.com" is attached--touching.
I0214 07:58:16.235672   77288 openstack_volumes.go:122] 11a4faea-bfc7-4713-88b3-dec492480dba kubernetes-dynamic-pvc-1fa0e8b4-f2b5-11e6-a8bb-fa163ecb84eb [map[id:11a4faea-bfc7-4713-88b3-dec492480dba server_id:212f723e-513b-4f00-a275-38192ea79fa6 attachment_id:a46be665-35d2-410e-bd3f-e0fd8a100b23 host_name:<nil> volume_id:11a4faea-bfc7-4713-88b3-dec492480dba device:/dev/vdr]]
I0214 07:58:16.235705   77288 attacher.go:140] VolumesAreAttached: check volume "11a4faea-bfc7-4713-88b3-dec492480dba" (specName: "pvc-1fa0e8b4-f2b5-11e6-a8bb-fa163ecb84eb") is no longer attached
I0214 07:58:16.235718   77288 operation_executor.go:565] VerifyVolumesAreAttached determined volume "kubernetes.io/cinder/11a4faea-bfc7-4713-88b3-dec492480dba" (spec.Name: "pvc-1fa0e8b4-f2b5-11e6-a8bb-fa163ecb84eb") is no longer attached to node %!q(MISSING), therefore it was marked as detached.

The same happens also when the pod is finally deleted and the volume should be detached - it's "marked as detached", but not actually detached from the node.

Comment 10 Jan Safranek 2017-02-14 14:26:54 UTC
It could be fixed by https://github.com/kubernetes/kubernetes/pull/39998/, let me check...

Comment 11 Jan Safranek 2017-02-14 16:42:37 UTC
Confirmed, upstream #39998 fixed the issue for me. Backported to origin as https://bugzilla.redhat.com/show_bug.cgi?id=1420645

Comment 12 Eric Paris 2017-02-14 19:48:37 UTC
Jan meant https://github.com/openshift/origin/pull/12955

Comment 13 Jan Safranek 2017-02-15 09:05:18 UTC
Oh, thanks for correction.

The PR got merged and will be available in the next build.

Comment 14 Troy Dawson 2017-02-17 20:39:42 UTC
Actually, the merge failed and needs someone to look at it.
https://github.com/openshift/origin/pull/12955

When the merge succeeds, please move this back to MODIFIED.

Comment 15 Jan Safranek 2017-02-20 08:23:34 UTC
Yeah, I'm sorry about that. It's merged now.

Comment 16 Troy Dawson 2017-02-21 22:34:24 UTC
This has been merged into ocp and is in OCP v3.5.0.32 or newer.

Comment 18 Jianwei Hou 2017-02-28 03:10:08 UTC
Verified this is fixed on 
openshift v3.5.0.34
kubernetes v1.5.2+43a9be4
etcd 3.1.0

Comment 20 errata-xmlrpc 2017-04-12 19:12:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0884


Note You need to log in before you can comment on or make changes to this bug.