1898521 – [CephFS] Deleting cephfsplugin pod along with app pods will make PV remain in Released state after deleting the PVC

Bug 1898521 - [CephFS] Deleting cephfsplugin pod along with app pods will make PV remain in Released state after deleting the PVC

Summary: [CephFS] Deleting cephfsplugin pod along with app pods will make PV remain in...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	csi-driver
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.6.0
Assignee:	Madhu Rajanna
QA Contact:	Jilju Joy
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-11-17 12:31 UTC by Jilju Joy
Modified:	2020-12-17 06:25 UTC (History)
CC List:	7 users (show)
Fixed In Version:	4.6.0-169.ci
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-12-17 06:25:30 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ceph ceph-csi pull 1690	None	closed	cephfs: check only the stderror message for umount	2021-01-13 04:47:55 UTC
Github	openshift ceph-csi pull 12	None	closed	BUG 1898521: check only the stderror message for umount	2021-01-13 04:47:58 UTC
Red Hat Product Errata	RHSA-2020:5605	None	None	None	2020-12-17 06:25:44 UTC

Description Jilju Joy 2020-11-17 12:31:39 UTC

Description of problem (please be detailed as possible and provide log
snippests):

Deleted one csi-cephfsplugin pod in parallel with app pods deletion. Cephfsplugin pod and app pods got deleted successfully. New csi-cephfsplugin pod got created. But when deleting the PVCs which were attached to the deleted app pods, the PVs remain in Released state. This happens if the deleted app pod and csi-cephfsplugin pod were on the same node.

Describe output of one of the PV (from test case error details).

E               TimeoutError: Timeout when waiting for pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4 to delete. Describe output: Name:            pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4
E               Labels:          <none>
E               Annotations:     pv.kubernetes.io/provisioned-by: openshift-storage.cephfs.csi.ceph.com
E               Finalizers:      [kubernetes.io/pv-protection]
E               StorageClass:    ocs-storagecluster-cephfs
E               Status:          Released
E               Claim:           namespace-test-4716f967cd314d98979fdc3600f279fe/pvc-test-405d5bbac2604987b956c2c88c436195
E               Reclaim Policy:  Delete
E               Access Modes:    RWO
E               VolumeMode:      Filesystem
E               Capacity:        3Gi
E               Node Affinity:   <none>
E               Message:         
E               Source:
E                   Type:              CSI (a Container Storage Interface (CSI) volume source)
E                   Driver:            openshift-storage.cephfs.csi.ceph.com
E                   FSType:            
E                   VolumeHandle:      0001-0011-openshift-storage-0000000000000001-2e5b7bac-229e-11eb-97b0-0a580a81020d
E                   ReadOnly:          false
E                   VolumeAttributes:      clusterID=openshift-storage
E                                          fsName=ocs-storagecluster-cephfilesystem
E                                          storage.kubernetes.io/csiProvisionerIdentity=1604917071728-8081-openshift-storage.cephfs.csi.ceph.com
E                                          subvolumeName=csi-vol-2e5b7bac-229e-11eb-97b0-0a580a81020d
E               Events:
E                 Type     Reason              Age                 From                                                                                                                      Message
E                 ----     ------              ----                ----                                                                                                                      -------
E                 Warning  VolumeFailedDelete  59s (x8 over 2m5s)  openshift-storage.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-699d6c9544-2cqhp_3c9b58a5-0a28-4c59-b0c1-a7717ece122d  persistentvolume pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4 is still attached to node compute-1


Events shows that the PV is still attached to node compute-1. The test case is checking the df output from the node and ensures that the PV is unmounted after deleting the app pod.


The below error is repeated in csi-cephfsplugin-rd5wt pod (new pod on node compute-1) csi-cephfsplugin container logs:

2020-11-09T15:22:49.617284913Z I1109 15:22:49.617219       1 cephcmds.go:53] ID: 13 Req-ID: 0001-0011-openshift-storage-0000000000000001-2e5b7bac-229e-11eb-97b0-0a580a81020d an error (exit status 32) and stdError (umount: /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4/globalmount: not mounted.
2020-11-09T15:22:49.617284913Z ) occurred while running umount args: [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4/globalmount]
2020-11-09T15:22:49.617284913Z E1109 15:22:49.617241       1 utils.go:163] ID: 13 Req-ID: 0001-0011-openshift-storage-0000000000000001-2e5b7bac-229e-11eb-97b0-0a580a81020d GRPC error: rpc error: code = Internal desc = an error (exit status 32) and stdError (umount: /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4/globalmount: not mounted.
2020-11-09T15:22:49.617284913Z ) occurred while running umount args: [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4/globalmount]



From csi-cephfsplugin-provisioner-699d6c9544-2cqhp pod csi-provisioner container logs:


2020-11-09T15:23:32.488291409Z I1109 15:23:32.488242       1 controller.go:1453] delete "pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4": started
2020-11-09T15:23:32.490149096Z E1109 15:23:32.490120       1 controller.go:1463] delete "pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4": volume deletion failed: persistentvolume pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4 is still attached to node compute-1
2020-11-09T15:23:32.490176890Z W1109 15:23:32.490163       1 controller.go:998] Retrying syncing volume "pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4", failure 0
2020-11-09T15:23:32.490195058Z E1109 15:23:32.490182       1 controller.go:1016] error syncing volume "pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4": persistentvolume pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4 is still attached to node compute-1
2020-11-09T15:23:32.490231329Z I1109 15:23:32.490213       1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolume", Namespace:"", Name:"pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4", UID:"3b263b4e-a00e-4f34-9ac7-9a187060a7e0", APIVersion:"v1", ResourceVersion:"217276", FieldPath:""}): type: 'Warning' reason: 'VolumeFailedDelete' persistentvolume pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4 is still attached to node compute-1


ocs-ci test case: tests.manage.pv_services.test_resource_deletion_during_pod_pvc_deletion.TestDeleteResourceDuringPodPvcDeletion.test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pods-cephfsplugin]


must-gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j006vu1cs33-t4cn/j006vu1cs33-t4cn_20201109T092655/logs/failed_testcase_ocs_logs_1604918119/test_disruptive_during_pod_pvc_deletion%5bCephFileSystem-delete_pods-cephfsplugin%5d_ocs_logs/

List of PVs in Released state: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j006vu1cs33-t4cn/j006vu1cs33-t4cn_20201109T092655/logs/failed_testcase_ocs_logs_1604918119/test_disruptive_during_pod_pvc_deletion%5bCephFileSystem-delete_pods-cephfsplugin%5d_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-87129c2124ed57e69eff5e20f8e4438ee602a30e6868ecbe581acd7d3ef4070a/cluster-scoped-resources/oc_output/get_pv

Test case debug logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j006vu1cs33-t4cn/j006vu1cs33-t4cn_20201109T092655/logs/ocs-ci-logs-1604918119/by_outcome/failed/tests/manage/pv_services/test_resource_deletion_during_pod_pvc_deletion.py/TestDeleteResourceDuringPodPvcDeletion/test_disruptive_during_pod_pvc_deletion-CephFileSystem-delete_pods-cephfsplugin/

The df output from worker nodes after deleting the app pods are present in the test case debug logs.

Name of the deleted pod is csi-cephfsplugin-ccxwh (node compute-1)


=============================================================================
Version of all relevant components (if applicable):
OCS operator	v4.6.0-156.ci

Ceph Version	14.2.8-111.el8cp (2e6029d57bc594eceba4751373da6505028c2650) nautilus (stable)

Cluster Version	4.6.0-0.nightly-2020-11-07-035509

cephfsplugin 8214efd14326e38f7edfbf7c0e4110ab0ac613b059f41727ece35e128a913526

rook_csi_ceph cephcsi@sha256:8214efd14326e38f7edfbf7c0e4110ab0ac613b059f41727ece35e128a913526


=============================================================================
Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
PV not deleted after deleting the PVC.

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

=============================================================================

Can this issue reproducible?
Yes, 3/5
Seems like a corner case.


Can this issue reproduce from the UI?



If this is a regression, please provide more details to justify this:
All the 4 runs of this test case in OCS 4.5 got passed.

===============================================================================
Steps to Reproduce:
1. Create a set of PVCs and pods (minimum one pod on each node).
2. Start deleting pods in a loop.
3. While step 2 is progressing, delete one csi-cephfsplugin pod. Wait for new 
   csi-cephfsplugin pod to be in running state.
4. Wait for step 2 to complete and ensure that the pods are deleted.
5. Delete PVCs
6. Ensure PVCs are deleted.
7. Ensure PVs are deleted. (reclaimPolicy is Delete).


OR

Run this test case 
tests/manage/pv_services/test_resource_deletion_during_pod_pvc_deletion.py::TestDeleteResourceDuringPodPvcDeletion::test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pods-cephfsplugin]


Actual results:
Some of the PVs are not deleted. These are the volumes attached to the node where the deleted csi-cephfsplugin pod was present. 


Expected results:
All of the PVs should be deleted.


Additional info:

Comment 3 Yaniv Kaul 2020-11-17 13:19:28 UTC

Why would you delete the csi-cephfsplugin pod?
I would suggest a customer that does that will open a customer case to resolve this issue.

I'd like to CLOSE-WONTFIX this BZ, I see no reason we'll handle this (unless I'm missing something here!)

Comment 15 Jilju Joy 2020-11-27 08:05:11 UTC

Verified in version:

OCS operator	v4.6.0-178.ci
Cluster Version	4.6.0-0.nightly-2020-11-26-234822
rook_csi_ceph	cephcsi@sha256:fc2de7d391db086c7758543d1ee81d8ec4d74a6eb6a8ef76d9ff9ac1718e64d7


Performed the step mentioned in comment #4 and then deleted the PVC. The PV also got deleted.


Logs from csi-cephfsplugin-zndvb pod csi-cephfsplugin container while deleting the app pod:

I1127 07:47:56.621625       1 utils.go:160] ID: 203 Req-ID: 0001-0011-openshift-storage-0000000000000001-5fda1128-307d-11eb-9ffe-0a580a830015 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-ab504e99-a281-450d-b143-93d269de2b71/globalmount","volume_id":"0001-0011-openshift-storage-0000000000000001-5fda1128-307d-11eb-9ffe-0a580a830015"}
I1127 07:47:56.623216       1 cephcmds.go:53] ID: 203 Req-ID: 0001-0011-openshift-storage-0000000000000001-5fda1128-307d-11eb-9ffe-0a580a830015 an error (exit status 32) and stdError (umount: /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-ab504e99-a281-450d-b143-93d269de2b71/globalmount: not mounted.
) occurred while running umount args: [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-ab504e99-a281-450d-b143-93d269de2b71/globalmount]
I1127 07:47:56.623243       1 nodeserver.go:301] ID: 203 Req-ID: 0001-0011-openshift-storage-0000000000000001-5fda1128-307d-11eb-9ffe-0a580a830015 cephfs: successfully unmounted volume 0001-0011-openshift-storage-0000000000000001-5fda1128-307d-11eb-9ffe-0a580a830015 from /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-ab504e99-a281-450d-b143-93d269de2b71/globalmount


Logs from csi-cephfsplugin-provisioner-7877dbbb77-nm7wn pod csi-provisioner container while deleting the PVC.

I1127 07:49:41.387991       1 controller.go:1468] delete "pvc-ab504e99-a281-450d-b143-93d269de2b71": volume deleted
I1127 07:49:41.394162       1 controller.go:1518] delete "pvc-ab504e99-a281-450d-b143-93d269de2b71": persistentvolume deleted
E1127 07:49:41.394191       1 controller.go:1521] couldn't create key for object pvc-ab504e99-a281-450d-b143-93d269de2b71: object has no meta: object does not implement the Object interfaces
I1127 07:49:41.394210       1 controller.go:1523] delete "pvc-ab504e99-a281-450d-b143-93d269de2b71": succeeded



Also verified using the test case tests/manage/pv_services/test_resource_deletion_during_pod_pvc_deletion.py::TestDeleteResourceDuringPodPvcDeletion::test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pods-cephfsplugin]
Test case passed - https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/15213/

Comment 17 errata-xmlrpc 2020-12-17 06:25:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5605

Note You need to log in before you can comment on or make changes to this bug.