Description of problem (please be detailed as possible and provide log snippests): Deleted one csi-cephfsplugin pod in parallel with app pods deletion. Cephfsplugin pod and app pods got deleted successfully. New csi-cephfsplugin pod got created. But when deleting the PVCs which were attached to the deleted app pods, the PVs remain in Released state. This happens if the deleted app pod and csi-cephfsplugin pod were on the same node. Describe output of one of the PV (from test case error details). E TimeoutError: Timeout when waiting for pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4 to delete. Describe output: Name: pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4 E Labels: <none> E Annotations: pv.kubernetes.io/provisioned-by: openshift-storage.cephfs.csi.ceph.com E Finalizers: [kubernetes.io/pv-protection] E StorageClass: ocs-storagecluster-cephfs E Status: Released E Claim: namespace-test-4716f967cd314d98979fdc3600f279fe/pvc-test-405d5bbac2604987b956c2c88c436195 E Reclaim Policy: Delete E Access Modes: RWO E VolumeMode: Filesystem E Capacity: 3Gi E Node Affinity: <none> E Message: E Source: E Type: CSI (a Container Storage Interface (CSI) volume source) E Driver: openshift-storage.cephfs.csi.ceph.com E FSType: E VolumeHandle: 0001-0011-openshift-storage-0000000000000001-2e5b7bac-229e-11eb-97b0-0a580a81020d E ReadOnly: false E VolumeAttributes: clusterID=openshift-storage E fsName=ocs-storagecluster-cephfilesystem E storage.kubernetes.io/csiProvisionerIdentity=1604917071728-8081-openshift-storage.cephfs.csi.ceph.com E subvolumeName=csi-vol-2e5b7bac-229e-11eb-97b0-0a580a81020d E Events: E Type Reason Age From Message E ---- ------ ---- ---- ------- E Warning VolumeFailedDelete 59s (x8 over 2m5s) openshift-storage.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-699d6c9544-2cqhp_3c9b58a5-0a28-4c59-b0c1-a7717ece122d persistentvolume pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4 is still attached to node compute-1 Events shows that the PV is still attached to node compute-1. The test case is checking the df output from the node and ensures that the PV is unmounted after deleting the app pod. The below error is repeated in csi-cephfsplugin-rd5wt pod (new pod on node compute-1) csi-cephfsplugin container logs: 2020-11-09T15:22:49.617284913Z I1109 15:22:49.617219 1 cephcmds.go:53] ID: 13 Req-ID: 0001-0011-openshift-storage-0000000000000001-2e5b7bac-229e-11eb-97b0-0a580a81020d an error (exit status 32) and stdError (umount: /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4/globalmount: not mounted. 2020-11-09T15:22:49.617284913Z ) occurred while running umount args: [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4/globalmount] 2020-11-09T15:22:49.617284913Z E1109 15:22:49.617241 1 utils.go:163] ID: 13 Req-ID: 0001-0011-openshift-storage-0000000000000001-2e5b7bac-229e-11eb-97b0-0a580a81020d GRPC error: rpc error: code = Internal desc = an error (exit status 32) and stdError (umount: /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4/globalmount: not mounted. 2020-11-09T15:22:49.617284913Z ) occurred while running umount args: [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4/globalmount] From csi-cephfsplugin-provisioner-699d6c9544-2cqhp pod csi-provisioner container logs: 2020-11-09T15:23:32.488291409Z I1109 15:23:32.488242 1 controller.go:1453] delete "pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4": started 2020-11-09T15:23:32.490149096Z E1109 15:23:32.490120 1 controller.go:1463] delete "pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4": volume deletion failed: persistentvolume pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4 is still attached to node compute-1 2020-11-09T15:23:32.490176890Z W1109 15:23:32.490163 1 controller.go:998] Retrying syncing volume "pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4", failure 0 2020-11-09T15:23:32.490195058Z E1109 15:23:32.490182 1 controller.go:1016] error syncing volume "pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4": persistentvolume pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4 is still attached to node compute-1 2020-11-09T15:23:32.490231329Z I1109 15:23:32.490213 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolume", Namespace:"", Name:"pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4", UID:"3b263b4e-a00e-4f34-9ac7-9a187060a7e0", APIVersion:"v1", ResourceVersion:"217276", FieldPath:""}): type: 'Warning' reason: 'VolumeFailedDelete' persistentvolume pvc-e622cd2b-0b87-4f49-a709-b89664ca6ec4 is still attached to node compute-1 ocs-ci test case: tests.manage.pv_services.test_resource_deletion_during_pod_pvc_deletion.TestDeleteResourceDuringPodPvcDeletion.test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pods-cephfsplugin] must-gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j006vu1cs33-t4cn/j006vu1cs33-t4cn_20201109T092655/logs/failed_testcase_ocs_logs_1604918119/test_disruptive_during_pod_pvc_deletion%5bCephFileSystem-delete_pods-cephfsplugin%5d_ocs_logs/ List of PVs in Released state: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j006vu1cs33-t4cn/j006vu1cs33-t4cn_20201109T092655/logs/failed_testcase_ocs_logs_1604918119/test_disruptive_during_pod_pvc_deletion%5bCephFileSystem-delete_pods-cephfsplugin%5d_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-87129c2124ed57e69eff5e20f8e4438ee602a30e6868ecbe581acd7d3ef4070a/cluster-scoped-resources/oc_output/get_pv Test case debug logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j006vu1cs33-t4cn/j006vu1cs33-t4cn_20201109T092655/logs/ocs-ci-logs-1604918119/by_outcome/failed/tests/manage/pv_services/test_resource_deletion_during_pod_pvc_deletion.py/TestDeleteResourceDuringPodPvcDeletion/test_disruptive_during_pod_pvc_deletion-CephFileSystem-delete_pods-cephfsplugin/ The df output from worker nodes after deleting the app pods are present in the test case debug logs. Name of the deleted pod is csi-cephfsplugin-ccxwh (node compute-1) ============================================================================= Version of all relevant components (if applicable): OCS operator v4.6.0-156.ci Ceph Version 14.2.8-111.el8cp (2e6029d57bc594eceba4751373da6505028c2650) nautilus (stable) Cluster Version 4.6.0-0.nightly-2020-11-07-035509 cephfsplugin 8214efd14326e38f7edfbf7c0e4110ab0ac613b059f41727ece35e128a913526 rook_csi_ceph cephcsi@sha256:8214efd14326e38f7edfbf7c0e4110ab0ac613b059f41727ece35e128a913526 ============================================================================= Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? PV not deleted after deleting the PVC. Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 ============================================================================= Can this issue reproducible? Yes, 3/5 Seems like a corner case. Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: All the 4 runs of this test case in OCS 4.5 got passed. =============================================================================== Steps to Reproduce: 1. Create a set of PVCs and pods (minimum one pod on each node). 2. Start deleting pods in a loop. 3. While step 2 is progressing, delete one csi-cephfsplugin pod. Wait for new csi-cephfsplugin pod to be in running state. 4. Wait for step 2 to complete and ensure that the pods are deleted. 5. Delete PVCs 6. Ensure PVCs are deleted. 7. Ensure PVs are deleted. (reclaimPolicy is Delete). OR Run this test case tests/manage/pv_services/test_resource_deletion_during_pod_pvc_deletion.py::TestDeleteResourceDuringPodPvcDeletion::test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pods-cephfsplugin] Actual results: Some of the PVs are not deleted. These are the volumes attached to the node where the deleted csi-cephfsplugin pod was present. Expected results: All of the PVs should be deleted. Additional info:
Why would you delete the csi-cephfsplugin pod? I would suggest a customer that does that will open a customer case to resolve this issue. I'd like to CLOSE-WONTFIX this BZ, I see no reason we'll handle this (unless I'm missing something here!)
Verified in version: OCS operator v4.6.0-178.ci Cluster Version 4.6.0-0.nightly-2020-11-26-234822 rook_csi_ceph cephcsi@sha256:fc2de7d391db086c7758543d1ee81d8ec4d74a6eb6a8ef76d9ff9ac1718e64d7 Performed the step mentioned in comment #4 and then deleted the PVC. The PV also got deleted. Logs from csi-cephfsplugin-zndvb pod csi-cephfsplugin container while deleting the app pod: I1127 07:47:56.621625 1 utils.go:160] ID: 203 Req-ID: 0001-0011-openshift-storage-0000000000000001-5fda1128-307d-11eb-9ffe-0a580a830015 GRPC request: {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-ab504e99-a281-450d-b143-93d269de2b71/globalmount","volume_id":"0001-0011-openshift-storage-0000000000000001-5fda1128-307d-11eb-9ffe-0a580a830015"} I1127 07:47:56.623216 1 cephcmds.go:53] ID: 203 Req-ID: 0001-0011-openshift-storage-0000000000000001-5fda1128-307d-11eb-9ffe-0a580a830015 an error (exit status 32) and stdError (umount: /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-ab504e99-a281-450d-b143-93d269de2b71/globalmount: not mounted. ) occurred while running umount args: [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-ab504e99-a281-450d-b143-93d269de2b71/globalmount] I1127 07:47:56.623243 1 nodeserver.go:301] ID: 203 Req-ID: 0001-0011-openshift-storage-0000000000000001-5fda1128-307d-11eb-9ffe-0a580a830015 cephfs: successfully unmounted volume 0001-0011-openshift-storage-0000000000000001-5fda1128-307d-11eb-9ffe-0a580a830015 from /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-ab504e99-a281-450d-b143-93d269de2b71/globalmount Logs from csi-cephfsplugin-provisioner-7877dbbb77-nm7wn pod csi-provisioner container while deleting the PVC. I1127 07:49:41.387991 1 controller.go:1468] delete "pvc-ab504e99-a281-450d-b143-93d269de2b71": volume deleted I1127 07:49:41.394162 1 controller.go:1518] delete "pvc-ab504e99-a281-450d-b143-93d269de2b71": persistentvolume deleted E1127 07:49:41.394191 1 controller.go:1521] couldn't create key for object pvc-ab504e99-a281-450d-b143-93d269de2b71: object has no meta: object does not implement the Object interfaces I1127 07:49:41.394210 1 controller.go:1523] delete "pvc-ab504e99-a281-450d-b143-93d269de2b71": succeeded Also verified using the test case tests/manage/pv_services/test_resource_deletion_during_pod_pvc_deletion.py::TestDeleteResourceDuringPodPvcDeletion::test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pods-cephfsplugin] Test case passed - https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/15213/
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5605