Bug 1869330
Summary: | Deletion of PVC while performing Ceph/OCS pod deletion leaves behind PV in Released state | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Rachael <rgeorge> |
Component: | csi-driver | Assignee: | Madhu Rajanna <mrajanna> |
Status: | CLOSED ERRATA | QA Contact: | Oded <oviner> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.5 | CC: | ebenahar, hchiramm, jijoy, madam, mrajanna, muagarwa, nberry, ocs-bugs |
Target Milestone: | --- | ||
Target Release: | OCS 4.5.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | 4.5.0-64.ci | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-09-15 10:18:38 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Comment 5
Neha Berry
2020-08-17 16:43:25 UTC
(In reply to Neha Berry from comment #5) > Proposing as a blocker as this issue was seen in two different test cases > for tier4 executions and it would help to get an initial analysis I'm sorry, but I'm really missing an understanding here of why is this even considered a blocker. Would you like to defer a release until this is fixed? Why? What's the user impact? What's the frequency? How hard is it to recover from it? Not a blocker. Simple fix, PR is already there. (In reply to Mudit Agarwal from comment #12) > Not a blocker. Simple fix, PR is already there. True.. The backport is merged too. This should not be a blocker though. Hi Madhu, Can you please confirm whether the root cause is same in the below case? This time in another test case which kill one MDS daemon while performing PVC deletion. Test case - pv_services.test_daemon_kill_during_pvc_pod_deletion_and_io.TestDaemonKillDuringMultipleDeleteOperations::test_daemon_kill_during_pvc_pod_deletion_and_io[CephFileSystem-mds] Testcase failed due to timeout waiting for PV pvc-927f7f09-d36c-400b-b6a4-d31b5b848d4c to delete. E TimeoutError: Timeout when waiting for pvc-927f7f09-d36c-400b-b6a4-d31b5b848d4c to delete. Describe output: Name: pvc-927f7f09-d36c-400b-b6a4-d31b5b848d4c E Labels: <none> E Annotations: pv.kubernetes.io/provisioned-by: openshift-storage.cephfs.csi.ceph.com E Finalizers: [kubernetes.io/pv-protection] E StorageClass: ocs-storagecluster-cephfs E Status: Released E Claim: namespace-test-539904cade1f4ea6b7e0b7fd60e74e6d/pvc-test-cd1e10c8a9a447fa82c2657eaefda116 E Reclaim Policy: Delete E Access Modes: RWX E VolumeMode: Filesystem E Capacity: 3Gi E Node Affinity: <none> E Message: E Source: E Type: CSI (a Container Storage Interface (CSI) volume source) E Driver: openshift-storage.cephfs.csi.ceph.com E FSType: ext4 E VolumeHandle: 0001-0011-openshift-storage-0000000000000001-e79b8e2b-dff4-11ea-b618-0a580a800212 E ReadOnly: false E VolumeAttributes: clusterID=openshift-storage E fsName=ocs-storagecluster-cephfilesystem E storage.kubernetes.io/csiProvisionerIdentity=1597555290158-8081-openshift-storage.cephfs.csi.ceph.com E Events: E Type Reason Age From Message E ---- ------ ---- ---- ------- E Warning VolumeFailedDelete 7m24s openshift-storage.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-c748c89bf-b6bj6_658b2045-2da6-4e75-a8c6-e2f400f2b1ba rpc error: code = DeadlineExceeded desc = context deadline exceeded E Warning VolumeFailedDelete 6m39s (x6 over 7m23s) openshift-storage.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-c748c89bf-b6bj6_658b2045-2da6-4e75-a8c6-e2f400f2b1ba rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-e79b8e2b-dff4-11ea-b618-0a580a800212 already exists E Warning VolumeFailedDelete 2m50s (x3 over 6m6s) openshift-storage.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-c748c89bf-b6bj6_658b2045-2da6-4e75-a8c6-e2f400f2b1ba rpc error: code = Aborted desc = an operation with the given Volume ID pvc-927f7f09-d36c-400b-b6a4-d31b5b848d4c already exists ocs_ci/ocs/ocp.py:655: TimeoutError I don't see "subvolume 'csi-vol-e79b8e2b-dff4-11ea-b618-0a580a800212' does not exist" error in PV describe output. But it is present in csi-cephfsplugin-provisioner-c748c89bf-b6bj6 csi-cephfsplugin container log. 2020-08-16T19:31:04.938542271Z E0816 19:31:04.938488 1 volume.go:82] ID: 1768 Req-ID: 0001-0011-openshift-storage-0000000000000001-e79b8e2b-dff4-11ea-b618-0a580a800212 failed to get the rootpath for the vol csi-vol-e79b8e2b-dff4-11ea-b618-0a580a800212(an error (exit status 2) occurred while running ceph args: [fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-e79b8e2b-dff4-11ea-b618-0a580a800212 --group_name csi -m 172.30.103.71:6789,172.30.230.64:6789,172.30.195.226:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner --keyfile=***stripped***]) stdError Error ENOENT: subvolume 'csi-vol-e79b8e2b-dff4-11ea-b618-0a580a800212' does not exist 2020-08-16T19:31:04.938598891Z E0816 19:31:04.938561 1 utils.go:163] ID: 1768 Req-ID: 0001-0011-openshift-storage-0000000000000001-e79b8e2b-dff4-11ea-b618-0a580a800212 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-927f7f09-d36c-400b-b6a4-d31b5b848d4c already exists must-gather - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/prsurve-slave-14/prsurve-slave-14_20200816T040807/logs/failed_testcase_ocs_logs_1597555947/test_daemon_kill_during_pvc_pod_deletion_and_io%5bCephFileSystem-mds%5d_ocs_logs/ Test case log - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/prsurve-slave-14/prsurve-slave-14_20200816T040807/logs/ocs-ci-logs-1597555947/by_outcome/failed/tests/manage/pv_services/test_daemon_kill_during_pvc_pod_deletion_and_io.py/TestDaemonKillDuringMultipleDeleteOperations/test_daemon_kill_during_pvc_pod_deletion_and_io-CephFileSystem-mds/ >Keywords: Regression
Elad, I would like to point out that, this is NOT a regression. That said, this code exist since OCS 4.2 and carried over all the releases till date. This issue may not hit in most of the cases or visible in some corner scenarios. Thats why we didnt hit it from customers or QE eventhough we qualified all previous releases with the same issue present.
Can you please remove the regression flag from this bug?
Please share if you have any questions on this.
These test cases pass. https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/pv_services/test_resource_deletion_during_pod_pvc_deletion.py test cases: pytest.param(*[constants.CEPHFILESYSTEM, 'delete_pvcs', 'mgr'],polarion-OCS-920)) pytest.param(*[constants.CEPHFILESYSTEM, 'delete_pvcs','cephfsplugin_provisioner'],polarion_id-OCS-951,bugzilla-1860891 https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/pv_services/test_resource_deletion_during_pvc_pod_deletion_and_io.py test cases: pytest.param(*[constants.CEPHFILESYSTEM, 'mgr'],polarion_id:OCS-813) pytest.param(*[constants.CEPHFILESYSTEM, 'cephfsplugin'],polarion_id:OCS-1012) SetUp: Provider:Vmware OCP Version:4.5.0-0.nightly-2020-08-27-074254 OCS Version:4.5.0-67.ci Test Process: 1.Create 12 PVC: $ oc get pvc -n namespace-test-9531f6e7b986489e9b91d98e06ef2616 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE pvc-test-0947d77312f6427ca5216cc8ea987e98 Bound pvc-3750674b-12f6-4b45-8963-649a11b86f55 3Gi RWO ocs-storagecluster-cephfs 6m21s pvc-test-0bf78014db2c4f45bd53f955f0a74434 Bound pvc-03cfc44e-d9b0-4b75-b9ef-9542527986ad 3Gi RWO ocs-storagecluster-cephfs 6m22s pvc-test-39ff51d48b9c47ce9c528cd398518f22 Bound pvc-b00c1875-9648-4b1e-b18e-b19564047356 3Gi RWO ocs-storagecluster-cephfs 6m19s pvc-test-67b89b5ae93a47ea88a385ab24986ba1 Bound pvc-f92bfcdb-6445-4431-b603-6ea72396b4f1 3Gi RWX ocs-storagecluster-cephfs 6m14s pvc-test-77952c0b13da438989711d3b4606db5d Bound pvc-39baab3b-f76e-4d1b-acf5-da325c3aba82 3Gi RWX ocs-storagecluster-cephfs 6m13s pvc-test-80ae0cf8ff1e4b87b4ca4660b0de08be Bound pvc-1b8e78d3-265c-47b7-bf86-946232f182bb 3Gi RWX ocs-storagecluster-cephfs 6m12s pvc-test-9925bf805ef9497f8180ce6a72ca2936 Bound pvc-1b88c3a0-a414-4bc3-acc3-e0219378c5c2 3Gi RWX ocs-storagecluster-cephfs 6m17s pvc-test-bc291f3364474e6abb8c157da8afdbff Bound pvc-665b77d2-5ea7-49aa-b20d-daf21cc6bbfa 3Gi RWO ocs-storagecluster-cephfs 6m20s pvc-test-bc4c173bae9a4b9bbb21dae77e837b6a Bound pvc-a5e65bda-c9ca-4f84-b4bb-0cca3e54d25e 3Gi RWO ocs-storagecluster-cephfs 6m18s pvc-test-bd60536efe304bc1805a140f739fbe0b Bound pvc-746a2fe2-bb4f-4b75-86d6-73e08153c66f 3Gi RWO ocs-storagecluster-cephfs 6m23s pvc-test-c5fb4bccdbdd400e8860d9ffc1fb7fb7 Bound pvc-bd570231-a32f-4ce9-8bfb-9fe9e5ee1f96 3Gi RWX ocs-storagecluster-cephfs 6m16s pvc-test-ddb63817a30141f989dc75de2c0db261 Bound pvc-e4fcfc66-a20e-4b74-91b3-ecf0696f29cd 3Gi RWX ocs-storagecluster-cephfs 6m15s 2.Create 18 pods[Image:nginx] $ oc get pods -n namespace-test-9531f6e7b986489e9b91d98e06ef2616 NAME READY STATUS RESTARTS AGE pod-test-cephfs-1889c2ba8a2b4520bf36fc0b8fc99766 1/1 Running 0 7m33s pod-test-cephfs-1eb293004907409a841c5f0a605c3014 1/1 Running 0 4m30s pod-test-cephfs-287dd71e19334f50a17890b144a9b66f 1/1 Running 0 5m40s pod-test-cephfs-31afce50e02f4995a0fe6933ad6190d8 1/1 Running 0 8m21s pod-test-cephfs-3c6bece91a884518a7414cbc2f3d885c 1/1 Running 0 7m52s pod-test-cephfs-3d1e81c26d4249dca591d4e520e77ada 1/1 Running 0 6m28s pod-test-cephfs-4868ae1c97d140a69f7f06ba67e08579 1/1 Running 0 9m14s pod-test-cephfs-5264688cb5024840a34dd8816b084009 1/1 Running 0 4m48s pod-test-cephfs-7422ebfecb03482c8e4566d9ae785f8d 1/1 Running 0 7m19s pod-test-cephfs-8763d7509fbd4ede9c7c6248504b2b33 1/1 Running 0 9m33s pod-test-cephfs-88e9425699b7461eb3edf1e6f5ab6a40 1/1 Running 0 5m2s pod-test-cephfs-90d3f55bd60140d5a5b2ad496d1148d4 1/1 Running 0 6m14s pod-test-cephfs-90eb64578702403f93b83ccc5181ed7b 1/1 Running 0 7m pod-test-cephfs-a4c8ca74d4ef497aa4bedef2e3e022f7 1/1 Running 0 5m59s pod-test-cephfs-c417cd7f54734439af22aeb79a55e298 1/1 Running 0 3m50s pod-test-cephfs-e3af369d1a224240a7291a8dbd69c2a3 1/1 Running 0 8m40s pod-test-cephfs-f21f92c6f74b4ba1848e1d315e43c735 1/1 Running 0 5m16s pod-test-cephfs-f6866c1e18dc49019a424991dc3d719d 1/1 Running 0 4m9s 3.Install FIO on all pods: oc -n namespace-test-9531f6e7b986489e9b91d98e06ef2616 rsh pod-test-cephfs-3c6bece91a884518a7414cbc2f3d885c *which apt-get *apt-get update *apt-get -y install fio 4.Run FIO command on all pods: oc -n namespace-test-9531f6e7b986489e9b91d98e06ef2616 rsh pod-test-cephfs-8763d7509fbd4ede9c7c6248504b2b33 fio --name=fio-rand-readwrite --filename=/var/lib/www/html/pod-test-cephfs-8763d7509fbd4ede9c7c6248504b2b33_io --readwrite=randrw --bs=4K --direct=1 --numjobs=1 --time_based=1 --runtime=60 --size=2G --iodepth=4 --invalidate=1 --fsync_on_close=1 --rwmixread=75 --ioengine=libaio --rate=1m --rate_process=poisson --output-format=json 5.Delete all 18 PODs oc -n namespace-test-9531f6e7b986489e9b91d98e06ef2616 delete Pod pod-test-cephfs-8763d7509fbd4ede9c7c6248504b2b33 --wait=false 6.Check Pod stataus: oc -n namespace-test-9531f6e7b986489e9b91d98e06ef2616 get Pod pod-test-cephfs-4868ae1c97d140a69f7f06ba67e08579 Starting pod/compute-0-debug ... To use host binaries, run `chroot /host` 17:47:15 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc debug nodes/compute-2 -- df 17:47:26 - MainThread - ocs_ci.utility.utils - WARNING - Command stderr: Starting pod/compute-2-debug ... To use host binaries, run `chroot /host` Removing debug pod ... 7.mount points are removed from nodes after deleting the pods oc get pv/pvc-746a2fe2-bb4f-4b75-86d6-73e08153c66f -o jsonpath='{.spec.csi.volumeHandle}' 8.Fetched image uuid associated with each PVC: oc -n namespace-test-9531f6e7b986489e9b91d98e06ef2616 get PersistentVolumeClaim -o yaml 9.Delete PVCs oc -n namespace-test-9531f6e7b986489e9b91d98e06ef2616 delete PersistentVolumeClaim pvc-test-bd60536efe304bc1805a140f739fbe0b 10.Delete rook-ceph-mgr-a Pod [parallel to step 9] oc -n openshift-storage delete Pod rook-ceph-mgr-a-6944ff7f79-8cxc7 --grace-period=0 --force 11.Check PVC in Bound state: oc get PersistentVolume pvc-1b8e78d3-265c-47b7-bf86-946232f182bb -o yaml 12.Go to Ceph tool pod and run "ceph fs" command oc -n openshift-storage rsh rook-ceph-tools-5b9cbc586c-m9whq ceph fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-4beef4d8-e86f-11ea-9e5a-0a580a80020a csi --format json 13.Check ceph health: Ceph cluster health is HEALTH_OK. This test cases pass. https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/pv_services/test_daemon_kill_during_pvc_pod_creation_and_io.py pytest.param(*[constants.CEPHFILESYSTEM, 'mgr'],polarion_id-OCS-1108) SetUp: Provider:Vmware OCP Version:4.5.0-0.nightly-2020-08-27-074254 OCS Version:4.5.0-67.ci Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3754 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |