Bug 1954030

Summary: [Tracker for Ceph BZ #1968325] AWS | reclaim capacity after snapshot deletion is very slow
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Avi Liani <alayani>
Component: cephAssignee: Patrick Donnelly <pdonnell>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Elad <ebenahar>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.7CC: bniver, kramdoss, madam, muagarwa, ocs-bugs, odf-bz-bot, pdonnell
Target Milestone: ---Keywords: Automation, Performance
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1968325 (view as bug list) Environment:
Last Closed: 2022-03-10 01:55:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1968325    

Description Avi Liani 2021-04-27 13:24:27 UTC
Description of problem (please be detailed as possible and provide log
snippests):

I created one PVC of 14 GiB on CephFS volume, and filed it up with data (~10GiB).
I took 100 snapshots of this PVC, rewrite all data after each snapshot - total data written to the storage ~3.3 TiB
at the end of the test, i deletes all snapshots, the PVC and the project they was created in.

watching the `rados df` command in the rook-ceph-toolbox pod, confirm to me that the data is actually deleted from the backend, but very very slow - less than 1M/Min.

The cluster is on AWS with M5.4XL worker type and 2TiB OSD size

Version of all relevant components (if applicable):

Driver versions
================

        OCP versions
        ==============

                clientVersion:
                  buildDate: "2021-04-09T04:34:49Z"
                  compiler: gc
                  gitCommit: 2513fdbb36e2ddf13bc0b17460151c03eb3a3547
                  gitTreeState: clean
                  gitVersion: 4.7.0-202104090228.p0-2513fdb
                  goVersion: go1.15.7
                  major: ""
                  minor: ""
                  platform: linux/amd64
                openshiftVersion: 4.7.6
                releaseClientVersion: 4.7.7
                serverVersion:
                  buildDate: "2021-03-14T16:01:39Z"
                  compiler: gc
                  gitCommit: bafe72fb05eddc8246040b9945ec242b9f805935
                  gitTreeState: clean
                  gitVersion: v1.20.0+bafe72f
                  goVersion: go1.15.7
                  major: "1"
                  minor: "20"
                  platform: linux/amd64
                
                
                Cluster version:

                NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
                version   4.7.6     True        False         4h16m   Cluster version is 4.7.6
                
        OCS versions
        ==============

                NAME                         DISPLAY                       VERSION        REPLACES   PHASE
                ocs-operator.v4.7.0-360.ci   OpenShift Container Storage   4.7.0-360.ci              Succeeded
                
        Rook versions
        ===============

                rook: 4.7-133.80f8b1112.release_4.7
                go: go1.15.7
                
        Ceph versions
        ===============

                ceph version 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable)


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

no

Is there any workaround available to the best of your knowledge?

not that i am aware of

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

3

Can this issue reproducible?

yes

Can this issue reproduce from the UI?

no

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. deploy OCS 4.7
2. run the ocs-ci test :  tests/e2e/performance/test_pvc_multi_snapshot_performance.py::TestPvcMultiSnapshotPerformance::test_pvc_multiple_snapshot_performance[CephFS] - this will take ~3.5Hours
3. watch the reclaim process from UI or CLI


Actual results:

the reclaim capacity is very slow
 
Expected results:

the capacity will be reclaim fast

Additional info:

all must-gather info will be uploaded.

Comment 9 Mudit Agarwal 2021-06-10 06:25:34 UTC
Hi Patrick, 
This is marked as a blocker for OCS4.8, do we need to fix this in 4.8?

Comment 11 Mudit Agarwal 2021-06-11 03:06:46 UTC
Thanks Patrick, moving it out of 4.9
Will create a ceph clone if required.

Comment 15 Mudit Agarwal 2021-09-08 10:39:29 UTC
BZ #1968325 is not planned for 5.0z1

Comment 17 Mudit Agarwal 2022-01-26 11:37:55 UTC
BZ #1968325 is targeted for RHCS 5.2