Bug 1954030 - [Tracker for Ceph BZ #1968325] AWS | reclaim capacity after snapshot deletion is very slow
Summary: [Tracker for Ceph BZ #1968325] AWS | reclaim capacity after snapshot deletion...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Patrick Donnelly
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks: 1968325
TreeView+ depends on / blocked
 
Reported: 2021-04-27 13:24 UTC by Avi Liani
Modified: 2023-08-09 16:37 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1968325 (view as bug list)
Environment:
Last Closed: 2022-03-10 01:55:21 UTC
Embargoed:


Attachments (Terms of Use)

Description Avi Liani 2021-04-27 13:24:27 UTC
Description of problem (please be detailed as possible and provide log
snippests):

I created one PVC of 14 GiB on CephFS volume, and filed it up with data (~10GiB).
I took 100 snapshots of this PVC, rewrite all data after each snapshot - total data written to the storage ~3.3 TiB
at the end of the test, i deletes all snapshots, the PVC and the project they was created in.

watching the `rados df` command in the rook-ceph-toolbox pod, confirm to me that the data is actually deleted from the backend, but very very slow - less than 1M/Min.

The cluster is on AWS with M5.4XL worker type and 2TiB OSD size

Version of all relevant components (if applicable):

Driver versions
================

        OCP versions
        ==============

                clientVersion:
                  buildDate: "2021-04-09T04:34:49Z"
                  compiler: gc
                  gitCommit: 2513fdbb36e2ddf13bc0b17460151c03eb3a3547
                  gitTreeState: clean
                  gitVersion: 4.7.0-202104090228.p0-2513fdb
                  goVersion: go1.15.7
                  major: ""
                  minor: ""
                  platform: linux/amd64
                openshiftVersion: 4.7.6
                releaseClientVersion: 4.7.7
                serverVersion:
                  buildDate: "2021-03-14T16:01:39Z"
                  compiler: gc
                  gitCommit: bafe72fb05eddc8246040b9945ec242b9f805935
                  gitTreeState: clean
                  gitVersion: v1.20.0+bafe72f
                  goVersion: go1.15.7
                  major: "1"
                  minor: "20"
                  platform: linux/amd64
                
                
                Cluster version:

                NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
                version   4.7.6     True        False         4h16m   Cluster version is 4.7.6
                
        OCS versions
        ==============

                NAME                         DISPLAY                       VERSION        REPLACES   PHASE
                ocs-operator.v4.7.0-360.ci   OpenShift Container Storage   4.7.0-360.ci              Succeeded
                
        Rook versions
        ===============

                rook: 4.7-133.80f8b1112.release_4.7
                go: go1.15.7
                
        Ceph versions
        ===============

                ceph version 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable)


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

no

Is there any workaround available to the best of your knowledge?

not that i am aware of

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

3

Can this issue reproducible?

yes

Can this issue reproduce from the UI?

no

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. deploy OCS 4.7
2. run the ocs-ci test :  tests/e2e/performance/test_pvc_multi_snapshot_performance.py::TestPvcMultiSnapshotPerformance::test_pvc_multiple_snapshot_performance[CephFS] - this will take ~3.5Hours
3. watch the reclaim process from UI or CLI


Actual results:

the reclaim capacity is very slow
 
Expected results:

the capacity will be reclaim fast

Additional info:

all must-gather info will be uploaded.

Comment 9 Mudit Agarwal 2021-06-10 06:25:34 UTC
Hi Patrick, 
This is marked as a blocker for OCS4.8, do we need to fix this in 4.8?

Comment 11 Mudit Agarwal 2021-06-11 03:06:46 UTC
Thanks Patrick, moving it out of 4.9
Will create a ceph clone if required.

Comment 15 Mudit Agarwal 2021-09-08 10:39:29 UTC
BZ #1968325 is not planned for 5.0z1

Comment 17 Mudit Agarwal 2022-01-26 11:37:55 UTC
BZ #1968325 is targeted for RHCS 5.2


Note You need to log in before you can comment on or make changes to this bug.