Bug 2101874 - CephFS | PVC Clone with a lot of files takes hours to Bound
Summary: CephFS | PVC Clone with a lot of files takes hours to Bound
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.11
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Venky Shankar
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-28 15:17 UTC by Avi Liani
Modified: 2023-08-09 16:37 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-10-26 03:36:50 UTC
Embargoed:


Attachments (Terms of Use)
Small file test results graphed (17.44 KB, image/png)
2022-08-11 23:56 UTC, JC Lopez
no flags Details
Bif files test results graphed (17.93 KB, image/png)
2022-08-11 23:56 UTC, JC Lopez
no flags Details

Description Avi Liani 2022-06-28 15:17:21 UTC
Description of problem (please be detailed as possible and provide log
snippests):

When creating a clone of CephFS PVC with a lot of small files - more then 800 K files (total of ~10 GiB), 
It take a very long time until the clone get into Bound state - ~ 1.5 Hour, ~ 1.8 MiB/Sec



Version of all relevant components (if applicable):

        OCP versions
        ==============

                clientVersion:
                  buildDate: "2022-01-11T00:24:19Z"
                  compiler: gc
                  gitCommit: 293b52e3eb2e15df6312ea3541c1f8f397cfb24f
                  gitTreeState: clean
                  gitVersion: 4.10.0-202201102355.p0.g293b52e.assembly.stream-293b52e
                  goVersion: go1.17.2
                  major: ""
                  minor: ""
                  platform: linux/amd64
                openshiftVersion: 4.11.0-0.nightly-2022-06-25-081133
                releaseClientVersion: 4.10.0-0.nightly-2022-01-11-065245
                serverVersion:
                  buildDate: "2022-06-24T05:13:15Z"
                  compiler: gc
                  gitCommit: 0a57f1f59bda75ea2cf13d9f3b4ac5d202134f2d
                  gitTreeState: clean
                  gitVersion: v1.24.0+9ddc8b1
                  goVersion: go1.18.1
                  major: "1"
                  minor: "24"
                  platform: linux/amd64
                
                
                Cluster version:

                NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
                version   4.11.0-0.nightly-2022-06-25-081133   True        False         87m     Cluster version is 4.11.0-0.nightly-2022-06-25-081133
                
        OCS versions
        ==============

                NAME                              DISPLAY                       VERSION   REPLACES   PHASE
                mcg-operator.v4.11.0              NooBaa Operator               4.11.0               Succeeded
                ocs-operator.v4.11.0              OpenShift Container Storage   4.11.0               Succeeded
                odf-csi-addons-operator.v4.11.0   CSI Addons                    4.11.0               Succeeded
                odf-operator.v4.11.0              OpenShift Data Foundation     4.11.0               Succeeded
                
                ODF (OCS) build :                     full_version: 4.11.0-105
                
        Rook versions
        ===============

                rook: v4.11.0-0.3cf4f59e15f5dbbaccede303940e26e06ef4aec1
                go: go1.17.10
                
        Ceph versions
        ===============

                ceph version 16.2.8-59.el8cp (4e10c5fa8a9bc0a421a4dd0833f951ab1cfdcfa7) pacific (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

volumes clone is very very slow.

Is there any workaround available to the best of your knowledge?

Not that i know


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

1


Can this issue reproducible?

All the time.

Can this issue reproduce from the UI?

yes

If this is a regression, please provide more details to justify this:

this is new test, i ran it on 4.10 as well with the same results.


Steps to Reproduce:

On OCP & ODF cluster :

1. Create new CephFS PVC (more then 10 GiB) and attach it to pod
2. on the pod extract the Linux kernel 13 times (will get > 800K files - ~9.5 GiB) - taking ~ 7 Min.
   can done by :
     a. use the image quay.io/ocsci/perf:latest
     b. mount the pvc in the pod on /mnt
     c. and run inside the pod the command /opt/multiple_files.sh 13 /mnt
     d. wait for the pod get into Completed state.
3. take a clone of the PVC - can be done from the UI


Actual results:
The clone take very long time - ~ 1.5 Hour

Expected results:
the clone will take no more then few minutes - not more then the time it took to initially write the data - ~ 7 Min.


Additional info:

the output of the `oc describe pvc` output after the clone got into bound state is :

Name:          clone-pas-test-1
Namespace:     pas-test-namespace
StorageClass:  ocs-storagecluster-cephfs
Status:        Bound
Volume:        pvc-4f91e953-7d67-40ad-9d56-afe836170aaf
Labels:        <none>
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: openshift-storage.cephfs.csi.ceph.com
               volume.kubernetes.io/storage-provisioner: openshift-storage.cephfs.csi.ceph.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      100Gi
Access Modes:  RWX
VolumeMode:    Filesystem
DataSource:
  Kind:   PersistentVolumeClaim
  Name:   pvc-pas-test
Used By:  <none>
Events:
  Type     Reason                Age                    From                                                                                                                      Message
  ----     ------                ----                   ----                                                                                                                      -------
  Warning  ProvisioningFailed    88m                    openshift-storage.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-8549bdb7d6-g8gl5_9259b5ba-75ce-48d4-9505-a61efd468227  failed to provision volume with StorageClass "ocs-storagecluster-cephfs": rpc error: code = Aborted desc = clone from snapshot is pending
  Warning  ProvisioningFailed    70m (x13 over 88m)     openshift-storage.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-8549bdb7d6-g8gl5_9259b5ba-75ce-48d4-9505-a61efd468227  failed to provision volume with StorageClass "ocs-storagecluster-cephfs": rpc error: code = Aborted desc = clone from snapshot is already in p
rogress
  Normal   ExternalProvisioning  3m37s (x351 over 88m)  persistentvolume-controller                                                                                               waiting for a volume to be created, either by external provisioner "openshift-storage.cephfs.csi.ceph.com" or manually created by system admin
istrator
  Normal   Provisioning          3s (x32 over 88m)      openshift-storage.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-8549bdb7d6-g8gl5_9259b5ba-75ce-48d4-9505-a61efd468227  External provisioner is provisioning volume for claim "pas-test-namespace/clone-pas-test-1"

and the output of the `oc -n openshift-storage logs csi-cephfsplugin-provisioner-8549bdb7d6-g8gl5 -c csi-provisioner --since-time=2022-06-28T13:01:02Z | grep clone-pas-test-1 | grep started` is :
I0628 13:01:03.294650       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 13:01:03.900756       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 13:01:04.932317       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 13:01:06.958651       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 13:01:14.901555       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 13:01:22.923662       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 13:01:39.098629       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 13:02:11.120525       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 13:03:15.143955       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 13:05:23.166780       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 13:09:39.215652       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 13:14:39.259017       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 13:14:49.828064       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 13:19:39.308110       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 13:24:39.361502       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 13:29:39.404231       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 13:29:49.828838       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 13:34:39.445224       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 13:39:39.486306       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 13:44:39.527288       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 13:44:49.829368       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 13:49:39.566177       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 13:54:39.608320       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 13:59:39.647876       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 13:59:49.829705       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 14:04:39.691295       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 14:09:39.734306       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 14:14:39.774401       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 14:14:49.830236       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 14:19:39.824495       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 14:24:39.869367       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started
I0628 14:29:39.909425       1 controller.go:1337] provision "pas-test-namespace/clone-pas-test-1" class "ocs-storagecluster-cephfs": started


as you can see it appear 32 times !!!

Comment 3 Avi Liani 2022-06-28 16:16:54 UTC
I ran the same test on the same hardware (VMWare-LSO) with OCP 4.11 & ODF 4.10 - got the same results
also the test ran on AWS with the same results.

Comment 4 JC Lopez 2022-08-11 23:53:29 UTC
Hi,

ran a few tests to evaluate the impact of the volume and number fo files when doing the restore to new PVC for a snapshot.

Test environment
- 3 worker nodes
- Storage backend is external array
- Platform VMware
- OCP 4.8.24
- ODF 4.8.14 (6 * 2TB OSDs with default configuration)

Results are expressed this way
- File (#) column is the number of files created
- Restore column is the time it took to restore the snapshot to a new PVC in seconds
- Create column is the time it took to create the original dataset

Here is what I got if this can help evaluate the time in takes for people.
Small 1KB Files (
Files (#)	Restore (s)	Create (s)
1000		14		4
5000		63		21
10000		128		39
50000		511		201
100000		927		428
200000		1684		852
400000		3512		1723
800000		7950		4117

Large 12MB Files
Files (#)	Restore (s)	Create (s)
1000		247		253
5000		698		1227
10000		1713		2454
50000		7759		12653
100000		50422		27218

Best regards
JC

Comment 5 JC Lopez 2022-08-11 23:56:12 UTC
Created attachment 1905055 [details]
Small file test results graphed

Comment 6 JC Lopez 2022-08-11 23:56:38 UTC
Created attachment 1905056 [details]
Bif files test results graphed

Comment 7 Avi Liani 2022-08-24 05:55:52 UTC
(In reply to JC Lopez from comment #4)
> Hi,
> 
> ran a few tests to evaluate the impact of the volume and number fo files
> when doing the restore to new PVC for a snapshot.
> 
> Test environment
> - 3 worker nodes
> - Storage backend is external array
> - Platform VMware
> - OCP 4.8.24
> - ODF 4.8.14 (6 * 2TB OSDs with default configuration)
> 
> Results are expressed this way
> - File (#) column is the number of files created
> - Restore column is the time it took to restore the snapshot to a new PVC in
> seconds
> - Create column is the time it took to create the original dataset
> 
> Here is what I got if this can help evaluate the time in takes for people.
> Small 1KB Files (
> Files (#)	Restore (s)	Create (s)
> 1000		14		4
> 5000		63		21
> 10000		128		39
> 50000		511		201
> 100000		927		428
> 200000		1684		852
> 400000		3512		1723
> 800000		7950		4117
> 
> Large 12MB Files
> Files (#)	Restore (s)	Create (s)
> 1000		247		253
> 5000		698		1227
> 10000		1713		2454
> 50000		7759		12653
> 100000		50422		27218
> 
> Best regards
> JC

and this results is acceptable ?
@jeanchlopez

Comment 8 JC Lopez 2022-09-01 02:56:50 UTC
Hi Avi

No it is not. Spectrum Protec Plus times out mounting the volume. 

We adjusted the timeout but we are already at 45 minutes which restricts the frequency of the backup to an outside objectstore for DR purposes. 

Best regards
JC

Comment 9 Venky Shankar 2022-09-01 04:44:22 UTC
Hi Avi,

CephFS PVC clones are slow for huge data sets as the implementation currently is a full copy of the entire data. This constraint is highlighted in the upstream doc here

        https://docs.ceph.com/en/quincy/cephfs/fs-volumes/#cloning-snapshots

and I presume also in the downstream documentation. Lots and lots of small files just makes this even worse.

There are plans for the future to improve/mitigate this, but that is pretty much long term.

Cheers,
Venky


Note You need to log in before you can comment on or make changes to this bug.