Description of problem (please be detailed as possible and provide log snippests): Creating many DataVolumes in parallel with the same source, which uses snapshot / smartcloning, causes some DVs to hang forever because snapshots keep failing: default 149m Warning SnapshotCreationFailed volumesnapshotcontent/snapcontent-ed26240f-c095-4e71-842f-a8d316e70d3c Failed to create snapshot: failed to take snapshot of the volume 0001-0011-openshift-storage-0000000000000001-7fd8f132-d5f6-11eb-ab7c-0a580a80084b: "rpc error: code = DeadlineExceeded desc = context deadline exceeded" default 149m Warning SnapshotCreationFailed volumesnapshotcontent/snapcontent-6b866f92-bb0f-41d5-ae36-9062d64c0d9a Failed to create snapshot: failed to take snapshot of the volume 0001-0011-openshift-storage-0000000000000001-7fd8f132-d5f6-11eb-ab7c-0a580a80084b: "rpc error: code = DeadlineExceeded desc = context deadline exceeded" [...] 14m Warning SnapshotContentCheckandUpdateFailed volumesnapshotcontent/snapcontent-47bb490d-6152-4ec5-abf5-4cd6ec477041 Failed to check and update snapshot content: failed to take snapshot of the volume 0001-0011-openshift-storage-0000000000000001-7fd8f132-d5f6-11eb-ab7c-0a580a80084b: "rpc error: code = Aborted desc = an operation with the given Volume ID snapshot-47bb490d-6152-4ec5-abf5-4cd6ec477041 already exists" 14m Warning SnapshotContentCheckandUpdateFailed volumesnapshotcontent/snapcontent-59bf4e5b-043a-4527-91c5-07c242e50461 Failed to check and update snapshot content: failed to take snapshot of the volume 0001-0011-openshift-storage-0000000000000001-7fd8f132-d5f6-11eb-ab7c-0a580a80084b: "rpc error: code = DeadlineExceeded desc = context deadline exceeded" Version of all relevant components (if applicable): ocs-operator.v4.8.0-427.ci using primaryAffinity of "0.0" for a small set of OSDs (BZ1924946), multus, compression, replica=2 CNV v2.6.5 Client Version: 4.7.9 Server Version: 4.7.9 Kubernetes Version: v1.20.0+7d0a2b2 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes it is limiting how many VMs we can create for scale testing. Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? yes, we also saw this on a previous build (BZ1972264) Can this issue reproduce from the UI? probably If this is a regression, please provide more details to justify this: In previous scale testing we have not seen DVs fail w/ snapshot errors. Steps to Reproduce: 1. Create 100s of DVs using the same source so they use snapshot/smartlcone, ex: apiVersion: cdi.kubevirt.io/v1alpha1 kind: DataVolume metadata: name: win10-1-disk spec: source: pvc: namespace: "default" name: "win10-dv-source" pvc: accessModes: - ReadWriteMany resources: requests: storage: 40Gi volumeMode: Block storageClassName: ocs-storagecluster-ceph-rbd 2. Wait for all DVs to reach Succeeded state 3. Find that many DVs are stuck in SnapshotForSmartCloneInProgress, for multiple days when the cluster is otherwise idle. We have plenty of space: sh-4.4# ceph df detail RAW STORAGE: CLASS SIZE AVAIL USED RAW USED %RAW USED ssd 77 TiB 71 TiB 6.0 TiB 6.1 TiB 7.97 TOTAL 77 TiB 71 TiB 6.0 TiB 6.1 TiB 7.97 POOLS: POOL ID STORED OBJECTS USED %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTES DIRTY USED COMPR UNDER COMPR ocs-storagecluster-cephblockpool 1 8.7 TiB 2.29M 5.9 TiB 9.21 29 TiB N/A N/A 2.29M 2.8 TiB 11 TiB ocs-storagecluster-cephfilesystem-metadata 2 3.0 KiB 22 84 KiB 0 29 TiB N/A N/A 22 0 B 0 B ocs-storagecluster-cephfilesystem-data0 3 0 B 0 0 B 0 29 TiB N/A N/A 0 0 B 0 B Actual results: DVs never reach Succeeded state, continue to fail over and over. Expected results: All DVs should finish cloning as long as space is available. Additional info: It is not clear yet if primaryAffinity is related to this issue.. Note the PVCs for these DVs have not been allocated yet. Ceph status has been reporting HEALTH_OK the whole time.. but I did notice messages like this if its relevant: Flattening image ocs-storagecluster-cephblockpool/csi-snapXXXX More event logs: default 146m Normal Provisioning persistentvolumeclaim/win10-593-disk External provisioner is provisioning volume for claim "default/win10-593-disk" default 146m Normal Bound datavolume/win10-592-disk PVC win10-592-disk Bound default 146m Normal SmartClonePVCInProgress datavolume/win10-593-disk Creating PVC for smart-clone is in progress (for pvc default/win10-dv-source) default 146m Normal CloneSucceeded datavolume/win10-593-disk Successfully cloned from default/win10-dv-source into default/win10-593-disk default 146m Normal ProvisioningSucceeded persistentvolumeclaim/win10-593-disk Successfully provisioned volume pvc-71e6fb7c-1f9f-47b5-924d-0fd2d88a41dc default 146m Normal Bound datavolume/win10-593-disk PVC win10-593-disk Bound default 146m Warning SnapshotContentCheckandUpdateFailed volumesnapshotcontent Failed to check and update snapshot content: failed to take snapshot of the volume 0001-0011-openshift-storage-0000000000000001-7fd8f132-d5f6-11eb-ab7c-0a580a80084b: "rpc error: code = Internal desc = RBD image not found" default 144m Warning SnapshotCreationFailed volumesnapshotcontent/snapcontent-2e3517b6-4dd4-448c-b178-7734bb642fc3 Failed to create snapshot: failed to take snapshot of the volume 0001-0011-openshift-storage-0000000000000001-7fd8f132-d5f6-11eb-ab7c-0a580a80084b: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
I tried deleting the hung DVs and their volumesnapshots (after patching out the finalizers), reapplying one of the DVs did succeed: default 4m13s Normal CreatingSnapshot volumesnapshot/win10-459-disk Waiting for a snapshot default/win10-459-disk to be created by the CSI driver. default 4m13s Normal SnapshotForSmartCloneInProgress datavolume/win10-459-disk Creating snapshot for smart-clone is in progress (for pvc default/win10-dv-source) openshift-cnv 2m12s Normal ReconcileHCO clusterserviceversion/kubevirt-hyperconverged-operator.v2.6.5 HCO Reconcile completed successfully openshift-cnv 2m12s Normal ReconcileHCO pod/hco-operator-7b8c96b8b7-jdtnf HCO Reconcile completed successfully default 2m12s Normal Modified networkaddonsconfig/cluster Config spec was modified openshift-cnv 2m12s Normal ReconcileHCO hyperconverged/kubevirt-hyperconverged HCO Reconcile completed successfully default 112s Normal CloneSucceeded datavolume/win10-459-disk Successfully cloned from default/win10-dv-source into default/win10-459-disk default 112s Normal Provisioning persistentvolumeclaim/win10-459-disk External provisioner is provisioning volume for claim "default/win10-459-disk" default 112s Normal ProvisioningSucceeded persistentvolumeclaim/win10-459-disk Successfully provisioned volume pvc-78916a3d-c623-462b-9ba1-b420518de192 default 112s Normal SmartClonePVCInProgress datavolume/win10-459-disk Creating PVC for smart-clone is in progress (for pvc default/win10-dv-source) default 112s Normal ExternalProvisioning persistentvolumeclaim/win10-459-disk waiting for a volume to be created, either by external provisioner "openshift-storage.rbd.csi.ceph.com" or manually created by system administrator default 111s Normal Bound datavolume/win10-459-disk PVC win10-459-disk Bound
> 14m Warning SnapshotContentCheckandUpdateFailed volumesnapshotcontent/snapcontent-47bb490d-6152-4ec5-abf5-4cd6ec477041 Failed to check and update snapshot content: failed to take snapshot of the volume 0001-0011-openshift-storage-0000000000000001-7fd8f132-d5f6-11eb-ab7c-0a580a80084b: "rpc error: code = Aborted desc = an operation with the given Volume ID snapshot-47bb490d-6152-4ec5-abf5-4cd6ec477041 already exists" > At glance, this looks like the snapshot controller operations not going through or get throttled in this setup. In description it has mentioned, below settings are enabled in this cluster: using primaryAffinity of "0.0" for a small set of OSDs (BZ1924946), multus, compression, replica=2 Looks like one of the above or combo of some contributing to this behaviour. Also, from a very quick look at the referenced bug, it seems that, its not only snapshot operations, some other operations are also slowed down in this setup. If thats the case, may be 'primaryaffinity' setting could be contributing to this. We have also seen that, when `compression` is enabled ( I believe the setting here is `aggressive`, please correct me if I am wrong) , snapshot operations are delayed to comeout from the ceph cluster ( backend). As a first step of isolation, is it possible to test this scenario in same build but without `primaryAffinity` set? Also, how many DVs are getting provisioned at batch of this test? Meanwhile we will dig into the logs and try to find out if anything can be spotted.
Yeah it could be a combination of the different settings causing some throttling and yes it is compression "aggressive" using the default lib. But if snapshots are throttled, the problem is they don't recover even after the cluster sat idle for 2 days. Yes I would like to eventually try again w/ out primaryAffinity or multus after we complete that testing.. In the past we have not seen this issue on OCS 4.6/4.7 w/out these new features, even with compression enabled. In this case I deployed 200 DVs in parallel, waited for some time, then deployed two more rounds of 200 DVs in parallel. I expected some amount of throttling due to the amount of work, but around 70 of them just never recovered. The "already exists" messages are interesting, the volumesnapshots object was created but no PVC. Is there any sort of "re-try" logic in volumesnapshots?
>In this case I deployed 200 DVs in parallel, waited for some time, then deployed two more rounds of 200 DVs in parallel. I expected some amount of throttling due to the amount of work, but around 70 of them just never recovered. The "already exists" messages are interesting, the volumesnapshots object was created but no PVC. Is there any sort of "re-try" logic in volumesnapshots? > Thanks for the details Jenifer, so its around 600 snapshot requests, and it looks like around 530 got satisfied, rest 70 didnt come up. >The "already exists" messages are interesting, the volumesnapshots object was created but no PVC. Is there any sort of "re-try" logic in volumesnapshots? The 'already exist' message, is actually a temporary state, that said, Ceph CSI take locks while operating on a volume to make sure we are not executing parellel operations on same volume while other operation on the volume is progressing and report back to kubernetes/caller saying "we have an operation already exist..." , so its a "retry" signal. For example: at time of clone we check we have any ongoing delete/expand operation, at time of restore, we check any "delete" operation is ongoing on the same volumeID..etc The volumesnapshot content existence is also "very" early stage of snapshot transaction in the workflow, the volumesnapshotcontent comes first from snapshot controller, then the actual snapshot creation get started in the CSI driver. Once the creation is successful, the state "ready to use" field and get changed to "true" along with other fields in the volumesnapshotcontent . In short, if the operation is retried and finally reach to a state "failed" to create , the volumesnapshotcontent ...etc has to be cleared properly which also does the finalizer removal ..etc as part of the process. You actually did that manually later to get it cleared.. Also, there is 'retry' logic/exponental backoff for each operation with a timeout in the snapshot controller. I am aware of some issues in snapshot controller which does not do proper updates on these objects in different stages in some scenarios. some of the bugs are fixed, but 1/2 still left. However one concerning error which I am worried about (at ceph csi) from the provided logs is this : default 146m Warning SnapshotContentCheckandUpdateFailed volumesnapshotcontent Failed to check and update snapshot content: failed to take snapshot of the volume 0001-0011-openshift-storage-0000000000000001-7fd8f132-d5f6-11eb-ab7c-0a580a80084b: "rpc error: code = Internal desc = RBD image not found" This kind of gives a message that, the "snapshot creation" itself failed or ceph csi returned "failure" here. Thats not good unless, the volume /image got deleted forcefully in the backend or the data source or referenced object does not exist yet in the cluster. One thing which I am struggling here ( atleast in absence of the test bed knowledge) is that the understanding the sequence of operation. because the operations listed here is "snapshot" creation, but there are also mentions of "smartcloning", can you provide some details on, the process executed here or point to the template which you use for testing: That said : "In this case I deployed 200 DVs in parallel" , are these "200 DVs" carry DataSource of "snapshot" or "PVCs"? or its a mix? If those are PVCs ( for smartcloning) , do we have referenced "PVC" exist properly here ? because there is a mention in bz description that, "Note the PVCs for these DVs have not been allocated yet." If those are "snapshots" , does that 'snapshot' exist ? or you are using "same" template/loop and execute both "snapshot" and DV creation together?
Thanks for the detailed response about the process and ongoing snapshot controller work. I noticed ceph did report "Flattening image ocs-storagecluster-cephblockpool/csi-snapXXXX" during this state, would that be expected? In this case the DataVolume (https://github.com/kubevirt/containerized-data-importer/blob/main/doc/datavolumes.md) creates the PVC itself, I started by creating a single DV to serve as the source for all others: apiVersion: cdi.kubevirt.io/v1alpha1 kind: DataVolume metadata: name: win10-dv-source spec: source: http: url: http://192.168.222.1:81/ocp4-upi/win10-1.img pvc: accessModes: - ReadWriteMany resources: requests: storage: 40Gi volumeMode: Block storageClassName: ocs-storagecluster-ceph-rbd And only after that DV was successful did I start to create the rounds of 200 DV clones using the above as the source pvc: apiVersion: cdi.kubevirt.io/v1alpha1 kind: DataVolume metadata: name: win10-1-disk spec: source: pvc: namespace: "default" name: "win10-dv-source" pvc: accessModes: - ReadWriteMany resources: requests: storage: 40Gi volumeMode: Block storageClassName: ocs-storagecluster-ceph-rbd It is interesting to note that only (some of) the last round of DVs got stuck in the SnapshotForSmartCloneInProgress state: 459, 478, 482, 483, 501-513, 515, 517, 594-600 Which seems like it could indicate a scale related problem?
Here is the analysis so far ( Too many snapshots (600) and logs to look at though ): It looks like '573' clones got completed and '26' snapshots are 'still pending' as per the must-gather: --> In total "573+26" -> 599 , iic, bz says 600 in total, so missing one here, but that should not be a concern at this stage: *) Pending Snapshots: -------------------------------------------------- snapshot-010650c6-c68f-42b8-83ac-7301282dd4d0 snapshot-0b25cd4d-ce40-4a8b-a3b2-6a250ec61d6d snapshot-2e3517b6-4dd4-448c-b178-7734bb642fc3 snapshot-315ee773-a6bc-4873-b32e-e2d6e29a2559 snapshot-355c401f-8b65-4380-90fc-752de4dbb11e snapshot-422d9bc6-81d7-4b77-b0c2-0309452e1c76 snapshot-47bb490d-6152-4ec5-abf5-4cd6ec477041 snapshot-57098abc-4b1e-4da7-981d-bbad6470424c snapshot-5745bba3-0fab-455d-8b34-d0977b66d693 snapshot-59bf4e5b-043a-4527-91c5-07c242e50461 snapshot-795cf661-84de-45b2-b1a6-9971b82c4069 snapshot-87b0c7dd-05dd-4604-b326-4d86897fdfd4 snapshot-8e9ff1d1-546c-4939-8241-f7184956b46f snapshot-9721af7f-a473-45c5-9c57-7db314ed5776 snapshot-98811305-cbb3-4725-9149-1bb19a08914b snapshot-a054c027-1331-43e6-846d-8d0688803195 snapshot-a6fa9d26-b9ba-42b9-9370-017adb3cd608 snapshot-aa92caa7-8ffd-41ea-9b3a-642acf3edb32 snapshot-af3d8578-e2c4-4a35-b825-557fe5a0cfc8 snapshot-b160657d-9250-4373-b525-cb43b3db161a snapshot-b7e63728-4602-46bb-963a-101a55312013 snapshot-c28ae54e-b4e5-4af5-b3b7-1c767d8a6e4a snapshot-cde741bb-3997-4569-87d2-0c2d32c155f7 snapshot-d7601e5c-64aa-4efe-a8c5-754e2cb855b0 snapshot-e7c375e7-4903-4b39-ba98-345aaeb01cfe snapshot-f5340dab-23cc-4e4b-9c65-d488e84beeac All these snapshots has reported errors saying that "rbd image not found" while it tried to perform "flatten" task as part of this transaction: ................... 2021-06-28T15:24:54.872752262Z W0628 15:24:54.872738 1 rbd_util.go:469] ID: 55568 Req-ID: snapshot-e7c375e7-4903-4b39-ba98-345aaeb01cfe uncaught error while scheduling a task (an error (exit status 2) occurred while running ceph args: [rbd task add flatten ocs-storagecluster-cephblockpool/csi-snap-04e89960-d5f8-11eb-ab7c-0a580a80084b --id csi-rbd-provisioner --keyfile=***stripped*** -m 172.30.64.23:6789,172.30.22.136:6789,172.30.131.179:6789]): Error ENOENT: [errno 2] Image ocs-storagecluster-cephblockpool/csi-snap-04e89960-d5f8-11eb-ab7c-0a580a80084b does not exist **** 2021-06-28T15:24:54.887930530Z W0628 15:24:54.887860 1 rbd_util.go:469] ID: 55578 Req-ID: snapshot-5745bba3-0fab-455d-8b34-d0977b66d693 uncaught error while scheduling a task (an error (exit status 2) occurred while running ceph args: [rbd task add flatten ocs-storagecluster-cephblockpool/csi-snap-a30832a9-d5f7-11eb-ab7c-0a580a80084b --id csi-rbd-provisioner --keyfile=***stripped*** -m 172.30.64.23:6789,172.30.22.136:6789,172.30.131.179:6789]): Error ENOENT: [errno 2] Image ocs-storagecluster-cephblockpool/csi-snap-a30832a9-d5f7-11eb-ab7c-0a580a80084b does not exist **** 2021-06-28T15:24:55.480658120Z W0628 15:24:55.480632 1 rbd_util.go:469] ID: 55562 Req-ID: snapshot-87b0c7dd-05dd-4604-b326-4d86897fdfd4 uncaught error while scheduling a task (an error (exit status 2) occurred while running ceph args: [rbd task add flatten ocs-storagecluster-cephblockpool/csi-snap-1f41bba7-d5fa-11eb-ab7c-0a580a80084b --id csi-rbd-provisioner --keyfile=***stripped*** -m 172.30.64.23:6789,172.30.22.136:6789,172.30.131.179:6789]): Error ENOENT: [errno 2] Image ocs-storagecluster-cephblockpool/csi-snap-1f41bba7-d5fa-11eb-ab7c-0a580a80084b does not exist .................. If we go ahead and map the images for these snapshot requests we can see: 2021-06-28T15:27:14.110438114Z E0628 15:27:14.110419 1 utils.go:173] ID: 55568 Req-ID: snapshot-e7c375e7-4903-4b39-ba98-345aaeb01cfe GRPC error: rpc error: code = ResourceExhausted desc = rbd image ocs-storagecluster-cephblockpool/csi-vol-7fd8f132-d5f6-11eb-ab7c-0a580a80084b has 457 snapshots --------------------------------------------- During our snapshot creation process/clone process, 'flattening' take place in few scenarios, for example: If the parent has more snapshots than the configured `maxsnapshotsonimage` , we add background tasks to flatten the temporary cloned images..etc. For example our snapshot creation process looks like this: ---------------------------------------------------------------------------------- rbd snap ls <RBD image for src k8s volume> --all // If the parent has more snapshots than the configured `maxsnapshotsonimage` // add background tasks to flatten the temporary cloned images (temporary cloned // image names will be same as snapshot names) ceph rbd task add flatten <RBD image for temporary snap images> rbd snap create <RBD image for src k8s volume>@<random snap name> rbd clone --rbd-default-clone-format 2 --image-feature layering,deep-flatten <RBD image for src k8s volume>@<random snap> <RBD image for temporary snap image> rbd snap rm <RBD image for src k8s volume>@<random snap name> rbd snap create <RBD image for temporary snap image>@<random snap name> // check the depth, if the depth is greater than configured hardlimit add a // task to flatten the cloned image, return snapshot status ready as `false`, // if the depth is greater than softlimit add a task to flatten the image // and return success ceph rbd task add flatten <RBD image for temporary snap image> --------------------------------------------------------------------- That said, Maximum number of snapshots allowed on an RBD Image is "510" https://github.com/torvalds/linux/blob/master/drivers/block/rbd.c#L98 however for safer side, in ceph csi we have configured this value to "450" which is the hardlimit we crossed (457) in this case.. In this case, the mentioned RBD Image got "457" snapshots and there was an attempt to "flatten" the image in backend which didnt go through and snapshot creation is stuck.. ******* jfr: if len(snaps) > int(maxSnapshotsOnImage) { util.DebugLog( ctx, "snapshots count %d on image: %s reached configured hard limit %d", len(snaps), rbdVol, maxSnapshotsOnImage) err = flattenClonedRbdImages( ctx, snaps, rbdVol.Pool, rbdVol.Monitors, rbdVol.RbdImageName, cr) if err != nil { return status.Error(codes.Internal, err.Error()) } return status.Errorf(codes.ResourceExhausted, "rbd image %s has %d snapshots", rbdVol, len(snaps)) } ****** The source PV: ------------------------------------------------------------------ Name: pvc-91ae1369-c277-4b75-9a24-ae5c9de5999b Labels: <none> Annotations: pv.kubernetes.io/provisioned-by: openshift-storage.rbd.csi.ceph.com Finalizers: [kubernetes.io/pv-protection] StorageClass: ocs-storagecluster-ceph-rbd Status: Bound Claim: default/win10-dv-source Reclaim Policy: Delete Access Modes: RWX VolumeMode: Block Capacity: 40Gi Node Affinity: <none> Message: Source: Type: CSI (a Container Storage Interface (CSI) volume source) Driver: openshift-storage.rbd.csi.ceph.com FSType: VolumeHandle: 0001-0011-openshift-storage-0000000000000001-7fd8f132-d5f6-11eb-ab7c-0a580a80084b ReadOnly: false VolumeAttributes: clusterID=openshift-storage csi.storage.k8s.io/pv/name=pvc-91ae1369-c277-4b75-9a24-ae5c9de5999b csi.storage.k8s.io/pvc/name=win10-dv-source csi.storage.k8s.io/pvc/namespace=default imageFeatures=layering imageFormat=2 imageName=csi-vol-7fd8f132-d5f6-11eb-ab7c-0a580a80084b journalPool=ocs-storagecluster-cephblockpool pool=ocs-storagecluster-cephblockpool storage.kubernetes.io/csiProvisionerIdentity=1624651293250-8081-openshift-storage.rbd.csi.ceph.com thickProvision=false --------------- I will go through the backend logs and try to figure out details, but from the error message it looks like something got messed up in between and it reports the 'image not found' error!!!
Moving it out of 4.8 while waiting for the live cluster. We can take the fix in 4.8.1 if required.
Karthick, while we wait for a live setup to debug this further , I have a couple of questions here : Do we test the scale limit ( 512) with very new build we have in CI ? Is this test automated and ran with each build ? or its manual ? if its manual, can we execute the same against this build and see the result ? If its ran by automation and passing for 4.8 builds (ex: one mentioned here ocs-operator.v4.8.0-427.ci), it will help us to isolate this scenario to any kind of regresssion/setup issue.
(In reply to Humble Chirammal from comment #20) > Karthick, while we wait for a live setup to debug this further , I have a > couple of questions here : > > Do we test the scale limit ( 512) with very new build we have in CI ? > > Is this test automated and ran with each build ? or its manual ? if its > manual, can we execute the same against this build and see the result ? > > If its ran by automation and passing for 4.8 builds (ex: one mentioned here > ocs-operator.v4.8.0-427.ci), it will help us to isolate this scenario to > any kind of regresssion/setup issue. Yes - we run the scale snapshot test(https://github.com/red-hat-storage/ocs-ci/blob/master/tests/e2e/performance/test_pvc_multi_snapshot_performance.py) and it ran successfully in our recent runs. in one of the discussions with Mudit, I remember it was mentioned that this issue was seen when cluster capacity was reached? we don't consume the entire cluster capacity while running this test. So, we won't end up in this issue. So, I don't think this is a regression.
(In reply to krishnaram Karthick from comment #21) > (In reply to Humble Chirammal from comment #20) > > Karthick, while we wait for a live setup to debug this further , I have a > > couple of questions here : > > > > Do we test the scale limit ( 512) with very new build we have in CI ? > > > > Is this test automated and ran with each build ? or its manual ? if its > > manual, can we execute the same against this build and see the result ? > > > > If its ran by automation and passing for 4.8 builds (ex: one mentioned here > > ocs-operator.v4.8.0-427.ci), it will help us to isolate this scenario to > > any kind of regresssion/setup issue. > > Yes - we run the scale snapshot > test(https://github.com/red-hat-storage/ocs-ci/blob/master/tests/e2e/ > performance/test_pvc_multi_snapshot_performance.py) and it ran successfully > in our recent runs. Thanks Karthick for the confirmation! > in one of the discussions with Mudit, I remember it was mentioned that this > issue was seen when cluster capacity was reached? May be it was a miscommunication. In this case the cluster capacity is not full ( atleast as per above bz comments). > we don't consume the entire cluster capacity while running this test. So, we > won't end up in this issue. So, I don't think this is a regression. ooc, do we also have a clone scale test ( PVC-> PVC) in our setup ?
(In reply to Humble Chirammal from comment #22) > (In reply to krishnaram Karthick from comment #21) > > (In reply to Humble Chirammal from comment #20) > > > Karthick, while we wait for a live setup to debug this further , I have a > > > couple of questions here : > > > > > > Do we test the scale limit ( 512) with very new build we have in CI ? > > > > > > Is this test automated and ran with each build ? or its manual ? if its > > > manual, can we execute the same against this build and see the result ? > > > > > > If its ran by automation and passing for 4.8 builds (ex: one mentioned here > > > ocs-operator.v4.8.0-427.ci), it will help us to isolate this scenario to > > > any kind of regresssion/setup issue. > > > > Yes - we run the scale snapshot > > test(https://github.com/red-hat-storage/ocs-ci/blob/master/tests/e2e/ > > performance/test_pvc_multi_snapshot_performance.py) and it ran successfully > > in our recent runs. > > Thanks Karthick for the confirmation! > > > > in one of the discussions with Mudit, I remember it was mentioned that this > > issue was seen when cluster capacity was reached? > > May be it was a miscommunication. In this case the cluster capacity is not > full ( atleast as per above bz comments). > > > we don't consume the entire cluster capacity while running this test. So, we > > won't end up in this issue. So, I don't think this is a regression. > > ooc, do we also have a clone scale test ( PVC-> PVC) in our setup ? Yes, 512 clones - https://github.com/red-hat-storage/ocs-ci/blob/master/tests/e2e/performance/test_pvc_multi_clone_performance.py
(In reply to krishnaram Karthick from comment #23) > (In reply to Humble Chirammal from comment #22) > > ooc, do we also have a clone scale test ( PVC-> PVC) in our setup ? > > Yes, 512 clones - > https://github.com/red-hat-storage/ocs-ci/blob/master/tests/e2e/performance/ > test_pvc_multi_clone_performance.py Thanks Karthick! That helps on rollout the possibilities of regression here. There are some extra setting in this cluster like primary affiniity set...etc on this cluster, may be those special settings is also contributing to this issue. Any way its good that, these builds pass our scale testing for snapshot , clone ...etc operations.