1976936 – Failed to create snapshot

Bug 1976936 - Failed to create snapshot

Summary: Failed to create snapshot

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	csi-driver
Sub Component:
Version:	4.8
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Humble Chirammal
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-28 15:25 UTC by Jenifer Abrams
Modified:	2023-08-09 16:37 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-26 05:27:45 UTC
Embargoed:

Attachments	(Terms of Use)

Description Jenifer Abrams 2021-06-28 15:25:43 UTC

Description of problem (please be detailed as possible and provide log
snippests):

Creating many DataVolumes in parallel with the same source, which uses snapshot / smartcloning, causes some DVs to hang forever because snapshots keep failing:

default                     149m        Warning   SnapshotCreationFailed                volumesnapshotcontent/snapcontent-ed26240f-c095-4e71-842f-a8d316e70d3c   Failed to create snapshot: failed to take snapshot of the volume 0001-0011-openshift-storage-0000000000000001-7fd8f132-d5f6-11eb-ab7c-0a580a80084b: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
default                     149m        Warning   SnapshotCreationFailed                volumesnapshotcontent/snapcontent-6b866f92-bb0f-41d5-ae36-9062d64c0d9a   Failed to create snapshot: failed to take snapshot of the volume 0001-0011-openshift-storage-0000000000000001-7fd8f132-d5f6-11eb-ab7c-0a580a80084b: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"

[...]

14m         Warning   SnapshotContentCheckandUpdateFailed   volumesnapshotcontent/snapcontent-47bb490d-6152-4ec5-abf5-4cd6ec477041   Failed to check and update snapshot content: failed to take snapshot of the volume 0001-0011-openshift-storage-0000000000000001-7fd8f132-d5f6-11eb-ab7c-0a580a80084b: "rpc error: code = Aborted desc = an operation with the given Volume ID snapshot-47bb490d-6152-4ec5-abf5-4cd6ec477041 already exists"
14m         Warning   SnapshotContentCheckandUpdateFailed   volumesnapshotcontent/snapcontent-59bf4e5b-043a-4527-91c5-07c242e50461   Failed to check and update snapshot content: failed to take snapshot of the volume 0001-0011-openshift-storage-0000000000000001-7fd8f132-d5f6-11eb-ab7c-0a580a80084b: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"

Version of all relevant components (if applicable):
ocs-operator.v4.8.0-427.ci
  using primaryAffinity of "0.0" for a small set of OSDs (BZ1924946), multus, compression, replica=2

CNV v2.6.5
Client Version: 4.7.9
Server Version: 4.7.9
Kubernetes Version: v1.20.0+7d0a2b2


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes it is limiting how many VMs we can create for scale testing. 


Is there any workaround available to the best of your knowledge?
No


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2


Can this issue reproducible?
yes, we also saw this on a previous build (BZ1972264)


Can this issue reproduce from the UI?
probably


If this is a regression, please provide more details to justify this:
In previous scale testing we have not seen DVs fail w/ snapshot errors. 


Steps to Reproduce:
1. Create 100s of DVs using the same source so they use snapshot/smartlcone, ex:

apiVersion: cdi.kubevirt.io/v1alpha1
kind: DataVolume
metadata:
  name: win10-1-disk
spec:
  source:
    pvc:
      namespace: "default"
      name: "win10-dv-source"
  pvc:
    accessModes:
      - ReadWriteMany
    resources:
      requests:
        storage: 40Gi
    volumeMode: Block
    storageClassName: ocs-storagecluster-ceph-rbd


2. Wait for all DVs to reach Succeeded state
3. Find that many DVs are stuck in SnapshotForSmartCloneInProgress, for multiple days when the cluster is otherwise idle.

We have plenty of space:
sh-4.4# ceph df detail
RAW STORAGE:
    CLASS     SIZE       AVAIL      USED        RAW USED     %RAW USED
    ssd       77 TiB     71 TiB     6.0 TiB      6.1 TiB          7.97
    TOTAL     77 TiB     71 TiB     6.0 TiB      6.1 TiB          7.97

POOLS:
    POOL                                           ID     STORED      OBJECTS     USED        %USED     MAX AVAIL     QUOTA OBJECTS     QUOTA BYTES     DIRTY     USED COMPR     UNDER COMPR
    ocs-storagecluster-cephblockpool                1     8.7 TiB       2.29M     5.9 TiB      9.21        29 TiB     N/A               N/A             2.29M        2.8 TiB          11 TiB
    ocs-storagecluster-cephfilesystem-metadata      2     3.0 KiB          22      84 KiB         0        29 TiB     N/A               N/A                22            0 B             0 B
    ocs-storagecluster-cephfilesystem-data0         3         0 B           0         0 B         0        29 TiB     N/A               N/A                 0            0 B             0 B


Actual results:
DVs never reach Succeeded state, continue to fail over and over. 


Expected results:
All DVs should finish cloning as long as space is available. 

Additional info:
It is not clear yet if primaryAffinity is related to this issue.. 

Note the PVCs for these DVs have not been allocated yet. 

Ceph status has been reporting HEALTH_OK the whole time.. but I did notice messages like this if its relevant:
Flattening image ocs-storagecluster-cephblockpool/csi-snapXXXX


More event logs:
default                     146m        Normal    Provisioning                          persistentvolumeclaim/win10-593-disk                                     External provisioner is provisioning volume for claim "default/win10-593-disk"
default                     146m        Normal    Bound                                 datavolume/win10-592-disk                                                PVC win10-592-disk Bound
default                     146m        Normal    SmartClonePVCInProgress               datavolume/win10-593-disk                                                Creating PVC for smart-clone is in progress (for pvc default/win10-dv-source)
default                     146m        Normal    CloneSucceeded                        datavolume/win10-593-disk                                                Successfully cloned from default/win10-dv-source into default/win10-593-disk
default                     146m        Normal    ProvisioningSucceeded                 persistentvolumeclaim/win10-593-disk                                     Successfully provisioned volume pvc-71e6fb7c-1f9f-47b5-924d-0fd2d88a41dc
default                     146m        Normal    Bound                                 datavolume/win10-593-disk                                                PVC win10-593-disk Bound
default                     146m        Warning   SnapshotContentCheckandUpdateFailed   volumesnapshotcontent                                                    Failed to check and update snapshot content: failed to take snapshot of the volume 0001-0011-openshift-storage-0000000000000001-7fd8f132-d5f6-11eb-ab7c-0a580a80084b: "rpc error: code = Internal desc = RBD image not found"
default                     144m        Warning   SnapshotCreationFailed                volumesnapshotcontent/snapcontent-2e3517b6-4dd4-448c-b178-7734bb642fc3   Failed to create snapshot: failed to take snapshot of the volume 0001-0011-openshift-storage-0000000000000001-7fd8f132-d5f6-11eb-ab7c-0a580a80084b: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"

Comment 3 Jenifer Abrams 2021-06-28 17:55:16 UTC

I tried deleting the hung DVs and their volumesnapshots (after patching out the finalizers), reapplying one of the DVs did succeed: 

default                      4m13s       Normal    CreatingSnapshot                      volumesnapshot/win10-459-disk                                            Waiting for a snapshot default/win10-459-disk to be created by the CSI driver.
default                      4m13s       Normal    SnapshotForSmartCloneInProgress       datavolume/win10-459-disk                                                Creating snapshot for smart-clone is in progress (for pvc default/win10-dv-source)
openshift-cnv                2m12s       Normal    ReconcileHCO                          clusterserviceversion/kubevirt-hyperconverged-operator.v2.6.5            HCO Reconcile completed successfully
openshift-cnv                2m12s       Normal    ReconcileHCO                          pod/hco-operator-7b8c96b8b7-jdtnf                                        HCO Reconcile completed successfully
default                      2m12s       Normal    Modified                              networkaddonsconfig/cluster                                              Config spec was modified
openshift-cnv                2m12s       Normal    ReconcileHCO                          hyperconverged/kubevirt-hyperconverged                                   HCO Reconcile completed successfully
default                      112s        Normal    CloneSucceeded                        datavolume/win10-459-disk                                                Successfully cloned from default/win10-dv-source into default/win10-459-disk
default                      112s        Normal    Provisioning                          persistentvolumeclaim/win10-459-disk                                     External provisioner is provisioning volume for claim "default/win10-459-disk"
default                      112s        Normal    ProvisioningSucceeded                 persistentvolumeclaim/win10-459-disk                                     Successfully provisioned volume pvc-78916a3d-c623-462b-9ba1-b420518de192
default                      112s        Normal    SmartClonePVCInProgress               datavolume/win10-459-disk                                                Creating PVC for smart-clone is in progress (for pvc default/win10-dv-source)
default                      112s        Normal    ExternalProvisioning                  persistentvolumeclaim/win10-459-disk                                     waiting for a volume to be created, either by external provisioner "openshift-storage.rbd.csi.ceph.com" or manually created by system administrator
default                      111s        Normal    Bound                                 datavolume/win10-459-disk                                                PVC win10-459-disk Bound

Comment 4 Humble Chirammal 2021-06-30 12:19:18 UTC

>
14m         Warning   SnapshotContentCheckandUpdateFailed   volumesnapshotcontent/snapcontent-47bb490d-6152-4ec5-abf5-4cd6ec477041   Failed to check and update snapshot content: failed to take snapshot of the volume 0001-0011-openshift-storage-0000000000000001-7fd8f132-d5f6-11eb-ab7c-0a580a80084b: "rpc error: code = Aborted desc = an operation with the given Volume ID snapshot-47bb490d-6152-4ec5-abf5-4cd6ec477041 already exists"
>

At glance, this looks like the snapshot controller operations not going through or get throttled in this setup. In description it has mentioned, below settings are enabled in this cluster:

  using primaryAffinity of "0.0" for a small set of OSDs (BZ1924946), multus, compression, replica=2

Looks like one of the above or combo of some contributing to this behaviour. Also, from a very quick look at the referenced bug, it seems that, its not only snapshot operations, some other operations are also slowed down in this setup. If thats the case, may be 'primaryaffinity' setting could be contributing to this. We have also seen that, when `compression` is enabled ( I believe the setting here is `aggressive`, please correct me if I am wrong) , snapshot operations are delayed  to comeout from the ceph cluster ( backend).

As a first step of isolation, is it possible to test this scenario in same build but without `primaryAffinity` set?

Also, how many DVs are getting provisioned at batch of this test?

Meanwhile we will dig into the logs and try to find out if anything can be spotted.

Comment 5 Jenifer Abrams 2021-06-30 13:49:42 UTC

Yeah it could be a combination of the different settings causing some throttling and yes it is compression "aggressive" using the default lib.  But if snapshots are throttled, the problem is they don't recover even after the cluster sat idle for 2 days. 

Yes I would like to eventually try again w/ out primaryAffinity or multus after we complete that testing.. In the past we have not seen this issue on OCS 4.6/4.7 w/out these new features, even with compression enabled. 

In this case I deployed 200 DVs in parallel, waited for some time, then deployed two more rounds of 200 DVs in parallel. I expected some amount of throttling due to the amount of work, but around 70 of them just never recovered. The "already exists" messages are interesting, the volumesnapshots object was created but no PVC. Is there any sort of "re-try" logic in volumesnapshots?

Comment 6 Humble Chirammal 2021-07-02 04:29:25 UTC

>In this case I deployed 200 DVs in parallel, waited for some time, then deployed two more rounds of 200 DVs in parallel. I expected some amount of throttling due to the amount of work, but around 70 of them just never recovered. The "already exists" messages are interesting, the volumesnapshots object was created but no PVC. Is there any sort of "re-try" logic in volumesnapshots?
>

Thanks for the details Jenifer, so its around 600 snapshot requests, and it looks like around 530 got satisfied, rest 70 didnt come up. 

>The "already exists" messages are interesting, the volumesnapshots object was created but no PVC. Is there any sort of "re-try" logic in volumesnapshots?

The 'already exist' message, is actually a temporary state, that said, Ceph CSI take locks while operating on a volume to make sure we are not executing parellel operations on same volume while other operation on the volume is progressing and report back to kubernetes/caller saying "we have an operation already exist..." , so its a "retry" signal. 


For example: at time of clone we check we have any ongoing delete/expand operation, at time of restore, we check any "delete" operation is ongoing on the same volumeID..etc
 

The volumesnapshot content existence is also "very" early stage of snapshot transaction in the workflow, the volumesnapshotcontent comes first from snapshot controller, then the actual snapshot creation get started in the CSI driver. Once the creation is successful, the state  "ready to use" field and get changed to "true" along with other fields in the volumesnapshotcontent . In short, if the operation is retried and finally reach to a state "failed" to create , the volumesnapshotcontent ...etc has to be cleared properly which also does the finalizer removal ..etc as part of the process. You actually did that manually later to get it cleared.. Also, there is 'retry' logic/exponental backoff for each operation with a timeout in the snapshot controller. 

I am aware of some issues in snapshot controller which does not do proper updates on these objects in different stages in some scenarios. some of the bugs are fixed, but 1/2 still left. 


However one concerning error which I am worried about (at ceph csi)  from the provided logs is this :

 
   default                     146m        Warning   SnapshotContentCheckandUpdateFailed   volumesnapshotcontent                                                    Failed to check and update snapshot content: failed to take snapshot of the volume 0001-0011-openshift-storage-0000000000000001-7fd8f132-d5f6-11eb-ab7c-0a580a80084b: "rpc error: code = Internal desc = RBD image not found"


This kind of gives a message that, the "snapshot creation" itself failed or ceph csi returned "failure" here. Thats not good unless, the volume /image got deleted forcefully in the backend or the data source or referenced object does not exist yet in the cluster. 

One thing which I am struggling here ( atleast in absence of the test bed knowledge) is that the understanding the sequence of operation. because the operations listed here is "snapshot" creation, but there are also mentions of "smartcloning", can you provide some details on, the process executed here or point to  the template which you use for testing:


That said :

"In this case I deployed 200 DVs in parallel" , are these "200 DVs" carry DataSource of "snapshot" or "PVCs"?  or its a mix? 

If those are PVCs ( for smartcloning) , do we have referenced "PVC" exist properly here ? because there is a mention in bz description that, "Note the PVCs for these DVs have not been allocated yet."

If those are "snapshots" , does that 'snapshot' exist ? or you are using "same" template/loop and execute both "snapshot" and DV creation together?

Comment 7 Jenifer Abrams 2021-07-02 20:25:52 UTC

Thanks for the detailed response about the process and ongoing snapshot controller work. I noticed ceph did report "Flattening image ocs-storagecluster-cephblockpool/csi-snapXXXX"  during this state, would that be expected?

In this case the DataVolume (https://github.com/kubevirt/containerized-data-importer/blob/main/doc/datavolumes.md) creates the PVC itself, I started by creating a single DV to serve as the source for all others:

apiVersion: cdi.kubevirt.io/v1alpha1
kind: DataVolume
metadata:
  name: win10-dv-source
spec:
  source:
      http:
         url: http://192.168.222.1:81/ocp4-upi/win10-1.img
  pvc:
    accessModes:
      - ReadWriteMany
    resources:
      requests:
        storage: 40Gi
    volumeMode: Block
    storageClassName: ocs-storagecluster-ceph-rbd


And only after that DV was successful did I start to create the rounds of 200 DV clones using the above as the source pvc:

apiVersion: cdi.kubevirt.io/v1alpha1
kind: DataVolume
metadata:
  name: win10-1-disk
spec:
  source:
    pvc:
      namespace: "default"
      name: "win10-dv-source"
  pvc:
    accessModes:
      - ReadWriteMany
    resources:
      requests:
        storage: 40Gi
    volumeMode: Block
    storageClassName: ocs-storagecluster-ceph-rbd


It is interesting to note that only (some of) the last round of DVs got stuck in the SnapshotForSmartCloneInProgress state: 
459, 478, 482, 483, 501-513, 515, 517, 594-600

Which seems like it could indicate a scale related problem?

Comment 9 Humble Chirammal 2021-07-05 08:25:18 UTC

Here is the analysis so far ( Too many snapshots  (600) and logs to look at though ):

It looks like '573' clones got completed and '26' snapshots are 'still pending' as per the must-gather: -->  In total "573+26" -> 599 , iic, bz says 600 in total, so missing one here, but that should not be a concern at this stage:

*) Pending Snapshots:
--------------------------------------------------
snapshot-010650c6-c68f-42b8-83ac-7301282dd4d0
snapshot-0b25cd4d-ce40-4a8b-a3b2-6a250ec61d6d
snapshot-2e3517b6-4dd4-448c-b178-7734bb642fc3
snapshot-315ee773-a6bc-4873-b32e-e2d6e29a2559
snapshot-355c401f-8b65-4380-90fc-752de4dbb11e
snapshot-422d9bc6-81d7-4b77-b0c2-0309452e1c76
snapshot-47bb490d-6152-4ec5-abf5-4cd6ec477041
snapshot-57098abc-4b1e-4da7-981d-bbad6470424c
snapshot-5745bba3-0fab-455d-8b34-d0977b66d693
snapshot-59bf4e5b-043a-4527-91c5-07c242e50461
snapshot-795cf661-84de-45b2-b1a6-9971b82c4069
snapshot-87b0c7dd-05dd-4604-b326-4d86897fdfd4
snapshot-8e9ff1d1-546c-4939-8241-f7184956b46f
snapshot-9721af7f-a473-45c5-9c57-7db314ed5776
snapshot-98811305-cbb3-4725-9149-1bb19a08914b
snapshot-a054c027-1331-43e6-846d-8d0688803195
snapshot-a6fa9d26-b9ba-42b9-9370-017adb3cd608
snapshot-aa92caa7-8ffd-41ea-9b3a-642acf3edb32
snapshot-af3d8578-e2c4-4a35-b825-557fe5a0cfc8
snapshot-b160657d-9250-4373-b525-cb43b3db161a
snapshot-b7e63728-4602-46bb-963a-101a55312013
snapshot-c28ae54e-b4e5-4af5-b3b7-1c767d8a6e4a
snapshot-cde741bb-3997-4569-87d2-0c2d32c155f7
snapshot-d7601e5c-64aa-4efe-a8c5-754e2cb855b0
snapshot-e7c375e7-4903-4b39-ba98-345aaeb01cfe
snapshot-f5340dab-23cc-4e4b-9c65-d488e84beeac


All these snapshots has reported errors saying that "rbd image not found" while it tried to perform "flatten" task as part of this transaction:

...................

2021-06-28T15:24:54.872752262Z W0628 15:24:54.872738       1 rbd_util.go:469] ID: 55568 Req-ID: snapshot-e7c375e7-4903-4b39-ba98-345aaeb01cfe uncaught error while scheduling a task (an error (exit status 2) occurred while running ceph args: [rbd task add flatten ocs-storagecluster-cephblockpool/csi-snap-04e89960-d5f8-11eb-ab7c-0a580a80084b --id csi-rbd-provisioner --keyfile=***stripped*** -m 172.30.64.23:6789,172.30.22.136:6789,172.30.131.179:6789]): Error ENOENT: [errno 2] Image ocs-storagecluster-cephblockpool/csi-snap-04e89960-d5f8-11eb-ab7c-0a580a80084b does not exist

****

2021-06-28T15:24:54.887930530Z W0628 15:24:54.887860       1 rbd_util.go:469] ID: 55578 Req-ID: snapshot-5745bba3-0fab-455d-8b34-d0977b66d693 uncaught error while scheduling a task (an error (exit status 2) occurred while running ceph args: [rbd task add flatten ocs-storagecluster-cephblockpool/csi-snap-a30832a9-d5f7-11eb-ab7c-0a580a80084b --id csi-rbd-provisioner --keyfile=***stripped*** -m 172.30.64.23:6789,172.30.22.136:6789,172.30.131.179:6789]): Error ENOENT: [errno 2] Image ocs-storagecluster-cephblockpool/csi-snap-a30832a9-d5f7-11eb-ab7c-0a580a80084b does not exist

****

2021-06-28T15:24:55.480658120Z W0628 15:24:55.480632       1 rbd_util.go:469] ID: 55562 Req-ID: snapshot-87b0c7dd-05dd-4604-b326-4d86897fdfd4 uncaught error while scheduling a task (an error (exit status 2) occurred while running ceph args: [rbd task add flatten ocs-storagecluster-cephblockpool/csi-snap-1f41bba7-d5fa-11eb-ab7c-0a580a80084b --id csi-rbd-provisioner --keyfile=***stripped*** -m 172.30.64.23:6789,172.30.22.136:6789,172.30.131.179:6789]): Error ENOENT: [errno 2] Image ocs-storagecluster-cephblockpool/csi-snap-1f41bba7-d5fa-11eb-ab7c-0a580a80084b does not exist

..................

If we go ahead and map the images for these snapshot requests we can see:

2021-06-28T15:27:14.110438114Z E0628 15:27:14.110419       1 utils.go:173] ID: 55568 Req-ID: snapshot-e7c375e7-4903-4b39-ba98-345aaeb01cfe GRPC error: rpc error: code = ResourceExhausted desc = rbd image ocs-storagecluster-cephblockpool/csi-vol-7fd8f132-d5f6-11eb-ab7c-0a580a80084b has 457 snapshots



---------------------------------------------

 
During our snapshot creation process/clone process, 'flattening' take place in few scenarios, for example: If the parent has more snapshots than the configured `maxsnapshotsonimage` , we add background tasks to flatten the temporary cloned images..etc.

For example our snapshot creation process looks like this:

----------------------------------------------------------------------------------

rbd snap ls <RBD image for src k8s volume> --all

// If the parent has more snapshots than the configured `maxsnapshotsonimage`
// add background tasks to flatten the temporary cloned images (temporary cloned
// image names will be same as snapshot names)
ceph rbd task add flatten <RBD image for temporary snap images>

rbd snap create <RBD image for src k8s volume>@<random snap name>
rbd clone --rbd-default-clone-format 2 --image-feature
    layering,deep-flatten <RBD image for src k8s volume>@<random snap>
    <RBD image for temporary snap image>
rbd snap rm <RBD image for src k8s volume>@<random snap name>
rbd snap create <RBD image for temporary snap image>@<random snap name>

// check the depth, if the depth is greater than configured hardlimit add a
// task to flatten the cloned image, return snapshot status ready as `false`,
// if the depth is greater than softlimit add a task to flatten the image
// and return success
ceph rbd task add flatten <RBD image for temporary snap image>


---------------------------------------------------------------------

That said, Maximum number of snapshots allowed on an RBD Image is "510"

https://github.com/torvalds/linux/blob/master/drivers/block/rbd.c#L98

however for safer side, in ceph csi we have configured this value to "450" which is the hardlimit we crossed (457) in this case.. 

In this case, the mentioned RBD Image got "457" snapshots and there was an attempt to "flatten" the image in backend which didnt go through and snapshot creation is stuck..

******* jfr:

if len(snaps) > int(maxSnapshotsOnImage) {
		util.DebugLog(
			ctx,
			"snapshots count %d on image: %s reached configured hard limit %d",
			len(snaps),
			rbdVol,
			maxSnapshotsOnImage)
		err = flattenClonedRbdImages(
			ctx,
			snaps,
			rbdVol.Pool,
			rbdVol.Monitors,
			rbdVol.RbdImageName,
			cr)
		if err != nil {
			return status.Error(codes.Internal, err.Error())
		}
		return status.Errorf(codes.ResourceExhausted, "rbd image %s has %d snapshots", rbdVol, len(snaps))
	}

******

The source PV:

------------------------------------------------------------------
Name:            pvc-91ae1369-c277-4b75-9a24-ae5c9de5999b
Labels:          <none>
Annotations:     pv.kubernetes.io/provisioned-by: openshift-storage.rbd.csi.ceph.com
Finalizers:      [kubernetes.io/pv-protection]
StorageClass:    ocs-storagecluster-ceph-rbd
Status:          Bound
Claim:           default/win10-dv-source
Reclaim Policy:  Delete
Access Modes:    RWX
VolumeMode:      Block
Capacity:        40Gi
Node Affinity:   <none>
Message:
Source:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            openshift-storage.rbd.csi.ceph.com
    FSType:
    VolumeHandle:      0001-0011-openshift-storage-0000000000000001-7fd8f132-d5f6-11eb-ab7c-0a580a80084b
    ReadOnly:          false
    VolumeAttributes:      clusterID=openshift-storage
                           csi.storage.k8s.io/pv/name=pvc-91ae1369-c277-4b75-9a24-ae5c9de5999b
                           csi.storage.k8s.io/pvc/name=win10-dv-source
                           csi.storage.k8s.io/pvc/namespace=default
                           imageFeatures=layering
                           imageFormat=2
                           imageName=csi-vol-7fd8f132-d5f6-11eb-ab7c-0a580a80084b
                           journalPool=ocs-storagecluster-cephblockpool
                           pool=ocs-storagecluster-cephblockpool
                           storage.kubernetes.io/csiProvisionerIdentity=1624651293250-8081-openshift-storage.rbd.csi.ceph.com
                           thickProvision=false

---------------



I will go through the backend logs and try to figure out details, but from the error message it looks like something got messed up in between and it reports the 'image not found' error!!!

Comment 19 Mudit Agarwal 2021-07-14 08:36:10 UTC

Moving it out of 4.8 while waiting for the live cluster.
We can take the fix in 4.8.1 if required.

Comment 20 Humble Chirammal 2021-07-16 10:49:32 UTC

Karthick, while we wait for a live setup to debug this further , I have a couple of questions here :

Do we test the scale limit ( 512) with very new build we have in CI ?

Is this test automated and ran with each build ? or its manual ? if its manual, can we execute the same against this build and see the result ?

If its ran by automation and passing for 4.8 builds (ex: one mentioned here ocs-operator.v4.8.0-427.ci), it will help us  to isolate this scenario  to any kind of regresssion/setup issue.

Comment 21 krishnaram Karthick 2021-07-21 12:00:19 UTC

(In reply to Humble Chirammal from comment #20)
> Karthick, while we wait for a live setup to debug this further , I have a
> couple of questions here :
> 
> Do we test the scale limit ( 512) with very new build we have in CI ?
> 
> Is this test automated and ran with each build ? or its manual ? if its
> manual, can we execute the same against this build and see the result ?
> 
> If its ran by automation and passing for 4.8 builds (ex: one mentioned here
> ocs-operator.v4.8.0-427.ci), it will help us  to isolate this scenario  to
> any kind of regresssion/setup issue.

Yes - we run the scale snapshot test(https://github.com/red-hat-storage/ocs-ci/blob/master/tests/e2e/performance/test_pvc_multi_snapshot_performance.py) and it ran successfully in our recent runs. 
in one of the discussions with Mudit, I remember it was mentioned that this issue was seen when cluster capacity was reached?
we don't consume the entire cluster capacity while running this test. So, we won't end up in this issue. So, I don't think this is a regression.

Comment 22 Humble Chirammal 2021-07-22 04:33:57 UTC

(In reply to krishnaram Karthick from comment #21)
> (In reply to Humble Chirammal from comment #20)
> > Karthick, while we wait for a live setup to debug this further , I have a
> > couple of questions here :
> > 
> > Do we test the scale limit ( 512) with very new build we have in CI ?
> > 
> > Is this test automated and ran with each build ? or its manual ? if its
> > manual, can we execute the same against this build and see the result ?
> > 
> > If its ran by automation and passing for 4.8 builds (ex: one mentioned here
> > ocs-operator.v4.8.0-427.ci), it will help us  to isolate this scenario  to
> > any kind of regresssion/setup issue.
> 
> Yes - we run the scale snapshot
> test(https://github.com/red-hat-storage/ocs-ci/blob/master/tests/e2e/
> performance/test_pvc_multi_snapshot_performance.py) and it ran successfully
> in our recent runs. 

Thanks Karthick for the confirmation! 


> in one of the discussions with Mudit, I remember it was mentioned that this
> issue was seen when cluster capacity was reached?

May be it was a miscommunication. In this case the cluster capacity is not full ( atleast as per above bz comments). 

> we don't consume the entire cluster capacity while running this test. So, we
> won't end up in this issue. So, I don't think this is a regression.

ooc, do we also have a clone scale test ( PVC-> PVC) in our setup ?

Comment 23 krishnaram Karthick 2021-07-22 04:37:18 UTC

(In reply to Humble Chirammal from comment #22)
> (In reply to krishnaram Karthick from comment #21)
> > (In reply to Humble Chirammal from comment #20)
> > > Karthick, while we wait for a live setup to debug this further , I have a
> > > couple of questions here :
> > > 
> > > Do we test the scale limit ( 512) with very new build we have in CI ?
> > > 
> > > Is this test automated and ran with each build ? or its manual ? if its
> > > manual, can we execute the same against this build and see the result ?
> > > 
> > > If its ran by automation and passing for 4.8 builds (ex: one mentioned here
> > > ocs-operator.v4.8.0-427.ci), it will help us  to isolate this scenario  to
> > > any kind of regresssion/setup issue.
> > 
> > Yes - we run the scale snapshot
> > test(https://github.com/red-hat-storage/ocs-ci/blob/master/tests/e2e/
> > performance/test_pvc_multi_snapshot_performance.py) and it ran successfully
> > in our recent runs. 
> 
> Thanks Karthick for the confirmation! 
> 
> 
> > in one of the discussions with Mudit, I remember it was mentioned that this
> > issue was seen when cluster capacity was reached?
> 
> May be it was a miscommunication. In this case the cluster capacity is not
> full ( atleast as per above bz comments). 
> 
> > we don't consume the entire cluster capacity while running this test. So, we
> > won't end up in this issue. So, I don't think this is a regression.
> 
> ooc, do we also have a clone scale test ( PVC-> PVC) in our setup ?

Yes, 512 clones - https://github.com/red-hat-storage/ocs-ci/blob/master/tests/e2e/performance/test_pvc_multi_clone_performance.py

Comment 24 Humble Chirammal 2021-07-22 06:52:54 UTC

(In reply to krishnaram Karthick from comment #23)
> (In reply to Humble Chirammal from comment #22)

> > ooc, do we also have a clone scale test ( PVC-> PVC) in our setup ?
> 
> Yes, 512 clones -
> https://github.com/red-hat-storage/ocs-ci/blob/master/tests/e2e/performance/
> test_pvc_multi_clone_performance.py

Thanks Karthick! That helps on rollout the possibilities of regression here.
There are some extra setting in this cluster like primary affiniity set...etc on this cluster, may be those special settings is also contributing to this issue. Any way its good that, these builds pass our scale testing for snapshot , clone ...etc operations.

Note You need to log in before you can comment on or make changes to this bug.