From the event logs this appeared to be an issue w/ snapshot availability, the original set of DVs eventually recovered ~5hrs later. Boaz created another 200 DVs (clones) and many got stuck this time in "SnapshotForSmartCloneInProgress" state for 7hrs+ with many of these event errors: 37m Warning SnapshotContentCheckandUpdateFailed volumesnapshotcontent/snapcontent-02d52864-cab3-4dab-bcb7-19e0b4cc19c9 Failed to check and update snapshot content: failed to take snapshot of the volume, 0001-0011-openshift-storage-0000000000000003-9ca44b7f-cd3c-11eb-a625-0a580a82050f: "rpc error: code = DeadlineExceeded desc = context deadline exceeded" 45m Warning SnapshotContentCheckandUpdateFailed volumesnapshotcontent/snapcontent-02e58a9d-25a2-4b21-8dea-da77b072e020 Failed to check and update snapshot content: failed to take snapshot of the volume, 0001-0011-openshift-storage-0000000000000003-9ca44b7f-cd3c-11eb-a625-0a580a82050f: "rpc error: code = Aborted desc = an operation with the given Volume ID snapshot-02e58a9d-25a2-4b21-8dea-da77b072e020 already exists"
Niels, could this be an issue with exceeding the maximum allowed number of snapshot clones? Is this a condition that recovers automatically? If this is an OCS issue, does the storage surface some sort of alert while it's happening?
This happened during a period of time on the cluster where we saw huge stalls once the OCS snapshot limit was reached, which should be recoverable, and later we could not reproduce it again -- covered in BZ1976936. I guess the question is should CNV be reporting any VM Progress state when it is waiting on a DV/PVC state?
(In reply to Adam Litke from comment #2) > Niels, could this be an issue with exceeding the maximum allowed number of > snapshot clones? Is this a condition that recovers automatically? If this > is an OCS issue, does the storage surface some sort of alert while it's > happening? There is a limit on the number of snapshots, before they get flattened. Once flattening (copying data and making the snapshop self-contained, independent of the parent) is happening, I/O might peak. This is a background operation run by the Ceph MGR component. I think there is a limit on the concurrent flattening processes that are done, which potentially cause delays while preparing the snapshots.
Thanks Niels. Can you check if there is an alert that fires while this process is ongoing? If so, I would be satisfied that the system is working as designed with an appropriate amount of transparency to affected users.
Hi Madhu, In our scale tests we are triggering the rbd snapshot flattening process. During flattening there is a significant delay in processing additional snapshot requests. Do you know if ODF is raising an alert while this process happens so that a cluster user or admin would have visibility into what is causing the delays?
Hi Adam, AFAIK there is no alert raised as the flattening is done as an internal operation during the CreateVolume/CreateSnapshot at the cephcsi level. CSI cannot talk to the kubernetes to raise any alerts in this case. If I remember correctly this will be logged at the PVC events.
Madhu, Thanks for providing this information. This seems like an important thing to alert about since it blocks progress for a long period of time. I am sure there is a way to detect this condition (whether at CSI level or Rook/ceph level) and report it upwards via an alert. I created https://bugzilla.redhat.com/show_bug.cgi?id=2062339 to track our request to add an alert for this condition.
This is tracked by development in our RHSTOR-3276, so closing this bug