1972264 – VM's cloned from a DV on a scaled environment became "zombies".

Bug 1972264 - VM's cloned from a DV on a scaled environment became "zombies".

Summary: VM's cloned from a DV on a scaled environment became "zombies".

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	2.6.4
Hardware:	x86_64
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Adam Litke
QA Contact:	Natalie Gavrielov
Docs Contact:
URL:
Whiteboard:
Depends On:	2062339
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-15 14:32 UTC by Boaz
Modified:	2022-09-28 15:47 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-06-22 12:24:54 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ceph ceph-csi pull 1195	0	None	Merged	rbd: Handle maximum snapshots on a single rbd image	2022-03-09 12:25:28 UTC
Red Hat Issue Tracker	RHSTOR-3276	0	None	None	None	2022-06-22 12:23:10 UTC

Comment 1 Jenifer Abrams 2021-06-16 17:13:35 UTC

From the event logs this appeared to be an issue w/ snapshot availability, the original set of DVs eventually recovered ~5hrs later.
Boaz created another 200 DVs (clones) and many got stuck this time in "SnapshotForSmartCloneInProgress" state for 7hrs+ with many of these event errors:

37m         Warning   SnapshotContentCheckandUpdateFailed   volumesnapshotcontent/snapcontent-02d52864-cab3-4dab-bcb7-19e0b4cc19c9   Failed to check and update snapshot content: failed to take snapshot of the volume, 0001-0011-openshift-storage-0000000000000003-9ca44b7f-cd3c-11eb-a625-0a580a82050f: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
45m         Warning   SnapshotContentCheckandUpdateFailed   volumesnapshotcontent/snapcontent-02e58a9d-25a2-4b21-8dea-da77b072e020   Failed to check and update snapshot content: failed to take snapshot of the volume, 0001-0011-openshift-storage-0000000000000003-9ca44b7f-cd3c-11eb-a625-0a580a82050f: "rpc error: code = Aborted desc = an operation with the given Volume ID snapshot-02e58a9d-25a2-4b21-8dea-da77b072e020 already exists"

Comment 2 Adam Litke 2021-08-23 20:25:14 UTC

Niels, could this be an issue with exceeding the maximum allowed number of snapshot clones?  Is this a condition that recovers automatically?  If this is an OCS issue, does the storage surface some sort of alert while it's happening?

Comment 3 Jenifer Abrams 2021-08-26 16:04:30 UTC

This happened during a period of time on the cluster where we saw huge stalls once the OCS snapshot limit was reached, which should be recoverable, and later we could not reproduce it again -- covered in BZ1976936.  I guess the question is should CNV be reporting any VM Progress state when it is waiting on a DV/PVC state?

Comment 7 Niels de Vos 2021-09-28 11:23:02 UTC

(In reply to Adam Litke from comment #2)
> Niels, could this be an issue with exceeding the maximum allowed number of
> snapshot clones?  Is this a condition that recovers automatically?  If this
> is an OCS issue, does the storage surface some sort of alert while it's
> happening?

There is a limit on the number of snapshots, before they get flattened. Once flattening (copying data and making the snapshop self-contained, independent of the parent) is happening, I/O might peak. This is a background operation run by the Ceph MGR component. I think there is a limit on the concurrent flattening processes that are done, which potentially cause delays while preparing the snapshots.

Comment 8 Adam Litke 2021-10-05 12:10:43 UTC

Thanks Niels.  Can you check if there is an alert that fires while this process is ongoing?  If so, I would be satisfied that the system is working as designed with an appropriate amount of transparency to affected users.

Comment 11 Adam Litke 2022-03-09 12:24:13 UTC

Hi Madhu,

In our scale tests we are triggering the rbd snapshot flattening process.  During flattening there is a significant delay in processing additional snapshot requests.  Do you know if ODF is raising an alert while this process happens so that a cluster user or admin would have visibility into what is causing the delays?

Comment 12 Madhu Rajanna 2022-03-09 13:03:24 UTC

Hi Adam,

AFAIK there is no alert raised as the flattening is done as an internal operation during the CreateVolume/CreateSnapshot at the cephcsi level. CSI cannot talk to the kubernetes to raise any alerts in this case.
If I remember correctly this will be logged at the PVC events.

Comment 13 Adam Litke 2022-03-09 15:13:57 UTC

Madhu,

Thanks for providing this information.  This seems like an important thing to alert about since it blocks progress for a long period of time.  I am sure there is a way to detect this condition (whether at CSI level or Rook/ceph level) and report it upwards via an alert.  I created https://bugzilla.redhat.com/show_bug.cgi?id=2062339 to track our request to add an alert for this condition.

Comment 14 Yan Du 2022-06-22 12:24:54 UTC

This is tracked by development in our RHSTOR-3276, so closing this bug

Note You need to log in before you can comment on or make changes to this bug.