Bug 1972264 - VM's cloned from a DV on a scaled environment became "zombies".
Summary: VM's cloned from a DV on a scaled environment became "zombies".
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Storage
Version: 2.6.4
Hardware: x86_64
OS: Unspecified
high
high
Target Milestone: ---
: 4.11.0
Assignee: Adam Litke
QA Contact: Natalie Gavrielov
URL:
Whiteboard:
Depends On: 2062339
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-15 14:32 UTC by Boaz
Modified: 2022-09-28 15:47 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-06-22 12:24:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph-csi pull 1195 0 None Merged rbd: Handle maximum snapshots on a single rbd image 2022-03-09 12:25:28 UTC
Red Hat Issue Tracker RHSTOR-3276 0 None None None 2022-06-22 12:23:10 UTC

Comment 1 Jenifer Abrams 2021-06-16 17:13:35 UTC
From the event logs this appeared to be an issue w/ snapshot availability, the original set of DVs eventually recovered ~5hrs later.
Boaz created another 200 DVs (clones) and many got stuck this time in "SnapshotForSmartCloneInProgress" state for 7hrs+ with many of these event errors:

37m         Warning   SnapshotContentCheckandUpdateFailed   volumesnapshotcontent/snapcontent-02d52864-cab3-4dab-bcb7-19e0b4cc19c9   Failed to check and update snapshot content: failed to take snapshot of the volume, 0001-0011-openshift-storage-0000000000000003-9ca44b7f-cd3c-11eb-a625-0a580a82050f: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
45m         Warning   SnapshotContentCheckandUpdateFailed   volumesnapshotcontent/snapcontent-02e58a9d-25a2-4b21-8dea-da77b072e020   Failed to check and update snapshot content: failed to take snapshot of the volume, 0001-0011-openshift-storage-0000000000000003-9ca44b7f-cd3c-11eb-a625-0a580a82050f: "rpc error: code = Aborted desc = an operation with the given Volume ID snapshot-02e58a9d-25a2-4b21-8dea-da77b072e020 already exists"

Comment 2 Adam Litke 2021-08-23 20:25:14 UTC
Niels, could this be an issue with exceeding the maximum allowed number of snapshot clones?  Is this a condition that recovers automatically?  If this is an OCS issue, does the storage surface some sort of alert while it's happening?

Comment 3 Jenifer Abrams 2021-08-26 16:04:30 UTC
This happened during a period of time on the cluster where we saw huge stalls once the OCS snapshot limit was reached, which should be recoverable, and later we could not reproduce it again -- covered in BZ1976936.  I guess the question is should CNV be reporting any VM Progress state when it is waiting on a DV/PVC state?

Comment 7 Niels de Vos 2021-09-28 11:23:02 UTC
(In reply to Adam Litke from comment #2)
> Niels, could this be an issue with exceeding the maximum allowed number of
> snapshot clones?  Is this a condition that recovers automatically?  If this
> is an OCS issue, does the storage surface some sort of alert while it's
> happening?

There is a limit on the number of snapshots, before they get flattened. Once flattening (copying data and making the snapshop self-contained, independent of the parent) is happening, I/O might peak. This is a background operation run by the Ceph MGR component. I think there is a limit on the concurrent flattening processes that are done, which potentially cause delays while preparing the snapshots.

Comment 8 Adam Litke 2021-10-05 12:10:43 UTC
Thanks Niels.  Can you check if there is an alert that fires while this process is ongoing?  If so, I would be satisfied that the system is working as designed with an appropriate amount of transparency to affected users.

Comment 11 Adam Litke 2022-03-09 12:24:13 UTC
Hi Madhu,

In our scale tests we are triggering the rbd snapshot flattening process.  During flattening there is a significant delay in processing additional snapshot requests.  Do you know if ODF is raising an alert while this process happens so that a cluster user or admin would have visibility into what is causing the delays?

Comment 12 Madhu Rajanna 2022-03-09 13:03:24 UTC
Hi Adam,

AFAIK there is no alert raised as the flattening is done as an internal operation during the CreateVolume/CreateSnapshot at the cephcsi level. CSI cannot talk to the kubernetes to raise any alerts in this case.
If I remember correctly this will be logged at the PVC events.

Comment 13 Adam Litke 2022-03-09 15:13:57 UTC
Madhu,

Thanks for providing this information.  This seems like an important thing to alert about since it blocks progress for a long period of time.  I am sure there is a way to detect this condition (whether at CSI level or Rook/ceph level) and report it upwards via an alert.  I created https://bugzilla.redhat.com/show_bug.cgi?id=2062339 to track our request to add an alert for this condition.

Comment 14 Yan Du 2022-06-22 12:24:54 UTC
This is tracked by development in our RHSTOR-3276, so closing this bug


Note You need to log in before you can comment on or make changes to this bug.