2235753 – Support snapshot crash consistency across clients

Bug 2235753 - Support snapshot crash consistency across clients

Summary: Support snapshot crash consistency across clients

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	7.1
Assignee:	Patrick Donnelly
QA Contact:	sumr
Docs Contact:	Akash Raj
URL:
Whiteboard:
Depends On:
Blocks:	2267614 2298578 2298579
TreeView+	depends on / blocked

Reported:	2023-08-29 15:46 UTC by Greg Farnum
Modified:	2024-11-16 04:25 UTC (History)
CC List:	9 users (show)
Fixed In Version:	ceph-18.2.1-58.el9cp
Doc Type:	Enhancement
Doc Text:	.CephFS supports quiescing of subvolumes or directory trees Previously, multiple clients would interleave reads and writes across a consistent snapshot barrier where out-of-band communication existed between clients. This communication led to clients wrongly believing they have reached a checkpoint that is mutually recoverable via a snapshot. With this enhancement, CephFS supports quiescing of subvolumes or directory trees to enable the execution of crash-consistent snapshots. Clients are now forced to quiesce all I/O before the MDS executes the snapshot. This enforces a checkpoint across all clients of the subtree.
Clone Of:
Environment:
Last Closed:	2024-06-13 14:20:57 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	62083	None	None	None	2023-08-29 15:46:14 UTC
Red Hat Issue Tracker	RHCEPH-7283	None	None	None	2023-08-29 15:51:12 UTC
Red Hat Product Errata	RHSA-2024:3925	None	None	None	2024-06-13 14:21:03 UTC

Description Greg Farnum 2023-08-29 15:46:15 UTC

Right now, when you create a CephFS snapshot, the MDS simply sends all affected clients a message saying the snapshot now exists.

This is incredibly fast, but does not maintain crash consistency if multiple clients are accessing the snapshotted tree.

We need to build a mechanism which lets us take crash-consistent snapshots.


I suspect the simplest mechanism will be to just revoke all client exclusive and write caps, then take the snapshot, and then let clients take back whatever caps they want. It has the huge advantage of not requiring any client updates. But we'll need to see how much work that is to implement.

Comment 1 RHEL Program Management 2023-08-29 15:46:24 UTC

Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 5 Leonid Usov 2023-12-26 13:55:48 UTC

Our approach to the problem is that the consistency of our snapshots boils down to whether clients are allowed to perform write IOs while snapshots are being taken. After analyzing the requirements we concluded that we should create and expose a new API to manage a pause of IO to a set of file system paths. We refer to this as a "quiesce set" database.

By exposing such an API we cover a variety of enterprise use cases. Given an active pause, one can schedule any maintenance that should appear atomic to the enterprise apps operating within the quiesced roots. This enables consistent FS snapshots by scheduling them during the pause, but it also allows to achieve consistent snapshots across the FS and RBD volumes by running the RBD snapshot(s) while an FS pause is active and then running the FS snapshots before releasing the pause.

As Greg suggested initially, our first implementation of the pause involves revoking write capabilities from the clients. This approach is backward compatible with all existing clients, but it has an overhead of the caps ping pong and redundant write cache flushes. NB: applications are required to issue flushes if they want to get crash consistency guarantees from the system, so this latter overhead is not a complete waste, but since it's asynchronous to the application it will have some performance impact in the general case. Another drawback is that the MDS servers will have to deal with the added pressure of the pending IOs due to the clients trying to claim the capabilities back while the pause is active.

We also considered a new client quiesce protocol that would avoid all of the overheads and implement the pause on the client side. This will require client-side changes and hence will have to go into a later release, subject to future planning.

The overall design is detailed in this slide deck: https://docs.google.com/presentation/d/1wE3-e9AAme7Q3qmeshUSthJoQGw7-fKTrtS9PsdAIVo/edit#slide=id.p
Ongoing work is tracked by the subtasks of the feature ticket: https://tracker.ceph.com/issues/63663

Comment 20 errata-xmlrpc 2024-06-13 14:20:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:3925

Comment 21 Red Hat Bugzilla 2024-11-16 04:25:09 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.