Bug 2235753

Summary:	Support snapshot crash consistency across clients
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Greg Farnum <gfarnum>
Component:	CephFS	Assignee:	Patrick Donnelly <pdonnell>
Status:	CLOSED ERRATA	QA Contact:	sumr
Severity:	high	Docs Contact:	Akash Raj <akraj>
Priority:	high
Version:	6.0	CC:	akraj, amk, ceph-eng-bugs, cephqe-warriors, hyelloji, lusov, ngangadh, pdonnell, tserlin
Target Milestone:	---
Target Release:	7.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	ceph-18.2.1-58.el9cp	Doc Type:	Enhancement
Doc Text:	.CephFS supports quiescing of subvolumes or directory trees Previously, multiple clients would interleave reads and writes across a consistent snapshot barrier where out-of-band communication existed between clients. This communication led to clients wrongly believing they have reached a checkpoint that is mutually recoverable via a snapshot. With this enhancement, CephFS supports quiescing of subvolumes or directory trees to enable the execution of crash-consistent snapshots. Clients are now forced to quiesce all I/O before the MDS executes the snapshot. This enforces a checkpoint across all clients of the subtree.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-06-13 14:20:57 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2267614, 2298578, 2298579

Description Greg Farnum 2023-08-29 15:46:15 UTC

Right now, when you create a CephFS snapshot, the MDS simply sends all affected clients a message saying the snapshot now exists.

This is incredibly fast, but does not maintain crash consistency if multiple clients are accessing the snapshotted tree.

We need to build a mechanism which lets us take crash-consistent snapshots.


I suspect the simplest mechanism will be to just revoke all client exclusive and write caps, then take the snapshot, and then let clients take back whatever caps they want. It has the huge advantage of not requiring any client updates. But we'll need to see how much work that is to implement.

Comment 1 RHEL Program Management 2023-08-29 15:46:24 UTC

Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 5 Leonid Usov 2023-12-26 13:55:48 UTC

Our approach to the problem is that the consistency of our snapshots boils down to whether clients are allowed to perform write IOs while snapshots are being taken. After analyzing the requirements we concluded that we should create and expose a new API to manage a pause of IO to a set of file system paths. We refer to this as a "quiesce set" database.

By exposing such an API we cover a variety of enterprise use cases. Given an active pause, one can schedule any maintenance that should appear atomic to the enterprise apps operating within the quiesced roots. This enables consistent FS snapshots by scheduling them during the pause, but it also allows to achieve consistent snapshots across the FS and RBD volumes by running the RBD snapshot(s) while an FS pause is active and then running the FS snapshots before releasing the pause.

As Greg suggested initially, our first implementation of the pause involves revoking write capabilities from the clients. This approach is backward compatible with all existing clients, but it has an overhead of the caps ping pong and redundant write cache flushes. NB: applications are required to issue flushes if they want to get crash consistency guarantees from the system, so this latter overhead is not a complete waste, but since it's asynchronous to the application it will have some performance impact in the general case. Another drawback is that the MDS servers will have to deal with the added pressure of the pending IOs due to the clients trying to claim the capabilities back while the pause is active.

We also considered a new client quiesce protocol that would avoid all of the overheads and implement the pause on the client side. This will require client-side changes and hence will have to go into a later release, subject to future planning.

The overall design is detailed in this slide deck: https://docs.google.com/presentation/d/1wE3-e9AAme7Q3qmeshUSthJoQGw7-fKTrtS9PsdAIVo/edit#slide=id.p
Ongoing work is tracked by the subtasks of the feature ticket: https://tracker.ceph.com/issues/63663

Comment 20 errata-xmlrpc 2024-06-13 14:20:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:3925

Comment 21 Red Hat Bugzilla 2024-11-16 04:25:09 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days