Right now, when you create a CephFS snapshot, the MDS simply sends all affected clients a message saying the snapshot now exists. This is incredibly fast, but does not maintain crash consistency if multiple clients are accessing the snapshotted tree. We need to build a mechanism which lets us take crash-consistent snapshots. I suspect the simplest mechanism will be to just revoke all client exclusive and write caps, then take the snapshot, and then let clients take back whatever caps they want. It has the huge advantage of not requiring any client updates. But we'll need to see how much work that is to implement.
Please specify the severity of this bug. Severity is defined here: https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.
Our approach to the problem is that the consistency of our snapshots boils down to whether clients are allowed to perform write IOs while snapshots are being taken. After analyzing the requirements we concluded that we should create and expose a new API to manage a pause of IO to a set of file system paths. We refer to this as a "quiesce set" database. By exposing such an API we cover a variety of enterprise use cases. Given an active pause, one can schedule any maintenance that should appear atomic to the enterprise apps operating within the quiesced roots. This enables consistent FS snapshots by scheduling them during the pause, but it also allows to achieve consistent snapshots across the FS and RBD volumes by running the RBD snapshot(s) while an FS pause is active and then running the FS snapshots before releasing the pause. As Greg suggested initially, our first implementation of the pause involves revoking write capabilities from the clients. This approach is backward compatible with all existing clients, but it has an overhead of the caps ping pong and redundant write cache flushes. NB: applications are required to issue flushes if they want to get crash consistency guarantees from the system, so this latter overhead is not a complete waste, but since it's asynchronous to the application it will have some performance impact in the general case. Another drawback is that the MDS servers will have to deal with the added pressure of the pending IOs due to the clients trying to claim the capabilities back while the pause is active. We also considered a new client quiesce protocol that would avoid all of the overheads and implement the pause on the client side. This will require client-side changes and hence will have to go into a later release, subject to future planning. The overall design is detailed in this slide deck: https://docs.google.com/presentation/d/1wE3-e9AAme7Q3qmeshUSthJoQGw7-fKTrtS9PsdAIVo/edit#slide=id.p Ongoing work is tracked by the subtasks of the feature ticket: https://tracker.ceph.com/issues/63663
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:3925
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days