Bug 2369721 - group replayer can erroneously delete and later re-create non-primary mirror snapshots
Summary: group replayer can erroneously delete and later re-create non-primary mirror ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RBD-Mirror
Version: 8.1
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 8.1z2
Assignee: Prasanna Kumar Kalever
QA Contact: Chaitanya
Rivka Pollack
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2025-06-02 08:49 UTC by Ilya Dryomov
Modified: 2025-08-18 14:02 UTC (History)
6 users (show)

Fixed In Version: ceph-19.2.1-240.el9cp
Doc Type: Bug Fix
Doc Text:
.Group replayer now correctly retains previous mirror group snapshots Previously, the group replayer mistakenly pruned the previous mirror group snapshot when a user-created group snapshot was present, treating it as the previous snapshot. Because user snapshots do not include mirror image snapshots, this caused unnecessary deletion and subsequent recreation of mirror snapshots. With this fix, the group replayer always retains the previous mirror group snapshot until the current one completes syncing, regardless of the presence of user snapshots. This change prevents incorrect pruning, avoids redundant snapshot recreation, and ensures smoother and more efficient mirroring.
Clone Of:
Environment:
Last Closed: 2025-08-18 14:02:25 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-11510 0 None None None 2025-06-02 08:51:03 UTC
Red Hat Product Errata RHBA-2025:14015 0 None None None 2025-08-18 14:02:28 UTC

Description Ilya Dryomov 2025-06-02 08:49:44 UTC
Description of problem:

Nithya was looking into something else when she ran these steps:

1. Created a group with 2 images, enabled it for mirroring and waited until it synced to the secondary.
2. Created 3 user group snaps snap1, snap2, snap3 and then created a mirror snapshot on the primary.
3. Listed the group and image snaps for 1 image on the secondary every second:

Fri May 16 12:07:45 PM IST 2025
ID            NAME                                                                   STATE       NAMESPACE                                                                           
106376cf19a1  .mirror.non-primary.37f42794-30d3-48ef-a9f4-adb1c35f7ac8.106376cf19a1    complete  mirror (non-primary peer_uuids:[] f4822494-dcad-4052-968d-06ece8573023:106376cf19a1)
1065ecbf3075  snap1                                                                  incomplete  user                                                                                

Image snaps
:SNAPID  NAME                                                                                  SIZE    PROTECTED  TIMESTAMP                 NAMESPACE                                                                        
    15  .mirror.non_primary.079c83d3-f96e-48ca-8bfe-c733f74301e9.2_10358a37e370_106376cf19a1  12 MiB             Fri May 16 12:03:45 2025  mirror (non-primary peer_uuids:[] f4822494-dcad-4052-968d-06ece8573023:13 copied)
------------------------------------------

####### The second user group snap snap2 is created 5 s later.

Fri May 16 12:07:50 PM IST 2025
ID            NAME                                                                   STATE       NAMESPACE                                                                           
106376cf19a1  .mirror.non-primary.37f42794-30d3-48ef-a9f4-adb1c35f7ac8.106376cf19a1    complete  mirror (non-primary peer_uuids:[] f4822494-dcad-4052-968d-06ece8573023:106376cf19a1)
1065ecbf3075  snap1                                                                  incomplete  user                                                                                
106793333556  snap2                                                                  incomplete  user                                                                                

Image snaps
:SNAPID  NAME                                                                                  SIZE    PROTECTED  TIMESTAMP                 NAMESPACE                                                                        
    15  .mirror.non_primary.079c83d3-f96e-48ca-8bfe-c733f74301e9.2_10358a37e370_106376cf19a1  12 MiB             Fri May 16 12:03:45 2025  mirror (non-primary peer_uuids:[] f4822494-dcad-4052-968d-06ece8573023:13 copied)
------------------------------------------

####### The third user group snap snap3 is created 5 s later.

Fri May 16 12:07:56 PM IST 2025
ID            NAME                                                                   STATE       NAMESPACE                                                                           
106376cf19a1  .mirror.non-primary.37f42794-30d3-48ef-a9f4-adb1c35f7ac8.106376cf19a1    complete  mirror (non-primary peer_uuids:[] f4822494-dcad-4052-968d-06ece8573023:106376cf19a1)
1065ecbf3075  snap1                                                                  incomplete  user                                                                                
106793333556  snap2                                                                  incomplete  user                                                                                
10697c05fdec  snap3                                                                  incomplete  user                                                                                

Image snaps
:SNAPID  NAME                                                                                  SIZE    PROTECTED  TIMESTAMP                 NAMESPACE                                                                        
    15  .mirror.non_primary.079c83d3-f96e-48ca-8bfe-c733f74301e9.2_10358a37e370_106376cf19a1  12 MiB             Fri May 16 12:03:45 2025  mirror (non-primary peer_uuids:[] f4822494-dcad-4052-968d-06ece8573023:13 copied)
------------------------------------------

### Finally the incomplete mirror snap is created after another 5s. And now a new Image mirror snap is finally created and starts syncing

Fri May 16 12:08:01 PM IST 2025
ID            NAME                                                                   STATE       NAMESPACE                                                                           
106376cf19a1  .mirror.non-primary.37f42794-30d3-48ef-a9f4-adb1c35f7ac8.106376cf19a1    complete  mirror (non-primary peer_uuids:[] f4822494-dcad-4052-968d-06ece8573023:106376cf19a1)
1065ecbf3075  snap1                                                                  incomplete  user                                                                                
106793333556  snap2                                                                  incomplete  user                                                                                
10697c05fdec  snap3                                                                  incomplete  user                                                                                
106bc744cc77  .mirror.non-primary.37f42794-30d3-48ef-a9f4-adb1c35f7ac8.106bc744cc77  incomplete  mirror (non-primary peer_uuids:[] f4822494-dcad-4052-968d-06ece8573023:106bc744cc77)

Image snaps
:SNAPID  NAME                                                                                  SIZE    PROTECTED  TIMESTAMP                 NAMESPACE                                                                            
    18  .mirror.non_primary.079c83d3-f96e-48ca-8bfe-c733f74301e9.2_10358a37e370_106376cf19a1  12 MiB             Fri May 16 12:08:01 2025  mirror (non-primary peer_uuids:[] f4822494-dcad-4052-968d-06ece8573023:13 33% copied)
------------------------------------------

Initial observations:

This is something really weird going in there: snap 13 was already synced to the secondary, attached to snap 15.  It looks like the act of syncing a mirror snapshot that logically contains user snapshots removed the already synced mirror snapshot only to recreate it a few moments later?

Notice

15  .mirror.non_primary.079c83d3-f96e-48ca-8bfe-c733f74301e9.2_10358a37e370_106376cf19a1  12 MiB             Fri May 16 12:03:45 2025  mirror (non-primary peer_uuids:[] f4822494-dcad-4052-968d-06ece8573023:13 copied)

and then in the next output

18  .mirror.non_primary.079c83d3-f96e-48ca-8bfe-c733f74301e9.2_10358a37e370_106376cf19a1  12 MiB             Fri May 16 12:08:01 2025  mirror (non-primary peer_uuids:[] f4822494-dcad-4052-968d-06ece8573023:13 33% copied)

Two different non-primary snapshots (15 and 18) for the same group snapshot (106376cf19a1) attached to the same primary snapshot (13) at different points in time

Relevant log snippets:

2025-05-16T12:08:00.662+0530 7f4ca01cd6c0 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x561656b38000 load_local_image_meta: 
2025-05-16T12:08:00.662+0530 7f4ca01cd6c0 15 librbd::mirror::snapshot::ImageMeta: 0x561656365ea0 load: oid=rbd_header.10352cde925c, key=.rbd_mirror.71ae63ed-8eb1-48e5-88a0-40c462d34478
2025-05-16T12:08:00.664+0530 7f4c9a1c16c0 15 librbd::mirror::snapshot::ImageMeta: 0x561656365ea0 handle_load: r=-2
2025-05-16T12:08:00.664+0530 7f4c9a1c16c0 15 librbd::mirror::snapshot::ImageMeta: 0x561656365ea0 handle_load: no snapshot-based mirroring image-meta: (2) No such file or directory
2025-05-16T12:08:00.664+0530 7f4c9a1c16c0 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x561656b38000 handle_load_local_image_meta: r=-2
2025-05-16T12:08:00.664+0530 7f4c9a1c16c0 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x561656b38000 scan_local_mirror_snapshots: 
2025-05-16T12:08:00.664+0530 7f4c9a1c16c0 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x561656b38000 scan_local_mirror_snapshots: local mirror snapshot: id=15, mirror_ns=[mirror state=non-primary, complete=1, mirror_peer_uuids=, primary_mirror_uuid=f4822494-dcad-4052-968d-06ece8573023, primary_snap_id=d, last_copied_object_number=3, snap_seqs={13=18446744073709551614}, group_spec={pool_id=2, group_id=10358a37e370}, group_snap_id=106376cf19a1]
2025-05-16T12:08:00.664+0530 7f4c9a1c16c0  5 rbd::mirror::image_replayer::snapshot::Replayer: 0x561656b38000 scan_local_mirror_snapshots: pruning unused non-primary snapshot 15
2025-05-16T12:08:00.664+0530 7f4c9a1c16c0 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x561656b38000 prune_non_primary_snapshot: snap_id=15
2025-05-16T12:08:00.664+0530 7f4c9a1c16c0 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x561656b38000 prune_non_primary_snapshot: removing from group list snap_id=15
2025-05-16T12:08:00.664+0530 7f4c9a1c16c0  5 librbd::Operations: 0x561656063790 snap_remove: snap_name=.mirror.non_primary.079c83d3-f96e-48ca-8bfe-c733f74301e9.2_10358a37e370_106376cf19a1

The image snap with id 15 is being pruned even though it is the only local image mirror snap.  The group replayer is pruning these here:

2025-05-16T12:07:55.627+0530 7f4c999c06c0 10 librbd::group::ListSnapshotsRequest: 0x561656bb0b60 handle_list_snaps: r=0
2025-05-16T12:07:55.628+0530 7f4c999c06c0 10 librbd::group::ListSnapshotsRequest: 0x561656bb0b60 sort_snaps: 
2025-05-16T12:07:55.628+0530 7f4c999c06c0 10 librbd::group::ListSnapshotsRequest: 0x561656bb0b60 finish: r=0
2025-05-16T12:07:55.628+0530 7f4c999c06c0 10 rbd::mirror::group_replayer::Replayer: 0x561656177680 handle_load_local_group_snapshots: r=0
2025-05-16T12:07:55.628+0530 7f4c999c06c0 10 rbd::mirror::group_replayer::Replayer: 0x561656177680 load_remote_group_snapshots: m_remote_group_id=1059ed525071
2025-05-16T12:07:55.630+0530 7f4c931b36c0 10 rbd::mirror::group_replayer::Replayer: 0x561656177680 handle_load_remote_group_snapshots: r=0
2025-05-16T12:07:55.630+0530 7f4c931b36c0 10 rbd::mirror::group_replayer::Replayer: 0x561656177680 is_resync_requested: m_local_group_id=10358a37e370
2025-05-16T12:07:55.631+0530 7f4c931b36c0 10 rbd::mirror::group_replayer::Replayer: 0x561656177680 prune_all_image_snapshots: attempting to unlink image snaps from group snap: 106376cf19a1
2025-05-16T12:07:55.632+0530 7f4c931b36c0 10 rbd::mirror::group_replayer::Replayer: 0x561656177680 prune_all_image_snapshots: pruning: f
2025-05-16T12:07:55.633+0530 7f4c931b36c0 10 rbd::mirror::group_replayer::Replayer: 0x561656177680 prune_all_image_snapshots: pruning: 10
2025-05-16T12:07:55.633+0530 7f4c931b36c0 10 rbd::mirror::group_replayer::Replayer: 0x561656177680 scan_for_unsynced_group_snapshots: 
2025-05-16T12:07:55.633+0530 7f4c931b36c0 10 rbd::mirror::group_replayer::Replayer: 0x561656177680 try_create_group_snapshot: 10697c05fdec

How reproducible:

Always (later reproduced by myself and Prasanna)

Comment 7 errata-xmlrpc 2025-08-18 14:02:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 8.1 security and bug fix updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2025:14015


Note You need to log in before you can comment on or make changes to this bug.