Bug 2211290 - [rbd-mirror] : snapshot schedules stopped : possibly due to hang in MirrorSnapshotScheduleHandler.shutdown which could be in wait_for_pending()
Summary: [rbd-mirror] : snapshot schedules stopped : possibly due to hang in MirrorSna...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RBD-Mirror
Version: 6.1
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 6.1z2
Assignee: Ram Raja
QA Contact: Sunil Angadi
Akash Raj
URL:
Whiteboard:
Depends On:
Blocks: 2067095 2235257
TreeView+ depends on / blocked
 
Reported: 2023-05-31 04:04 UTC by Vasishta
Modified: 2023-11-03 04:01 UTC (History)
11 users (show)

Fixed In Version: ceph-17.2.6-121.el9cp
Doc Type: Bug Fix
Doc Text:
.The librbd client correctly propagates the block-listing error to the caller Previously, when the `rbd_support` module's RADOS client was block-listed, the module's `mirror_snapshot_schedule` handler would not always shut down correctly. The handler's `librbd` client would not propagate the block-list error, thereby stalling the handler's shutdown. This lead to the failures of the `mirror_snapshot_schedule` handler and the `rbd_support` module to automatically recover from repeated client block-listing. The `rbd_support` module stopped scheduling mirror snapshots after its client was repeatedly block-listed. With this fix, the race in the `librbd` client between its exclusive lock acquisition and handling of block-listing is fixed. This allows the `librbd` client to propagate the block-listing error correctly to the caller, for example, the `mirror_snapshot_schedule` handler, while waiting to acquire an exclusive lock. The `mirror_snapshot_schedule` handler and the `rbd_support_module` automatically recover from repeated client block-listing.
Clone Of:
: 2237303 (view as bug list)
Environment:
Last Closed: 2023-10-12 16:34:26 UTC
Embargoed:


Attachments (Terms of Use)
mgr logs around when rbd_support module recovery failed on client blocklisting (541.86 KB, text/plain)
2023-06-06 12:52 UTC, Ram Raja
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 61607 0 None None None 2023-06-06 20:39:44 UTC
Red Hat Issue Tracker RHCEPH-6770 0 None None None 2023-05-31 04:05:11 UTC
Red Hat Product Errata RHSA-2023:5693 0 None None None 2023-10-12 16:36:17 UTC

Description Vasishta 2023-05-31 04:04:14 UTC
Description of problem:
Configured two clusters with two way snapshot based mirroring with both clusters hosting primary images with delay in between the clusters simulated using qdisc.
Images were observed for days for snapshot scheduling with multiple build to build upgrades.

While testing blocklist recovery fix,tried blocklisting for a an hour in loop.
Later observed that the snapshot schedule had stopped for all the primary images in one of the clusters.

Upon observation, it appeared like client blocklist had got recovered but there seemed  MirrorSnapshotScheduleHandler.shutdown might have got hung possibly in wait_for_pending()

Version-Release number of selected component (if applicable):
 17.2.6-69

How reproducible:
Tried once on 22 images, snapshot scheduling has got affected on all images as handler might have got affected.

Steps to Reproduce:
Configure snapshot based mirroring at image level, add rbd_support client blocklisting in loop and observe snapshot schedule.

Actual results:
Snapshot schedule got stuck stalling mirroring

Expected results:
Snapshot scheduling must be intact

Additional info:

Comment 3 Ram Raja 2023-06-01 13:35:44 UTC
Vasishta's test: 22 images were scheduled to have their mirror-snapshots created once every 3 minutes. rbd_support module's rados client was added to the blocklist every 5 seconds.

Observation:
Close to 50 minutes into this test, the rbd_support module failed to recover from  blocklisting. The rbd_support module was able to recover from blocklisting until then. It was observed that the MirrorSnapshotScheduleHandler failed to shutdown properly, and this stopped the rbd_support module recovery to proceed further. The handler shutdown got stuck waiting for all the pending snapshot creation requests to fail or finish.

Initial analysis of the failed recovery:
Snapshot requests were created at 2023-05-29T18:51:00 ( identified by log message "CreateSnapshotRequests.add") for all the 22 images. The add requests started at 2023-05-29T18:50:59.999+0000 and ended at 2023-05-29T18:51:00.021+0000.  First sign of blocklisting occurred at 2023-05-29T18:51:00.262+0000 , librbd::watcher::RewatchRequest: 0x55fa591f4910 handle_unwatch r=-108 .  After blocklisting, it was observed that create snapshot requests for 20 images finished (1 completed and 19 errored out). Only create snapshot requests for 2 images ( "2//1231886181b19" and "2//17109c5f76935") failed to finish.  The failure of these 2 snapshot requests to finish must have blocked the shutdown of the MirrorSnapshotScheduleHandler.

I need to dig deeper to figure out why these 2 requests failed to error out or complete.

Comment 5 Ram Raja 2023-06-05 12:55:48 UTC
Of the 22 images whose snapshots were scheduled, 12 errored out when opening the image for snapshot, 8 errored out when creating a snapshot for the image, and 2 failed to error out (that caused the recovery to hang). Please note the correction in https://bugzilla.redhat.com/show_bug.cgi?id=2211290#c3 , where I mentioned that 1 mirror snapshot creation completed.

For the 2 scheduled mirror snapshot operations that failed to error out, I see "librbd::api::Mirror: image_snapshot_create" and later "librbd::ImageWatcher: 0x55fa64aa8f00 remote snap_create request". However, for a mirror snapshot schedule that errored out when creating a snapshot, I see "librbd::api::Mirror: image_snapshot_create" followed by "librbd::mirror::snapshot::CreatePrimaryRequest: 0x55fa5955e9c0 create_snapshot ", "librbd::Operations: 0x55fa5e73c7e0 execute_snap_create",  and then "[rbd_support ERROR root] error when creating snapshot for 2//14a6a474a58d0: -108".

Please see below for detailed log snippets.

Log snippet for mirror snapshot schedule that *did not error out* and caused the rbd_support module recovery to fail,
```
2023-05-29T18:51:00.008+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_get_mirror_info 2//1231886181b19: r=0 info={'global_id': '0d8be64e-956a-42be-817c-b1945343b921', 'state': 1, 'primary': True}
2023-05-29T18:51:00.008+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.create_snapshot for 2//1231886181b19
2023-05-29T18:51:00.008+0000 7fb11add0640 20 librbd::api::Mirror: image_snapshot_create: ictx=0x55fa61151000
2023-05-29T18:51:00.008+0000 7fb11add0640 20 librbd::mirror::GetInfoRequest: 0x55fa5beb12c0 get_mirror_image:
2023-05-29T18:51:00.008+0000 7fb11add0640 10 librbd::image::RefreshRequest: 0x55fa5d8e3880 handle_v2_get_group: r=0
2023-05-29T18:51:00.008+0000 7fb11add0640 10 librbd::image::RefreshRequest: 0x55fa5d8e3880 send_v2_get_snapshots
2023-05-29T18:51:00.008+0000 7fb11add0640 10 librbd::image::OpenRequest: handle_v2_get_initial_metadata: r=0
2023-05-29T18:51:00.008+0000 7fb11add0640 10 librbd::image::OpenRequest: 0x55fa5d9bb840 send_v2_get_create_timestamp
2023-05-29T18:51:00.008+0000 7fb11add0640 20 librbd::mirror::GetInfoRequest: 0x55fa5beb1980 handle_get_mirror_image: r=0
2023-05-29T18:51:00.008+0000 7fb11add0640 10 librbd::mirror::GetInfoRequest: 0x55fa5beb1980 calc_promotion_state: promotion_state=1, primary_mirror_uuid=
2023-05-29T18:51:00.008+0000 7fb11add0640 20 librbd::mirror::GetInfoRequest: 0x55fa5beb1980 finish: r=0
2023-05-29T18:51:00.008+0000 7fb11add0640 20 librbd: C_AioCompletion::finish: r=0
...
2023-05-29T18:51:00.010+0000 7fb11add0640 10 librbd::ImageWatcher: 0x55fa64aa8f00 remote snap_create request: [342496,94533739825408,15101] [mirror state=primary, complete=0, mirror_peer_uuids=52d86adc-3b8e-4aba-b0d2-9818b7f83340, clean_since_snap_id=head] .mirror.primary.0d8be64e-956a-42be-817c-b1945343b921.cf193596-7096-4d58-bd60-a6727703a3c5 0
2023-05-29T18:51:00.010+0000 7fb11add0640 20 librbd::ExclusiveLock: 0x55fa59570ea0 accept_request=0 (request_type=0)
2023-05-29T18:51:00.010+0000 7fb11add0640 10 librbd::Watcher::C_NotifyAck 0x55fa594dd4a0 finish: r=0
2023-05-29T18:51:00.010+0000 7fb11b5d1640 20 librbd::mirror::GetInfoRequest: 0x55fa5beb0f00 handle_get_mirror_image: r=0
2023-05-29T18:51:00.010+0000 7fb11b5d1640 10 librbd::mirror::GetInfoRequest: 0x55fa5beb0f00 calc_promotion_state: promotion_state=1, primary_mirror_uuid=
2023-05-29T18:51:00.010+0000 7fb11b5d1640 20 librbd::mirror::GetInfoRequest: 0x55fa5beb0f00 finish: r=0
2023-05-29T18:51:00.010+0000 7fb11b5d1640 20 librbd: C_AioCompletion::finish: r=0
```

Log snippet for a mirror snapshot schedule that errored out when creating a snapshot as expected,
```
2023-05-29T18:51:00.008+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_get_mirror_info 2//14a6a474a58d0: r=0 info={'global_id': '242275a6-bedc-4aac-aa5d-dde6a01eab32', 'state': 1, 'primary': True}
2023-05-29T18:51:00.008+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.create_snapshot for 2//14a6a474a58d0
2023-05-29T18:51:00.008+0000 7fb11add0640 20 librbd::api::Mirror: image_snapshot_create: ictx=0x55fa60969000
2023-05-29T18:51:00.008+0000 7fb11add0640 20 librbd::mirror::GetInfoRequest: 0x55fa5beb1980 get_mirror_image:
2023-05-29T18:51:00.008+0000 7fb11add0640 10 librbd::image::RefreshRequest: 0x55fa5d8e3880 handle_v2_get_snapshots: r=0
2023-05-29T18:51:00.008+0000 7fb11add0640 10 librbd::image::RefreshRequest: 0x55fa5d8e3880 send_v2_init_exclusive_lock
2023-05-29T18:51:00.008+0000 7fb11add0640 10 librbd::ExclusiveLock: 0x55fa51e82000 init: features=61
2023-05-29T18:51:00.008+0000 7fb11add0640  5 librbd::io::Dispatcher: 0x55fa5e73d320 register_dispatch: dispatch_layer=3
2023-05-29T18:51:00.008+0000 7fb11add0640 20 librbd::exclusive_lock::ImageDispatch: 0x55fa594b21c0 set_require_lock: direction=1, enabled=1
2023-05-29T18:51:00.008+0000 7fb11add0640 20 librbd::io::AsyncOperation: 0x55fa5bb6bdf0 start_op                      
...
2023-05-29T18:51:00.009+0000 7fb11add0640 15 librbd::mirror::snapshot::CreatePrimaryRequest: 0x55fa5955e9c0 handle_get_mirror_peers: r=0
2023-05-29T18:51:00.009+0000 7fb11add0640 15 librbd::mirror::snapshot::CreatePrimaryRequest: 0x55fa5955e9c0 create_snapshot: name=.mirror.primary.242275a6-bedc-4aac-aa5d-dde6a01eab32.8b6baca8-5d16-4fb9-8f83-eda964f303e4, ns=[mirror state=primary, complete=0, mirror_peer_uuids=52d86adc-3b8e-4aba-b0d2-9818b7f83340, clean_since_snap_id=head]
2023-05-29T18:51:00.009+0000 7fb11add0640  5 librbd::Operations: 0x55fa5e73c7e0 snap_create: snap_name=.mirror.primary.242275a6-bedc-4aac-aa5d-dde6a01eab32.8b6baca8-5d16-4fb9-8f83-eda964f303e4
...
2023-05-29T18:51:00.010+0000 7fb11add0640 20 librbd::Operations: start_op: snap create 0x55fa59006710
2023-05-29T18:51:00.010+0000 7fb11add0640 20 librbd::Operations: start snap create 0x55fa59006710
2023-05-29T18:51:00.010+0000 7fb11add0640 20 librbd::Operations: execute_local_request
2023-05-29T18:51:00.010+0000 7fb11add0640 20 librbd::ManagedLock: 0x55fa51ea4018 is_lock_owner: =1
2023-05-29T18:51:00.010+0000 7fb11b5d1640 10 librbd::Watcher: 0x55fa5d493800 handle_register_watch: r=0
2023-05-29T18:51:00.010+0000 7fb11add0640  5 librbd::Operations: 0x55fa5e73c7e0 execute_snap_create: snap_name=.mirror.primary.242275a6-bedc-4aac-aa5d-dde6a01eab32.8b6baca8-5d16-4fb9-8f83-eda964f303e4
2023-05-29T18:51:00.010+0000 7fb11b5d1640 10 librbd::image::OpenRequest: 0x55fa5d9bb840 handle_register_watch: r=0
2023-05-29T18:51:00.010+0000 7fb11add0640  5 librbd::SnapshotCreateRequest: 0x55fa590c6000 send_notify_quiesce
...
2023-05-29T18:51:00.010+0000 7fb11b5d1640 20 librbd: C_OpenComplete::finish: r=0
2023-05-29T18:51:00.010+0000 7fb11b5d1640 20 librbd: C_AioCompletion::finish: r=0
...
2023-05-29T18:51:02.612+0000 7fb11b5d1640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_create_snapshot for 2//14a6a474a58d0: r=-108, snap_id=None
2023-05-29T18:51:02.612+0000 7fb11b5d1640  0 [rbd_support ERROR root] error when creating snapshot for 2//14a6a474a58d0: -108
2023-05-29T18:51:02.612+0000 7fb11b5d1640  0 [rbd_support DEBUG root] CreateSnapshotRequests.close_image 2//14a6a474a58d0
```

Comment 6 Ram Raja 2023-06-06 12:52:13 UTC
Created attachment 1969296 [details]
mgr logs around when rbd_support module recovery failed on client blocklisting

The mgr logs around when the rbd_support module failed to recover are in "2023-05-29T18:51.txt" . The file is used to compare the logs of images scheduled for snapshots, images whose snapshots creation errored out with those whose snapshot creation didn't on the rados client being blocklisted.

Below are log snippets of images scheduled for snapshots whose snapshot creation errored out as expected on client being blocklisted.
```
$ grep -E '171317162a276|a92d6ef8-68ce-4a33-b9d8-916198f12469' 2023-05-29T18\:51.txt
2023-05-29T18:51:00.006+0000 7fb0fa44f640  0 [rbd_support DEBUG root] CreateSnapshotRequests.add: 2//171317162a276
2023-05-29T18:51:00.006+0000 7fb0fa44f640  0 [rbd_support DEBUG root] CreateSnapshotRequests.open_image: 2//171317162a276
2023-05-29T18:51:00.008+0000 7fb0fa44f640 10 librbd::ImageCtx: 0x55fa6610e000 ImageCtx: image_name=, image_id=171317162a276
2023-05-29T18:51:00.008+0000 7fb0fa44f640  0 [rbd_support DEBUG root] MirrorSnapshotScheduleHandler: scheduling 2//171317162a276 at 2023-05-29 18:54:00
2023-05-29T18:51:00.009+0000 7fb11add0640 10 librbd::ImageCtx: init_layout stripe_unit 4194304 stripe_count 1 object_size 4194304 prefix rbd_data.171317162a276 format rbd_data.171317162a276.%016llx
2023-05-29T18:51:00.010+0000 7fb11b5d1640 20 librbd::image::RefreshRequest: new snapshot id=484290 name=.mirror.primary.a92d6ef8-68ce-4a33-b9d8-916198f12469.8e9f4188-eb6b-4af8-8627-f93403afd0c5 size=107374182400
2023-05-29T18:51:00.010+0000 7fb11b5d1640 20 librbd::image::RefreshRequest: new snapshot id=484108 name=.mirror.primary.a92d6ef8-68ce-4a33-b9d8-916198f12469.c9ef235d-c490-4604-a11a-6930f8f8daad size=107374182400
2023-05-29T18:51:00.010+0000 7fb11b5d1640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_open_image 2//171317162a276: r=0
2023-05-29T18:51:00.010+0000 7fb11b5d1640  0 [rbd_support DEBUG root] CreateSnapshotRequests.get_mirror_mode: 2//171317162a276
2023-05-29T18:51:00.011+0000 7fb11b5d1640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_get_mirror_mode 2//171317162a276: r=0 mode=1
2023-05-29T18:51:00.011+0000 7fb11b5d1640  0 [rbd_support DEBUG root] CreateSnapshotRequests.get_mirror_info: 2//171317162a276
2023-05-29T18:51:00.011+0000 7fb11b5d1640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_get_mirror_info 2//171317162a276: r=0 info={'global_id': 'a92d6ef8-68ce-4a33-b9d8-916198f12469', 'state': 1, 'primary': True}
2023-05-29T18:51:00.011+0000 7fb11b5d1640  0 [rbd_support DEBUG root] CreateSnapshotRequests.create_snapshot for 2//171317162a276
2023-05-29T18:51:00.011+0000 7fb11add0640 15 librbd::mirror::snapshot::CreatePrimaryRequest: 0x55fa5955fd40 create_snapshot: name=.mirror.primary.a92d6ef8-68ce-4a33-b9d8-916198f12469.b733b096-4479-446d-af38-96e4bf4dd3d1, ns=[mirror state=primary, complete=0, mirror_peer_uuids=52d86adc-3b8e-4aba-b0d2-9818b7f83340, clean_since_snap_id=head]
2023-05-29T18:51:00.011+0000 7fb11add0640  5 librbd::Operations: 0x55fa5e73d3b0 snap_create: snap_name=.mirror.primary.a92d6ef8-68ce-4a33-b9d8-916198f12469.b733b096-4479-446d-af38-96e4bf4dd3d1
2023-05-29T18:51:00.012+0000 7fb11b5d1640 10 librbd::object_map::RefreshRequest: 0x55fa5b547860 send_lock: oid=rbd_object_map.171317162a276
2023-05-29T18:51:00.012+0000 7fb11b5d1640 10 librbd::object_map::LockRequest: 0x55fa5dc66fc0 send_lock: oid=rbd_object_map.171317162a276
2023-05-29T18:51:00.012+0000 7fb11b5d1640 10 librbd::object_map::RefreshRequest: 0x55fa5b547860 send_load: oid=rbd_object_map.171317162a276
2023-05-29T18:51:00.013+0000 7fb11b5d1640  5 librbd::Operations: 0x55fa5e73d3b0 execute_snap_create: snap_name=.mirror.primary.a92d6ef8-68ce-4a33-b9d8-916198f12469.b733b096-4479-446d-af38-96e4bf4dd3d1
2023-05-29T18:51:02.613+0000 7fb11b5d1640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_create_snapshot for 2//171317162a276: r=-108, snap_id=None
2023-05-29T18:51:02.613+0000 7fb11b5d1640  0 [rbd_support ERROR root] error when creating snapshot for 2//171317162a276: -108
2023-05-29T18:51:02.613+0000 7fb11b5d1640  0 [rbd_support DEBUG root] CreateSnapshotRequests.close_image 2//171317162a276
2023-05-29T18:51:02.613+0000 7fb11b5d1640 10 librbd::object_map::UnlockRequest: 0x55fa63e074d0 send_unlock: oid=rbd_object_map.171317162a276
2023-05-29T18:51:02.616+0000 7fb1045a3640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_close_image 2//171317162a276: r=0
2023-05-29T18:51:02.616+0000 7fb1045a3640  0 [rbd_support DEBUG root] CreateSnapshotRequests.finish: 2//171317162a276
```

and

```
$ grep -E '1bded98edb7a0|35496098-d456-4cf9-940f-fb67aac15d69' 2023-05-29T18\:51.txt
2023-05-29T18:51:00.008+0000 7fb0fa44f640  0 [rbd_support DEBUG root] CreateSnapshotRequests.add: 2//1bded98edb7a0
2023-05-29T18:51:00.008+0000 7fb0fa44f640  0 [rbd_support DEBUG root] CreateSnapshotRequests.open_image: 2//1bded98edb7a0
2023-05-29T18:51:00.010+0000 7fb0fa44f640 10 librbd::ImageCtx: 0x55fa5d37a000 ImageCtx: image_name=, image_id=1bded98edb7a0
2023-05-29T18:51:00.010+0000 7fb0fa44f640  0 [rbd_support DEBUG root] MirrorSnapshotScheduleHandler: scheduling 2//1bded98edb7a0 at 2023-05-29 18:54:00
2023-05-29T18:51:00.011+0000 7fb11b5d1640 10 librbd::ImageCtx: init_layout stripe_unit 4194304 stripe_count 1 object_size 4194304 prefix rbd_data.1bded98edb7a0 format rbd_data.1bded98edb7a0.%016llx
2023-05-29T18:51:00.012+0000 7fb11add0640 20 librbd::image::RefreshRequest: new snapshot id=484296 name=.mirror.primary.35496098-d456-4cf9-940f-fb67aac15d69.a32de766-e1b4-40f2-8dd1-208d31e27d23 size=107374182400
2023-05-29T18:51:00.012+0000 7fb11add0640 20 librbd::image::RefreshRequest: new snapshot id=484114 name=.mirror.primary.35496098-d456-4cf9-940f-fb67aac15d69.21c93a23-c7af-4d50-9ee7-8692ef5e6ebb size=107374182400
2023-05-29T18:51:00.012+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_open_image 2//1bded98edb7a0: r=0
2023-05-29T18:51:00.012+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.get_mirror_mode: 2//1bded98edb7a0
2023-05-29T18:51:00.013+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_get_mirror_mode 2//1bded98edb7a0: r=0 mode=1
2023-05-29T18:51:00.013+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.get_mirror_info: 2//1bded98edb7a0
2023-05-29T18:51:00.013+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_get_mirror_info 2//1bded98edb7a0: r=0 info={'global_id': '35496098-d456-4cf9-940f-fb67aac15d69', 'state': 1, 'primary': True}
2023-05-29T18:51:00.013+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.create_snapshot for 2//1bded98edb7a0
2023-05-29T18:51:00.013+0000 7fb11add0640 15 librbd::mirror::snapshot::CreatePrimaryRequest: 0x55fa5955f6c0 create_snapshot: name=.mirror.primary.35496098-d456-4cf9-940f-fb67aac15d69.6b6c5a31-828e-48c7-8ebe-c017679df345, ns=[mirror state=primary, complete=0, mirror_peer_uuids=52d86adc-3b8e-4aba-b0d2-9818b7f83340, clean_since_snap_id=head]
2023-05-29T18:51:00.013+0000 7fb11add0640  5 librbd::Operations: 0x55fa5e73d560 snap_create: snap_name=.mirror.primary.35496098-d456-4cf9-940f-fb67aac15d69.6b6c5a31-828e-48c7-8ebe-c017679df345
2023-05-29T18:51:00.014+0000 7fb11add0640 10 librbd::ImageWatcher: 0x55fa650cdb00 remote snap_create request: [342496,94533745813632,15107] [mirror state=primary, complete=0, mirror_peer_uuids=52d86adc-3b8e-4aba-b0d2-9818b7f83340, clean_since_snap_id=head] .mirror.primary.35496098-d456-4cf9-940f-fb67aac15d69.6b6c5a31-828e-48c7-8ebe-c017679df345 0
2023-05-29T18:51:00.314+0000 7fb11b5d1640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_create_snapshot for 2//1bded98edb7a0: r=-108, snap_id=None
2023-05-29T18:51:00.314+0000 7fb11b5d1640  0 [rbd_support ERROR root] error when creating snapshot for 2//1bded98edb7a0: -108
2023-05-29T18:51:00.314+0000 7fb11b5d1640  0 [rbd_support DEBUG root] CreateSnapshotRequests.close_image 2//1bded98edb7a0
2023-05-29T18:51:00.314+0000 7fb1045a3640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_close_image 2//1bded98edb7a0: r=0
2023-05-29T18:51:00.314+0000 7fb1045a3640  0 [rbd_support DEBUG root] CreateSnapshotRequests.finish: 2//1bded98edb7a0
```

Following is the log snippet of the image that was scheduled for a mirror snapshot. On rbd_support module's client being blocklisted, the image's snapshot creation did not error out. This stalled the recovery of the rbd_support module.
```
$ grep -E '1231886181b19|0d8be64e-956a-42be-817c-b1945343b921' 2023-05-29T18\:51.txt
2023-05-29T18:50:59.999+0000 7fb0fa44f640  0 [rbd_support DEBUG root] CreateSnapshotRequests.add: 2//1231886181b19
2023-05-29T18:50:59.999+0000 7fb0fa44f640  0 [rbd_support DEBUG root] CreateSnapshotRequests.open_image: 2//1231886181b19
2023-05-29T18:51:00.001+0000 7fb0fa44f640 10 librbd::ImageCtx: 0x55fa61151000 ImageCtx: image_name=, image_id=1231886181b19
2023-05-29T18:51:00.001+0000 7fb0fa44f640  0 [rbd_support DEBUG root] MirrorSnapshotScheduleHandler: scheduling 2//1231886181b19 at 2023-05-29 18:54:00
2023-05-29T18:51:00.002+0000 7fb11add0640 10 librbd::ImageCtx: init_layout stripe_unit 4194304 stripe_count 1 object_size 4194304 prefix rbd_data.1231886181b19 format rbd_data.1231886181b19.%016llx
2023-05-29T18:51:00.005+0000 7fb11b5d1640 20 librbd::image::RefreshRequest: new snapshot id=484294 name=.mirror.primary.0d8be64e-956a-42be-817c-b1945343b921.a45b233b-5297-4504-8c8a-cf16fa1b5b86 size=107374182400
2023-05-29T18:51:00.005+0000 7fb11b5d1640 20 librbd::image::RefreshRequest: new snapshot id=484112 name=.mirror.primary.0d8be64e-956a-42be-817c-b1945343b921.f5fea28e-0750-4df4-90f1-66176b95b092 size=107374182400
2023-05-29T18:51:00.007+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_open_image 2//1231886181b19: r=0
2023-05-29T18:51:00.007+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.get_mirror_mode: 2//1231886181b19
2023-05-29T18:51:00.008+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_get_mirror_mode 2//1231886181b19: r=0 mode=1
2023-05-29T18:51:00.008+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.get_mirror_info: 2//1231886181b19
2023-05-29T18:51:00.008+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_get_mirror_info 2//1231886181b19: r=0 info={'global_id': '0d8be64e-956a-42be-817c-b1945343b921', 'state': 1, 'primary': True}
2023-05-29T18:51:00.008+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.create_snapshot for 2//1231886181b19
2023-05-29T18:51:00.008+0000 7fb11add0640 15 librbd::mirror::snapshot::CreatePrimaryRequest: 0x55fa5955e4e0 create_snapshot: name=.mirror.primary.0d8be64e-956a-42be-817c-b1945343b921.cf193596-7096-4d58-bd60-a6727703a3c5, ns=[mirror state=primary, complete=0, mirror_peer_uuids=52d86adc-3b8e-4aba-b0d2-9818b7f83340, clean_since_snap_id=head]
2023-05-29T18:51:00.008+0000 7fb11add0640  5 librbd::Operations: 0x55fa5e73c630 snap_create: snap_name=.mirror.primary.0d8be64e-956a-42be-817c-b1945343b921.cf193596-7096-4d58-bd60-a6727703a3c5
2023-05-29T18:51:00.010+0000 7fb11add0640 10 librbd::ImageWatcher: 0x55fa64aa8f00 remote snap_create request: [342496,94533739825408,15101] [mirror state=primary, complete=0, mirror_peer_uuids=52d86adc-3b8e-4aba-b0d2-9818b7f83340, clean_since_snap_id=head] .mirror.primary.0d8be64e-956a-42be-817c-b1945343b921.cf193596-7096-4d58-bd60-a6727703a3c5 0
```

Looking at these log snippets, it seems like librbd code sometimes fails to error out due to client blocklisting after remote snap_create request is issued.

Comment 9 Mudit Agarwal 2023-07-13 03:58:14 UTC
Ilya, are we tracking this as a 6.1z1 blocker?

Comment 27 Akash Raj 2023-09-14 10:33:13 UTC
Hi Ram.

Could you please confirm if this BZ needs to be added to the 6.1z2 release notes? If so, please provide the doc type and text.

Thanks.

Comment 28 Ram Raja 2023-09-15 03:39:14 UTC
I've added the required doc text.

Comment 30 errata-xmlrpc 2023-10-12 16:34:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 6.1 security, enhancement, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:5693


Note You need to log in before you can comment on or make changes to this bug.