Bug 2237303 - [rbd-mirror] : snapshot schedules stopped : possibly due to hang in MirrorSnapshotScheduleHandler.shutdown which could be in wait_for_pending() [7.0]
Summary: [rbd-mirror] : snapshot schedules stopped : possibly due to hang in MirrorSna...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RBD-Mirror
Version: 6.1
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 7.0
Assignee: Ram Raja
QA Contact: Sunil Angadi
Rivka Pollack
URL:
Whiteboard:
Depends On:
Blocks: 2237662
TreeView+ depends on / blocked
 
Reported: 2023-09-04 16:19 UTC by Ilya Dryomov
Modified: 2023-12-13 15:22 UTC (History)
13 users (show)

Fixed In Version: ceph-18.2.0-9.el9cp
Doc Type: Bug Fix
Doc Text:
.The `librbd` client correctly propagates the block-listing error to the caller Previously, when the `rbd_support` module's RADOS client was block-listed, the module's `mirror_snapshot_schedule` handler would not always shut down correctly. The handler's `librbd` client would not propagate the block-list error, thereby stalling the handler's shutdown. This lead to the failures of the `mirror_snapshot_schedule` handler and the `rbd_support` module to automatically recover from repeated client block-listing. The `rbd_support` module stopped scheduling mirror snapshots after its client was repeatedly block-listed. With this fix, the race in the `librbd` client between its exclusive lock acquisition and handling of block-listing is fixed. This allows the `librbd` client to propagate the block-listing error correctly to the caller, for example, the `mirror_snapshot_schedule` handler, while waiting to acquire an exclusive lock. The `mirror_snapshot_schedule` handler and the `rbd_support_module` automatically recover from repeated client block-listing.
Clone Of: 2211290
Environment:
Last Closed: 2023-12-13 15:22:41 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 61607 0 None None None 2023-09-04 16:19:48 UTC
Red Hat Issue Tracker RHCEPH-7308 0 None None None 2023-09-04 16:21:06 UTC
Red Hat Product Errata RHBA-2023:7780 0 None None None 2023-12-13 15:22:45 UTC

Description Ilya Dryomov 2023-09-04 16:19:48 UTC
+++ This bug was initially created as a clone of Bug #2211290 +++

Description of problem:
Configured two clusters with two way snapshot based mirroring with both clusters hosting primary images with delay in between the clusters simulated using qdisc.
Images were observed for days for snapshot scheduling with multiple build to build upgrades.

While testing blocklist recovery fix,tried blocklisting for a an hour in loop.
Later observed that the snapshot schedule had stopped for all the primary images in one of the clusters.

Upon observation, it appeared like client blocklist had got recovered but there seemed  MirrorSnapshotScheduleHandler.shutdown might have got hung possibly in wait_for_pending()

Version-Release number of selected component (if applicable):
 17.2.6-69

How reproducible:
Tried once on 22 images, snapshot scheduling has got affected on all images as handler might have got affected.

Steps to Reproduce:
Configure snapshot based mirroring at image level, add rbd_support client blocklisting in loop and observe snapshot schedule.

Actual results:
Snapshot schedule got stuck stalling mirroring

Expected results:
Snapshot scheduling must be intact

Additional info:

--- Additional comment from Vasishta on 2023-05-31 04:07:55 UTC ---

Log link - http://magna002.ceph.redhat.com/ceph-qe-logs/vasishta/2023/block_list/cluster_9/ceph-mgr.cali024.edzqcd.log-20230530.gz
Respective log snippets can be found after the timestamp 18:06:00

--- Additional comment from Vasishta on 2023-05-31 04:09:03 UTC ---

Adding initial analysis offline conversation for reference -

Ilya Dryomov, Yesterday 8:54 PM
Looks like an issue in rbd_support logic at first sight -- MirrorSnapshotScheduleHandler initially detects blocklisting, TrashPurgeScheduleHandler picks it up in a bit too but after recovery has started:2023-05-29T18:51:51.962+0000 7fb0fa44f640  0 [rbd_support DEBUG root] MirrorSnapshotScheduleHandler: refresh_images
2023-05-29T18:51:51.962+0000 7fb0fa44f640  0 [rbd_support INFO root] MirrorSnapshotScheduleHandler: load_schedules
2023-05-29T18:51:51.964+0000 7fb0fa44f640  0 [rbd_support DEBUG root] CreateSnapshotRequests.wait_for_pending
2023-05-29T18:51:51.965+0000 7fb0fa44f640  0 [rbd_support INFO root] load_schedules: mirror_pool, start_after=
2023-05-29T18:51:51.966+0000 7fb0f9c4e640  0 [rbd_support DEBUG root] PerfHandler: tick
2023-05-29T18:51:51.966+0000 7fb0fa44f640  0 [rbd_support ERROR root] MirrorSnapshotScheduleHandler: client blocklisted
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/rbd_support/mirror_snapshot_schedule.py", line 361, in run
    refresh_delay = self.refresh_images()  
  File "/usr/share/ceph/mgr/rbd_support/mirror_snapshot_schedule.py", line 400, in refresh_images
    self.load_schedules()
  File "/usr/share/ceph/mgr/rbd_support/mirror_snapshot_schedule.py", line 390, in load_schedules
    self.schedules.load(namespace_validator, image_validator)
  File "/usr/share/ceph/mgr/rbd_support/schedule.py", line 419, in load
    self.load_from_pool(ioctx, namespace_validator,
  File "/usr/share/ceph/mgr/rbd_support/schedule.py", line 443, in load_from_pool
    ioctx.operate_read_op(read_op, self.handler.SCHEDULE_OID)
  File "rados.pyx", line 3720, in rados.Ioctx.operate_read_op
rados.ConnectionShutdown: [errno 108] RADOS connection was shutdown (Failed to operate read op for oid rbd_mirror_snapshot_schedule)
2023-05-29T18:51:51.966+0000 7fb117dca640  0 [rbd_support INFO root] recovering from blocklisting
2023-05-29T18:51:51.966+0000 7fb117dca640  0 [rbd_support INFO root] MirrorSnapshotScheduleHandler: shutting down
2023-05-29T18:51:51.966+0000 7fb117dca640  0 [rbd_support DEBUG root] CreateSnapshotRequests.wait_for_pending
2023-05-29T18:51:51.967+0000 7fb0f944d640  0 [rbd_support DEBUG root] TaskHandler: tick
2023-05-29T18:51:51.968+0000 7fb0f8c4c640  0 [rbd_support DEBUG root] TrashPurgeScheduleHandler: refresh_pools
2023-05-29T18:51:51.968+0000 7fb0f8c4c640  0 [rbd_support INFO root] TrashPurgeScheduleHandler: load_schedules
2023-05-29T18:51:51.973+0000 7fb0f8c4c640  0 [rbd_support DEBUG root] CreateSnapshotRequests.wait_for_pending
[...]
2023-05-29T18:51:51.973+0000 7fb0f8c4c640  0 [rbd_support DEBUG root] CreateSnapshotRequests.wait_for_pending
2023-05-29T18:51:51.975+0000 7fb0f8c4c640  0 [rbd_support INFO root] load_schedules: mirror_pool, start_after=
2023-05-29T18:51:51.975+0000 7fb0f8c4c640  0 [rbd_support ERROR root] TrashPurgeScheduleHandler: client blocklisted
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/rbd_support/trash_purge_schedule.py", line 48, in run
    refresh_delay = self.refresh_pools()
  File "/usr/share/ceph/mgr/rbd_support/trash_purge_schedule.py", line 97, in refresh_pools
    self.load_schedules()
  File "/usr/share/ceph/mgr/rbd_support/trash_purge_schedule.py", line 87, in load_schedules
    self.schedules.load()
  File "/usr/share/ceph/mgr/rbd_support/schedule.py", line 419, in load
    self.load_from_pool(ioctx, namespace_validator,
  File "/usr/share/ceph/mgr/rbd_support/schedule.py", line 443, in load_from_pool
    ioctx.operate_read_op(read_op, self.handler.SCHEDULE_OID)
  File "rados.pyx", line 3720, in rados.Ioctx.operate_read_op
rados.ConnectionShutdown: [errno 108] RADOS connection was shutdown (Failed to operate read op for oid rbd_trash_purge_schedule)
2023-05-29T18:51:56.966+0000 7fb0f9c4e640  0 [rbd_support DEBUG root] PerfHandler: tick
2023-05-29T18:51:56.967+0000 7fb0f944d640  0 [rbd_support DEBUG root] TaskHandler: tick

Ilya Dryomov, Yesterday 8:54 PM, Edited
But PerfHandler and TaskHandler happily continue ticking...

Ramana Krisna Venkatesh Raja, Yesterday 8:55 PM
just caught the same in the log

Ilya Dryomov, Yesterday 8:57 PM, Edited
Looks like a hang in MirrorSnapshotScheduleHandler.shutdown() to me, most likely in wait_for_pending()...

--- Additional comment from Ram Raja on 2023-06-01 13:35:44 UTC ---

Vasishta's test: 22 images were scheduled to have their mirror-snapshots created once every 3 minutes. rbd_support module's rados client was added to the blocklist every 5 seconds.

Observation:
Close to 50 minutes into this test, the rbd_support module failed to recover from  blocklisting. The rbd_support module was able to recover from blocklisting until then. It was observed that the MirrorSnapshotScheduleHandler failed to shutdown properly, and this stopped the rbd_support module recovery to proceed further. The handler shutdown got stuck waiting for all the pending snapshot creation requests to fail or finish.

Initial analysis of the failed recovery:
Snapshot requests were created at 2023-05-29T18:51:00 ( identified by log message "CreateSnapshotRequests.add") for all the 22 images. The add requests started at 2023-05-29T18:50:59.999+0000 and ended at 2023-05-29T18:51:00.021+0000.  First sign of blocklisting occurred at 2023-05-29T18:51:00.262+0000 , librbd::watcher::RewatchRequest: 0x55fa591f4910 handle_unwatch r=-108 .  After blocklisting, it was observed that create snapshot requests for 20 images finished (1 completed and 19 errored out). Only create snapshot requests for 2 images ( "2//1231886181b19" and "2//17109c5f76935") failed to finish.  The failure of these 2 snapshot requests to finish must have blocked the shutdown of the MirrorSnapshotScheduleHandler.

I need to dig deeper to figure out why these 2 requests failed to error out or complete.

--- Additional comment from Vasishta on 2023-06-02 04:14:19 UTC ---

Script used to hit the issue - http://pastebin.test.redhat.com/1101663
Script needs to be run on node with admin keyring in place
(script can be optmised/groomed further on variables)

--- Additional comment from Ram Raja on 2023-06-05 12:55:48 UTC ---

Of the 22 images whose snapshots were scheduled, 12 errored out when opening the image for snapshot, 8 errored out when creating a snapshot for the image, and 2 failed to error out (that caused the recovery to hang). Please note the correction in https://bugzilla.redhat.com/show_bug.cgi?id=2211290#c3 , where I mentioned that 1 mirror snapshot creation completed.

For the 2 scheduled mirror snapshot operations that failed to error out, I see "librbd::api::Mirror: image_snapshot_create" and later "librbd::ImageWatcher: 0x55fa64aa8f00 remote snap_create request". However, for a mirror snapshot schedule that errored out when creating a snapshot, I see "librbd::api::Mirror: image_snapshot_create" followed by "librbd::mirror::snapshot::CreatePrimaryRequest: 0x55fa5955e9c0 create_snapshot ", "librbd::Operations: 0x55fa5e73c7e0 execute_snap_create",  and then "[rbd_support ERROR root] error when creating snapshot for 2//14a6a474a58d0: -108".

Please see below for detailed log snippets.

Log snippet for mirror snapshot schedule that *did not error out* and caused the rbd_support module recovery to fail,
```
2023-05-29T18:51:00.008+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_get_mirror_info 2//1231886181b19: r=0 info={'global_id': '0d8be64e-956a-42be-817c-b1945343b921', 'state': 1, 'primary': True}
2023-05-29T18:51:00.008+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.create_snapshot for 2//1231886181b19
2023-05-29T18:51:00.008+0000 7fb11add0640 20 librbd::api::Mirror: image_snapshot_create: ictx=0x55fa61151000
2023-05-29T18:51:00.008+0000 7fb11add0640 20 librbd::mirror::GetInfoRequest: 0x55fa5beb12c0 get_mirror_image:
2023-05-29T18:51:00.008+0000 7fb11add0640 10 librbd::image::RefreshRequest: 0x55fa5d8e3880 handle_v2_get_group: r=0
2023-05-29T18:51:00.008+0000 7fb11add0640 10 librbd::image::RefreshRequest: 0x55fa5d8e3880 send_v2_get_snapshots
2023-05-29T18:51:00.008+0000 7fb11add0640 10 librbd::image::OpenRequest: handle_v2_get_initial_metadata: r=0
2023-05-29T18:51:00.008+0000 7fb11add0640 10 librbd::image::OpenRequest: 0x55fa5d9bb840 send_v2_get_create_timestamp
2023-05-29T18:51:00.008+0000 7fb11add0640 20 librbd::mirror::GetInfoRequest: 0x55fa5beb1980 handle_get_mirror_image: r=0
2023-05-29T18:51:00.008+0000 7fb11add0640 10 librbd::mirror::GetInfoRequest: 0x55fa5beb1980 calc_promotion_state: promotion_state=1, primary_mirror_uuid=
2023-05-29T18:51:00.008+0000 7fb11add0640 20 librbd::mirror::GetInfoRequest: 0x55fa5beb1980 finish: r=0
2023-05-29T18:51:00.008+0000 7fb11add0640 20 librbd: C_AioCompletion::finish: r=0
...
2023-05-29T18:51:00.010+0000 7fb11add0640 10 librbd::ImageWatcher: 0x55fa64aa8f00 remote snap_create request: [342496,94533739825408,15101] [mirror state=primary, complete=0, mirror_peer_uuids=52d86adc-3b8e-4aba-b0d2-9818b7f83340, clean_since_snap_id=head] .mirror.primary.0d8be64e-956a-42be-817c-b1945343b921.cf193596-7096-4d58-bd60-a6727703a3c5 0
2023-05-29T18:51:00.010+0000 7fb11add0640 20 librbd::ExclusiveLock: 0x55fa59570ea0 accept_request=0 (request_type=0)
2023-05-29T18:51:00.010+0000 7fb11add0640 10 librbd::Watcher::C_NotifyAck 0x55fa594dd4a0 finish: r=0
2023-05-29T18:51:00.010+0000 7fb11b5d1640 20 librbd::mirror::GetInfoRequest: 0x55fa5beb0f00 handle_get_mirror_image: r=0
2023-05-29T18:51:00.010+0000 7fb11b5d1640 10 librbd::mirror::GetInfoRequest: 0x55fa5beb0f00 calc_promotion_state: promotion_state=1, primary_mirror_uuid=
2023-05-29T18:51:00.010+0000 7fb11b5d1640 20 librbd::mirror::GetInfoRequest: 0x55fa5beb0f00 finish: r=0
2023-05-29T18:51:00.010+0000 7fb11b5d1640 20 librbd: C_AioCompletion::finish: r=0
```

Log snippet for a mirror snapshot schedule that errored out when creating a snapshot as expected,
```
2023-05-29T18:51:00.008+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_get_mirror_info 2//14a6a474a58d0: r=0 info={'global_id': '242275a6-bedc-4aac-aa5d-dde6a01eab32', 'state': 1, 'primary': True}
2023-05-29T18:51:00.008+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.create_snapshot for 2//14a6a474a58d0
2023-05-29T18:51:00.008+0000 7fb11add0640 20 librbd::api::Mirror: image_snapshot_create: ictx=0x55fa60969000
2023-05-29T18:51:00.008+0000 7fb11add0640 20 librbd::mirror::GetInfoRequest: 0x55fa5beb1980 get_mirror_image:
2023-05-29T18:51:00.008+0000 7fb11add0640 10 librbd::image::RefreshRequest: 0x55fa5d8e3880 handle_v2_get_snapshots: r=0
2023-05-29T18:51:00.008+0000 7fb11add0640 10 librbd::image::RefreshRequest: 0x55fa5d8e3880 send_v2_init_exclusive_lock
2023-05-29T18:51:00.008+0000 7fb11add0640 10 librbd::ExclusiveLock: 0x55fa51e82000 init: features=61
2023-05-29T18:51:00.008+0000 7fb11add0640  5 librbd::io::Dispatcher: 0x55fa5e73d320 register_dispatch: dispatch_layer=3
2023-05-29T18:51:00.008+0000 7fb11add0640 20 librbd::exclusive_lock::ImageDispatch: 0x55fa594b21c0 set_require_lock: direction=1, enabled=1
2023-05-29T18:51:00.008+0000 7fb11add0640 20 librbd::io::AsyncOperation: 0x55fa5bb6bdf0 start_op                      
...
2023-05-29T18:51:00.009+0000 7fb11add0640 15 librbd::mirror::snapshot::CreatePrimaryRequest: 0x55fa5955e9c0 handle_get_mirror_peers: r=0
2023-05-29T18:51:00.009+0000 7fb11add0640 15 librbd::mirror::snapshot::CreatePrimaryRequest: 0x55fa5955e9c0 create_snapshot: name=.mirror.primary.242275a6-bedc-4aac-aa5d-dde6a01eab32.8b6baca8-5d16-4fb9-8f83-eda964f303e4, ns=[mirror state=primary, complete=0, mirror_peer_uuids=52d86adc-3b8e-4aba-b0d2-9818b7f83340, clean_since_snap_id=head]
2023-05-29T18:51:00.009+0000 7fb11add0640  5 librbd::Operations: 0x55fa5e73c7e0 snap_create: snap_name=.mirror.primary.242275a6-bedc-4aac-aa5d-dde6a01eab32.8b6baca8-5d16-4fb9-8f83-eda964f303e4
...
2023-05-29T18:51:00.010+0000 7fb11add0640 20 librbd::Operations: start_op: snap create 0x55fa59006710
2023-05-29T18:51:00.010+0000 7fb11add0640 20 librbd::Operations: start snap create 0x55fa59006710
2023-05-29T18:51:00.010+0000 7fb11add0640 20 librbd::Operations: execute_local_request
2023-05-29T18:51:00.010+0000 7fb11add0640 20 librbd::ManagedLock: 0x55fa51ea4018 is_lock_owner: =1
2023-05-29T18:51:00.010+0000 7fb11b5d1640 10 librbd::Watcher: 0x55fa5d493800 handle_register_watch: r=0
2023-05-29T18:51:00.010+0000 7fb11add0640  5 librbd::Operations: 0x55fa5e73c7e0 execute_snap_create: snap_name=.mirror.primary.242275a6-bedc-4aac-aa5d-dde6a01eab32.8b6baca8-5d16-4fb9-8f83-eda964f303e4
2023-05-29T18:51:00.010+0000 7fb11b5d1640 10 librbd::image::OpenRequest: 0x55fa5d9bb840 handle_register_watch: r=0
2023-05-29T18:51:00.010+0000 7fb11add0640  5 librbd::SnapshotCreateRequest: 0x55fa590c6000 send_notify_quiesce
...
2023-05-29T18:51:00.010+0000 7fb11b5d1640 20 librbd: C_OpenComplete::finish: r=0
2023-05-29T18:51:00.010+0000 7fb11b5d1640 20 librbd: C_AioCompletion::finish: r=0
...
2023-05-29T18:51:02.612+0000 7fb11b5d1640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_create_snapshot for 2//14a6a474a58d0: r=-108, snap_id=None
2023-05-29T18:51:02.612+0000 7fb11b5d1640  0 [rbd_support ERROR root] error when creating snapshot for 2//14a6a474a58d0: -108
2023-05-29T18:51:02.612+0000 7fb11b5d1640  0 [rbd_support DEBUG root] CreateSnapshotRequests.close_image 2//14a6a474a58d0
```

--- Additional comment from Ram Raja on 2023-06-06 12:52:13 UTC ---

The mgr logs around when the rbd_support module failed to recover are in "2023-05-29T18:51.txt" . The file is used to compare the logs of images scheduled for snapshots, images whose snapshots creation errored out with those whose snapshot creation didn't on the rados client being blocklisted.

Below are log snippets of images scheduled for snapshots whose snapshot creation errored out as expected on client being blocklisted.
```
$ grep -E '171317162a276|a92d6ef8-68ce-4a33-b9d8-916198f12469' 2023-05-29T18\:51.txt
2023-05-29T18:51:00.006+0000 7fb0fa44f640  0 [rbd_support DEBUG root] CreateSnapshotRequests.add: 2//171317162a276
2023-05-29T18:51:00.006+0000 7fb0fa44f640  0 [rbd_support DEBUG root] CreateSnapshotRequests.open_image: 2//171317162a276
2023-05-29T18:51:00.008+0000 7fb0fa44f640 10 librbd::ImageCtx: 0x55fa6610e000 ImageCtx: image_name=, image_id=171317162a276
2023-05-29T18:51:00.008+0000 7fb0fa44f640  0 [rbd_support DEBUG root] MirrorSnapshotScheduleHandler: scheduling 2//171317162a276 at 2023-05-29 18:54:00
2023-05-29T18:51:00.009+0000 7fb11add0640 10 librbd::ImageCtx: init_layout stripe_unit 4194304 stripe_count 1 object_size 4194304 prefix rbd_data.171317162a276 format rbd_data.171317162a276.%016llx
2023-05-29T18:51:00.010+0000 7fb11b5d1640 20 librbd::image::RefreshRequest: new snapshot id=484290 name=.mirror.primary.a92d6ef8-68ce-4a33-b9d8-916198f12469.8e9f4188-eb6b-4af8-8627-f93403afd0c5 size=107374182400
2023-05-29T18:51:00.010+0000 7fb11b5d1640 20 librbd::image::RefreshRequest: new snapshot id=484108 name=.mirror.primary.a92d6ef8-68ce-4a33-b9d8-916198f12469.c9ef235d-c490-4604-a11a-6930f8f8daad size=107374182400
2023-05-29T18:51:00.010+0000 7fb11b5d1640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_open_image 2//171317162a276: r=0
2023-05-29T18:51:00.010+0000 7fb11b5d1640  0 [rbd_support DEBUG root] CreateSnapshotRequests.get_mirror_mode: 2//171317162a276
2023-05-29T18:51:00.011+0000 7fb11b5d1640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_get_mirror_mode 2//171317162a276: r=0 mode=1
2023-05-29T18:51:00.011+0000 7fb11b5d1640  0 [rbd_support DEBUG root] CreateSnapshotRequests.get_mirror_info: 2//171317162a276
2023-05-29T18:51:00.011+0000 7fb11b5d1640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_get_mirror_info 2//171317162a276: r=0 info={'global_id': 'a92d6ef8-68ce-4a33-b9d8-916198f12469', 'state': 1, 'primary': True}
2023-05-29T18:51:00.011+0000 7fb11b5d1640  0 [rbd_support DEBUG root] CreateSnapshotRequests.create_snapshot for 2//171317162a276
2023-05-29T18:51:00.011+0000 7fb11add0640 15 librbd::mirror::snapshot::CreatePrimaryRequest: 0x55fa5955fd40 create_snapshot: name=.mirror.primary.a92d6ef8-68ce-4a33-b9d8-916198f12469.b733b096-4479-446d-af38-96e4bf4dd3d1, ns=[mirror state=primary, complete=0, mirror_peer_uuids=52d86adc-3b8e-4aba-b0d2-9818b7f83340, clean_since_snap_id=head]
2023-05-29T18:51:00.011+0000 7fb11add0640  5 librbd::Operations: 0x55fa5e73d3b0 snap_create: snap_name=.mirror.primary.a92d6ef8-68ce-4a33-b9d8-916198f12469.b733b096-4479-446d-af38-96e4bf4dd3d1
2023-05-29T18:51:00.012+0000 7fb11b5d1640 10 librbd::object_map::RefreshRequest: 0x55fa5b547860 send_lock: oid=rbd_object_map.171317162a276
2023-05-29T18:51:00.012+0000 7fb11b5d1640 10 librbd::object_map::LockRequest: 0x55fa5dc66fc0 send_lock: oid=rbd_object_map.171317162a276
2023-05-29T18:51:00.012+0000 7fb11b5d1640 10 librbd::object_map::RefreshRequest: 0x55fa5b547860 send_load: oid=rbd_object_map.171317162a276
2023-05-29T18:51:00.013+0000 7fb11b5d1640  5 librbd::Operations: 0x55fa5e73d3b0 execute_snap_create: snap_name=.mirror.primary.a92d6ef8-68ce-4a33-b9d8-916198f12469.b733b096-4479-446d-af38-96e4bf4dd3d1
2023-05-29T18:51:02.613+0000 7fb11b5d1640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_create_snapshot for 2//171317162a276: r=-108, snap_id=None
2023-05-29T18:51:02.613+0000 7fb11b5d1640  0 [rbd_support ERROR root] error when creating snapshot for 2//171317162a276: -108
2023-05-29T18:51:02.613+0000 7fb11b5d1640  0 [rbd_support DEBUG root] CreateSnapshotRequests.close_image 2//171317162a276
2023-05-29T18:51:02.613+0000 7fb11b5d1640 10 librbd::object_map::UnlockRequest: 0x55fa63e074d0 send_unlock: oid=rbd_object_map.171317162a276
2023-05-29T18:51:02.616+0000 7fb1045a3640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_close_image 2//171317162a276: r=0
2023-05-29T18:51:02.616+0000 7fb1045a3640  0 [rbd_support DEBUG root] CreateSnapshotRequests.finish: 2//171317162a276
```

and

```
$ grep -E '1bded98edb7a0|35496098-d456-4cf9-940f-fb67aac15d69' 2023-05-29T18\:51.txt
2023-05-29T18:51:00.008+0000 7fb0fa44f640  0 [rbd_support DEBUG root] CreateSnapshotRequests.add: 2//1bded98edb7a0
2023-05-29T18:51:00.008+0000 7fb0fa44f640  0 [rbd_support DEBUG root] CreateSnapshotRequests.open_image: 2//1bded98edb7a0
2023-05-29T18:51:00.010+0000 7fb0fa44f640 10 librbd::ImageCtx: 0x55fa5d37a000 ImageCtx: image_name=, image_id=1bded98edb7a0
2023-05-29T18:51:00.010+0000 7fb0fa44f640  0 [rbd_support DEBUG root] MirrorSnapshotScheduleHandler: scheduling 2//1bded98edb7a0 at 2023-05-29 18:54:00
2023-05-29T18:51:00.011+0000 7fb11b5d1640 10 librbd::ImageCtx: init_layout stripe_unit 4194304 stripe_count 1 object_size 4194304 prefix rbd_data.1bded98edb7a0 format rbd_data.1bded98edb7a0.%016llx
2023-05-29T18:51:00.012+0000 7fb11add0640 20 librbd::image::RefreshRequest: new snapshot id=484296 name=.mirror.primary.35496098-d456-4cf9-940f-fb67aac15d69.a32de766-e1b4-40f2-8dd1-208d31e27d23 size=107374182400
2023-05-29T18:51:00.012+0000 7fb11add0640 20 librbd::image::RefreshRequest: new snapshot id=484114 name=.mirror.primary.35496098-d456-4cf9-940f-fb67aac15d69.21c93a23-c7af-4d50-9ee7-8692ef5e6ebb size=107374182400
2023-05-29T18:51:00.012+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_open_image 2//1bded98edb7a0: r=0
2023-05-29T18:51:00.012+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.get_mirror_mode: 2//1bded98edb7a0
2023-05-29T18:51:00.013+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_get_mirror_mode 2//1bded98edb7a0: r=0 mode=1
2023-05-29T18:51:00.013+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.get_mirror_info: 2//1bded98edb7a0
2023-05-29T18:51:00.013+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_get_mirror_info 2//1bded98edb7a0: r=0 info={'global_id': '35496098-d456-4cf9-940f-fb67aac15d69', 'state': 1, 'primary': True}
2023-05-29T18:51:00.013+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.create_snapshot for 2//1bded98edb7a0
2023-05-29T18:51:00.013+0000 7fb11add0640 15 librbd::mirror::snapshot::CreatePrimaryRequest: 0x55fa5955f6c0 create_snapshot: name=.mirror.primary.35496098-d456-4cf9-940f-fb67aac15d69.6b6c5a31-828e-48c7-8ebe-c017679df345, ns=[mirror state=primary, complete=0, mirror_peer_uuids=52d86adc-3b8e-4aba-b0d2-9818b7f83340, clean_since_snap_id=head]
2023-05-29T18:51:00.013+0000 7fb11add0640  5 librbd::Operations: 0x55fa5e73d560 snap_create: snap_name=.mirror.primary.35496098-d456-4cf9-940f-fb67aac15d69.6b6c5a31-828e-48c7-8ebe-c017679df345
2023-05-29T18:51:00.014+0000 7fb11add0640 10 librbd::ImageWatcher: 0x55fa650cdb00 remote snap_create request: [342496,94533745813632,15107] [mirror state=primary, complete=0, mirror_peer_uuids=52d86adc-3b8e-4aba-b0d2-9818b7f83340, clean_since_snap_id=head] .mirror.primary.35496098-d456-4cf9-940f-fb67aac15d69.6b6c5a31-828e-48c7-8ebe-c017679df345 0
2023-05-29T18:51:00.314+0000 7fb11b5d1640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_create_snapshot for 2//1bded98edb7a0: r=-108, snap_id=None
2023-05-29T18:51:00.314+0000 7fb11b5d1640  0 [rbd_support ERROR root] error when creating snapshot for 2//1bded98edb7a0: -108
2023-05-29T18:51:00.314+0000 7fb11b5d1640  0 [rbd_support DEBUG root] CreateSnapshotRequests.close_image 2//1bded98edb7a0
2023-05-29T18:51:00.314+0000 7fb1045a3640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_close_image 2//1bded98edb7a0: r=0
2023-05-29T18:51:00.314+0000 7fb1045a3640  0 [rbd_support DEBUG root] CreateSnapshotRequests.finish: 2//1bded98edb7a0
```

Following is the log snippet of the image that was scheduled for a mirror snapshot. On rbd_support module's client being blocklisted, the image's snapshot creation did not error out. This stalled the recovery of the rbd_support module.
```
$ grep -E '1231886181b19|0d8be64e-956a-42be-817c-b1945343b921' 2023-05-29T18\:51.txt
2023-05-29T18:50:59.999+0000 7fb0fa44f640  0 [rbd_support DEBUG root] CreateSnapshotRequests.add: 2//1231886181b19
2023-05-29T18:50:59.999+0000 7fb0fa44f640  0 [rbd_support DEBUG root] CreateSnapshotRequests.open_image: 2//1231886181b19
2023-05-29T18:51:00.001+0000 7fb0fa44f640 10 librbd::ImageCtx: 0x55fa61151000 ImageCtx: image_name=, image_id=1231886181b19
2023-05-29T18:51:00.001+0000 7fb0fa44f640  0 [rbd_support DEBUG root] MirrorSnapshotScheduleHandler: scheduling 2//1231886181b19 at 2023-05-29 18:54:00
2023-05-29T18:51:00.002+0000 7fb11add0640 10 librbd::ImageCtx: init_layout stripe_unit 4194304 stripe_count 1 object_size 4194304 prefix rbd_data.1231886181b19 format rbd_data.1231886181b19.%016llx
2023-05-29T18:51:00.005+0000 7fb11b5d1640 20 librbd::image::RefreshRequest: new snapshot id=484294 name=.mirror.primary.0d8be64e-956a-42be-817c-b1945343b921.a45b233b-5297-4504-8c8a-cf16fa1b5b86 size=107374182400
2023-05-29T18:51:00.005+0000 7fb11b5d1640 20 librbd::image::RefreshRequest: new snapshot id=484112 name=.mirror.primary.0d8be64e-956a-42be-817c-b1945343b921.f5fea28e-0750-4df4-90f1-66176b95b092 size=107374182400
2023-05-29T18:51:00.007+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_open_image 2//1231886181b19: r=0
2023-05-29T18:51:00.007+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.get_mirror_mode: 2//1231886181b19
2023-05-29T18:51:00.008+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_get_mirror_mode 2//1231886181b19: r=0 mode=1
2023-05-29T18:51:00.008+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.get_mirror_info: 2//1231886181b19
2023-05-29T18:51:00.008+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.handle_get_mirror_info 2//1231886181b19: r=0 info={'global_id': '0d8be64e-956a-42be-817c-b1945343b921', 'state': 1, 'primary': True}
2023-05-29T18:51:00.008+0000 7fb11add0640  0 [rbd_support DEBUG root] CreateSnapshotRequests.create_snapshot for 2//1231886181b19
2023-05-29T18:51:00.008+0000 7fb11add0640 15 librbd::mirror::snapshot::CreatePrimaryRequest: 0x55fa5955e4e0 create_snapshot: name=.mirror.primary.0d8be64e-956a-42be-817c-b1945343b921.cf193596-7096-4d58-bd60-a6727703a3c5, ns=[mirror state=primary, complete=0, mirror_peer_uuids=52d86adc-3b8e-4aba-b0d2-9818b7f83340, clean_since_snap_id=head]
2023-05-29T18:51:00.008+0000 7fb11add0640  5 librbd::Operations: 0x55fa5e73c630 snap_create: snap_name=.mirror.primary.0d8be64e-956a-42be-817c-b1945343b921.cf193596-7096-4d58-bd60-a6727703a3c5
2023-05-29T18:51:00.010+0000 7fb11add0640 10 librbd::ImageWatcher: 0x55fa64aa8f00 remote snap_create request: [342496,94533739825408,15101] [mirror state=primary, complete=0, mirror_peer_uuids=52d86adc-3b8e-4aba-b0d2-9818b7f83340, clean_since_snap_id=head] .mirror.primary.0d8be64e-956a-42be-817c-b1945343b921.cf193596-7096-4d58-bd60-a6727703a3c5 0
```

Looking at these log snippets, it seems like librbd code sometimes fails to error out due to client blocklisting after remote snap_create request is issued.

--- Additional comment from Ilya Dryomov on 2023-06-06 20:39:21 UTC ---

(In reply to Ram Raja from comment #6)
> Looking at these log snippets, it seems like librbd code sometimes fails to
> error out due to client blocklisting after remote snap_create request is
> issued.

I think you got side tracked by remote snap_create request code because the image actually gets past it.  The conclusion is on point though -- we are looking at another exclusive lock acquisition vs blocklisting race.  I honed in on this image and filed https://tracker.ceph.com/issues/61607 with my findings.

--- Additional comment from Ilya Dryomov on 2023-06-19 17:44:40 UTC ---

Likely also encountered in BZ 2215982, this time in ODF and so without blocklisting on purpose.

--- Additional comment from Mudit Agarwal on 2023-07-13 03:58:14 UTC ---

Ilya, are we tracking this as a 6.1z1 blocker?

--- Additional comment from Ilya Dryomov on 2023-07-13 08:14:31 UTC ---

Hi Mudit,

I haven't heard of a decision to block 6.1z1 on anything that is still WIP, so this BZ is very likely to be moved to 6.1z2.

--- Additional comment from Red Hat Bugzilla on 2023-07-31 21:50:26 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2023-07-31 21:50:34 UTC ---

remove performed by PnT Account Manager <pnt-expunge>

--- Additional comment from Red Hat Bugzilla on 2023-08-03 08:28:48 UTC ---

Account disabled by LDAP Audit

--- Additional comment from Ilya Dryomov on 2023-09-04 15:39:51 UTC ---

Pushed to ceph-6.1-rhel-patches.

Comment 7 Akash Raj 2023-09-20 08:32:16 UTC
Hi Ram.

Could you please confirm if this BZ needs to be added in the 7.0 RN? If so, please provide the doc type and text.

Thanks.

Comment 8 Ram Raja 2023-09-25 15:46:12 UTC
Hi Akash,

Yes, you can add the BZ to the release notes.

The doc text is the same as the one I provided for the BZ it was cloned from, https://bugzilla.redhat.com/show_bug.cgi?id=2211290 . I've copy pasted it here.

Comment 9 errata-xmlrpc 2023-12-13 15:22:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:7780


Note You need to log in before you can comment on or make changes to this bug.