Bug 2247543 - [rbd_support] fix hangs and mgr crash when rbd_support module tries to recover from repeated blocklisting
Summary: [rbd_support] fix hangs and mgr crash when rbd_support module tries to recove...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RBD-Mirror
Version: 6.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 6.1z3
Assignee: Ram Raja
QA Contact: Sunil Angadi
URL:
Whiteboard:
Depends On: 2247531
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-11-01 21:54 UTC by Ram Raja
Modified: 2023-12-12 13:56 UTC (History)
7 users (show)

Fixed In Version: ceph-17.2.6-153.el9cp
Doc Type: Bug Fix
Doc Text:
.`rbd_support` module no longer fails to recover from repeated blocklisting of its client Previously, it was observed that the `rbd_support` module failed to recover from repeated blocklisting of its client due to a recursive deadlock in the rbd_support module, a race condition in the rbd_support module's librbd client, and a bug in the librbd cython bindings that sometimes crashed the ceph-mgr. With this release, all these 3 issues are fixed and rbd_support` module no longer fails to recover from repeated blocklisting of its client
Clone Of: 2247531
Environment:
Last Closed: 2023-12-12 13:56:05 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 62891 0 None None None 2023-11-01 21:54:00 UTC
Ceph Project Bug Tracker 62994 0 None None None 2023-11-01 21:54:00 UTC
Ceph Project Bug Tracker 63009 0 None None None 2023-11-01 21:54:00 UTC
Ceph Project Bug Tracker 63028 0 None None None 2023-11-01 21:54:00 UTC
Red Hat Issue Tracker RHCEPH-7842 0 None None None 2023-11-01 21:55:03 UTC
Red Hat Product Errata RHSA-2023:7740 0 None None None 2023-12-12 13:56:07 UTC

Comment 5 Ram Raja 2023-11-22 18:03:31 UTC
(In reply to Sunil Angadi from comment #4)
...
> 
> Tried the test mentioned here
> https://tracker.ceph.com/issues/61607#note-2
> 
> ran script more than 3 hours,
> 
> also used this upstream script https://github.com/ceph/ceph/pull/53535
> to test in on downstream ceph clusters
> 
> + sleep 10
> ++ ceph mgr dump
> ++ jq '.active_clients[]'
> ++ jq 'select(.name == "rbd_support")'
> ++ jq -r '[.addrvec[0].addr, "/", .addrvec[0].nonce|tostring] | add'
> + CLIENT_ADDR=10.8.131.14:0/3677632432
> ++ date +%s
> + CURRENT_TIME=1700653795
> + (( CURRENT_TIME <= END_TIME ))
> + [[ -n 10.8.131.14:0/3677632432 ]]
> + [[ 10.8.131.14:0/3677632432 !=
> \1\0\.\8\.\1\3\1\.\1\4\:\0\/\3\6\7\7\6\3\2\4\3\2 ]]
> + sleep 10
> 
> Tried it on multiple snapshot scheduling intervals like 1m, 3m, 5m
> while initiating the client blocklist script 
> able to run IO continuosly
> 

Hey Sunil, this is great. Can you let me know how many times you ran the script for 3 hours?

> Each time observed that rbd_support client was recovered and 
> mirror snapshots got created as per the schedule
> 
...
> 
> 
>  @Raman,
> seen some logs in mgr as below,
> can you please confirm these are expected due to client blocklist test?
> 
> 2023-11-22T10:14:02.085+0000 7fe573234640 -1 librbd::io::AioCompletion:
> 0x55d8d209fce0 fail: (108) Cannot send after transport endpoint shutdown
> 2023-11-22T10:14:02.085+0000 7fe573a35640 -1
> librbd::watcher::RewatchRequest: 0x55d8ce46b810 handle_unwatch client
> blocklisted
> 2023-11-22T10:14:02.085+0000 7fe573a35640 -1
> librbd::watcher::RewatchRequest: 0x55d8d068a5a0 handle_unwatch client
> blocklisted
> 2023-11-22T10:14:02.085+0000 7fe573a35640 -1 librbd::ImageWatcher:
> 0x55d8cd5e0c00 image watch failed: 94389613608960, (107) Transport endpoint
> is not connected
> 2023-11-22T10:14:02.085+0000 7fe573234640 -1
> librbd::managed_lock::BreakRequest: 0x55d8cca16500 handle_break_lock: failed
> to break lock: (108) Cannot send after transport endpoint shutdown
> 2023-11-22T10:14:02.085+0000 7fe573a35640 -1 librbd::Watcher: 0x55d8cd5e0c00
> handle_error: handle=94389613608960: (107) Transport endpoint is not
> connected
> 2023-11-22T10:14:02.085+0000 7fe573234640 -1
> librbd::managed_lock::AcquireRequest: 0x55d8d0000690 handle_break_lock:
> failed to break lock : (108) Cannot send after transport endpoint shutdown
> 2023-11-22T10:14:02.085+0000 7fe573234640 -1 librbd::ManagedLock:
> 0x55d8d0531bb8 handle_acquire_lock: failed to acquire exclusive lock: (108)
> Cannot send after transport endpoint shutdown
> 2023-11-22T10:14:02.085+0000 7fe573a35640 -1 librbd::ImageWatcher:
> 0x55d8cdcb6900 image watch failed: 94389702976512, (107) Transport endpoint
> is not connected
> 2023-11-22T10:14:02.085+0000 7fe573a35640 -1 librbd::Watcher: 0x55d8cdcb6900
> handle_error: handle=94389702976512: (107) Transport endpoint is not
> connected
> 2023-11-22T10:14:02.085+0000 7fe573a35640 -1
> librbd::mirror::snapshot::CreatePrimaryRequest: 0x55d8c7c1a820
> handle_create_snapshot: failed to create mirror snapshot: (108) Cannot send
> after transport endpoint shutdown
> 2023-11-22T10:14:02.085+0000 7fe573a35640 -1 librbd::io::AioCompletion:
> 0x55d8d1758580 fail: (108) Cannot send after transport endpoint shutdown
> 2023-11-22T10:14:02.085+0000 7fe573234640 -1 librbd::ImageWatcher:
> 0x55d8cdb3e300 image watch failed: 94389554526848, (107) Transport endpoint
> is not connected
> 2023-11-22T10:14:02.085+0000 7fe573234640 -1 librbd::Watcher: 0x55d8cdb3e300
> handle_error: handle=94389554526848: (107) Transport endpoint is not
> connected
> 2023-11-22T10:14:02.085+0000 7fe573234640 -1
> librbd::watcher::RewatchRequest: 0x55d8ce223810 handle_unwatch client
> blocklisted
> 2023-11-22T10:14:02.085+0000 7fe573234640 -1
> librbd::watcher::RewatchRequest: 0x55d8d3f448c0 handle_unwatch client
> blocklisted
> 2023-11-22T10:14:02.085+0000 7fe573a35640 -1 librbd::image::OpenRequest:
> failed to retrieve name: (108) Cannot send after transport endpoint shutdown
> 2023-11-22T10:14:02.085+0000 7fe573234640 -1 librbd::ImageState:
> 0x55d8d193db80 failed to open image: (108) Cannot send after transport
> endpoint shutdown
> 2023-11-22T10:14:02.087+0000 7fe573a35640 -1 librbd::image::OpenRequest:
> failed to retrieve name: (108) Cannot send after transport endpoint shutdown
> 2023-11-22T10:14:02.087+0000 7fe573234640 -1 librbd::ImageState:
> 0x55d8cad28200 failed to open image: (108) Cannot send after transport
> endpoint shutdown
> 2023-11-22T10:14:02.089+0000 7fe573a35640 -1 librbd::image::OpenRequest:
> failed to retrieve name: (108) Cannot send after transport endpoint shutdown

Yes, this is expected. You're seeing the result of blocklisting the rbd_support module's client (by the script); the client is no longer able to connect to the cluster.

Comment 8 errata-xmlrpc 2023-12-12 13:56:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 6.1 security, enhancements, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:7740


Note You need to log in before you can comment on or make changes to this bug.