Bug 2247543

Summary:	[rbd_support] fix hangs and mgr crash when rbd_support module tries to recover from repeated blocklisting
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Ram Raja <rraja>
Component:	RBD-Mirror	Assignee:	Ram Raja <rraja>
Status:	CLOSED ERRATA	QA Contact:	Sunil Angadi <sangadi>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	6.1	CC:	ceph-eng-bugs, cephqe-warriors, dwalveka, idryomov, mgowri, sangadi, tserlin
Target Milestone:	---
Target Release:	6.1z3
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	ceph-17.2.6-153.el9cp	Doc Type:	Bug Fix
Doc Text:	.`rbd_support` module no longer fails to recover from repeated blocklisting of its client Previously, it was observed that the `rbd_support` module failed to recover from repeated blocklisting of its client due to a recursive deadlock in the rbd_support module, a race condition in the rbd_support module's librbd client, and a bug in the librbd cython bindings that sometimes crashed the ceph-mgr. With this release, all these 3 issues are fixed and rbd_support` module no longer fails to recover from repeated blocklisting of its client	Story Points:	---
Clone Of:	2247531	Environment:
Last Closed:	2023-12-12 13:56:05 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2247531
Bug Blocks:

Comment 5 Ram Raja 2023-11-22 18:03:31 UTC

(In reply to Sunil Angadi from comment #4)
...
> 
> Tried the test mentioned here
> https://tracker.ceph.com/issues/61607#note-2
> 
> ran script more than 3 hours,
> 
> also used this upstream script https://github.com/ceph/ceph/pull/53535
> to test in on downstream ceph clusters
> 
> + sleep 10
> ++ ceph mgr dump
> ++ jq '.active_clients[]'
> ++ jq 'select(.name == "rbd_support")'
> ++ jq -r '[.addrvec[0].addr, "/", .addrvec[0].nonce|tostring] | add'
> + CLIENT_ADDR=10.8.131.14:0/3677632432
> ++ date +%s
> + CURRENT_TIME=1700653795
> + (( CURRENT_TIME <= END_TIME ))
> + [[ -n 10.8.131.14:0/3677632432 ]]
> + [[ 10.8.131.14:0/3677632432 !=
> \1\0\.\8\.\1\3\1\.\1\4\:\0\/\3\6\7\7\6\3\2\4\3\2 ]]
> + sleep 10
> 
> Tried it on multiple snapshot scheduling intervals like 1m, 3m, 5m
> while initiating the client blocklist script 
> able to run IO continuosly
> 

Hey Sunil, this is great. Can you let me know how many times you ran the script for 3 hours?

> Each time observed that rbd_support client was recovered and 
> mirror snapshots got created as per the schedule
> 
...
> 
> 
>  @Raman,
> seen some logs in mgr as below,
> can you please confirm these are expected due to client blocklist test?
> 
> 2023-11-22T10:14:02.085+0000 7fe573234640 -1 librbd::io::AioCompletion:
> 0x55d8d209fce0 fail: (108) Cannot send after transport endpoint shutdown
> 2023-11-22T10:14:02.085+0000 7fe573a35640 -1
> librbd::watcher::RewatchRequest: 0x55d8ce46b810 handle_unwatch client
> blocklisted
> 2023-11-22T10:14:02.085+0000 7fe573a35640 -1
> librbd::watcher::RewatchRequest: 0x55d8d068a5a0 handle_unwatch client
> blocklisted
> 2023-11-22T10:14:02.085+0000 7fe573a35640 -1 librbd::ImageWatcher:
> 0x55d8cd5e0c00 image watch failed: 94389613608960, (107) Transport endpoint
> is not connected
> 2023-11-22T10:14:02.085+0000 7fe573234640 -1
> librbd::managed_lock::BreakRequest: 0x55d8cca16500 handle_break_lock: failed
> to break lock: (108) Cannot send after transport endpoint shutdown
> 2023-11-22T10:14:02.085+0000 7fe573a35640 -1 librbd::Watcher: 0x55d8cd5e0c00
> handle_error: handle=94389613608960: (107) Transport endpoint is not
> connected
> 2023-11-22T10:14:02.085+0000 7fe573234640 -1
> librbd::managed_lock::AcquireRequest: 0x55d8d0000690 handle_break_lock:
> failed to break lock : (108) Cannot send after transport endpoint shutdown
> 2023-11-22T10:14:02.085+0000 7fe573234640 -1 librbd::ManagedLock:
> 0x55d8d0531bb8 handle_acquire_lock: failed to acquire exclusive lock: (108)
> Cannot send after transport endpoint shutdown
> 2023-11-22T10:14:02.085+0000 7fe573a35640 -1 librbd::ImageWatcher:
> 0x55d8cdcb6900 image watch failed: 94389702976512, (107) Transport endpoint
> is not connected
> 2023-11-22T10:14:02.085+0000 7fe573a35640 -1 librbd::Watcher: 0x55d8cdcb6900
> handle_error: handle=94389702976512: (107) Transport endpoint is not
> connected
> 2023-11-22T10:14:02.085+0000 7fe573a35640 -1
> librbd::mirror::snapshot::CreatePrimaryRequest: 0x55d8c7c1a820
> handle_create_snapshot: failed to create mirror snapshot: (108) Cannot send
> after transport endpoint shutdown
> 2023-11-22T10:14:02.085+0000 7fe573a35640 -1 librbd::io::AioCompletion:
> 0x55d8d1758580 fail: (108) Cannot send after transport endpoint shutdown
> 2023-11-22T10:14:02.085+0000 7fe573234640 -1 librbd::ImageWatcher:
> 0x55d8cdb3e300 image watch failed: 94389554526848, (107) Transport endpoint
> is not connected
> 2023-11-22T10:14:02.085+0000 7fe573234640 -1 librbd::Watcher: 0x55d8cdb3e300
> handle_error: handle=94389554526848: (107) Transport endpoint is not
> connected
> 2023-11-22T10:14:02.085+0000 7fe573234640 -1
> librbd::watcher::RewatchRequest: 0x55d8ce223810 handle_unwatch client
> blocklisted
> 2023-11-22T10:14:02.085+0000 7fe573234640 -1
> librbd::watcher::RewatchRequest: 0x55d8d3f448c0 handle_unwatch client
> blocklisted
> 2023-11-22T10:14:02.085+0000 7fe573a35640 -1 librbd::image::OpenRequest:
> failed to retrieve name: (108) Cannot send after transport endpoint shutdown
> 2023-11-22T10:14:02.085+0000 7fe573234640 -1 librbd::ImageState:
> 0x55d8d193db80 failed to open image: (108) Cannot send after transport
> endpoint shutdown
> 2023-11-22T10:14:02.087+0000 7fe573a35640 -1 librbd::image::OpenRequest:
> failed to retrieve name: (108) Cannot send after transport endpoint shutdown
> 2023-11-22T10:14:02.087+0000 7fe573234640 -1 librbd::ImageState:
> 0x55d8cad28200 failed to open image: (108) Cannot send after transport
> endpoint shutdown
> 2023-11-22T10:14:02.089+0000 7fe573a35640 -1 librbd::image::OpenRequest:
> failed to retrieve name: (108) Cannot send after transport endpoint shutdown

Yes, this is expected. You're seeing the result of blocklisting the rbd_support module's client (by the script); the client is no longer able to connect to the cluster.

Comment 8 errata-xmlrpc 2023-12-12 13:56:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 6.1 security, enhancements, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:7740