Previously, due to an implementation defect, `rbd-mirror` daemon did not properly dispose of outdated `PoolReplayer` instances in particular when refreshing the mirror peer configuration.
Due to this there was unnecessary resource consumption and number of `PoolReplayer` instances competed with each other causing `rbd-mirror` daemon health to be reported as ERROR and replication would hang in some cases. To resume replication the administrator had to restart the `rbd-mirror` daemon.
With this fix, the implementation defect is corrected and rbd-mirror daemon now properly disposes of outdated `PoolReplayer` instances.
Tested using
ceph version 16.2.10-260.el8cp (b20e1a5452628262667a6b060687917fde010343) pacific (stable)
Followed these test steps
---
1. Set up bidirectional mirroring on a test pool as usual
2. Verify that "rbd mirror pool status" reports "health: OK" on both clusters
3. Grab service_id and instance_id from "rbd mirror pool status --verbose" output on cluster B
4. Grab peer UUID ("UUID: ...", not "Mirror UUID: ...") from "rbd mirror pool info" output on cluster B
5. Run "rbd mirror pool peer set <peer UUID from step 4> client client.invalid" command on cluster B
6. Wait 30-90 seconds and verify that "rbd mirror pool status" reports "health: ERROR" on cluster B and "health: WARNING" on cluster A
7. Run "rbd mirror pool peer set <peer UUID from step 4> client client.rbd-mirror-peer" command on cluster B
8. Wait 30-90 seconds and verify that "rbd mirror pool status" reports "health: OK" on both clusters again
9. Grab service_id and instance_id from "rbd mirror pool status --verbose" output on cluster B again
10. Verify that service_id from step 3 is equal to the one from step 9
11. Verify that instance_id from step 3 is less than the one from step 9
On primary
---
[ceph: root@ceph-rbd1-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool status -p ec_img_pool_EXnhoJmOKa --verbose
health: OK
daemon health: OK
image health: OK
images: 1 total
1 replaying
DAEMONS
service 24425:
instance_id: 24611
client_id: ceph-rbd1-sangadi-bz-tfrmy6-node5.qxijuw
hostname: ceph-rbd1-sangadi-bz-tfrmy6-node5
version: 16.2.10-260.el8cp
leader: true
health: OK
IMAGES
ec_imageBkzgATmZuh:
global_id: 8b2121a1-93d6-4d02-80e0-d95324a285e9
state: up+stopped
description: local image is primary
service: ceph-rbd1-sangadi-bz-tfrmy6-node5.qxijuw on ceph-rbd1-sangadi-bz-tfrmy6-node5
last_update: 2024-06-04 08:54:15
peer_sites:
name: ceph-rbd2
state: up+replaying
description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"local_snapshot_timestamp":1717491240,"remote_snapshot_timestamp":1717491240,"replay_state":"idle"}
last_update: 2024-06-04 08:54:20
On secondary
---
[ceph: root@ceph-rbd2-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool status -p ec_img_pool_EXnhoJmOKa --verbose
health: OK
daemon health: OK
image health: OK
images: 1 total
1 replaying
DAEMONS
service 24418:
instance_id: 24586
client_id: ceph-rbd2-sangadi-bz-tfrmy6-node5.hfmyci
hostname: ceph-rbd2-sangadi-bz-tfrmy6-node5
version: 16.2.10-260.el8cp
leader: true
health: OK
IMAGES
ec_imageBkzgATmZuh:
global_id: 8b2121a1-93d6-4d02-80e0-d95324a285e9
state: up+replaying
description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"local_snapshot_timestamp":1717491000,"remote_snapshot_timestamp":1717491000,"replay_state":"idle"}
service: ceph-rbd2-sangadi-bz-tfrmy6-node5.hfmyci on ceph-rbd2-sangadi-bz-tfrmy6-node5
last_update: 2024-06-04 08:50:20
peer_sites:
name: ceph-rbd1
state: up+stopped
description: local image is primary
last_update: 2024-06-04 08:50:45
[ceph: root@ceph-rbd2-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool info -p ec_img_pool_EXnhoJmOKa
Mode: image
Site Name: ceph-rbd2
Peer Sites:
UUID: c468309d-1c30-4e4f-83df-a4c5550a84d5
Name: ceph-rbd1
Mirror UUID: 51e60cf3-b64f-4efd-a3ee-ab6240c36f40
Direction: rx-tx
Client: client.rbd-mirror-peer
set the invalid client id for the mirror peer
[ceph: root@ceph-rbd2-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool peer set --pool ec_img_pool_EXnhoJmOKa c468309d-1c30-4e4f-83df-a4c5550a84d5 client client.invalid
[ceph: root@ceph-rbd2-sangadi-bz-tfrmy6-node1-installer /]#
as expected, the status gets changed accordingly
On secondary
---
[ceph: root@ceph-rbd2-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool status -p ec_img_pool_EXnhoJmOKa --verbose
health: ERROR
daemon health: ERROR
image health: OK
images: 1 total
1 stopped
DAEMONS
service 24418:
instance_id: 24586
client_id: ceph-rbd2-sangadi-bz-tfrmy6-node5.hfmyci
hostname: ceph-rbd2-sangadi-bz-tfrmy6-node5
version: 16.2.10-260.el8cp
leader: false
health: ERROR
callouts: unable to connect to remote cluster
IMAGES
ec_imageBkzgATmZuh:
global_id: 8b2121a1-93d6-4d02-80e0-d95324a285e9
state: down+stopped
description: stopped
last_update: 2024-06-04 08:59:52
peer_sites:
name: ceph-rbd1
state: up+stopped
description: local image is primary
last_update: 2024-06-04 09:02:15
On primary
---
[ceph: root@ceph-rbd1-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool status -p ec_img_pool_EXnhoJmOKa --verbose
health: WARNING
daemon health: OK
image health: WARNING
images: 1 total
1 unknown
DAEMONS
service 24425:
instance_id: 24611
client_id: ceph-rbd1-sangadi-bz-tfrmy6-node5.qxijuw
hostname: ceph-rbd1-sangadi-bz-tfrmy6-node5
version: 16.2.10-260.el8cp
leader: true
health: OK
IMAGES
ec_imageBkzgATmZuh:
global_id: 8b2121a1-93d6-4d02-80e0-d95324a285e9
state: up+stopped
description: local image is primary
service: ceph-rbd1-sangadi-bz-tfrmy6-node5.qxijuw on ceph-rbd1-sangadi-bz-tfrmy6-node5
last_update: 2024-06-04 09:03:15
peer_sites:
name: ceph-rbd2
state: down+stopped
description: stopped
last_update: 2024-06-04 08:59:52
after resetting it back to the client as client.rbd-mirror-peer
status came back properly
on secondary
---
[ceph: root@ceph-rbd2-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool status -p ec_img_pool_EXnhoJmOKa --verbose
health: OK
daemon health: OK
image health: OK
images: 1 total
1 replaying
DAEMONS
service 24418:
instance_id: 24742
client_id: ceph-rbd2-sangadi-bz-tfrmy6-node5.hfmyci
hostname: ceph-rbd2-sangadi-bz-tfrmy6-node5
version: 16.2.10-260.el8cp
leader: true
health: OK
IMAGES
ec_imageBkzgATmZuh:
global_id: 8b2121a1-93d6-4d02-80e0-d95324a285e9
state: up+replaying
description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"local_snapshot_timestamp":1717492080,"remote_snapshot_timestamp":1717492080,"replay_state":"idle"}
service: ceph-rbd2-sangadi-bz-tfrmy6-node5.hfmyci on ceph-rbd2-sangadi-bz-tfrmy6-node5
last_update: 2024-06-04 09:08:53
peer_sites:
name: ceph-rbd1
state: up+stopped
description: local image is primary
last_update: 2024-06-04 09:08:45
on primary
---
[ceph: root@ceph-rbd1-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool status -p ec_img_pool_EXnhoJmOKa --verbose
health: OK
daemon health: OK
image health: OK
images: 1 total
1 replaying
DAEMONS
service 24425:
instance_id: 24611
client_id: ceph-rbd1-sangadi-bz-tfrmy6-node5.qxijuw
hostname: ceph-rbd1-sangadi-bz-tfrmy6-node5
version: 16.2.10-260.el8cp
leader: true
health: OK
IMAGES
ec_imageBkzgATmZuh:
global_id: 8b2121a1-93d6-4d02-80e0-d95324a285e9
state: up+stopped
description: local image is primary
service: ceph-rbd1-sangadi-bz-tfrmy6-node5.qxijuw on ceph-rbd1-sangadi-bz-tfrmy6-node5
last_update: 2024-06-04 09:09:45
peer_sites:
name: ceph-rbd2
state: up+replaying
description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"local_snapshot_timestamp":1717492140,"remote_snapshot_timestamp":1717492140,"replay_state":"idle"}
last_update: 2024-06-04 09:09:53
also noted that service_id remains the same and instance_id less than it's previous to next.
also RBD mirror sanity looks good
Moving it to Verified.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: Red Hat Ceph Storage 5.3 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2024:4118
Tested using ceph version 16.2.10-260.el8cp (b20e1a5452628262667a6b060687917fde010343) pacific (stable) Followed these test steps --- 1. Set up bidirectional mirroring on a test pool as usual 2. Verify that "rbd mirror pool status" reports "health: OK" on both clusters 3. Grab service_id and instance_id from "rbd mirror pool status --verbose" output on cluster B 4. Grab peer UUID ("UUID: ...", not "Mirror UUID: ...") from "rbd mirror pool info" output on cluster B 5. Run "rbd mirror pool peer set <peer UUID from step 4> client client.invalid" command on cluster B 6. Wait 30-90 seconds and verify that "rbd mirror pool status" reports "health: ERROR" on cluster B and "health: WARNING" on cluster A 7. Run "rbd mirror pool peer set <peer UUID from step 4> client client.rbd-mirror-peer" command on cluster B 8. Wait 30-90 seconds and verify that "rbd mirror pool status" reports "health: OK" on both clusters again 9. Grab service_id and instance_id from "rbd mirror pool status --verbose" output on cluster B again 10. Verify that service_id from step 3 is equal to the one from step 9 11. Verify that instance_id from step 3 is less than the one from step 9 On primary --- [ceph: root@ceph-rbd1-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool status -p ec_img_pool_EXnhoJmOKa --verbose health: OK daemon health: OK image health: OK images: 1 total 1 replaying DAEMONS service 24425: instance_id: 24611 client_id: ceph-rbd1-sangadi-bz-tfrmy6-node5.qxijuw hostname: ceph-rbd1-sangadi-bz-tfrmy6-node5 version: 16.2.10-260.el8cp leader: true health: OK IMAGES ec_imageBkzgATmZuh: global_id: 8b2121a1-93d6-4d02-80e0-d95324a285e9 state: up+stopped description: local image is primary service: ceph-rbd1-sangadi-bz-tfrmy6-node5.qxijuw on ceph-rbd1-sangadi-bz-tfrmy6-node5 last_update: 2024-06-04 08:54:15 peer_sites: name: ceph-rbd2 state: up+replaying description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"local_snapshot_timestamp":1717491240,"remote_snapshot_timestamp":1717491240,"replay_state":"idle"} last_update: 2024-06-04 08:54:20 On secondary --- [ceph: root@ceph-rbd2-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool status -p ec_img_pool_EXnhoJmOKa --verbose health: OK daemon health: OK image health: OK images: 1 total 1 replaying DAEMONS service 24418: instance_id: 24586 client_id: ceph-rbd2-sangadi-bz-tfrmy6-node5.hfmyci hostname: ceph-rbd2-sangadi-bz-tfrmy6-node5 version: 16.2.10-260.el8cp leader: true health: OK IMAGES ec_imageBkzgATmZuh: global_id: 8b2121a1-93d6-4d02-80e0-d95324a285e9 state: up+replaying description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"local_snapshot_timestamp":1717491000,"remote_snapshot_timestamp":1717491000,"replay_state":"idle"} service: ceph-rbd2-sangadi-bz-tfrmy6-node5.hfmyci on ceph-rbd2-sangadi-bz-tfrmy6-node5 last_update: 2024-06-04 08:50:20 peer_sites: name: ceph-rbd1 state: up+stopped description: local image is primary last_update: 2024-06-04 08:50:45 [ceph: root@ceph-rbd2-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool info -p ec_img_pool_EXnhoJmOKa Mode: image Site Name: ceph-rbd2 Peer Sites: UUID: c468309d-1c30-4e4f-83df-a4c5550a84d5 Name: ceph-rbd1 Mirror UUID: 51e60cf3-b64f-4efd-a3ee-ab6240c36f40 Direction: rx-tx Client: client.rbd-mirror-peer set the invalid client id for the mirror peer [ceph: root@ceph-rbd2-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool peer set --pool ec_img_pool_EXnhoJmOKa c468309d-1c30-4e4f-83df-a4c5550a84d5 client client.invalid [ceph: root@ceph-rbd2-sangadi-bz-tfrmy6-node1-installer /]# as expected, the status gets changed accordingly On secondary --- [ceph: root@ceph-rbd2-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool status -p ec_img_pool_EXnhoJmOKa --verbose health: ERROR daemon health: ERROR image health: OK images: 1 total 1 stopped DAEMONS service 24418: instance_id: 24586 client_id: ceph-rbd2-sangadi-bz-tfrmy6-node5.hfmyci hostname: ceph-rbd2-sangadi-bz-tfrmy6-node5 version: 16.2.10-260.el8cp leader: false health: ERROR callouts: unable to connect to remote cluster IMAGES ec_imageBkzgATmZuh: global_id: 8b2121a1-93d6-4d02-80e0-d95324a285e9 state: down+stopped description: stopped last_update: 2024-06-04 08:59:52 peer_sites: name: ceph-rbd1 state: up+stopped description: local image is primary last_update: 2024-06-04 09:02:15 On primary --- [ceph: root@ceph-rbd1-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool status -p ec_img_pool_EXnhoJmOKa --verbose health: WARNING daemon health: OK image health: WARNING images: 1 total 1 unknown DAEMONS service 24425: instance_id: 24611 client_id: ceph-rbd1-sangadi-bz-tfrmy6-node5.qxijuw hostname: ceph-rbd1-sangadi-bz-tfrmy6-node5 version: 16.2.10-260.el8cp leader: true health: OK IMAGES ec_imageBkzgATmZuh: global_id: 8b2121a1-93d6-4d02-80e0-d95324a285e9 state: up+stopped description: local image is primary service: ceph-rbd1-sangadi-bz-tfrmy6-node5.qxijuw on ceph-rbd1-sangadi-bz-tfrmy6-node5 last_update: 2024-06-04 09:03:15 peer_sites: name: ceph-rbd2 state: down+stopped description: stopped last_update: 2024-06-04 08:59:52 after resetting it back to the client as client.rbd-mirror-peer status came back properly on secondary --- [ceph: root@ceph-rbd2-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool status -p ec_img_pool_EXnhoJmOKa --verbose health: OK daemon health: OK image health: OK images: 1 total 1 replaying DAEMONS service 24418: instance_id: 24742 client_id: ceph-rbd2-sangadi-bz-tfrmy6-node5.hfmyci hostname: ceph-rbd2-sangadi-bz-tfrmy6-node5 version: 16.2.10-260.el8cp leader: true health: OK IMAGES ec_imageBkzgATmZuh: global_id: 8b2121a1-93d6-4d02-80e0-d95324a285e9 state: up+replaying description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"local_snapshot_timestamp":1717492080,"remote_snapshot_timestamp":1717492080,"replay_state":"idle"} service: ceph-rbd2-sangadi-bz-tfrmy6-node5.hfmyci on ceph-rbd2-sangadi-bz-tfrmy6-node5 last_update: 2024-06-04 09:08:53 peer_sites: name: ceph-rbd1 state: up+stopped description: local image is primary last_update: 2024-06-04 09:08:45 on primary --- [ceph: root@ceph-rbd1-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool status -p ec_img_pool_EXnhoJmOKa --verbose health: OK daemon health: OK image health: OK images: 1 total 1 replaying DAEMONS service 24425: instance_id: 24611 client_id: ceph-rbd1-sangadi-bz-tfrmy6-node5.qxijuw hostname: ceph-rbd1-sangadi-bz-tfrmy6-node5 version: 16.2.10-260.el8cp leader: true health: OK IMAGES ec_imageBkzgATmZuh: global_id: 8b2121a1-93d6-4d02-80e0-d95324a285e9 state: up+stopped description: local image is primary service: ceph-rbd1-sangadi-bz-tfrmy6-node5.qxijuw on ceph-rbd1-sangadi-bz-tfrmy6-node5 last_update: 2024-06-04 09:09:45 peer_sites: name: ceph-rbd2 state: up+replaying description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"local_snapshot_timestamp":1717492140,"remote_snapshot_timestamp":1717492140,"replay_state":"idle"} last_update: 2024-06-04 09:09:53 also noted that service_id remains the same and instance_id less than it's previous to next. also RBD mirror sanity looks good Moving it to Verified.