Bug 2281592

Summary:	rbd-mirror daemon in ERROR state, require manual restart [5.3z]
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Ilya Dryomov <idryomov>
Component:	RBD-Mirror	Assignee:	Ilya Dryomov <idryomov>
Status:	CLOSED ERRATA	QA Contact:	Sunil Angadi <sangadi>
Severity:	high	Docs Contact:	Disha Walvekar <dwalveka>
Priority:	unspecified
Version:	5.3	CC:	asriram, ceph-eng-bugs, cephqe-warriors, dwalveka, sangadi, tserlin
Target Milestone:	---
Target Release:	5.3z7
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	ceph-16.2.10-258.el8cp	Doc Type:	Bug Fix
Doc Text:	Previously, due to an implementation defect, `rbd-mirror` daemon did not properly dispose of outdated `PoolReplayer` instances in particular when refreshing the mirror peer configuration. Due to this there was unnecessary resource consumption and number of `PoolReplayer` instances competed with each other causing `rbd-mirror` daemon health to be reported as ERROR and replication would hang in some cases. To resume replication the administrator had to restart the `rbd-mirror` daemon. With this fix, the implementation defect is corrected and rbd-mirror daemon now properly disposes of outdated `PoolReplayer` instances.	Story Points:	---
Clone Of:	2279528	Environment:
Last Closed:	2024-06-26 10:02:43 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 5 Sunil Angadi 2024-06-04 12:49:43 UTC

Tested using
ceph version 16.2.10-260.el8cp (b20e1a5452628262667a6b060687917fde010343) pacific (stable)

Followed these test steps
---
1. Set up bidirectional mirroring on a test pool as usual
2. Verify that "rbd mirror pool status" reports "health: OK" on both clusters
3. Grab service_id and instance_id from "rbd mirror pool status --verbose" output on cluster B
4. Grab peer UUID ("UUID: ...", not "Mirror UUID: ...") from "rbd mirror pool info" output on cluster B
5. Run "rbd mirror pool peer set <peer UUID from step 4> client client.invalid" command on cluster B
6. Wait 30-90 seconds and verify that "rbd mirror pool status" reports "health: ERROR" on cluster B and "health: WARNING" on cluster A
7. Run "rbd mirror pool peer set <peer UUID from step 4> client client.rbd-mirror-peer" command on cluster B
8. Wait 30-90 seconds and verify that "rbd mirror pool status" reports "health: OK" on both clusters again
9. Grab service_id and instance_id from "rbd mirror pool status --verbose" output on cluster B again
10. Verify that service_id from step 3 is equal to the one from step 9
11. Verify that instance_id from step 3 is less than the one from step 9

On primary
---

[ceph: root@ceph-rbd1-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool status -p ec_img_pool_EXnhoJmOKa --verbose
health: OK
daemon health: OK
image health: OK
images: 1 total
    1 replaying

DAEMONS
service 24425:
  instance_id: 24611
  client_id: ceph-rbd1-sangadi-bz-tfrmy6-node5.qxijuw
  hostname: ceph-rbd1-sangadi-bz-tfrmy6-node5
  version: 16.2.10-260.el8cp
  leader: true
  health: OK


IMAGES
ec_imageBkzgATmZuh:
  global_id:   8b2121a1-93d6-4d02-80e0-d95324a285e9
  state:       up+stopped
  description: local image is primary
  service:     ceph-rbd1-sangadi-bz-tfrmy6-node5.qxijuw on ceph-rbd1-sangadi-bz-tfrmy6-node5
  last_update: 2024-06-04 08:54:15
  peer_sites:
    name: ceph-rbd2
    state: up+replaying
    description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"local_snapshot_timestamp":1717491240,"remote_snapshot_timestamp":1717491240,"replay_state":"idle"}
    last_update: 2024-06-04 08:54:20

On secondary
---

[ceph: root@ceph-rbd2-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool status -p ec_img_pool_EXnhoJmOKa --verbose
health: OK
daemon health: OK
image health: OK
images: 1 total
    1 replaying

DAEMONS
service 24418:
  instance_id: 24586
  client_id: ceph-rbd2-sangadi-bz-tfrmy6-node5.hfmyci
  hostname: ceph-rbd2-sangadi-bz-tfrmy6-node5
  version: 16.2.10-260.el8cp
  leader: true
  health: OK


IMAGES
ec_imageBkzgATmZuh:
  global_id:   8b2121a1-93d6-4d02-80e0-d95324a285e9
  state:       up+replaying
  description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"local_snapshot_timestamp":1717491000,"remote_snapshot_timestamp":1717491000,"replay_state":"idle"}
  service:     ceph-rbd2-sangadi-bz-tfrmy6-node5.hfmyci on ceph-rbd2-sangadi-bz-tfrmy6-node5
  last_update: 2024-06-04 08:50:20
  peer_sites:
    name: ceph-rbd1
    state: up+stopped
    description: local image is primary
    last_update: 2024-06-04 08:50:45


[ceph: root@ceph-rbd2-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool info -p ec_img_pool_EXnhoJmOKa
Mode: image
Site Name: ceph-rbd2

Peer Sites:

UUID: c468309d-1c30-4e4f-83df-a4c5550a84d5
Name: ceph-rbd1
Mirror UUID: 51e60cf3-b64f-4efd-a3ee-ab6240c36f40
Direction: rx-tx
Client: client.rbd-mirror-peer

set the invalid client id for the mirror peer

[ceph: root@ceph-rbd2-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool peer set --pool ec_img_pool_EXnhoJmOKa c468309d-1c30-4e4f-83df-a4c5550a84d5 client client.invalid
[ceph: root@ceph-rbd2-sangadi-bz-tfrmy6-node1-installer /]#

as expected, the status gets changed accordingly

On secondary
---
[ceph: root@ceph-rbd2-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool status -p ec_img_pool_EXnhoJmOKa --verbose
health: ERROR
daemon health: ERROR
image health: OK
images: 1 total
    1 stopped

DAEMONS
service 24418:
  instance_id: 24586
  client_id: ceph-rbd2-sangadi-bz-tfrmy6-node5.hfmyci
  hostname: ceph-rbd2-sangadi-bz-tfrmy6-node5
  version: 16.2.10-260.el8cp
  leader: false
  health: ERROR
  callouts: unable to connect to remote cluster


IMAGES
ec_imageBkzgATmZuh:
  global_id:   8b2121a1-93d6-4d02-80e0-d95324a285e9
  state:       down+stopped
  description: stopped
  last_update: 2024-06-04 08:59:52
  peer_sites:
    name: ceph-rbd1
    state: up+stopped
    description: local image is primary
    last_update: 2024-06-04 09:02:15

On primary
---
[ceph: root@ceph-rbd1-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool status -p ec_img_pool_EXnhoJmOKa --verbose
health: WARNING
daemon health: OK
image health: WARNING
images: 1 total
    1 unknown

DAEMONS
service 24425:
  instance_id: 24611
  client_id: ceph-rbd1-sangadi-bz-tfrmy6-node5.qxijuw
  hostname: ceph-rbd1-sangadi-bz-tfrmy6-node5
  version: 16.2.10-260.el8cp
  leader: true
  health: OK


IMAGES
ec_imageBkzgATmZuh:
  global_id:   8b2121a1-93d6-4d02-80e0-d95324a285e9
  state:       up+stopped
  description: local image is primary
  service:     ceph-rbd1-sangadi-bz-tfrmy6-node5.qxijuw on ceph-rbd1-sangadi-bz-tfrmy6-node5
  last_update: 2024-06-04 09:03:15
  peer_sites:
    name: ceph-rbd2
    state: down+stopped
    description: stopped
    last_update: 2024-06-04 08:59:52


after resetting it back to the client as client.rbd-mirror-peer
status came back properly

on secondary
---

[ceph: root@ceph-rbd2-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool status -p ec_img_pool_EXnhoJmOKa --verbose
health: OK
daemon health: OK
image health: OK
images: 1 total
    1 replaying

DAEMONS
service 24418:
  instance_id: 24742
  client_id: ceph-rbd2-sangadi-bz-tfrmy6-node5.hfmyci
  hostname: ceph-rbd2-sangadi-bz-tfrmy6-node5
  version: 16.2.10-260.el8cp
  leader: true
  health: OK


IMAGES
ec_imageBkzgATmZuh:
  global_id:   8b2121a1-93d6-4d02-80e0-d95324a285e9
  state:       up+replaying
  description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"local_snapshot_timestamp":1717492080,"remote_snapshot_timestamp":1717492080,"replay_state":"idle"}
  service:     ceph-rbd2-sangadi-bz-tfrmy6-node5.hfmyci on ceph-rbd2-sangadi-bz-tfrmy6-node5
  last_update: 2024-06-04 09:08:53
  peer_sites:
    name: ceph-rbd1
    state: up+stopped
    description: local image is primary
    last_update: 2024-06-04 09:08:45

on primary
---

[ceph: root@ceph-rbd1-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool status -p ec_img_pool_EXnhoJmOKa --verbose
health: OK
daemon health: OK
image health: OK
images: 1 total
    1 replaying

DAEMONS
service 24425:
  instance_id: 24611
  client_id: ceph-rbd1-sangadi-bz-tfrmy6-node5.qxijuw
  hostname: ceph-rbd1-sangadi-bz-tfrmy6-node5
  version: 16.2.10-260.el8cp
  leader: true
  health: OK


IMAGES
ec_imageBkzgATmZuh:
  global_id:   8b2121a1-93d6-4d02-80e0-d95324a285e9
  state:       up+stopped
  description: local image is primary
  service:     ceph-rbd1-sangadi-bz-tfrmy6-node5.qxijuw on ceph-rbd1-sangadi-bz-tfrmy6-node5
  last_update: 2024-06-04 09:09:45
  peer_sites:
    name: ceph-rbd2
    state: up+replaying
    description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"local_snapshot_timestamp":1717492140,"remote_snapshot_timestamp":1717492140,"replay_state":"idle"}
    last_update: 2024-06-04 09:09:53

also noted that service_id remains the same and instance_id less than it's previous to next.

also RBD mirror sanity looks good

Moving it to Verified.

Comment 8 errata-xmlrpc 2024-06-26 10:02:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 5.3 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4118