2281592 – rbd-mirror daemon in ERROR state, require manual restart [5.3z]

Bug 2281592 - rbd-mirror daemon in ERROR state, require manual restart [5.3z]

Summary: rbd-mirror daemon in ERROR state, require manual restart [5.3z]

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RBD-Mirror
Sub Component:
Version:	5.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	5.3z7
Assignee:	Ilya Dryomov
QA Contact:	Sunil Angadi
Docs Contact:	Disha Walvekar
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-05-20 07:00 UTC by Ilya Dryomov
Modified:	2024-06-26 10:02 UTC (History)
CC List:	6 users (show)
Fixed In Version:	ceph-16.2.10-258.el8cp
Doc Type:	Bug Fix
Doc Text:	Previously, due to an implementation defect, `rbd-mirror` daemon did not properly dispose of outdated `PoolReplayer` instances in particular when refreshing the mirror peer configuration. Due to this there was unnecessary resource consumption and number of `PoolReplayer` instances competed with each other causing `rbd-mirror` daemon health to be reported as ERROR and replication would hang in some cases. To resume replication the administrator had to restart the `rbd-mirror` daemon. With this fix, the implementation defect is corrected and rbd-mirror daemon now properly disposes of outdated `PoolReplayer` instances.
Clone Of:	2279528
Environment:
Last Closed:	2024-06-26 10:02:43 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	65487	None	None	None	2024-05-20 07:00:28 UTC
Red Hat Issue Tracker	RHCEPH-9060	None	None	None	2024-05-20 07:08:47 UTC
Red Hat Product Errata	RHSA-2024:4118	None	None	None	2024-06-26 10:02:46 UTC

Comment 5 Sunil Angadi 2024-06-04 12:49:43 UTC

Tested using
ceph version 16.2.10-260.el8cp (b20e1a5452628262667a6b060687917fde010343) pacific (stable)

Followed these test steps
---
1. Set up bidirectional mirroring on a test pool as usual
2. Verify that "rbd mirror pool status" reports "health: OK" on both clusters
3. Grab service_id and instance_id from "rbd mirror pool status --verbose" output on cluster B
4. Grab peer UUID ("UUID: ...", not "Mirror UUID: ...") from "rbd mirror pool info" output on cluster B
5. Run "rbd mirror pool peer set <peer UUID from step 4> client client.invalid" command on cluster B
6. Wait 30-90 seconds and verify that "rbd mirror pool status" reports "health: ERROR" on cluster B and "health: WARNING" on cluster A
7. Run "rbd mirror pool peer set <peer UUID from step 4> client client.rbd-mirror-peer" command on cluster B
8. Wait 30-90 seconds and verify that "rbd mirror pool status" reports "health: OK" on both clusters again
9. Grab service_id and instance_id from "rbd mirror pool status --verbose" output on cluster B again
10. Verify that service_id from step 3 is equal to the one from step 9
11. Verify that instance_id from step 3 is less than the one from step 9

On primary
---

[ceph: root@ceph-rbd1-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool status -p ec_img_pool_EXnhoJmOKa --verbose
health: OK
daemon health: OK
image health: OK
images: 1 total
    1 replaying

DAEMONS
service 24425:
  instance_id: 24611
  client_id: ceph-rbd1-sangadi-bz-tfrmy6-node5.qxijuw
  hostname: ceph-rbd1-sangadi-bz-tfrmy6-node5
  version: 16.2.10-260.el8cp
  leader: true
  health: OK


IMAGES
ec_imageBkzgATmZuh:
  global_id:   8b2121a1-93d6-4d02-80e0-d95324a285e9
  state:       up+stopped
  description: local image is primary
  service:     ceph-rbd1-sangadi-bz-tfrmy6-node5.qxijuw on ceph-rbd1-sangadi-bz-tfrmy6-node5
  last_update: 2024-06-04 08:54:15
  peer_sites:
    name: ceph-rbd2
    state: up+replaying
    description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"local_snapshot_timestamp":1717491240,"remote_snapshot_timestamp":1717491240,"replay_state":"idle"}
    last_update: 2024-06-04 08:54:20

On secondary
---

[ceph: root@ceph-rbd2-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool status -p ec_img_pool_EXnhoJmOKa --verbose
health: OK
daemon health: OK
image health: OK
images: 1 total
    1 replaying

DAEMONS
service 24418:
  instance_id: 24586
  client_id: ceph-rbd2-sangadi-bz-tfrmy6-node5.hfmyci
  hostname: ceph-rbd2-sangadi-bz-tfrmy6-node5
  version: 16.2.10-260.el8cp
  leader: true
  health: OK


IMAGES
ec_imageBkzgATmZuh:
  global_id:   8b2121a1-93d6-4d02-80e0-d95324a285e9
  state:       up+replaying
  description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"local_snapshot_timestamp":1717491000,"remote_snapshot_timestamp":1717491000,"replay_state":"idle"}
  service:     ceph-rbd2-sangadi-bz-tfrmy6-node5.hfmyci on ceph-rbd2-sangadi-bz-tfrmy6-node5
  last_update: 2024-06-04 08:50:20
  peer_sites:
    name: ceph-rbd1
    state: up+stopped
    description: local image is primary
    last_update: 2024-06-04 08:50:45


[ceph: root@ceph-rbd2-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool info -p ec_img_pool_EXnhoJmOKa
Mode: image
Site Name: ceph-rbd2

Peer Sites:

UUID: c468309d-1c30-4e4f-83df-a4c5550a84d5
Name: ceph-rbd1
Mirror UUID: 51e60cf3-b64f-4efd-a3ee-ab6240c36f40
Direction: rx-tx
Client: client.rbd-mirror-peer

set the invalid client id for the mirror peer

[ceph: root@ceph-rbd2-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool peer set --pool ec_img_pool_EXnhoJmOKa c468309d-1c30-4e4f-83df-a4c5550a84d5 client client.invalid
[ceph: root@ceph-rbd2-sangadi-bz-tfrmy6-node1-installer /]#

as expected, the status gets changed accordingly

On secondary
---
[ceph: root@ceph-rbd2-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool status -p ec_img_pool_EXnhoJmOKa --verbose
health: ERROR
daemon health: ERROR
image health: OK
images: 1 total
    1 stopped

DAEMONS
service 24418:
  instance_id: 24586
  client_id: ceph-rbd2-sangadi-bz-tfrmy6-node5.hfmyci
  hostname: ceph-rbd2-sangadi-bz-tfrmy6-node5
  version: 16.2.10-260.el8cp
  leader: false
  health: ERROR
  callouts: unable to connect to remote cluster


IMAGES
ec_imageBkzgATmZuh:
  global_id:   8b2121a1-93d6-4d02-80e0-d95324a285e9
  state:       down+stopped
  description: stopped
  last_update: 2024-06-04 08:59:52
  peer_sites:
    name: ceph-rbd1
    state: up+stopped
    description: local image is primary
    last_update: 2024-06-04 09:02:15

On primary
---
[ceph: root@ceph-rbd1-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool status -p ec_img_pool_EXnhoJmOKa --verbose
health: WARNING
daemon health: OK
image health: WARNING
images: 1 total
    1 unknown

DAEMONS
service 24425:
  instance_id: 24611
  client_id: ceph-rbd1-sangadi-bz-tfrmy6-node5.qxijuw
  hostname: ceph-rbd1-sangadi-bz-tfrmy6-node5
  version: 16.2.10-260.el8cp
  leader: true
  health: OK


IMAGES
ec_imageBkzgATmZuh:
  global_id:   8b2121a1-93d6-4d02-80e0-d95324a285e9
  state:       up+stopped
  description: local image is primary
  service:     ceph-rbd1-sangadi-bz-tfrmy6-node5.qxijuw on ceph-rbd1-sangadi-bz-tfrmy6-node5
  last_update: 2024-06-04 09:03:15
  peer_sites:
    name: ceph-rbd2
    state: down+stopped
    description: stopped
    last_update: 2024-06-04 08:59:52


after resetting it back to the client as client.rbd-mirror-peer
status came back properly

on secondary
---

[ceph: root@ceph-rbd2-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool status -p ec_img_pool_EXnhoJmOKa --verbose
health: OK
daemon health: OK
image health: OK
images: 1 total
    1 replaying

DAEMONS
service 24418:
  instance_id: 24742
  client_id: ceph-rbd2-sangadi-bz-tfrmy6-node5.hfmyci
  hostname: ceph-rbd2-sangadi-bz-tfrmy6-node5
  version: 16.2.10-260.el8cp
  leader: true
  health: OK


IMAGES
ec_imageBkzgATmZuh:
  global_id:   8b2121a1-93d6-4d02-80e0-d95324a285e9
  state:       up+replaying
  description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"local_snapshot_timestamp":1717492080,"remote_snapshot_timestamp":1717492080,"replay_state":"idle"}
  service:     ceph-rbd2-sangadi-bz-tfrmy6-node5.hfmyci on ceph-rbd2-sangadi-bz-tfrmy6-node5
  last_update: 2024-06-04 09:08:53
  peer_sites:
    name: ceph-rbd1
    state: up+stopped
    description: local image is primary
    last_update: 2024-06-04 09:08:45

on primary
---

[ceph: root@ceph-rbd1-sangadi-bz-tfrmy6-node1-installer /]# rbd mirror pool status -p ec_img_pool_EXnhoJmOKa --verbose
health: OK
daemon health: OK
image health: OK
images: 1 total
    1 replaying

DAEMONS
service 24425:
  instance_id: 24611
  client_id: ceph-rbd1-sangadi-bz-tfrmy6-node5.qxijuw
  hostname: ceph-rbd1-sangadi-bz-tfrmy6-node5
  version: 16.2.10-260.el8cp
  leader: true
  health: OK


IMAGES
ec_imageBkzgATmZuh:
  global_id:   8b2121a1-93d6-4d02-80e0-d95324a285e9
  state:       up+stopped
  description: local image is primary
  service:     ceph-rbd1-sangadi-bz-tfrmy6-node5.qxijuw on ceph-rbd1-sangadi-bz-tfrmy6-node5
  last_update: 2024-06-04 09:09:45
  peer_sites:
    name: ceph-rbd2
    state: up+replaying
    description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"local_snapshot_timestamp":1717492140,"remote_snapshot_timestamp":1717492140,"replay_state":"idle"}
    last_update: 2024-06-04 09:09:53

also noted that service_id remains the same and instance_id less than it's previous to next.

also RBD mirror sanity looks good

Moving it to Verified.

Comment 8 errata-xmlrpc 2024-06-26 10:02:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 5.3 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4118

Note You need to log in before you can comment on or make changes to this bug.