.Detect the blocklisted client
Previously, if the client requested an exclusive lock while blocklisted, the delayed request would not continue and the call that requested the lock would never complete.
With this fix, the blocklisted client is detected and the stuck condition completes with an appropriate error code.
Description of problem:
In a cluster with different cluster and public networks, While upgrading the cluster using ceph orch upgrade, while ceph orch upgrade status was reporting that osds were being upgraded, rbd mirror pool status reported that
=============================
health: ERROR
daemon health: WARNING
image health: ERROR
images: 106 total
79 error
26 replaying
1 stopping_replay
============================
***After 13+ minutes, pool status, image status were back to OK.***
There was no RECENT image operations involved. (Failover and failback)
The cluster was hosting ~26 secondary and ~80 primary images
The peer cluster reported all images as unknown
Even peer cluster reported all okay after 13+ minutes.
Later After ~2.5 hours when pool mirror status was observed,
==============================
health: ERROR
daemon health: WARNING
image health: ERROR
images: 106 total
3 error
26 replaying
2 stopping_replay
75 stopped
==============================
Observed after around 10 hours later, pool status is same.
Upon observing snapshot schedule, it was stopped on 1/26 primary images.
Mirroring on images stopped some time around after the upgrade was success.
Version-Release number of selected component (if applicable):
(from ceph orch ps)
rbd-mirror.e22-h24-b01-fc640.xrklhe e22-h24-b01-fc640.rdu2.scalelab.redhat.com running (21h) 7m ago 21h 1221M - 16.2.10-82.el8cp 9600fe784925 79bd65b3b55d
How reproducible:
Observed once
Steps to Reproduce:
1. Explained above in description
Actual results:
Expected results:
No snapshot schedule miss and healthy mirroring
Additional info:
Observed multiple blocklists from OSDs, will provide more details in upcoming
Observed snapshot scheduling for over multiple upgrades, for over week of period.
Tried multiple rbd mirror daemon restarts.
Did not observe snapshot schedule being stuck for images.
Moving to Verified state.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: Red Hat Ceph Storage 6.1 security and bug fix update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2023:3623
Description of problem: In a cluster with different cluster and public networks, While upgrading the cluster using ceph orch upgrade, while ceph orch upgrade status was reporting that osds were being upgraded, rbd mirror pool status reported that ============================= health: ERROR daemon health: WARNING image health: ERROR images: 106 total 79 error 26 replaying 1 stopping_replay ============================ ***After 13+ minutes, pool status, image status were back to OK.*** There was no RECENT image operations involved. (Failover and failback) The cluster was hosting ~26 secondary and ~80 primary images The peer cluster reported all images as unknown Even peer cluster reported all okay after 13+ minutes. Later After ~2.5 hours when pool mirror status was observed, ============================== health: ERROR daemon health: WARNING image health: ERROR images: 106 total 3 error 26 replaying 2 stopping_replay 75 stopped ============================== Observed after around 10 hours later, pool status is same. Upon observing snapshot schedule, it was stopped on 1/26 primary images. Mirroring on images stopped some time around after the upgrade was success. Version-Release number of selected component (if applicable): (from ceph orch ps) rbd-mirror.e22-h24-b01-fc640.xrklhe e22-h24-b01-fc640.rdu2.scalelab.redhat.com running (21h) 7m ago 21h 1221M - 16.2.10-82.el8cp 9600fe784925 79bd65b3b55d How reproducible: Observed once Steps to Reproduce: 1. Explained above in description Actual results: Expected results: No snapshot schedule miss and healthy mirroring Additional info: Observed multiple blocklists from OSDs, will provide more details in upcoming