Bug 2153673

Summary: snapshot schedule stopped on one image and mirroring stopped on secondary images while upgrading from 16.2.10-82 to 16.2.10-84
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vasishta <vashastr>
Component: RBD-MirrorAssignee: Christopher Hoffman <choffman>
Status: CLOSED ERRATA QA Contact: Vasishta <vashastr>
Severity: urgent Docs Contact: Akash Raj <akraj>
Priority: unspecified    
Version: 5.3CC: akraj, ceph-eng-bugs, cephqe-warriors, choffman, idryomov, mmurthy, ocs-bugs, rmandyam, sostapov, tserlin
Target Milestone: ---Flags: choffman: needinfo+
Target Release: 6.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-17.2.6-11.el9cp Doc Type: Bug Fix
Doc Text:
.Detect the blocklisted client Previously, if the client requested an exclusive lock while blocklisted, the delayed request would not continue and the call that requested the lock would never complete. With this fix, the blocklisted client is detected and the stuck condition completes with an appropriate error code.
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-06-15 09:16:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2192813    

Description Vasishta 2022-12-15 06:26:26 UTC
Description of problem:

In a cluster with different cluster and public networks, While upgrading the cluster using ceph orch upgrade, while ceph orch upgrade status was reporting that osds were being upgraded, rbd mirror pool status reported that 
=============================
health: ERROR
daemon health: WARNING
image health: ERROR
images: 106 total
    79 error
    26 replaying
    1 stopping_replay
============================

***After 13+ minutes, pool status, image status were back to OK.***

There was no RECENT image operations involved. (Failover and failback)

The cluster was hosting ~26 secondary and ~80 primary images

The peer cluster reported all images as unknown
Even peer cluster reported all okay after 13+ minutes.

Later After ~2.5 hours when pool mirror status was observed,
==============================
health: ERROR
daemon health: WARNING
image health: ERROR
images: 106 total
    3 error
    26 replaying
    2 stopping_replay
    75 stopped
==============================

Observed after around 10 hours later, pool status is same.

Upon observing snapshot schedule, it was stopped on 1/26 primary images.
Mirroring on images stopped some time around after the upgrade was success.

Version-Release number of selected component (if applicable):
(from ceph orch ps)
rbd-mirror.e22-h24-b01-fc640.xrklhe                    e22-h24-b01-fc640.rdu2.scalelab.redhat.com               running (21h)     7m ago  21h    1221M        -  16.2.10-82.el8cp  9600fe784925  79bd65b3b55d

How reproducible:
Observed once

Steps to Reproduce:
1. Explained above in description

Actual results:


Expected results:
No snapshot schedule miss and healthy mirroring

Additional info:
Observed multiple blocklists from OSDs, will provide more details in upcoming

Comment 22 Vasishta 2023-05-08 17:37:26 UTC
Observed snapshot scheduling for over multiple upgrades, for over week of period.
Tried multiple rbd mirror daemon restarts.

Did not observe snapshot schedule being stuck for images.
Moving to Verified state.

Comment 25 errata-xmlrpc 2023-06-15 09:16:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 6.1 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:3623