2153673 – snapshot schedule stopped on one image and mirroring stopped on secondary images while upgrading from 16.2.10-82 to 16.2.10-84

Bug 2153673 - snapshot schedule stopped on one image and mirroring stopped on secondary images while upgrading from 16.2.10-82 to 16.2.10-84

Summary: snapshot schedule stopped on one image and mirroring stopped on secondary ima...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RBD-Mirror
Sub Component:
Version:	5.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	6.1
Assignee:	Christopher Hoffman
QA Contact:	Vasishta
Docs Contact:	Akash Raj
URL:
Whiteboard:
Depends On:
Blocks:	2192813
TreeView+	depends on / blocked

Reported:	2022-12-15 06:26 UTC by Vasishta
Modified:	2023-06-21 09:32 UTC (History)
CC List:	10 users (show)
Fixed In Version:	ceph-17.2.6-11.el9cp
Doc Type:	Bug Fix
Doc Text:	.Detect the blocklisted client Previously, if the client requested an exclusive lock while blocklisted, the delayed request would not continue and the call that requested the lock would never complete. With this fix, the blocklisted client is detected and the stuck condition completes with an appropriate error code.
Clone Of:
Environment:
Last Closed:	2023-06-15 09:16:27 UTC
Embargoed:
Dependent Products:
Flags:	choffman: needinfo+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	59115	None	None	None	2023-03-21 16:52:39 UTC
Red Hat Issue Tracker	RHCEPH-5795	None	None	None	2022-12-15 06:50:20 UTC
Red Hat Product Errata	RHSA-2023:3623	None	None	None	2023-06-15 09:17:39 UTC

Description Vasishta 2022-12-15 06:26:26 UTC

Description of problem:

In a cluster with different cluster and public networks, While upgrading the cluster using ceph orch upgrade, while ceph orch upgrade status was reporting that osds were being upgraded, rbd mirror pool status reported that 
=============================
health: ERROR
daemon health: WARNING
image health: ERROR
images: 106 total
    79 error
    26 replaying
    1 stopping_replay
============================

***After 13+ minutes, pool status, image status were back to OK.***

There was no RECENT image operations involved. (Failover and failback)

The cluster was hosting ~26 secondary and ~80 primary images

The peer cluster reported all images as unknown
Even peer cluster reported all okay after 13+ minutes.

Later After ~2.5 hours when pool mirror status was observed,
==============================
health: ERROR
daemon health: WARNING
image health: ERROR
images: 106 total
    3 error
    26 replaying
    2 stopping_replay
    75 stopped
==============================

Observed after around 10 hours later, pool status is same.

Upon observing snapshot schedule, it was stopped on 1/26 primary images.
Mirroring on images stopped some time around after the upgrade was success.

Version-Release number of selected component (if applicable):
(from ceph orch ps)
rbd-mirror.e22-h24-b01-fc640.xrklhe                    e22-h24-b01-fc640.rdu2.scalelab.redhat.com               running (21h)     7m ago  21h    1221M        -  16.2.10-82.el8cp  9600fe784925  79bd65b3b55d

How reproducible:
Observed once

Steps to Reproduce:
1. Explained above in description

Actual results:


Expected results:
No snapshot schedule miss and healthy mirroring

Additional info:
Observed multiple blocklists from OSDs, will provide more details in upcoming

Comment 22 Vasishta 2023-05-08 17:37:26 UTC

Observed snapshot scheduling for over multiple upgrades, for over week of period.
Tried multiple rbd mirror daemon restarts.

Did not observe snapshot schedule being stuck for images.
Moving to Verified state.

Comment 25 errata-xmlrpc 2023-06-15 09:16:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 6.1 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:3623

Note You need to log in before you can comment on or make changes to this bug.