1550026 – upgrade to 12.2.3 will cause MDSs to fail during upgrade

Bug 1550026 - upgrade to 12.2.3 will cause MDSs to fail during upgrade

Summary: upgrade to 12.2.3 will cause MDSs to fail during upgrade

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Documentation
Sub Component:
Version:	3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	z2
Target Release:	3.0
Assignee:	Erin Donnelly
QA Contact:	Manohar Murthy
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1557269
TreeView+	depends on / blocked

Reported:	2018-02-28 10:45 UTC by Patrick Donnelly
Modified:	2019-08-26 06:55 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	.When using the `rolling_update.yml` playbook to upgrade to Red Hat Ceph Storage 3.0 and from version 3.0 to other zStream releases of 3.0, users who use CephFS must manually upgrade the MDS cluster Currently the Metadata Server (MDS) cluster does not have built-in versioning or file system flags to support seamless upgrades of the MDS nodes without potentially causing assertions or other faults due to incompatible messages or other functional differences. For this reason, it's necessary during any cluster upgrade to reduce the number of active MDS nodes for a file system to one, first so that two active MDS nodes do not communicate with different versions. Further, it's also necessary to take standbys offline as any new `CompatSet` flags will propagate via the MDSMap to all MDS nodes and cause older MDS nodes to suicide. To upgrade the MDS cluster: . Reduce the number of ranks to 1: + ---- ceph fs set <fs_name> max_mds 1 ---- . Deactivate all non-zero ranks, from the highest rank to the lowest, while waiting for each MDS to finish stopping: + ---- ceph mds deactivate <fs_name>:<n> ceph status # wait for MDS to finish stopping ---- . Take all standbys offline using `systemctl`: + ---- systemctl stop ceph-mds.target ceph status # confirm only one MDS is online and is active ---- . Upgrade the single active MDS and restart daemon using `systemctl`: + ---- systemctl restart ceph-mds.target ---- . Upgrade and start the standby daemons. . Restore the previous max_mds for your cluster: + ---- ceph fs set <fs_name> max_mds <old_max_mds> ---- For steps on how to upgrade the MDS cluster in a container, refer to the https://access.redhat.com/articles/2789521[Updating Red Hat Ceph Storage deployed as a Container Image] Knowledgebase article.
Clone Of:
Environment:
Last Closed:	2019-08-26 06:55:46 UTC
Embargoed:
Dependent Products:
Flags:	pdonnell: needinfo+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	23172	None	None	None	2018-02-28 10:45:33 UTC
Github	ceph ceph pull 21263	None	closed	doc: outline upgrade procedure for mds cluster	2020-12-18 09:02:35 UTC
Red Hat Bugzilla	1548067	medium	CLOSED	rebase ceph to 12.2.4	2022-02-21 18:21:59 UTC
Red Hat Bugzilla	1569689	high	CLOSED	MDS rolling-upgrade process needs to be changed to follow new recommendations	2021-02-22 00:41:40 UTC

Internal Links: 1548067 1569689

Description Patrick Donnelly 2018-02-28 10:45:33 UTC

Description of problem:

When MDSs are upgraded to 12.2.3+, all online MDS will suicide after the first upgraded MDS goes online.

Version-Release number of selected component (if applicable):

N/A yet

How reproducible:

100%

Steps to Reproduce:
1. Take RHCS3.0 cluster and upgrade an MDS to a release based on 12.2.3.
2.
3.

Actual results:

12.2.2- MDSs will suicide.

Expected results:

12.2.2- MDS continue functioning.

Additional info:

Caused by this backport: https://github.com/ceph/ceph/pull/18782

Comment 9 Yan, Zheng 2018-04-05 07:11:01 UTC

I think it's caused by

commit cb8eff43b1abd8c268df9e57906d677ff4be8d95
Author: Yan, Zheng <zyan>
Date:   Wed Oct 18 20:58:15 2017 +0800

    mds: don't rdlock locks in replica object while auth mds is recovering
    
    Auth mds may take xlock on the lock and change the object when replaying
    unsafe requests. To guarantee new requests and replayed unsafe requests
    (on auth mds) get processed in proper order, we shouldn't rdlock locks in
    replica object while auth mds of the object is recovering
    
    Signed-off-by: "Yan, Zheng" <zyan>
    (cherry picked from commit 0afbc0338e1b9f32340eaa74899d8d43ac8608fe)


The commit modified CInode::encode_replica and CInode::_encode_locks_state_for_replica

Comment 34 Vasishta 2018-04-16 14:40:02 UTC

Hi Erin,

Can you please add changes made in RHEL installation guide, also  to Ubuntu installation guide and Container Guide also ?

Regards,
Vasishta Shastry
AQE, Ceph

Comment 36 Harish NV Rao 2018-04-18 09:18:25 UTC

Pushing this to assigned state based on comment 34 and 35

Comment 40 Ramakrishnan Periyasamy 2018-04-20 14:15:00 UTC

Moving this bz to verified state, doc text for RHEL, Ubuntu and Container looks good.

Note You need to log in before you can comment on or make changes to this bug.