1976820 – [cephadm] 5.0 - Stopping mgr service using orch command is making cluster inaccessible - We need warning message and --force option for the "stop" service command

Bug 1976820 - [cephadm] 5.0 - Stopping mgr service using orch command is making cluster inaccessible - We need warning message and --force option for the "stop" service command

Summary: [cephadm] 5.0 - Stopping mgr service using orch command is making cluster ina...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Cephadm
Sub Component:
Version:	5.0
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	5.0z1
Assignee:	Adam King
QA Contact:	Sunil Kumar Nagaraju
Docs Contact:	Mary Frances Hull
URL:
Whiteboard:
Depends On:
Blocks:	1959686
TreeView+	depends on / blocked

Reported:	2021-06-28 10:25 UTC by Preethi
Modified:	2021-11-02 16:39 UTC (History)
CC List:	10 users (show)
Fixed In Version:	ceph-16.2.0-125.el8cp
Doc Type:	Bug Fix
Doc Text:	.Users are no longer able to remove the Ceph Manager service using `cephadm` Previously, if a user ran a `ceph orch rm mgr` command, it would cause `cephadm` to remove all the Ceph Manager daemons in the storage cluster, making the storage cluster inaccessible. With this release, attempting to remove the Ceph Manager, a Ceph Monitor, or a Ceph OSD service using the `ceph orch rm _SERVICE_NAME_` command displays a warning message stating that it is not safe to remove these services, and results in no actions taken.
Clone Of:
Environment:
Last Closed:	2021-11-02 16:38:26 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	51298	None	None	None	2021-06-29 12:40:47 UTC
Red Hat Issue Tracker	RHCEPH-746	None	None	None	2021-08-18 21:57:25 UTC
Red Hat Product Errata	RHBA-2021:4105	None	None	None	2021-11-02 16:39:08 UTC

Description Preethi 2021-06-28 10:25:10 UTC

Description of problem:[cephadm] 5.0 - Stopping mgr service using orch command is making cluster inaccessible - We need warning message and --force option for the "stop" command for all the service using ceph orch command for better user experience

upstream tracker  - https://tracker.ceph.com/issues/51298


Version-Release number of selected component (if applicable):
[ceph: root@ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp /]# ceph version
ceph version 16.2.0-78.el8cp (4c0b27dfdc25bd5a62233bef76eee4821089d79e) pacific (stable)


How reproducible:


Steps to Reproduce:
1. Deploy 5.0 cluster with all ceph services like mgr, mon, osds, rgw
2. Perform stop service command using ceph orch stop <service command>
3. Observe the behaviour




[ceph: root@ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp /]# ceph orch stop mgr
Scheduled to stop mgr.ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp on host 'ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp'
Scheduled to stop mgr.ceph-sshtest-1624545282997-node2-osd-mon-mgr-mds-node-exporter on host 'ceph-sshtest-1624545282997-node2-osd-mon-mgr-mds-node-exporter'
Scheduled to stop mgr.ceph-sshtest-1624545282997-node3-mon-osd-node-exporter-crash-rg on host 'ceph-sshtest-1624545282997-node3-mon-osd-node-exporter-crash-rg'


[ceph: root@ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp /]# 
[ceph: root@ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp /]# 
[ceph: root@ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp /]# 
[ceph: root@ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp /]# 
[ceph: root@ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp /]# ceph status
  cluster:
    id:     f64f341c-655d-11eb-8778-fa163e914bcc
    health: HEALTH_WARN
            no active mgr
            1 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp,ceph-sshtest-1624545282997-node2-osd-mon-mgr-mds-node-exporter,ceph-sshtest-1624545282997-node3-mon-osd-node-exporter-crash-rg (age 108s)
    mgr: no daemons active (since 74s)
    mds: 1/1 daemons up, 1 standby
    osd: 12 osds: 11 up (since 2m), 11 in (since 38m)
    rgw: 2 daemons active (2 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   7 pools, 169 pgs
    objects: 243 objects, 43 KiB
    usage:   778 MiB used, 164 GiB / 165 GiB avail
    pgs:     169 active+clean


[ceph: root@ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp /]# ceph orch ls











Actual results: Cluster becomes inaccessible. and mgr stop will stop all the services and makes ceph crashes. Hence, we recommend to put warning message for the stop command using ceph orch 
along with force option and warning message to the user as mgr services is stopping all the daemons


NOTE: We need to use systemctl command to stop single daemon service to avoid such failures at customer front.

Upstream tracker for reference :
https://tracker.ceph.com/issues/51298


Expected results: We need a warning message to the user stating this will be the behaviour and force argument should be included for "stop" 


Additional info:
10.0.211.15 cephuser/cephuser

Comment 1 Sebastian Wagner 2021-06-28 10:36:35 UTC

moving this to 5.0z1

Comment 2 Preethi 2021-06-28 17:45:25 UTC

We can recover the cluster by following the steps - The issue is seen only for MON and MGR now. Cluster will be in accessible state for other services like OSDs, RGW etc  but all daemons of that services will be down. Hence, we need warning msgs for all type of services when we use ceph orch stop <service name> option


How to recover, 

Go to /var/lib/ceph/fsid and note the mon service

And perform systemctl start ceph-<fsid>@monservicename - what we have in /var/lib/ceph/mon service → repeat this for all mon/mgr nodes 

Check ps -ef | grep ceph-mon/ceph-mgr ---> verify the process id created for ceph-mon/ceph-mgr


Ex: systemctl start ceph-f64f341c-655d-11eb-8778-fa163e914bcc

Comment 5 Veera Raghava Reddy 2021-06-30 07:31:56 UTC

Hi Mike,
Can you review from a support perspective if the recovery procedure is good to differ this BZ to 5.0z1?

Comment 10 Sebastian Wagner 2021-08-17 12:53:24 UTC

PR is merged in upstream, but not yet in z1

Comment 22 errata-xmlrpc 2021-11-02 16:38:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 5.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4105

Note You need to log in before you can comment on or make changes to this bug.