Bug 1976820

Summary: [cephadm] 5.0 - Stopping mgr service using orch command is making cluster inaccessible - We need warning message and --force option for the "stop" service command
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Preethi <pnataraj>
Component: CephadmAssignee: Adam King <adking>
Status: CLOSED ERRATA QA Contact: Sunil Kumar Nagaraju <sunnagar>
Severity: urgent Docs Contact: Mary Frances Hull <mhull>
Priority: urgent    
Version: 5.0CC: adking, agunn, asakthiv, gsitlani, mhackett, sewagner, sunnagar, tserlin, vereddy, vumrao
Target Milestone: ---   
Target Release: 5.0z1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ceph-16.2.0-125.el8cp Doc Type: Bug Fix
Doc Text:
.Users are no longer able to remove the Ceph Manager service using `cephadm` Previously, if a user ran a `ceph orch rm mgr` command, it would cause `cephadm` to remove all the Ceph Manager daemons in the storage cluster, making the storage cluster inaccessible. With this release, attempting to remove the Ceph Manager, a Ceph Monitor, or a Ceph OSD service using the `ceph orch rm _SERVICE_NAME_` command displays a warning message stating that it is not safe to remove these services, and results in no actions taken.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-11-02 16:38:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1959686    

Description Preethi 2021-06-28 10:25:10 UTC
Description of problem:[cephadm] 5.0 - Stopping mgr service using orch command is making cluster inaccessible - We need warning message and --force option for the "stop" command for all the service using ceph orch command for better user experience

upstream tracker  - https://tracker.ceph.com/issues/51298


Version-Release number of selected component (if applicable):
[ceph: root@ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp /]# ceph version
ceph version 16.2.0-78.el8cp (4c0b27dfdc25bd5a62233bef76eee4821089d79e) pacific (stable)


How reproducible:


Steps to Reproduce:
1. Deploy 5.0 cluster with all ceph services like mgr, mon, osds, rgw
2. Perform stop service command using ceph orch stop <service command>
3. Observe the behaviour




[ceph: root@ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp /]# ceph orch stop mgr
Scheduled to stop mgr.ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp on host 'ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp'
Scheduled to stop mgr.ceph-sshtest-1624545282997-node2-osd-mon-mgr-mds-node-exporter on host 'ceph-sshtest-1624545282997-node2-osd-mon-mgr-mds-node-exporter'
Scheduled to stop mgr.ceph-sshtest-1624545282997-node3-mon-osd-node-exporter-crash-rg on host 'ceph-sshtest-1624545282997-node3-mon-osd-node-exporter-crash-rg'


[ceph: root@ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp /]# 
[ceph: root@ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp /]# 
[ceph: root@ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp /]# 
[ceph: root@ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp /]# 
[ceph: root@ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp /]# ceph status
  cluster:
    id:     f64f341c-655d-11eb-8778-fa163e914bcc
    health: HEALTH_WARN
            no active mgr
            1 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp,ceph-sshtest-1624545282997-node2-osd-mon-mgr-mds-node-exporter,ceph-sshtest-1624545282997-node3-mon-osd-node-exporter-crash-rg (age 108s)
    mgr: no daemons active (since 74s)
    mds: 1/1 daemons up, 1 standby
    osd: 12 osds: 11 up (since 2m), 11 in (since 38m)
    rgw: 2 daemons active (2 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   7 pools, 169 pgs
    objects: 243 objects, 43 KiB
    usage:   778 MiB used, 164 GiB / 165 GiB avail
    pgs:     169 active+clean


[ceph: root@ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp /]# ceph orch ls











Actual results: Cluster becomes inaccessible. and mgr stop will stop all the services and makes ceph crashes. Hence, we recommend to put warning message for the stop command using ceph orch 
along with force option and warning message to the user as mgr services is stopping all the daemons


NOTE: We need to use systemctl command to stop single daemon service to avoid such failures at customer front.

Upstream tracker for reference :
https://tracker.ceph.com/issues/51298


Expected results: We need a warning message to the user stating this will be the behaviour and force argument should be included for "stop" 


Additional info:
10.0.211.15 cephuser/cephuser

Comment 1 Sebastian Wagner 2021-06-28 10:36:35 UTC
moving this to 5.0z1

Comment 2 Preethi 2021-06-28 17:45:25 UTC
We can recover the cluster by following the steps - The issue is seen only for MON and MGR now. Cluster will be in accessible state for other services like OSDs, RGW etc  but all daemons of that services will be down. Hence, we need warning msgs for all type of services when we use ceph orch stop <service name> option


How to recover, 

Go to /var/lib/ceph/fsid and note the mon service

And perform systemctl start ceph-<fsid>@monservicename - what we have in /var/lib/ceph/mon service → repeat this for all mon/mgr nodes 

Check ps -ef | grep ceph-mon/ceph-mgr ---> verify the process id created for ceph-mon/ceph-mgr


Ex: systemctl start ceph-f64f341c-655d-11eb-8778-fa163e914bcc

Comment 5 Veera Raghava Reddy 2021-06-30 07:31:56 UTC
Hi Mike,
Can you review from a support perspective if the recovery procedure is good to differ this BZ to 5.0z1?

Comment 10 Sebastian Wagner 2021-08-17 12:53:24 UTC
PR is merged in upstream, but not yet in z1

Comment 22 errata-xmlrpc 2021-11-02 16:38:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 5.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4105