Description of problem:[cephadm] 5.0 - Stopping mgr service using orch command is making cluster inaccessible - We need warning message and --force option for the "stop" command for all the service using ceph orch command for better user experience upstream tracker - https://tracker.ceph.com/issues/51298 Version-Release number of selected component (if applicable): [ceph: root@ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp /]# ceph version ceph version 16.2.0-78.el8cp (4c0b27dfdc25bd5a62233bef76eee4821089d79e) pacific (stable) How reproducible: Steps to Reproduce: 1. Deploy 5.0 cluster with all ceph services like mgr, mon, osds, rgw 2. Perform stop service command using ceph orch stop <service command> 3. Observe the behaviour [ceph: root@ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp /]# ceph orch stop mgr Scheduled to stop mgr.ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp on host 'ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp' Scheduled to stop mgr.ceph-sshtest-1624545282997-node2-osd-mon-mgr-mds-node-exporter on host 'ceph-sshtest-1624545282997-node2-osd-mon-mgr-mds-node-exporter' Scheduled to stop mgr.ceph-sshtest-1624545282997-node3-mon-osd-node-exporter-crash-rg on host 'ceph-sshtest-1624545282997-node3-mon-osd-node-exporter-crash-rg' [ceph: root@ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp /]# [ceph: root@ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp /]# [ceph: root@ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp /]# [ceph: root@ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp /]# [ceph: root@ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp /]# ceph status cluster: id: f64f341c-655d-11eb-8778-fa163e914bcc health: HEALTH_WARN no active mgr 1 daemons have recently crashed services: mon: 3 daemons, quorum ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp,ceph-sshtest-1624545282997-node2-osd-mon-mgr-mds-node-exporter,ceph-sshtest-1624545282997-node3-mon-osd-node-exporter-crash-rg (age 108s) mgr: no daemons active (since 74s) mds: 1/1 daemons up, 1 standby osd: 12 osds: 11 up (since 2m), 11 in (since 38m) rgw: 2 daemons active (2 hosts, 1 zones) data: volumes: 1/1 healthy pools: 7 pools, 169 pgs objects: 243 objects, 43 KiB usage: 778 MiB used, 164 GiB / 165 GiB avail pgs: 169 active+clean [ceph: root@ceph-sshtest-1624545282997-node1-installer-mon-mgr-osd-node-exp /]# ceph orch ls Actual results: Cluster becomes inaccessible. and mgr stop will stop all the services and makes ceph crashes. Hence, we recommend to put warning message for the stop command using ceph orch along with force option and warning message to the user as mgr services is stopping all the daemons NOTE: We need to use systemctl command to stop single daemon service to avoid such failures at customer front. Upstream tracker for reference : https://tracker.ceph.com/issues/51298 Expected results: We need a warning message to the user stating this will be the behaviour and force argument should be included for "stop" Additional info: 10.0.211.15 cephuser/cephuser
moving this to 5.0z1
We can recover the cluster by following the steps - The issue is seen only for MON and MGR now. Cluster will be in accessible state for other services like OSDs, RGW etc but all daemons of that services will be down. Hence, we need warning msgs for all type of services when we use ceph orch stop <service name> option How to recover, Go to /var/lib/ceph/fsid and note the mon service And perform systemctl start ceph-<fsid>@monservicename - what we have in /var/lib/ceph/mon service → repeat this for all mon/mgr nodes Check ps -ef | grep ceph-mon/ceph-mgr ---> verify the process id created for ceph-mon/ceph-mgr Ex: systemctl start ceph-f64f341c-655d-11eb-8778-fa163e914bcc
Hi Mike, Can you review from a support perspective if the recovery procedure is good to differ this BZ to 5.0z1?
PR is merged in upstream, but not yet in z1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 5.0 Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4105