Bug 1946888

Summary: Host removal procedure
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Juan Miguel Olmo <jolmomar>
Component: DocumentationAssignee: Ranjini M N <rmandyam>
Status: CLOSED CURRENTRELEASE QA Contact: Preethi <pnataraj>
Severity: high Docs Contact:
Priority: unspecified    
Version: 5.0CC: hyelloji, kdreyer, rmandyam, sewagner, vashastr
Target Milestone: ---   
Target Release: 5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-09-16 13:40:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Juan Miguel Olmo 2021-04-07 07:17:29 UTC
Describe the issue:
-------------------

We need to describe clearly that in RHCS 5.0. the removal of ceph nodes cannot be done without removing manually the daemons in the host that we want to remove.

The upstream documentation about this fact is in:
https://docs.ceph.com/en/latest/cephadm/host-management/#removing-hosts

Comment 1 RHEL Program Management 2021-04-07 07:17:35 UTC
Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 5 Vasishta 2021-05-10 12:20:26 UTC
Hi Ranjini,

Thanks for the information.

Here are my comments on the exiting steps -
1) Since cephadm is designed to migrate running daemons apart from osds to other nodes if host label is set to _no_schedule as explained in 2.b, I think step 2.a would be redundant right?!

https://docs.ceph.com/en/latest/cephadm/host-management/#special-host-labels

2) With current behavior during testing, we are observing that apart from users adding _no_schedule label, users need to remove other labels.
So Either we need to provide instructions to remove labels or fix this from orchestrator.
Will update more info based on offline conversation

3) There are multiple typos in 2.b , 
In syntax, remove _no_schedule as label_name is representing the label's name
In example remove extra '_' at the end of the label

4) In the note we had already mentioned "If the node you want to remove is running OSDs, remove the OSDs before removing the hosts."
So I referred 2.8.8. Removing the OSD daemons using the Ceph Orchestrator to remove OSDs from cluster before proceeding

In that perspective step 2.c is a redundant
Please let me know your thoughts.
Can we provide step 2.c in section 2.8.8 somehow I think step 2.c was missing in section 2.8.8 and could be a good addition?

5) In step 3 we need to highlight it to users to execute the commands on the particular host which they want to remove


Regards,
Vasishta Shastry
QE, ceph

Comment 12 Preethi 2021-06-06 17:04:20 UTC
@Ranjini, Below are my comments

1) You can remove hosts of a Ceph cluster with the Ceph Orchestrators. You can also remove the node-exporter and crash services.

Note
If the node you want to remove is running OSDs, remove the OSDs before removing the hosts.
{PN} - We need to remove the above first line stating you can also remove node-exporter and crash services . make it general or remove that line and NOTE should be applicable for all services not olny for OSDs.

Basically, we need to remove the running services on the respective hosts before applying ceph orch host rm command on the removing hosts.

2)For all Ceph services, except node-exporter and crash, run the following commands:

Navigate to the path where the cluster.yml file is located:

Example

[ceph: root@host01 /]# cd /var/lib/ceph/cluster.yml/
Remove the host from the cluster.yml file:

Example

service_type: rgw
placement:
  hosts:
  - host01
  - host02
In this example, host03 is removed the placement specification.

{PN} - Remove node-exporter and crash from point 2 and give generic example. cluster.yaml file is not a directory cd to that is not a valid path

c) Remove the OSDs from the host:

Syntax

---
for osd_id in $(ceph orch ps HOST_NAME --daemon_type osd | grep osd | awk '{print  $1}' | cut -c 5-) ; do ceph orch osd rm OSD_ID ; done
---
Run this step one after the other to remove all the OSDs on the host.

Example

[ceph: root@host01 /]# for osd_id in $(ceph orch ps host03 --daemon_type osd | grep osd | awk '{print  $1}' | cut -c 5-) ; do ceph orch osd rm 7; done


{PN} if we are specifying remove OSDs, then we need to show for all services or generic example
above example is incorrect 
ceph orch osd rm osd_id should come instead of 7 there

********************************
NOTE :
1)Not sure if node-exporter and crash services has separate steps to remove them unlike other services- Lets get it confirmed by Juan. if yes, lets keep them separate in a NOTE and make other services as generic
2) We have the BZ1886120 which cannot be removed services directly using host rm. Hence, above is the workaround to remove the service on the host before removing the host
Lets get clarity with Juan on this.

Comment 18 Sebastian Wagner 2021-06-09 12:21:41 UTC
LGTM

Comment 19 Preethi 2021-06-09 17:18:22 UTC
@ranjini, Docs looks good to me. However, NOTE provided should be generic for all services not only for OSDs - Check with Juan on this as well. as we need to remove all services running on the hosts before applying remove command.

below are the observation after removing the hosts i.e MON3 from the cluster

host ls --> host is removed
no daemons is reported -> under host
ceph orch ls -> reporting 2/2 which is correct
so mon is removed
ceph status-  3 mons showing still - which is incorrect

below warnings are seen when we do host removal - We need to add stray host removal steps in the documentation under troubleshooting steps

[ceph: root@ceph-sunil1adm-1622017621066-node1-installer-mon-mgr-osd-node-e /]# ceph health detail
HEALTH_WARN 8 stray daemon(s) not managed by cephadm; 1 stray host(s) with 8 daemon(s) not managed by cephadm; 3 daemons have recently crashed
[WRN] CEPHADM_STRAY_DAEMON: 8 stray daemon(s) not managed by cephadm
    stray daemon mds.cephfs.ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash.oipecp on host ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash not managed by cephadm
    stray daemon mon.ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash on host ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash not managed by cephadm
    stray daemon osd.10 on host ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash not managed by cephadm
    stray daemon osd.11 on host ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash not managed by cephadm
    stray daemon osd.8 on host ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash not managed by cephadm
    stray daemon osd.9 on host ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash not managed by cephadm
    stray daemon rgw.my-rgw.ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash.imkgco on host ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash not managed by cephadm
    stray daemon rgw.my-rgw.ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash.bvzgru on host ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash not managed by cephadm
[WRN] CEPHADM_STRAY_HOST: 1 stray host(s) with 8 daemon(s) not managed by cephadm
    stray host ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash has 8 stray daemons: ['mds.cephfs.ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash.oipecp', 'mon.ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash', 'osd.10', 'osd.11', 'osd.8', 'osd.9', 'rgw.my-rgw.ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash.imkgco', 'rgw.my-rgw.ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash.bvzgru']
[WRN] RECENT_CRASH: 3 daemons have recently crashed
    osd.5 crashed on host ceph-sunil1adm-1622017621066-node2-osd-mon-mgr-mds-node-exporte at 2021-06-02T19:53:18.196773Z
    osd.2 crashed on host ceph-sunil1adm-1622017621066-node1-installer-mon-mgr-osd-node-e at 2021-06-04T09:53:15.854728Z
    osd.0 crashed on host ceph-sunil1adm-1622017621066-node1-installer-mon-mgr-osd-node-e at 2021-06-04T09:53:15.856856Z
[ceph: root@ceph-sunil1adm-1622017621066-node1-installer-mon-mgr-osd-node-e /]# ceph -s
  cluster:
    id:     f64f341c-655d-11eb-8778-fa163e914bcc
    health: HEALTH_WARN
            8 stray daemon(s) not managed by cephadm
            1 stray host(s) with 8 daemon(s) not managed by cephadm
            3 daemons have recently crashed
 
  services:
    mon: 3 daemons, quorum ceph-sunil1adm-1622017621066-node1-installer-mon-mgr-osd-node-e,ceph-sunil1adm-1622017621066-node2-osd-mon-mgr-mds-node-exporte,ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash (age 80m)
    mgr: ceph-sunil1adm-1622017621066-node1-installer-mon-mgr-osd-node-e.mnwazb(active, since 92m), standbys: ceph-sunil1adm-1622017621066-node2-osd-mon-mgr-mds-node-exporte.newdrq
    mds: 1/1 daemons up, 1 standby
    osd: 12 osds: 12 up (since 90m), 12 in (since 9d)
    rgw: 4 daemons active (2 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   8 pools, 201 pgs
    objects: 396 objects, 57 KiB
    usage:   3.4 GiB used, 177 GiB / 180 GiB avail
    pgs:     201 active+clean
 
  io:
    client:   85 B/s rd, 0 op/s rd, 0 op/s wr

Comment 20 Preethi 2021-06-10 06:56:05 UTC
@Ranjini, I recommend to test the workflow to check whatever we documented the steps are correct and not seeing any warnings/errors when we perform host removal.

Comment 21 Preethi 2021-06-14 11:23:27 UTC
@Ranjini, verified the changes in the below link. Looks good. However, we need to add a NOTE to set the OSDs in "ceph orch apply osd --all-available-devices --unmanaged=true" if user has deployed with all-available devices perform performing for loop for removing OSDs.

https://jenkins.dxp.redhat.com/job/CCS/job/ccs-mr-preview/337/artifact/preview/index.html#removing-hosts-using-the-ceph-orchestrator_ops

Comment 23 Preethi 2021-06-16 08:17:55 UTC
Hi Ranjini,

Content looks good to me.

Regards,
Preethi