Bug 1946888
| Summary: | Host removal procedure | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Juan Miguel Olmo <jolmomar> |
| Component: | Documentation | Assignee: | Ranjini M N <rmandyam> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Preethi <pnataraj> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 5.0 | CC: | hyelloji, kdreyer, rmandyam, sewagner, vashastr |
| Target Milestone: | --- | ||
| Target Release: | 5.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-09-16 13:40:47 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Juan Miguel Olmo
2021-04-07 07:17:29 UTC
Please specify the severity of this bug. Severity is defined here: https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity. Hi Ranjini, Thanks for the information. Here are my comments on the exiting steps - 1) Since cephadm is designed to migrate running daemons apart from osds to other nodes if host label is set to _no_schedule as explained in 2.b, I think step 2.a would be redundant right?! https://docs.ceph.com/en/latest/cephadm/host-management/#special-host-labels 2) With current behavior during testing, we are observing that apart from users adding _no_schedule label, users need to remove other labels. So Either we need to provide instructions to remove labels or fix this from orchestrator. Will update more info based on offline conversation 3) There are multiple typos in 2.b , In syntax, remove _no_schedule as label_name is representing the label's name In example remove extra '_' at the end of the label 4) In the note we had already mentioned "If the node you want to remove is running OSDs, remove the OSDs before removing the hosts." So I referred 2.8.8. Removing the OSD daemons using the Ceph Orchestrator to remove OSDs from cluster before proceeding In that perspective step 2.c is a redundant Please let me know your thoughts. Can we provide step 2.c in section 2.8.8 somehow I think step 2.c was missing in section 2.8.8 and could be a good addition? 5) In step 3 we need to highlight it to users to execute the commands on the particular host which they want to remove Regards, Vasishta Shastry QE, ceph @Ranjini, Below are my comments
1) You can remove hosts of a Ceph cluster with the Ceph Orchestrators. You can also remove the node-exporter and crash services.
Note
If the node you want to remove is running OSDs, remove the OSDs before removing the hosts.
{PN} - We need to remove the above first line stating you can also remove node-exporter and crash services . make it general or remove that line and NOTE should be applicable for all services not olny for OSDs.
Basically, we need to remove the running services on the respective hosts before applying ceph orch host rm command on the removing hosts.
2)For all Ceph services, except node-exporter and crash, run the following commands:
Navigate to the path where the cluster.yml file is located:
Example
[ceph: root@host01 /]# cd /var/lib/ceph/cluster.yml/
Remove the host from the cluster.yml file:
Example
service_type: rgw
placement:
hosts:
- host01
- host02
In this example, host03 is removed the placement specification.
{PN} - Remove node-exporter and crash from point 2 and give generic example. cluster.yaml file is not a directory cd to that is not a valid path
c) Remove the OSDs from the host:
Syntax
---
for osd_id in $(ceph orch ps HOST_NAME --daemon_type osd | grep osd | awk '{print $1}' | cut -c 5-) ; do ceph orch osd rm OSD_ID ; done
---
Run this step one after the other to remove all the OSDs on the host.
Example
[ceph: root@host01 /]# for osd_id in $(ceph orch ps host03 --daemon_type osd | grep osd | awk '{print $1}' | cut -c 5-) ; do ceph orch osd rm 7; done
{PN} if we are specifying remove OSDs, then we need to show for all services or generic example
above example is incorrect
ceph orch osd rm osd_id should come instead of 7 there
********************************
NOTE :
1)Not sure if node-exporter and crash services has separate steps to remove them unlike other services- Lets get it confirmed by Juan. if yes, lets keep them separate in a NOTE and make other services as generic
2) We have the BZ1886120 which cannot be removed services directly using host rm. Hence, above is the workaround to remove the service on the host before removing the host
Lets get clarity with Juan on this.
LGTM @ranjini, Docs looks good to me. However, NOTE provided should be generic for all services not only for OSDs - Check with Juan on this as well. as we need to remove all services running on the hosts before applying remove command.
below are the observation after removing the hosts i.e MON3 from the cluster
host ls --> host is removed
no daemons is reported -> under host
ceph orch ls -> reporting 2/2 which is correct
so mon is removed
ceph status- 3 mons showing still - which is incorrect
below warnings are seen when we do host removal - We need to add stray host removal steps in the documentation under troubleshooting steps
[ceph: root@ceph-sunil1adm-1622017621066-node1-installer-mon-mgr-osd-node-e /]# ceph health detail
HEALTH_WARN 8 stray daemon(s) not managed by cephadm; 1 stray host(s) with 8 daemon(s) not managed by cephadm; 3 daemons have recently crashed
[WRN] CEPHADM_STRAY_DAEMON: 8 stray daemon(s) not managed by cephadm
stray daemon mds.cephfs.ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash.oipecp on host ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash not managed by cephadm
stray daemon mon.ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash on host ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash not managed by cephadm
stray daemon osd.10 on host ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash not managed by cephadm
stray daemon osd.11 on host ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash not managed by cephadm
stray daemon osd.8 on host ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash not managed by cephadm
stray daemon osd.9 on host ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash not managed by cephadm
stray daemon rgw.my-rgw.ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash.imkgco on host ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash not managed by cephadm
stray daemon rgw.my-rgw.ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash.bvzgru on host ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash not managed by cephadm
[WRN] CEPHADM_STRAY_HOST: 1 stray host(s) with 8 daemon(s) not managed by cephadm
stray host ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash has 8 stray daemons: ['mds.cephfs.ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash.oipecp', 'mon.ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash', 'osd.10', 'osd.11', 'osd.8', 'osd.9', 'rgw.my-rgw.ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash.imkgco', 'rgw.my-rgw.ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash.bvzgru']
[WRN] RECENT_CRASH: 3 daemons have recently crashed
osd.5 crashed on host ceph-sunil1adm-1622017621066-node2-osd-mon-mgr-mds-node-exporte at 2021-06-02T19:53:18.196773Z
osd.2 crashed on host ceph-sunil1adm-1622017621066-node1-installer-mon-mgr-osd-node-e at 2021-06-04T09:53:15.854728Z
osd.0 crashed on host ceph-sunil1adm-1622017621066-node1-installer-mon-mgr-osd-node-e at 2021-06-04T09:53:15.856856Z
[ceph: root@ceph-sunil1adm-1622017621066-node1-installer-mon-mgr-osd-node-e /]# ceph -s
cluster:
id: f64f341c-655d-11eb-8778-fa163e914bcc
health: HEALTH_WARN
8 stray daemon(s) not managed by cephadm
1 stray host(s) with 8 daemon(s) not managed by cephadm
3 daemons have recently crashed
services:
mon: 3 daemons, quorum ceph-sunil1adm-1622017621066-node1-installer-mon-mgr-osd-node-e,ceph-sunil1adm-1622017621066-node2-osd-mon-mgr-mds-node-exporte,ceph-sunil1adm-1622017621066-node3-mon-osd-node-exporter-crash (age 80m)
mgr: ceph-sunil1adm-1622017621066-node1-installer-mon-mgr-osd-node-e.mnwazb(active, since 92m), standbys: ceph-sunil1adm-1622017621066-node2-osd-mon-mgr-mds-node-exporte.newdrq
mds: 1/1 daemons up, 1 standby
osd: 12 osds: 12 up (since 90m), 12 in (since 9d)
rgw: 4 daemons active (2 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 8 pools, 201 pgs
objects: 396 objects, 57 KiB
usage: 3.4 GiB used, 177 GiB / 180 GiB avail
pgs: 201 active+clean
io:
client: 85 B/s rd, 0 op/s rd, 0 op/s wr
@Ranjini, I recommend to test the workflow to check whatever we documented the steps are correct and not seeing any warnings/errors when we perform host removal. @Ranjini, verified the changes in the below link. Looks good. However, we need to add a NOTE to set the OSDs in "ceph orch apply osd --all-available-devices --unmanaged=true" if user has deployed with all-available devices perform performing for loop for removing OSDs. https://jenkins.dxp.redhat.com/job/CCS/job/ccs-mr-preview/337/artifact/preview/index.html#removing-hosts-using-the-ceph-orchestrator_ops Hi Ranjini, Content looks good to me. Regards, Preethi |