Bug 1812796
Summary: | [DOCS] RHCS 4 Need ceph-ansible [mds] uninstall documentation | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Jerrin Jose <jjose> |
Component: | Documentation | Assignee: | Ranjini M N <rmandyam> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Yogesh Mane <ymane> |
Severity: | medium | Docs Contact: | Aron Gunn <agunn> |
Priority: | unspecified | ||
Version: | 4.0 | CC: | agunn, asriram, hyelloji, kdreyer, rmandyam, ymane |
Target Milestone: | rc | ||
Target Release: | 4.3 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-09-16 13:40:47 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1966534 |
Comment 1
John Brier
2021-03-01 21:25:35 UTC
I tested this procedure. In my testing it did remove the node from the cluster but not from the Ansible inventory file, so it would be reprovisioned if site.yml/site-containers.yml was run again. I wonder if all the shrink playbooks do not remove the node from the inventory file. If so we should mention that/add a step to do that in the procedures we already have on removing OSDs and MONs. Cool thing: I asked it to remove the active MDS and it did that and made the standby the new active. When I rerun it on the remaining active MDS, it removes the FS too. It does not remove the pools, however. Testing logs: == Cluster state before remove [admin@cluster1-node1 infrastructure-playbooks]$ cat ../hosts [grafana-server] cluster1-node2 [mons] cluster1-node2 cluster1-node3 cluster1-node4 [osds] cluster1-node2 cluster1-node3 cluster1-node4 cluster1-node5 cluster1-node6 [mgrs] cluster1-node2 cluster1-node3 cluster1-node4 [mdss] cluster1-node5 cluster1-node6 [clients] [root@cluster1-node2 ~]# ceph -s cluster: id: bb89661e-7d6c-48af-8473-ebfe6c2cdc31 health: HEALTH_WARN 2 pool(s) have non-power-of-two pg_num services: mon: 3 daemons, quorum cluster1-node2,cluster1-node3,cluster1-node4 (age 6m) mgr: cluster1-node3(active, since 6m), standbys: cluster1-node4, cluster1-node2 mds: cephfs:1 {0=cluster1-node5=up:active} 1 up:standby osd: 5 osds: 5 up (since 6m), 5 in (since 6m) task status: scrub status: mds.cluster1-node5: idle data: pools: 10 pools, 401 pgs objects: 1.54k objects, 5.6 GiB usage: 23 GiB used, 52 GiB / 75 GiB avail pgs: 401 active+clean [root@cluster1-node2 ~]# ceph mds stat cephfs:1 {0=cluster1-node5=up:active} 1 up:standby [root@cluster1-node2 ~]# ceph fs dump dumped fsmap epoch 2487 e2487 enable_multiple, ever_enabled_multiple: 0,0 compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} legacy client fscid: 2 Filesystem 'cephfs' (2) fs_name cephfs epoch 2487 flags 12 created 2021-02-18 17:32:23.929674 modified 2021-03-01 16:35:44.159538 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 min_compat_client -1 (unspecified) last_failure 0 last_failure_osd_epoch 1044 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} max_mds 1 in 0 up {0=454117} failed damaged stopped data_pools [16] metadata_pool 17 inline_data disabled balancer standby_count_wanted 1 [mds.cluster1-node5{0:454117} state up:active seq 733 addr [v2:192.168.0.35:6800/2671665574,v1:192.168.0.35:6801/2671665574]] Standby daemons: [mds.cluster1-node6{-1:454128} state up:standby seq 1 addr [v2:192.168.0.36:6800/3651113900,v1:192.168.0.36:6801/3651113900]] [root@cluster1-node3 ~]# ceph osd pool ls block-device-pool device_health_metrics .rgw.root default.rgw.control default.rgw.meta default.rgw.log default.rgw.otp trash-test-pool cephfs_data cephfs_metadata [root@client1 ~]# mount -t ceph :/ /mnt/cephfs/ -o name=1 [root@client1 ~]# df -h Filesystem Size Used Avail Use% Mounted on devtmpfs 880M 0 880M 0% /dev tmpfs 897M 84K 896M 1% /dev/shm tmpfs 897M 18M 879M 2% /run tmpfs 897M 0 897M 0% /sys/fs/cgroup /dev/mapper/rhel-root 13G 2.7G 9.9G 22% / /dev/nvme0n1p1 1014M 324M 691M 32% /boot shm 63M 0 63M 0% /var/lib/containers/storage/overlay-containers/f558efc6f3f8714ee7e2c89547a94238eb2c3e0bda8444fdcaefd730328f43ce/userdata/shm overlay 13G 2.7G 9.9G 22% /var/lib/containers/storage/overlay/dc694a6f437c1be6e0b8de4d8802ca2126ac6c898ae2f0f13dd16b2ee6f454d4/merged tmpfs 180M 0 180M 0% /run/user/0 192.168.0.32:6789,192.168.0.33:6789,192.168.0.34:6789:/ 16G 0 16G 0% /mnt/cephfs [root@client1 ~]# ls /mnt/cephfs/ [root@client1 ~]# ls anaconda-ks.cfg ceph.client.1.secret.backup ceph.client.admin.secret.backup == Remove active MDS [admin@cluster1-node1 ceph-ansible]$ ansible-playbook infrastructure-playbooks/shrink-mds.yml -e mds_to_kill=cluster1-node5 -i hosts PLAY [gather facts and check the init system] **************************************************************** TASK [Gathering Facts] *************************************************************************************** Wednesday 03 March 2021 13:55:04 -0500 (0:00:00.052) 0:00:00.052 ******* ok: [cluster1-node4] ok: [cluster1-node2] ok: [cluster1-node6] ok: [cluster1-node5] ok: [cluster1-node3] TASK [debug] ************************************************************************************************* Wednesday 03 March 2021 13:55:07 -0500 (0:00:02.584) 0:00:02.636 ******* ok: [cluster1-node2] => msg: gather facts on all Ceph hosts for following reference ok: [cluster1-node3] => msg: gather facts on all Ceph hosts for following reference ok: [cluster1-node4] => msg: gather facts on all Ceph hosts for following reference ok: [cluster1-node5] => msg: gather facts on all Ceph hosts for following reference ok: [cluster1-node6] => msg: gather facts on all Ceph hosts for following reference TASK [ceph-facts : check if podman binary is present] ******************************************************** Wednesday 03 March 2021 13:55:07 -0500 (0:00:00.096) 0:00:02.733 ******* ok: [cluster1-node3] ok: [cluster1-node5] ok: [cluster1-node4] ok: [cluster1-node6] ok: [cluster1-node2] TASK [ceph-facts : set_fact container_binary] **************************************************************** Wednesday 03 March 2021 13:55:08 -0500 (0:00:00.934) 0:00:03.667 ******* ok: [cluster1-node2] ok: [cluster1-node3] ok: [cluster1-node4] ok: [cluster1-node5] ok: [cluster1-node6] Are you sure you want to shrink the cluster? [no]: yes PLAY [perform checks, remove mds and print cluster health] *************************************************** TASK [exit playbook, if no mds was given] ******************************************************************** Wednesday 03 March 2021 13:55:37 -0500 (0:00:29.090) 0:00:32.758 ******* skipping: [cluster1-node2] TASK [exit playbook, if the mds is not part of the inventory] ************************************************ Wednesday 03 March 2021 13:55:37 -0500 (0:00:00.020) 0:00:32.779 ******* skipping: [cluster1-node2] TASK [exit playbook, if user did not mean to shrink cluster] ************************************************* Wednesday 03 March 2021 13:55:37 -0500 (0:00:00.021) 0:00:32.800 ******* skipping: [cluster1-node2] TASK [set_fact container_exec_cmd for mon0] ****************************************************************** Wednesday 03 March 2021 13:55:37 -0500 (0:00:00.032) 0:00:32.832 ******* skipping: [cluster1-node2] TASK [exit playbook, if can not connect to the cluster] ****************************************************** Wednesday 03 March 2021 13:55:37 -0500 (0:00:00.022) 0:00:32.855 ******* changed: [cluster1-node2] TASK [set_fact mds_to_kill_hostname] ************************************************************************* Wednesday 03 March 2021 13:55:38 -0500 (0:00:00.602) 0:00:33.457 ******* ok: [cluster1-node2] TASK [exit mds when containerized deployment] **************************************************************** Wednesday 03 March 2021 13:55:38 -0500 (0:00:00.028) 0:00:33.485 ******* skipping: [cluster1-node2] TASK [get ceph status] *************************************************************************************** Wednesday 03 March 2021 13:55:38 -0500 (0:00:00.019) 0:00:33.505 ******* changed: [cluster1-node2] TASK [set_fact current_max_mds] ****************************************************************************** Wednesday 03 March 2021 13:55:38 -0500 (0:00:00.513) 0:00:34.019 ******* ok: [cluster1-node2] TASK [fail if removing that mds node wouldn't satisfy max_mds anymore] *************************************** Wednesday 03 March 2021 13:55:38 -0500 (0:00:00.021) 0:00:34.041 ******* skipping: [cluster1-node2] TASK [stop mds service] ************************************************************************************** Wednesday 03 March 2021 13:55:38 -0500 (0:00:00.038) 0:00:34.079 ******* changed: [cluster1-node2 -> cluster1-node5] TASK [ensure that the mds is stopped] ************************************************************************ Wednesday 03 March 2021 13:55:42 -0500 (0:00:03.622) 0:00:37.702 ******* changed: [cluster1-node2 -> cluster1-node5] TASK [get new ceph status] *********************************************************************************** Wednesday 03 March 2021 13:55:42 -0500 (0:00:00.216) 0:00:37.918 ******* changed: [cluster1-node2] TASK [get active mds nodes list] ***************************************************************************** Wednesday 03 March 2021 13:55:43 -0500 (0:00:00.467) 0:00:38.386 ******* ok: [cluster1-node2] => (item={'filesystem_id': 2, 'rank': 0, 'name': 'cluster1-node6', 'status': 'up:rejoin', 'gid': 454128}) TASK [get ceph fs dump status] ******************************************************************************* Wednesday 03 March 2021 13:55:43 -0500 (0:00:00.031) 0:00:38.417 ******* changed: [cluster1-node2] TASK [create a list of standby mdss] ************************************************************************* Wednesday 03 March 2021 13:55:43 -0500 (0:00:00.472) 0:00:38.890 ******* ok: [cluster1-node2] TASK [fail if mds just killed is being reported as active or standby] **************************************** Wednesday 03 March 2021 13:55:43 -0500 (0:00:00.018) 0:00:38.909 ******* skipping: [cluster1-node2] TASK [delete the filesystem when killing last mds] *********************************************************** Wednesday 03 March 2021 13:55:43 -0500 (0:00:00.019) 0:00:38.928 ******* skipping: [cluster1-node2] TASK [purge mds store] *************************************************************************************** Wednesday 03 March 2021 13:55:43 -0500 (0:00:00.029) 0:00:38.958 ******* changed: [cluster1-node2 -> cluster1-node5] TASK [show ceph health] ************************************************************************************** Wednesday 03 March 2021 13:55:43 -0500 (0:00:00.345) 0:00:39.303 ******* changed: [cluster1-node2] PLAY RECAP *************************************************************************************************** cluster1-node2 : ok=16 changed=8 unreachable=0 failed=0 skipped=8 rescued=0 ignored=0 cluster1-node3 : ok=4 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 cluster1-node4 : ok=4 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 cluster1-node5 : ok=4 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 cluster1-node6 : ok=4 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 Wednesday 03 March 2021 13:55:44 -0500 (0:00:00.486) 0:00:39.790 ******* =============================================================================== ceph-facts : set_fact container_binary --------------------------------------------------------------- 29.09s stop mds service -------------------------------------------------------------------------------------- 3.62s Gathering Facts --------------------------------------------------------------------------------------- 2.58s ceph-facts : check if podman binary is present -------------------------------------------------------- 0.93s exit playbook, if can not connect to the cluster ------------------------------------------------------ 0.60s get ceph status --------------------------------------------------------------------------------------- 0.51s show ceph health -------------------------------------------------------------------------------------- 0.49s get ceph fs dump status ------------------------------------------------------------------------------- 0.47s get new ceph status ----------------------------------------------------------------------------------- 0.47s purge mds store --------------------------------------------------------------------------------------- 0.35s ensure that the mds is stopped ------------------------------------------------------------------------ 0.22s debug ------------------------------------------------------------------------------------------------- 0.10s fail if removing that mds node wouldn't satisfy max_mds anymore --------------------------------------- 0.04s exit playbook, if user did not mean to shrink cluster ------------------------------------------------- 0.03s get active mds nodes list ----------------------------------------------------------------------------- 0.03s delete the filesystem when killing last mds ----------------------------------------------------------- 0.03s set_fact mds_to_kill_hostname ------------------------------------------------------------------------- 0.03s set_fact container_exec_cmd for mon0 ------------------------------------------------------------------ 0.02s exit playbook, if the mds is not part of the inventory ------------------------------------------------ 0.02s set_fact current_max_mds ------------------------------------------------------------------------------ 0.02s == Cluster state after removal of first MDS [root@cluster1-node2 ~]# ceph -s cluster: id: bb89661e-7d6c-48af-8473-ebfe6c2cdc31 health: HEALTH_WARN insufficient standby MDS daemons available 2 pool(s) have non-power-of-two pg_num services: mon: 3 daemons, quorum cluster1-node2,cluster1-node3,cluster1-node4 (age 16m) mgr: cluster1-node3(active, since 16m), standbys: cluster1-node4, cluster1-node2 mds: cephfs:1 {0=cluster1-node6=up:active} osd: 5 osds: 5 up (since 16m), 5 in (since 16m) task status: scrub status: mds.cluster1-node6: idle data: pools: 10 pools, 401 pgs objects: 1.54k objects, 5.6 GiB usage: 23 GiB used, 52 GiB / 75 GiB avail pgs: 401 active+clean [root@cluster1-node2 ~]# ceph fs dump dumped fsmap epoch 2492 e2492 enable_multiple, ever_enabled_multiple: 0,0 compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} legacy client fscid: 2 Filesystem 'cephfs' (2) fs_name cephfs epoch 2492 flags 12 created 2021-02-18 17:32:23.929674 modified 2021-03-03 13:55:43.234398 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 min_compat_client -1 (unspecified) last_failure 0 last_failure_osd_epoch 1111 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} max_mds 1 in 0 up {0=454128} failed damaged stopped data_pools [16] metadata_pool 17 inline_data disabled balancer standby_count_wanted 1 [mds.cluster1-node6{0:454128} state up:active seq 41512 addr [v2:192.168.0.36:6800/3651113900,v1:192.168.0.36:6801/3651113900]] [root@client1 ~]# mount -t ceph :/ /mnt/cephfs/ -o name=1 [root@client1 ~]# umount /mnt/cephfs == Removal of second MDS [admin@cluster1-node1 ceph-ansible]$ ansible-playbook infrastructure-playbooks/shrink-mds.yml -e mds_to_kill=cluster1-node6 -i hosts PLAY [gather facts and check the init system] **************************************************************** TASK [debug] ************************************************************************************************* Wednesday 03 March 2021 14:01:26 -0500 (0:00:00.070) 0:00:00.070 ******* ok: [cluster1-node2] => msg: gather facts on all Ceph hosts for following reference ok: [cluster1-node3] => msg: gather facts on all Ceph hosts for following reference ok: [cluster1-node4] => msg: gather facts on all Ceph hosts for following reference ok: [cluster1-node5] => msg: gather facts on all Ceph hosts for following reference ok: [cluster1-node6] => msg: gather facts on all Ceph hosts for following reference TASK [ceph-facts : check if podman binary is present] ******************************************************** Wednesday 03 March 2021 14:01:26 -0500 (0:00:00.109) 0:00:00.180 ******* ok: [cluster1-node4] ok: [cluster1-node3] ok: [cluster1-node5] ok: [cluster1-node6] ok: [cluster1-node2] TASK [ceph-facts : set_fact container_binary] **************************************************************** Wednesday 03 March 2021 14:01:27 -0500 (0:00:00.630) 0:00:00.810 ******* ok: [cluster1-node2] ok: [cluster1-node3] ok: [cluster1-node4] ok: [cluster1-node5] ok: [cluster1-node6] Are you sure you want to shrink the cluster? [no]: yes PLAY [perform checks, remove mds and print cluster health] *************************************************** TASK [exit playbook, if no mds was given] ******************************************************************** Wednesday 03 March 2021 14:01:29 -0500 (0:00:02.253) 0:00:03.064 ******* skipping: [cluster1-node2] TASK [exit playbook, if the mds is not part of the inventory] ************************************************ Wednesday 03 March 2021 14:01:29 -0500 (0:00:00.027) 0:00:03.091 ******* skipping: [cluster1-node2] TASK [exit playbook, if user did not mean to shrink cluster] ************************************************* Wednesday 03 March 2021 14:01:29 -0500 (0:00:00.025) 0:00:03.117 ******* skipping: [cluster1-node2] TASK [set_fact container_exec_cmd for mon0] ****************************************************************** Wednesday 03 March 2021 14:01:29 -0500 (0:00:00.037) 0:00:03.154 ******* skipping: [cluster1-node2] TASK [exit playbook, if can not connect to the cluster] ****************************************************** Wednesday 03 March 2021 14:01:29 -0500 (0:00:00.024) 0:00:03.179 ******* changed: [cluster1-node2] TASK [set_fact mds_to_kill_hostname] ************************************************************************* Wednesday 03 March 2021 14:01:30 -0500 (0:00:00.577) 0:00:03.756 ******* ok: [cluster1-node2] TASK [exit mds when containerized deployment] **************************************************************** Wednesday 03 March 2021 14:01:30 -0500 (0:00:00.038) 0:00:03.795 ******* skipping: [cluster1-node2] TASK [get ceph status] *************************************************************************************** Wednesday 03 March 2021 14:01:30 -0500 (0:00:00.024) 0:00:03.819 ******* changed: [cluster1-node2] TASK [set_fact current_max_mds] ****************************************************************************** Wednesday 03 March 2021 14:01:30 -0500 (0:00:00.476) 0:00:04.296 ******* ok: [cluster1-node2] TASK [fail if removing that mds node wouldn't satisfy max_mds anymore] *************************************** Wednesday 03 March 2021 14:01:30 -0500 (0:00:00.024) 0:00:04.321 ******* skipping: [cluster1-node2] TASK [stop mds service] ************************************************************************************** Wednesday 03 March 2021 14:01:30 -0500 (0:00:00.038) 0:00:04.359 ******* changed: [cluster1-node2 -> cluster1-node6] TASK [ensure that the mds is stopped] ************************************************************************ Wednesday 03 March 2021 14:01:34 -0500 (0:00:03.851) 0:00:08.211 ******* changed: [cluster1-node2 -> cluster1-node6] TASK [get new ceph status] *********************************************************************************** Wednesday 03 March 2021 14:01:34 -0500 (0:00:00.215) 0:00:08.426 ******* changed: [cluster1-node2] TASK [get active mds nodes list] ***************************************************************************** Wednesday 03 March 2021 14:01:35 -0500 (0:00:00.473) 0:00:08.899 ******* TASK [get ceph fs dump status] ******************************************************************************* Wednesday 03 March 2021 14:01:35 -0500 (0:00:00.022) 0:00:08.921 ******* changed: [cluster1-node2] TASK [create a list of standby mdss] ************************************************************************* Wednesday 03 March 2021 14:01:35 -0500 (0:00:00.507) 0:00:09.429 ******* ok: [cluster1-node2] TASK [fail if mds just killed is being reported as active or standby] **************************************** Wednesday 03 March 2021 14:01:35 -0500 (0:00:00.021) 0:00:09.450 ******* skipping: [cluster1-node2] TASK [delete the filesystem when killing last mds] *********************************************************** Wednesday 03 March 2021 14:01:35 -0500 (0:00:00.022) 0:00:09.472 ******* changed: [cluster1-node2] TASK [purge mds store] *************************************************************************************** Wednesday 03 March 2021 14:01:36 -0500 (0:00:01.005) 0:00:10.478 ******* changed: [cluster1-node2 -> cluster1-node6] TASK [show ceph health] ************************************************************************************** Wednesday 03 March 2021 14:01:37 -0500 (0:00:00.331) 0:00:10.809 ******* changed: [cluster1-node2] PLAY RECAP *************************************************************************************************** cluster1-node2 : ok=15 changed=9 unreachable=0 failed=0 skipped=8 rescued=0 ignored=0 cluster1-node3 : ok=3 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 cluster1-node4 : ok=3 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 cluster1-node5 : ok=3 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 cluster1-node6 : ok=3 changed=0 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 Wednesday 03 March 2021 14:01:37 -0500 (0:00:00.530) 0:00:11.339 ******* =============================================================================== stop mds service -------------------------------------------------------------------------------------- 3.85s ceph-facts : set_fact container_binary ---------------------------------------------------------------- 2.25s delete the filesystem when killing last mds ----------------------------------------------------------- 1.01s ceph-facts : check if podman binary is present -------------------------------------------------------- 0.63s exit playbook, if can not connect to the cluster ------------------------------------------------------ 0.58s show ceph health -------------------------------------------------------------------------------------- 0.53s get ceph fs dump status ------------------------------------------------------------------------------- 0.51s get ceph status --------------------------------------------------------------------------------------- 0.48s get new ceph status ----------------------------------------------------------------------------------- 0.47s purge mds store --------------------------------------------------------------------------------------- 0.33s ensure that the mds is stopped ------------------------------------------------------------------------ 0.22s debug ------------------------------------------------------------------------------------------------- 0.11s fail if removing that mds node wouldn't satisfy max_mds anymore --------------------------------------- 0.04s set_fact mds_to_kill_hostname ------------------------------------------------------------------------- 0.04s exit playbook, if user did not mean to shrink cluster ------------------------------------------------- 0.04s exit playbook, if no mds was given -------------------------------------------------------------------- 0.03s exit playbook, if the mds is not part of the inventory ------------------------------------------------ 0.03s set_fact current_max_mds ------------------------------------------------------------------------------ 0.02s set_fact container_exec_cmd for mon0 ------------------------------------------------------------------ 0.02s exit mds when containerized deployment ---------------------------------------------------------------- 0.02s == Cluster state after removal of second MDS [root@cluster1-node2 ~]# ceph -s cluster: id: bb89661e-7d6c-48af-8473-ebfe6c2cdc31 health: HEALTH_WARN 2 pool(s) have non-power-of-two pg_num services: mon: 3 daemons, quorum cluster1-node2,cluster1-node3,cluster1-node4 (age 41m) mgr: cluster1-node3(active, since 40m), standbys: cluster1-node4, cluster1-node2 osd: 5 osds: 5 up (since 41m), 5 in (since 41m) data: pools: 10 pools, 401 pgs objects: 1.54k objects, 5.6 GiB usage: 23 GiB used, 52 GiB / 75 GiB avail pgs: 401 active+clean [root@cluster1-node2 ~]# [root@cluster1-node2 ~]# ceph fs dump dumped fsmap epoch 2494 e2494 enable_multiple, ever_enabled_multiple: 0,0 compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} legacy client fscid: -1 No filesystems configured [root@cluster1-node3 ~]# ceph osd pool ls block-device-pool device_health_metrics .rgw.root default.rgw.control default.rgw.meta default.rgw.log default.rgw.otp trash-test-pool cephfs_data cephfs_metadata [root@cluster1-node3 ~]# [root@client1 ~]# mount -t ceph :/ /mnt/cephfs/ -o name=1 mount error 110 = Connection timed out [root@client1 ~]# In the previous comment I said "In my testing it did remove the node from the cluster but not from the Ansible inventory file, so it would be reprovisioned if site.yml/site-containers.yml was run again." I just reran site.yml and it did try to reprovision the MDS servers but failed because the old pools with objects in them still exist: fatal: [cluster1-node5 -> cluster1-node2]: FAILED! => changed=false cmd: - ceph - --cluster - ceph - fs - new - cephfs - cephfs_metadata - cephfs_data delta: '0:00:00.312410' end: '2021-03-04 11:47:57.381653' invocation: module_args: _raw_params: ceph --cluster ceph fs new cephfs cephfs_metadata cephfs_data _uses_shell: false argv: null chdir: null creates: null executable: null removes: null stdin: null stdin_add_newline: true strip_empty_ends: true warn: true msg: non-zero return code rc: 22 start: '2021-03-04 11:47:57.069243' stderr: 'Error EINVAL: pool ''cephfs_metadata'' already contains some objects. Use an empty pool instead.' stderr_lines: <omitted> stdout: '' stdout_lines: <omitted> NO MORE HOSTS LEFT ****************************************************************************************** PLAY RECAP ************************************************************************************************** client1 : ok=51 changed=3 unreachable=0 failed=0 skipped=141 rescued=0 ignored=0 cluster1-node2 : ok=328 changed=24 unreachable=0 failed=0 skipped=438 rescued=0 ignored=0 cluster1-node3 : ok=280 changed=23 unreachable=0 failed=0 skipped=396 rescued=0 ignored=0 cluster1-node4 : ok=286 changed=23 unreachable=0 failed=0 skipped=396 rescued=0 ignored=0 cluster1-node5 : ok=183 changed=16 unreachable=0 failed=1 skipped=331 rescued=0 ignored=0 cluster1-node6 : ok=165 changed=14 unreachable=0 failed=0 skipped=313 rescued=0 ignored=0 INSTALLER STATUS ******************************************************************************************** Install Ceph Monitor : Complete (0:01:45) Install Ceph Manager : Complete (0:01:50) Install Ceph OSD : Complete (0:03:15) Install Ceph MDS : In Progress (0:00:26) I noticed in the FS Guide for the procedure to remove a CephFS [1] there is an optional step at the end to remove the pools and if that is done and I run site.yml again it successfully reprovisions the MDS servers: [root@cluster1-node2 ~]# ceph osd pool delete cephfs_metadata cephfs_metadata --yes-i-really-really-mean-it pool 'cephfs_metadata' removed [root@cluster1-node2 ~]# ceph osd pool delete cephfs_data cephfs_data --yes-i-really-really-mean-it pool 'cephfs_data' removed INSTALLER STATUS ******************************************************************************************** Install Ceph Monitor : Complete (0:02:14) Install Ceph Manager : Complete (0:02:22) Install Ceph OSD : Complete (0:03:19) Install Ceph MDS : Complete (0:00:45) Install Ceph Client : Complete (0:00:09) Install Ceph Grafana : In Progress (0:05:25) This phase can be restarted by running: roles/ceph-grafana/tasks/main.yml Install Ceph Node Exporter : Complete (0:00:53) Thursday 04 March 2021 12:58:37 -0500 (0:00:00.004) 0:19:10.927 ******** =============================================================================== Why doesn't shrink-mds.yml remove the pools? Can you recreate the FS w/ old pools? Otherwise what is the point in keeping them? 1) https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/4/html-single/file_system_guide/index#removing-a-ceph-file-system_fs Not only did it reprovision the MDS servers, it created an FS. I don't remember it doing that the first time I set up CephFS. If it didn't and I had to manually create it last time, maybe the old configuration was saved somewhere and ceph-ansible went ahead and recreated it based on previous settings? [root@cluster1-node2 ~]# ceph fs dump dumped fsmap epoch 2499 e2499 enable_multiple, ever_enabled_multiple: 0,0 compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} legacy client fscid: 3 Filesystem 'cephfs' (3) fs_name cephfs epoch 2498 flags 12 created 2021-03-04 12:51:54.792670 modified 2021-03-04 12:52:10.404081 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 min_compat_client -1 (unspecified) last_failure 0 last_failure_osd_epoch 0 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} max_mds 1 in 0 up {0=624306} failed damaged stopped data_pools [18] metadata_pool 19 inline_data disabled balancer standby_count_wanted 1 [mds.cluster1-node6{0:624306} state up:active seq 3 addr [v2:192.168.0.36:6808/825808135,v1:192.168.0.36:6809/825808135]] Standby daemons: [mds.cluster1-node5{-1:644348} state up:standby seq 2 addr [v2:192.168.0.35:6808/1480844303,v1:192.168.0.35:6809/1480844303]] [root@client1 ~]# mount -t ceph :/ /mnt/cephfs/ -o name=1 [root@client1 ~]# df Filesystem 1K-blocks Used Available Use% Mounted on devtmpfs 900552 0 900552 0% /dev tmpfs 917512 84 917428 1% /dev/shm tmpfs 917512 18048 899464 2% /run tmpfs 917512 0 917512 0% /sys/fs/cgroup /dev/mapper/rhel-root 13092864 2768652 10324212 22% / /dev/nvme0n1p1 1038336 331580 706756 32% /boot shm 64000 0 64000 0% /var/lib/containers/storage/overlay-containers/f558efc6f3f8714ee7e2c89547a94238eb2c3e0bda8444fdcaefd730328f43ce/userdata/shm overlay 13092864 2768652 10324212 22% /var/lib/containers/storage/overlay/dc694a6f437c1be6e0b8de4d8802ca2126ac6c898ae2f0f13dd16b2ee6f454d4/merged tmpfs 183500 0 183500 0% /run/user/0 192.168.0.32:6789,192.168.0.33:6789,192.168.0.34:6789:/ 15757312 0 15757312 0% /mnt/cephfs [root@client1 ~]# > Not only did it reprovision the MDS servers, it created an FS. I don't remember it doing that the first time I set up CephFS.
I just tested reprovisioning MDS nodes on a new cluster and ceph-ansible does automatically create the FS during that process.
Hi In step 1 & 6 of procedure in 4.11 section , location of hosts file should be "/usr/share/ceph-ansible/hosts" instead of /etc/ansible/hosts. In step 3, command line needs addition of hosts parameter which would look something like "ansible-playbook infrastructure-playbooks/shrink-mds.yml -e mds_to_kill=MDS_NODE -i hosts " And IMHO we could exchange step 1 & 2. That way we could navigate to "/usr/share/ceph-ansible/" directory in step 1 & open hosts file in step 2. Hi Step 4 "Optional: Repeat the process for any additional MDS nodes" should have "-i hosts" parameter And there is additional step needed before running playbook - set max_mds to 1 (if not already set) # ceph fs set fs_name max_mds 1 Otherwise playbook will fail And in step 7 , first step should be removing filesystem as ansible will only remove mds. # ceph fs rm fs_name --yes-i-really-mean-it And step 5 should be removed as filesyetm will still be existed. Instead we can add "ceph fs status" with output of failed filesystem with no mdss. # ceph fs status cephfs - 0 clients ========== +------+--------+-----+----------+-----+------+ | Rank | State | MDS | Activity | dns | inos | +------+--------+-----+----------+-----+------+ | 0 | failed | | | | | +------+--------+-----+----------+-----+------+ +-----------------+----------+-------+-------+ | Pool | type | used | avail | +-----------------+----------+-------+-------+ | cephfs_metadata | metadata | 627M | 108G | | data_pool | data | 576k | 108G | +-----------------+----------+-------+-------+ +-------------+ | Standby MDS | +-------------+ +-------------+ +---------+---------+ | version | daemons | +---------+---------+ +---------+---------+ LGTM |