Bug 1337915
| Summary: | purge-cluster.yml confused by presence of ceph installer, ceph kernel threads | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Ben England <bengland> |
| Component: | Ceph-Ansible | Assignee: | Guillaume Abrioux <gabrioux> |
| Status: | CLOSED ERRATA | QA Contact: | Vasishta <vashastr> |
| Severity: | medium | Docs Contact: | |
| Priority: | low | ||
| Version: | 3.0 | CC: | adeza, agunn, anharris, aschoen, bengland, ceph-eng-bugs, ddharwar, edonnell, flucifre, gabrioux, kdreyer, nthomas, rraja, sankarshan, seb, shan, tserlin |
| Target Milestone: | rc | ||
| Target Release: | 3.3 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | RHEL: ceph-ansible-3.2.18-1.el7cp Ubuntu: ceph-ansible_3.2.18-2redhat1 | Doc Type: | Bug Fix |
| Doc Text: |
.The `purge-cluster.yml` playbook no longer causes issues with redeploying a cluster
Previously the `purge-cluster.yml` Ansible playbook did not clean all Red Hat Ceph Storage kernel threads as it should and could leave CephFS `mountpoint` mounted and Ceph Block Devices mapped. This could prevent redeploying a cluster. With this update, the `purge-cluster.yml` Ansible playbook cleans all Ceph kernel threads, unmounts all Ceph related `mountpoint` on client nodes, and unmaps Ceph Block Devices so the cluster can be redeployed.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-08-21 15:10:24 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1726135 | ||
|
Description
Ben England
2016-05-20 12:59:45 UTC
Ramana, please take Ben's suggestion and send a PR for this. Hi Ramana, did you get a chance to take a look at this BZ ? (see C5 comment from Sebastien). (In reply to Guillaume Abrioux from comment #6) > Hi Ramana, > > did you get a chance to take a look at this BZ ? (see C5 comment from > Sebastien). Yes, I did look at the BZ before. But haven't started working on it. I am OK if someone else wants to take up this BZ, if this is urgent. I can no longer reproduce this issue. "purge-cluster.yml" completed on all hosts, including the ceph client machine where a Ceph filesystem was mounted and unmounted using kernel client.
The issue (back in 2016/05/20) seems to have been caused by task "check for anything running ceph" in purge-cluster.yml working incorrectly. This particular task has undergone quite a few changes since then
and seems to work currently now.
When this BZ was filed the task used to be ,
- name: check for anything running ceph
shell: "ps awux | grep -v grep | grep -q -- ceph-"
register: check_for_running_ceph
failed_when: check_for_running_ceph.rc == 0
introduced by commit 90fd2c70
Now the task is,
- name: check for anything running ceph
command: "ps -u ceph -U ceph"
register: check_for_running_ceph
failed_when: check_for_running_ceph.rc == 0
introduced by commit 5a3f95dfc
Also, I don't think we want to remove libceph and rbd kernel module as ceph-ansible doesn't explicitly install them. The modules come bundled with the distribution kernel.
Ben, are you still hitting this issue with RHCS 3.2? If not we can close this BZ.
I haven't tried running purge-cluster or purge-docker-cluster playbooks this way lately. So as far as I know it's still an issue. So it's true that ceph-ansible doesn't install libceph and rbd kernel module, and we really don't have to remove the modules, but we *do* have to unmap all RBD devices and unmount all Cephfs mountpoints before we install a new version or configuration of Ceph, right? Otherwise these devices and mountpoints will have nothing supporting them and become difficult to clean up or re-initialize (may even require a reboot). The "modprobe -rv" was put there only because it would fail if someone left /dev/rbd* or Cephfs mountpoints active on any host in the inventory. But in hindsight, I see that modprobe -rv is not sufficient to solve this problem, because of the way that ansible works - it will fail on that one host, stopping purge-cluster on that host, but other hosts will continue, and so the cluster will still be mostly taken down and the left-over RBD devices and mountpoints will still be problematic. Can we change the modprobe -rv task so that it is done first, and if *any* host fails to do modprobe -rv rbd libceph, the whole playbook stops right there? If not, the playbook should explicitly remove the /dev/rbd* and Cephfs mountpoints, with "rbd unmap" and "umount" commands, and do that at the very beginning of the purge-cluster run, before any other steps have been taken. RGW does not have this problem because the clients are using S3 or Swift, which are HTTP-based protocols, and there are no client resources that need to be removed in order to disconnect the clients from the servers. (In reply to Ben England from comment #9) > I haven't tried running purge-cluster or purge-docker-cluster playbooks this > way lately. So as far as I know it's still an issue. Oh OK. So you're also concerned about 'clean' purge of client machines with actively mounted Ceph filesystems with cephfs kernel client or mapped devices with rbd cephfs client. I didn't get this while reading the bug description. Thanks for the explanation. > > So it's true that ceph-ansible doesn't install libceph and rbd kernel > module, and we really don't have to remove the modules, but we *do* have to > unmap all RBD devices and unmount all Cephfs mountpoints before we install a > new version or configuration of Ceph, right? Otherwise these devices and > mountpoints will have nothing supporting them and become difficult to clean > up or re-initialize (may even require a reboot). The "modprobe -rv" was put > there only because it would fail if someone left /dev/rbd* or Cephfs > mountpoints active on any host in the inventory. > > But in hindsight, I see that modprobe -rv is not sufficient to solve this > problem, because of the way that ansible works - it will fail on that one > host, stopping purge-cluster on that host, but other hosts will continue, > and so the cluster will still be mostly taken down and the left-over RBD > devices and mountpoints will still be problematic. Can we change the > modprobe -rv task so that it is done first, and if *any* host fails to do > modprobe -rv rbd libceph, the whole playbook stops right there? Yes, I think we can do this by first checking in client hosts whether rbd or ceph kernel modules is in use. If the kernel module is in use, then fail on all hosts using, `any_errors_fatal: true` setting, https://docs.ansible.com/ansible/latest/user_guide/playbooks_error_handling.html#aborting-the-play > If not, the playbook should explicitly remove the /dev/rbd* and Cephfs mountpoints, with > "rbd unmap" and "umount" commands, and do that at the very beginning of the > purge-cluster run, before any other steps have been taken. Yes, we can do this too using the following commands. umount -a -type ceph # unmount ceph filesystems rbdmap unmap-all # unmap all devices > > RGW does not have this problem because the clients are using S3 or Swift, > which are HTTP-based protocols, and there are no client resources that need > to be removed in order to disconnect the clients from the servers. Here we're worried about only kernel clients. Would there be similar "difficult to clean up or re-initialize (may even require a reboot)" issues with leftover RBD devices and CephFS mounts that use userspace ceph clients. e.g. rbd-nbd/librbd, ceph-fuse/libcephfs? I'll defer to you on implementation method, all the above sounds good, I didn't know about rbd unmap-all command. What about iSCSI? Is there a way to ask the iSCSI daemon if it's serving any remote clients before it is shut down? How would NFS client mountpoints be handled if you uninstall Ceph and Ganesha Ceph FSAL is serving NFS clients at the time? Is there a way to detect that an NFS Ganesha process is actively serving a NFS client mountpoint before you shut it down? Don't forget about FUSE Cephfs mounts - some sites use both FUSE and kernel Cephfs. is nbd still in the picture? I thought they decided not to implement it. If nbd is being used, that's a kernel block device so there should be some way to detect that it's attached to a Ceph daemon? rr>Here we're worried about only kernel clients. Would there be similar "difficult to clean up or re-initialize (may even require a reboot)" issues with leftover RBD devices and CephFS mounts that use userspace ceph clients. e.g. rbd-nbd/librbd, ceph-fuse/libcephfs? if you rip ceph out from underneath a process that's using libcephfs or the like, you may have to restart the client process, but you won't have to reboot the host. And the client's TCP socket should get a disconnection event that should shut it down. Thanks for looking at this, it's strange but cluster teardown is important functionality because sometimes you need to re-do an install (because it's hard to get an install done right the first time), and purge*cluster.yml is the only practical way to achieve that. Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. Regards, Giri Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. Regards, Giri Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:2538 |