Bug 2153827 - `ceph orch host drain` should not always delete contents of /etc/ceph/
Summary: `ceph orch host drain` should not always delete contents of /etc/ceph/
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Cephadm
Version: 5.2
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 7.0
Assignee: Adam King
QA Contact: Aditya Ramteke
Rivka Pollack
URL:
Whiteboard:
Depends On:
Blocks: 2237662
TreeView+ depends on / blocked
 
Reported: 2022-12-15 15:10 UTC by John Fulton
Modified: 2023-12-13 15:19 UTC (History)
6 users (show)

Fixed In Version: ceph-18.2.0-4.el9cp
Doc Type: Enhancement
Doc Text:
.Users can now drain a host of daemons without draining the client `conf` or `keyring` files With this enhancement, users can drain a host of daemons, without also draining the client `conf` or `keyring` files deployed on the host by passing the `--keep-conf-keyring` flag to the `ceph orch host drain` command. Users can now mark a host to have all daemons drained or not placed there while still having Cephadm manage `conf` or `keyring` files on the host.
Clone Of:
Environment:
Last Closed: 2023-12-13 15:19:47 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-5805 0 None None None 2022-12-15 15:17:43 UTC
Red Hat Product Errata RHBA-2023:7780 0 None None None 2023-12-13 15:19:51 UTC

Description John Fulton 2022-12-15 15:10:43 UTC
There are cases where someone might want to be able to run `ceph orch host drain` and have the contents of /etc/ceph/ not be deleted. This is similar to the following PR but it would be nice to have a flag so it's user controllable since what's below doesn't address a use case that happens in RH OpenStack 17.

  https://github.com/ceph/ceph/pull/45174

In Red Hat OpenStack 17 we have a controller replacement procedure which involves draining the Ceph mon/mgr/rgw/mds daemons from a node before we shut the node down and replace it with a new node. We automate the testing of that precuedrewith this ansible playbook:

https://review.gerrithub.io/c/rhos-infra/cloud-config/+/547322/1/post_tasks/roles/replace-controller/tasks/remove_ceph_monitor.yml#160

Note the steps we have to take to not loose the key and conf:

      - name: drain ceph daemons on host being removed
        shell: |
            cp /etc/ceph/ceph.conf /home/tripleo-admin
            cp /etc/ceph/ceph.client.admin.keyring /home/tripleo-admin
            cephadm shell ceph orch host drain {{ install.controller.to.remove }}
            cp /home/tripleo-admin/ceph.conf /etc/ceph
            cp /home/tripleo-admin/ceph.client.admin.keyring /etc/ceph
            rm /home/tripleo-admin/ceph.conf
            rm /home/tripleo-admin/ceph.client.admin.keyring
        delegate_to: "{{ install.controller.to.remove }}"
        when:
            - rc_controller_is_reachable
            - '"No daemons reported" not in ceph_daemon_status_predrain.stdout'

If we don't do the above workaround to back up the ceph conf and keyring and restore them then we have an error in our procedure because we can't run anymore cephadm commands on that node. 

"When doing the procedure manually I went into cephadm shell and then executed the host drain command. Worked with no problem. When doing the automation it was all one step i.e. automation executes the command: cephadm shell ceph orch host drain. After that command is executed all ceph commands failed with error: ObjectNotFound('RADOS object not found. After looking at it for a long time I saw the /etc/ceph/ceph.conf and /etc/ceph/ceph.client.admin.keyring were being deleted. After that I saved them, let them be deleted, and then replaced them. The automation worked after that. If you know of a way to do the ceph drain without it deleting those files then I would do it. I could not find a way."

This bug requests that the above workaround not be necessary.

Comment 1 RHEL Program Management 2022-12-15 15:10:55 UTC
Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 2 John Fulton 2022-12-16 13:59:17 UTC
(In reply to John Fulton from comment #0)
> If we don't do the above workaround to back up the ceph conf and keyring and
> restore them then we have an error in our procedure because we can't run
> anymore cephadm commands on that node. 
...
> This bug requests that the above workaround not be necessary.

Update: we avoid having to back up and restore the files be executing all subsequent cephadm commands on a different node; i.e. drain is the last cephadm command on that node. 

Regardless, it would be nice to have an option to not delete /etc/ceph (deleting files in /etc after removing a daemon violates principle of least surprise IMHO).

Comment 13 errata-xmlrpc 2023-12-13 15:19:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:7780


Note You need to log in before you can comment on or make changes to this bug.