2153827 – `ceph orch host drain` should not always delete contents of /etc/ceph/

Bug 2153827 - `ceph orch host drain` should not always delete contents of /etc/ceph/

Summary: `ceph orch host drain` should not always delete contents of /etc/ceph/

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Cephadm
Sub Component:
Version:	5.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	7.0
Assignee:	Adam King
QA Contact:	Aditya Ramteke
Docs Contact:	Rivka Pollack
URL:
Whiteboard:
Depends On:
Blocks:	2237662
TreeView+	depends on / blocked

Reported:	2022-12-15 15:10 UTC by John Fulton
Modified:	2023-12-13 15:19 UTC (History)
CC List:	6 users (show)
Fixed In Version:	ceph-18.2.0-4.el9cp
Doc Type:	Enhancement
Doc Text:	.Users can now drain a host of daemons without draining the client `conf` or `keyring` files With this enhancement, users can drain a host of daemons, without also draining the client `conf` or `keyring` files deployed on the host by passing the `--keep-conf-keyring` flag to the `ceph orch host drain` command. Users can now mark a host to have all daemons drained or not placed there while still having Cephadm manage `conf` or `keyring` files on the host.
Clone Of:
Environment:
Last Closed:	2023-12-13 15:19:47 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-5805	0	None	None	None	2022-12-15 15:17:43 UTC
Red Hat Product Errata	RHBA-2023:7780	0	None	None	None	2023-12-13 15:19:51 UTC

Description John Fulton 2022-12-15 15:10:43 UTC

There are cases where someone might want to be able to run `ceph orch host drain` and have the contents of /etc/ceph/ not be deleted. This is similar to the following PR but it would be nice to have a flag so it's user controllable since what's below doesn't address a use case that happens in RH OpenStack 17.

https://github.com/ceph/ceph/pull/45174

In Red Hat OpenStack 17 we have a controller replacement procedure which involves draining the Ceph mon/mgr/rgw/mds daemons from a node before we shut the node down and replace it with a new node. We automate the testing of that precuedrewith this ansible playbook:

https://review.gerrithub.io/c/rhos-infra/cloud-config/+/547322/1/post_tasks/roles/replace-controller/tasks/remove_ceph_monitor.yml#160

Note the steps we have to take to not loose the key and conf:

- name: drain ceph daemons on host being removed
shell: |
cp /etc/ceph/ceph.conf /home/tripleo-admin
cp /etc/ceph/ceph.client.admin.keyring /home/tripleo-admin
cephadm shell ceph orch host drain {{ install.controller.to.remove }}
cp /home/tripleo-admin/ceph.conf /etc/ceph
cp /home/tripleo-admin/ceph.client.admin.keyring /etc/ceph
rm /home/tripleo-admin/ceph.conf
rm /home/tripleo-admin/ceph.client.admin.keyring
delegate_to: "{{ install.controller.to.remove }}"
when:
- rc_controller_is_reachable
- '"No daemons reported" not in ceph_daemon_status_predrain.stdout'

If we don't do the above workaround to back up the ceph conf and keyring and restore them then we have an error in our procedure because we can't run anymore cephadm commands on that node.

"When doing the procedure manually I went into cephadm shell and then executed the host drain command. Worked with no problem. When doing the automation it was all one step i.e. automation executes the command: cephadm shell ceph orch host drain. After that command is executed all ceph commands failed with error: ObjectNotFound('RADOS object not found. After looking at it for a long time I saw the /etc/ceph/ceph.conf and /etc/ceph/ceph.client.admin.keyring were being deleted. After that I saved them, let them be deleted, and then replaced them. The automation worked after that. If you know of a way to do the ceph drain without it deleting those files then I would do it. I could not find a way."

This bug requests that the above workaround not be necessary.

Comment 1 RHEL Program Management 2022-12-15 15:10:55 UTC

Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 2 John Fulton 2022-12-16 13:59:17 UTC

(In reply to John Fulton from comment #0)
> If we don't do the above workaround to back up the ceph conf and keyring and
> restore them then we have an error in our procedure because we can't run
> anymore cephadm commands on that node. 
...
> This bug requests that the above workaround not be necessary.

Update: we avoid having to back up and restore the files be executing all subsequent cephadm commands on a different node; i.e. drain is the last cephadm command on that node. 

Regardless, it would be nice to have an option to not delete /etc/ceph (deleting files in /etc after removing a daemon violates principle of least surprise IMHO).

Comment 13 errata-xmlrpc 2023-12-13 15:19:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:7780

Note You need to log in before you can comment on or make changes to this bug.