Description of problem: When Ceph is deployed in a tripleo-ipa context, stray hosts warning shows up in the cluster for RGW. [WRN] CEPHADM_STRAY_HOST: 3 stray host(s) with 3 daemon(s) not managed by cephadm stray host controller-0.redhat.local has 1 stray daemons: ['rgw.rgw.controller-0.cfqldy'] stray host controller-1.redhat.local has 1 stray daemons: ['rgw.rgw.controller-1.qmexqp'] stray host controller-2.redhat.local has 1 stray daemons: ['rgw.rgw.controller-2.gbtzli'] If a cephadm is performed, it's extended to all the daemons. HEALTH_WARN 6 stray host(s) with 23 daemon(s) not managed by cephadm [WRN] CEPHADM_STRAY_HOST: 6 stray host(s) with 23 daemon(s) not managed by cephadm stray host ceph-0.redhat.local has 5 stray daemons: ['osd.0', 'osd.1', 'osd.2', 'osd.3', 'osd.4'] stray host ceph-1.redhat.local has 5 stray daemons: ['osd.11', 'osd.13', 'osd.5', 'osd.7', 'osd.9'] stray host ceph-2.redhat.local has 5 stray daemons: ['osd.10', 'osd.12', 'osd.14', 'osd.6', 'osd.8'] stray host controller-0.redhat.local has 3 stray daemons: ['mgr.controller-0.sxxchf', 'mon.controller-0', 'rgw.rgw.controller-0.cfqldy'] stray host controller-1.redhat.local has 3 stray daemons: ['mgr.controller-1.dcxuoe', 'mon.controller-1', 'rgw.rgw.controller-1.qmexqp'] stray host controller-2.redhat.local has 2 stray daemons: ['mgr.controller-2.fuxajb', 'rgw.rgw.controller-2.gbtzli'] [ceph: root@controller-0 /]# The problem is related to the fact that ceph is deployed using short names, and even though we're able to pass CephSpecFqdn: true to the overcloud deployment command, the bootstrap process doesn't use the same approach. We should have an option `--tld` that should be passed to deployed ceph to properly build the spec and enroll the right hosts when the cluster is created. e.g. openstack overcloud ceph spec --tld example.com Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
The new option should be available to both commands: `openstack overcloud ceph spec --tld redhat.com` `openstack overcloude deploy deploy --tld redhat.com` It injects the Top Level Domain passed by the user into the generated Ceph spec. It should be a new parameter to this module and the behavior of the fqdn boolean parameter will need to be adjusted. https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/ansible_plugins/modules/ceph_spec_bootstrap.py#L66-L69 That should include this check: https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/ansible_plugins/modules/ceph_spec_bootstrap.py#L545-L549 It could be updated this way: # 5. fqdn is only supported for the inventory method or if TLD is provided The original reason why we had this check was because the output of metalsmith alone does not contain the TLD. However, if the user provides it, then there's no reason we can't just append what they provide.
Initial testing was done simulating the --tld method and it is necessary but not sufficient to solve the issue. Two more changes will be needed for this bug. Mikolaj simulated if we had the --tld option by doing the following: 1. `openstack overcloud ceph spec` 2. add the TLD to the spec using sed or similar 3. `openstack overcloud ceph deploy --ceph-spec ... ` He than had an Ansible error that the host was unreachable: 2023-02-27 08:32:36.139561 | 52540004-dd75-d9e7-acca-00000000002b | UNREACHABLE | Install cephadm package | controller-0.redhat.local because the following code needs to be changed: https://opendev.org/openstack/tripleo-ansible/src/branch/master/tripleo_ansible/playbooks/cli-deployed-ceph.yaml#L96 He then manually changed that code and cephadm failed here: orchestrator._interface.OrchestratorError: Host controller-0.redhat.local (192.168.24.41) failed check(s): ['hostname "controller-0" does not match expected hostname "controller-0.redhat.local"'] From: https://github.com/ceph/ceph/blob/v16.2.8/src/cephadm/cephadm#L6226-L6232 However, the above will be addressed by adding `--skip-prepare-host` to the bootstrap.yaml tasks file which is proposed here: https://review.opendev.org/c/openstack/tripleo-ansible/+/874630/1/tripleo_ansible/roles/tripleo_cephadm/tasks/bootstrap.yaml#51
(In reply to John Fulton from comment #1) > The new option should be available to both commands: > > `openstack overcloud ceph spec --tld redhat.com` > `openstack overcloude deploy deploy --tld redhat.com` I typed the last command wrong. It should be: `openstack overcloud ceph deploy --tld redhat.com` The idea is that the customer shouldn't always have to generate and pass the spec using --ceph-spec. Instead we can pass the --tld through to the spec generation code.
Hi Jenny-Anne, updated doc text looks good to me.
Updated doc text looks good.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1 (Wallaby)), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2023:4577