Bug 2173101 - ceph stray_host issue: fqdn vs short host names when deploy with freeipa / tls-everywhere [NEEDINFO]
Summary: ceph stray_host issue: fqdn vs short host names when deploy with freeipa / tl...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: tripleo-ansible
Version: 17.0 (Wallaby)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: beta
: 17.1
Assignee: Manoj Katari
QA Contact: Alfredo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-02-23 22:33 UTC by Francesco Pantano
Modified: 2023-08-16 01:14 UTC (History)
4 users (show)

Fixed In Version: tripleo-ansible-3.3.1-1.20230416001818.c6edb35.el9ost , python-tripleoclient-16.5.1-1.20230416001030.78730a3.el9ost
Doc Type: Bug Fix
Doc Text:
Before this update, when users deployed Red Hat Ceph Storage in a tripleo-ipa context, a `stray hosts` warning showed in the cluster for the Ceph Object Gateway (RADOS Gateway [RGW]). With this update, during a Ceph Storage deployment, you can pass the option `--tld` in a tripleo-ipa context to use the correct hosts when you create the cluster.
Clone Of:
Environment:
Last Closed: 2023-08-16 01:13:59 UTC
Target Upstream Version:
Embargoed:
jelynch: needinfo?


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 879302 0 None MERGED Fix Ceph Stray host(s)/daemon(s) issue 2023-04-17 14:48:40 UTC
OpenStack gerrit 879303 0 None MERGED Fix Ceph Stray host(s)/daemon(s) issue 2023-04-17 14:48:40 UTC
OpenStack gerrit 879486 0 None MERGED Use short_hostnames when tls-everywhere is enabled 2023-04-17 18:13:15 UTC
Red Hat Issue Tracker OSP-22651 0 None None None 2023-02-23 22:35:36 UTC
Red Hat Product Errata RHEA-2023:4577 0 None None None 2023-08-16 01:14:23 UTC

Description Francesco Pantano 2023-02-23 22:33:56 UTC
Description of problem:

When Ceph is deployed in a tripleo-ipa context, stray hosts warning shows up in the cluster for RGW.


[WRN] CEPHADM_STRAY_HOST: 3 stray host(s) with 3 daemon(s) not managed by cephadm
    stray host controller-0.redhat.local has 1 stray daemons: ['rgw.rgw.controller-0.cfqldy']
    stray host controller-1.redhat.local has 1 stray daemons: ['rgw.rgw.controller-1.qmexqp']
    stray host controller-2.redhat.local has 1 stray daemons: ['rgw.rgw.controller-2.gbtzli']

If a cephadm is performed, it's extended to all the daemons.

HEALTH_WARN 6 stray host(s) with 23 daemon(s) not managed by cephadm
[WRN] CEPHADM_STRAY_HOST: 6 stray host(s) with 23 daemon(s) not managed by cephadm
    stray host ceph-0.redhat.local has 5 stray daemons: ['osd.0', 'osd.1', 'osd.2', 'osd.3', 'osd.4']
    stray host ceph-1.redhat.local has 5 stray daemons: ['osd.11', 'osd.13', 'osd.5', 'osd.7', 'osd.9']
    stray host ceph-2.redhat.local has 5 stray daemons: ['osd.10', 'osd.12', 'osd.14', 'osd.6', 'osd.8']
    stray host controller-0.redhat.local has 3 stray daemons: ['mgr.controller-0.sxxchf', 'mon.controller-0', 'rgw.rgw.controller-0.cfqldy']
    stray host controller-1.redhat.local has 3 stray daemons: ['mgr.controller-1.dcxuoe', 'mon.controller-1', 'rgw.rgw.controller-1.qmexqp']
    stray host controller-2.redhat.local has 2 stray daemons: ['mgr.controller-2.fuxajb', 'rgw.rgw.controller-2.gbtzli']
[ceph: root@controller-0 /]#


The problem is related to the fact that ceph is deployed using short names, and even though we're able to pass CephSpecFqdn: true to the overcloud deployment command, the bootstrap process doesn't use the same approach.

We should have an option `--tld` that should be passed to deployed ceph to properly build the spec and enroll the right hosts when the cluster is created.

e.g.

openstack overcloud ceph spec --tld example.com


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 John Fulton 2023-02-27 12:25:14 UTC
The new option should be available to both commands:

`openstack overcloud ceph spec --tld redhat.com`
`openstack overcloude deploy deploy --tld redhat.com`

It injects the Top Level Domain passed by the user into the generated Ceph spec. It should be a new parameter to this module and the behavior of the fqdn boolean parameter will need to be adjusted.

  https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/ansible_plugins/modules/ceph_spec_bootstrap.py#L66-L69

That should include this check:

  https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/ansible_plugins/modules/ceph_spec_bootstrap.py#L545-L549

It could be updated this way:

    # 5. fqdn is only supported for the inventory method or if TLD is provided

The original reason why we had this check was because the output of metalsmith alone does not contain the TLD. However, if the user provides it, then there's no reason we can't just append what they provide.

Comment 2 John Fulton 2023-02-27 12:30:55 UTC
Initial testing was done simulating the --tld method and it is necessary but not sufficient to solve the issue. Two more changes will be needed for this bug.

Mikolaj simulated if we had the --tld option by doing the following:

1. `openstack overcloud ceph spec`
2. add the TLD to the spec using sed or similar
3. `openstack overcloud ceph deploy --ceph-spec ... `

He than had an Ansible error that the host was unreachable:

 2023-02-27 08:32:36.139561 | 52540004-dd75-d9e7-acca-00000000002b | UNREACHABLE | Install cephadm package | controller-0.redhat.local 

because the following code needs to be changed:

 https://opendev.org/openstack/tripleo-ansible/src/branch/master/tripleo_ansible/playbooks/cli-deployed-ceph.yaml#L96

He then manually changed that code and cephadm failed here:

  orchestrator._interface.OrchestratorError: Host controller-0.redhat.local (192.168.24.41) failed check(s): ['hostname "controller-0" does not match expected hostname "controller-0.redhat.local"']

From:

  https://github.com/ceph/ceph/blob/v16.2.8/src/cephadm/cephadm#L6226-L6232

However, the above will be addressed by adding `--skip-prepare-host` to the bootstrap.yaml tasks file which is proposed here:

  https://review.opendev.org/c/openstack/tripleo-ansible/+/874630/1/tripleo_ansible/roles/tripleo_cephadm/tasks/bootstrap.yaml#51

Comment 3 John Fulton 2023-02-27 12:34:11 UTC
(In reply to John Fulton from comment #1)
> The new option should be available to both commands:
> 
> `openstack overcloud ceph spec --tld redhat.com`
> `openstack overcloude deploy deploy --tld redhat.com`

I typed the last command wrong. It should be:

 `openstack overcloud ceph deploy --tld redhat.com`

The idea is that the customer shouldn't always have to generate and pass the spec using --ceph-spec. Instead we can pass the --tld through to the spec generation code.

Comment 18 Manoj Katari 2023-06-02 19:20:01 UTC
Hi Jenny-Anne,

updated doc text looks good to me.

Comment 23 Manoj Katari 2023-08-07 05:30:41 UTC
Updated doc text looks good.

Comment 28 errata-xmlrpc 2023-08-16 01:13:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.1 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2023:4577


Note You need to log in before you can comment on or make changes to this bug.