Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2226963

Summary: CephMon and CephMgr are wrongly deployed on DistributedComputeHCIScaleOut nodes
Product: Red Hat OpenStack Reporter: yatanaka
Component: tripleo-ansibleAssignee: Manoj Katari <mkatari>
Status: CLOSED ERRATA QA Contact: Alfredo <alfrgarc>
Severity: medium Docs Contact:
Priority: medium    
Version: 17.0 (Wallaby)CC: dhill, eharney, fpantano, gbrinn, gfidente, johfulto, mkatari
Target Milestone: z2Keywords: Triaged
Target Release: 17.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: tripleo-ansible-3.3.1-17.1.20230816000827.bd032f7.el9ost Doc Type: Bug Fix
Doc Text:
Before this update, if a DCN site had 3 `DistributedComputeHCI` nodes and at least 1 `DistributedComputeHCIScaleOut` node, `cephadm` generated the incorrect spec. With this update, if a DCN site has a mix of `DistributedComputeHCI` and `DistributedComputeHCIScaleOut` nodes, `cephadm` generates the spec correctly.
Story Points: ---
Clone Of: Environment:
Last Closed: 2024-01-16 14:30:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description yatanaka 2023-07-27 07:18:48 UTC
Description of problem:

DistributedComputeHCIScaleOut should only have CephOSD, and shouldn't have CephMon nor CephMgr.

~~~
https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/roles/DistributedComputeHCIScaleOut.yaml#L29-L31
    - OS::TripleO::Services::CephClient
    - OS::TripleO::Services::CephExternal
    - OS::TripleO::Services::CephOSD
~~~

However, CephMon and CephMgr are deployed on DistributedComputeHCIScaleOut in my RHOSP 17.0.1 lab.

~~~
[root@dcn0-compute-0 ~]# podman ps |grep ceph
fa4985f495f8  undercloud.ctlplane.yatanaka.example.com:8787/rhceph/rhceph-5-rhel8@sha256:b25f6178c91483c5248f9794122f1f6731e42cbc8ddba8402c7a9e2911e0e874  -n client.crash.d...  3 hours ago   Up 3 hours ago                        ceph-d961401d-50e0-50ac-a40f-ef07cbc752a6-crash-dcn0-compute-0
30bcde9123aa  undercloud.ctlplane.yatanaka.example.com:8787/openshift4/ose-prometheus-node-exporter:v4.6                                                   --no-collector.ti...  3 hours ago   Up 3 hours ago                        ceph-d961401d-50e0-50ac-a40f-ef07cbc752a6-node-exporter-dcn0-compute-0
9be8e742b6b4  undercloud.ctlplane.yatanaka.example.com:8787/rhceph/rhceph-5-rhel8@sha256:b25f6178c91483c5248f9794122f1f6731e42cbc8ddba8402c7a9e2911e0e874  -n mon.dcn0-compu...  3 hours ago   Up 3 hours ago                        ceph-d961401d-50e0-50ac-a40f-ef07cbc752a6-mon-dcn0-compute-0
28b3bd3654dc  undercloud.ctlplane.yatanaka.example.com:8787/rhceph/rhceph-5-rhel8@sha256:b25f6178c91483c5248f9794122f1f6731e42cbc8ddba8402c7a9e2911e0e874  -n mgr.dcn0-compu...  3 hours ago   Up 3 hours ago                        ceph-d961401d-50e0-50ac-a40f-ef07cbc752a6-mgr-dcn0-compute-0-efkzri

(undercloud) [stack@undercloud ~]$ ssh heat-admin sudo cephadm shell --config /etc/ceph/dcn0.conf --keyring /etc/ceph/dcn0.client.admin.keyring -- ceph orch ps
Inferring fsid d961401d-50e0-50ac-a40f-ef07cbc752a6
crash.dcn0-compute-0              dcn0-compute-0                             running (2h)     3m ago   2d    6627k        -  16.2.10-187.el8cp  72d512a15e58  fa4985f495f8 <=========(*)DistributedComputeHCIScaleOut
crash.dcn0-computehci-0           dcn0-computehci-0                          running (2h)     3m ago   2d    6627k        -  16.2.10-187.el8cp  72d512a15e58  97d594b3eb84  
crash.dcn0-computehci-1           dcn0-computehci-1                          running (2h)     3m ago   2d    6627k        -  16.2.10-187.el8cp  72d512a15e58  22e6d923190d  
crash.dcn0-computehci-2           dcn0-computehci-2                          running (2h)     3m ago   2d    6627k        -  16.2.10-187.el8cp  72d512a15e58  a499d142239f  
mgr.dcn0-compute-0.efkzri         dcn0-compute-0                             running (2h)     3m ago   2d     395M        -  16.2.10-187.el8cp  72d512a15e58  28b3bd3654dc <=========(*)DistributedComputeHCIScaleOut
mgr.dcn0-computehci-0.nlakit      dcn0-computehci-0  *:9283                  running (2h)     3m ago   2d     470M        -  16.2.10-187.el8cp  72d512a15e58  ea200fa686bb  
mgr.dcn0-computehci-1.rioxdc      dcn0-computehci-1                          running (2h)     3m ago   2d     403M        -  16.2.10-187.el8cp  72d512a15e58  f3a9b33ebb6c  
mgr.dcn0-computehci-2.xuhuhm      dcn0-computehci-2                          running (2h)     3m ago   2d     403M        -  16.2.10-187.el8cp  72d512a15e58  e74abfdf8bde  
mon.dcn0-compute-0                dcn0-compute-0                             running (2h)     3m ago   2d     128M    2048M  16.2.10-187.el8cp  72d512a15e58  9be8e742b6b4 <=========(*)DistributedComputeHCIScaleOut
mon.dcn0-computehci-0             dcn0-computehci-0                          running (2h)     3m ago   2d     273M    2048M  16.2.10-187.el8cp  72d512a15e58  25cedb06cdfd  
mon.dcn0-computehci-1             dcn0-computehci-1                          running (2h)     3m ago   2d     229M    2048M  16.2.10-187.el8cp  72d512a15e58  18ae06f39ea1  
mon.dcn0-computehci-2             dcn0-computehci-2                          running (2h)     3m ago   2d     187M    2048M  16.2.10-187.el8cp  72d512a15e58  5f86d45a7f0a  
node-exporter.dcn0-compute-0      dcn0-compute-0     172.16.1.100:9100       running (2h)     3m ago   2d    22.1M        -  1.0.1              c8af8d642c9a  30bcde9123aa <=========(*)DistributedComputeHCIScaleOut
node-exporter.dcn0-computehci-0   dcn0-computehci-0  172.16.1.34:9100        running (2h)     3m ago   2d    23.0M        -  1.0.1              c8af8d642c9a  e7606e183ced  
node-exporter.dcn0-computehci-1   dcn0-computehci-1  172.16.1.104:9100       running (2h)     3m ago   2d    23.5M        -  1.0.1              c8af8d642c9a  1bde9c8b9b9b  
node-exporter.dcn0-computehci-2   dcn0-computehci-2  172.16.1.96:9100        running (2h)     3m ago   2d    21.6M        -  1.0.1              c8af8d642c9a  a2a77f39a9c1  
~~~

When I check the spec file for cephadm generated by TripleO, I can see that Mon/Mgr/Osd are scheduled on DistributedComputeHCIScaleOut node as well as DistributedComputeHCI

~~~
(undercloud) [stack@undercloud ~]$ cat overcloud-deploy/dcn0/generated_ceph_spec.yaml
   :
placement:
  hosts:
  - dcn0-computehci-0  <==============(*) DistributedComputeHCI
  - dcn0-computehci-1  <==============(*) DistributedComputeHCI
  - dcn0-computehci-2  <==============(*) DistributedComputeHCI
  - dcn0-compute-0     <==============(*) DistributedComputeHCIScaleOut
service_id: mon
service_name: mon
service_type: mon
---
placement:
  hosts:
  - dcn0-computehci-0  <==============(*) DistributedComputeHCI
  - dcn0-computehci-1  <==============(*) DistributedComputeHCI
  - dcn0-computehci-2  <==============(*) DistributedComputeHCI
  - dcn0-compute-0     <==============(*) DistributedComputeHCIScaleOut
service_id: mgr
service_name: mgr
service_type: mgr
---
data_devices:
  all: true
placement:
  hosts:
  - dcn0-computehci-0  <==============(*) DistributedComputeHCI
  - dcn0-computehci-1  <==============(*) DistributedComputeHCI
  - dcn0-computehci-2  <==============(*) DistributedComputeHCI
  - dcn0-compute-0     <==============(*) DistributedComputeHCIScaleOut
service_id: default_drive_group
service_name: osd.default_drive_group
service_type: osd
~~~

This bug comes from the following code.
The regular expression is not proper.

~~~
https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/ansible_plugins/modules/ceph_spec_bootstrap.py#L280
                        pat = host_fmt.replace('%stackname%', '.*').replace('-%index%', '')
                        reg = re.compile(pat)
                        matching_hosts = []
                        for host in name_map:
                            if reg.match(host):
                                matching_hosts.append(name_map[host])
~~~

I did rpdb debug here, and the following is the result of the rpdb debug.
The regex for DistributedComputeHCI role is '.*-distributedcomputehci'.
Ideally, this should only matches DistributedComputeHCI nodes.
However, it matches both DistributedComputeHCI nodes and DistributedComputeHCIScaleOut nodes.

~~~
(Pdb) p reg
re.compile('.*-distributedcomputehci')  <=============(*) wrong regex

(Pdb) print(json.dumps(name_map, indent=2))
{
  "dcn0-distributedcomputehci-0": "dcn0-computehci-0",
  "dcn0-distributedcomputehci-1": "dcn0-computehci-1",
  "dcn0-distributedcomputehci-2": "dcn0-computehci-2",
  "dcn0-distributedcomputehciscaleout-0": "dcn0-compute-0"  <=============(*) This regex also matches DistributedComputeHCIScaleOut node wrongly.
}
~~~

That's why Mon/Mgr are deployed on DistributedComputeHCIScaleOut as well as DistributedComputeHCI

I think this regex should be like '.*-distributedcomputehci-', not '.*-distributedcomputehci'


Version-Release number of selected component (if applicable):
RHOSP 17.0.1

How reproducible:

Steps to Reproduce:
1. Run `openstack overcloud ceph deploy` for a DCN site with 3 DistributedComputeHCI nodes and at least 1 DistributedComputeHCIScaleOut.


Actual results:
DistributedComputeHCIScaleOut has MGR/MON as well as OSD


Expected results:
DistributedComputeHCIScaleOut only has OSD, not MGR/MON

Comment 1 yatanaka 2023-07-27 08:27:16 UTC
JFYI, I'm pasting roles_data.yaml and overcloud-baremetal-deploy.yaml used to deploy ceph below:

~~~
(undercloud) [stack@undercloud ~]$ cat dcn0/dcn0_roles.yaml 
###############################################################################
# File generated by TripleO
###############################################################################
###############################################################################
# Role: DistributedComputeHCI                                                 #
###############################################################################
- name: DistributedComputeHCI
  description: |
    Distributed Compute Node role with Ceph, Cinder volume, and Glance.
  tags:
    - compute
  networks:
    InternalApi:
      subnet: internal_api_subnet
    Tenant:
      subnet: tenant_subnet
    Storage:
      subnet: storage_subnet
    StorageMgmt:
      subnet: storage_mgmt_subnet
  RoleParametersDefault:
    FsAioMaxNumber: 1048576
    TunedProfileName: "throughput-performance"
  # CephOSD present so serial has to be 1
  update_serial: 1
  ServicesDefault:
    - OS::TripleO::Services::Aide
    - OS::TripleO::Services::AuditD
    - OS::TripleO::Services::BarbicanClient
    - OS::TripleO::Services::BootParams
    - OS::TripleO::Services::CACerts
    - OS::TripleO::Services::CephClient
    - OS::TripleO::Services::CephExternal
    - OS::TripleO::Services::CephGrafana
    - OS::TripleO::Services::CephMds
    - OS::TripleO::Services::CephMgr
    - OS::TripleO::Services::CephMon
    - OS::TripleO::Services::CephRbdMirror
    - OS::TripleO::Services::CephRgw
    - OS::TripleO::Services::CephOSD
    - OS::TripleO::Services::CinderVolumeEdge
    - OS::TripleO::Services::Collectd
    - OS::TripleO::Services::ComputeCeilometerAgent
    - OS::TripleO::Services::ComputeNeutronCorePlugin
    - OS::TripleO::Services::ComputeNeutronL3Agent
    - OS::TripleO::Services::ComputeNeutronMetadataAgent
    - OS::TripleO::Services::ComputeNeutronOvsAgent
    - OS::TripleO::Services::Etcd
    - OS::TripleO::Services::Frr
    - OS::TripleO::Services::GlanceApiEdge
    - OS::TripleO::Services::IpaClient
    - OS::TripleO::Services::Ipsec
    - OS::TripleO::Services::Iscsid
    - OS::TripleO::Services::Kernel
    - OS::TripleO::Services::LoginDefs
    - OS::TripleO::Services::MetricsQdr
    - OS::TripleO::Services::Multipathd
    - OS::TripleO::Services::MySQLClient
    - OS::TripleO::Services::NeutronBgpVpnBagpipe
    - OS::TripleO::Services::NeutronLinuxbridgeAgent
    - OS::TripleO::Services::NeutronVppAgent
    - OS::TripleO::Services::NovaAZConfig
    - OS::TripleO::Services::NovaCompute
    - OS::TripleO::Services::NovaLibvirt
    - OS::TripleO::Services::NovaLibvirtGuests
    - OS::TripleO::Services::NovaMigrationTarget
    - OS::TripleO::Services::ContainersLogrotateCrond
    - OS::TripleO::Services::Podman
    - OS::TripleO::Services::Rhsm
    - OS::TripleO::Services::Rsyslog
    - OS::TripleO::Services::RsyslogSidecar
    - OS::TripleO::Services::Securetty
    - OS::TripleO::Services::Snmp
    - OS::TripleO::Services::Sshd
    - OS::TripleO::Services::Timesync
    - OS::TripleO::Services::Timezone
    - OS::TripleO::Services::TripleoFirewall
    - OS::TripleO::Services::TripleoPackages
    - OS::TripleO::Services::Tuned
    - OS::TripleO::Services::Vpp
    - OS::TripleO::Services::OVNController
    - OS::TripleO::Services::OVNMetadataAgent
###############################################################################
# Role: DistributedComputeHCIScaleOut                                         #
###############################################################################
- name: DistributedComputeHCIScaleOut
  description: |
    Distributed Compute Node role with CephOSD and HAproxy for Glance.
  tags:
    - compute
  networks:
    InternalApi:
      subnet: internal_api_subnet
    Tenant:
      subnet: tenant_subnet
    Storage:
      subnet: storage_subnet
    StorageMgmt:
      subnet: storage_mgmt_subnet
  RoleParametersDefault:
    FsAioMaxNumber: 1048576
    TunedProfileName: "throughput-performance"
  # CephOSD present so serial has to be 1
  update_serial: 1
  ServicesDefault:
    - OS::TripleO::Services::Aide
    - OS::TripleO::Services::AuditD
    - OS::TripleO::Services::BarbicanClient
    - OS::TripleO::Services::BootParams
    - OS::TripleO::Services::CACerts
    - OS::TripleO::Services::CephClient
    - OS::TripleO::Services::CephExternal
    - OS::TripleO::Services::CephOSD
    - OS::TripleO::Services::Collectd
    - OS::TripleO::Services::ComputeCeilometerAgent
    - OS::TripleO::Services::ComputeNeutronCorePlugin
    - OS::TripleO::Services::ComputeNeutronL3Agent
    - OS::TripleO::Services::ComputeNeutronMetadataAgent
    - OS::TripleO::Services::ComputeNeutronOvsAgent
    - OS::TripleO::Services::Frr
    - OS::TripleO::Services::HAproxyEdge
    - OS::TripleO::Services::IpaClient
    - OS::TripleO::Services::Ipsec
    - OS::TripleO::Services::Iscsid
    - OS::TripleO::Services::Kernel
    - OS::TripleO::Services::LoginDefs
    - OS::TripleO::Services::MetricsQdr
    - OS::TripleO::Services::Multipathd
    - OS::TripleO::Services::MySQLClient
    - OS::TripleO::Services::NeutronBgpVpnBagpipe
    - OS::TripleO::Services::NeutronLinuxbridgeAgent
    - OS::TripleO::Services::NeutronVppAgent
    - OS::TripleO::Services::NovaAZConfig
    - OS::TripleO::Services::NovaCompute
    - OS::TripleO::Services::NovaLibvirt
    - OS::TripleO::Services::NovaLibvirtGuests
    - OS::TripleO::Services::NovaMigrationTarget
    - OS::TripleO::Services::ContainersLogrotateCrond
    - OS::TripleO::Services::Podman
    - OS::TripleO::Services::Rhsm
    - OS::TripleO::Services::Rsyslog
    - OS::TripleO::Services::RsyslogSidecar
    - OS::TripleO::Services::Securetty
    - OS::TripleO::Services::Snmp
    - OS::TripleO::Services::Sshd
    - OS::TripleO::Services::Timesync
    - OS::TripleO::Services::Timezone
    - OS::TripleO::Services::TripleoFirewall
    - OS::TripleO::Services::TripleoPackages
    - OS::TripleO::Services::Tuned
    - OS::TripleO::Services::Vpp
    - OS::TripleO::Services::OVNController
    - OS::TripleO::Services::OVNMetadataAgent


(undercloud) [stack@undercloud ~]$ cat dcn0/overcloud-baremetal-deploy.yaml 
- name: DistributedComputeHCI
  count: 3
  defaults:
    networks:
    - network: ctlplane
      vif: true
    - network: external
      subnet: external_subnet
    - network: internal_api
      subnet: internal_api_subnet
    - network: storage
      subnet: storage_subnet
    - network: storage_mgmt
      subnet: storage_mgmt_subnet
    - network: tenant
      subnet: tenant_subnet
    network_config:
      template: /home/stack/dcn0/two_interfaces.j2
      default_route_network:
      - external
  instances:
  - hostname: dcn0-computehci-0
    name: dcn0_computehci0
  - hostname: dcn0-computehci-1
    name: dcn0_computehci1
  - hostname: dcn0-computehci-2
    name: dcn0_computehci2
- name: DistributedComputeHCIScaleOut
  count: 1
  defaults:
    networks:
    - network: ctlplane
      vif: true
    - network: external
      subnet: external_subnet
    - network: internal_api
      subnet: internal_api_subnet
    - network: storage
      subnet: storage_subnet
    - network: storage_mgmt
      subnet: storage_mgmt_subnet
    - network: tenant
      subnet: tenant_subnet
    network_config:
      template: /home/stack/dcn0/two_interfaces.j2
      default_route_network:
      - external
  instances:
  - hostname: dcn0-compute-0
    name: dcn0_compute0
~~~

Comment 5 Manoj Katari 2023-08-16 05:49:22 UTC
How to test:

On a DCN site with 3 DistributedComputeHCI nodes and at least 1 DistributedComputeHCIScaleOut node.

Run ceph deploy using the command openstack 'overcloud ceph deploy'

The DistributedComputeHCIScaleOut node should not have MON/MGR service but only OSD service.

Also, any node configured for a role should have only the these services listed for that role as in [1]

[1] https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/roles/

Comment 11 Manoj Katari 2023-12-01 05:16:25 UTC
Hi Gareth,

Updated doc text works for me.

Comment 20 errata-xmlrpc 2024-01-16 14:30:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 17.1.2 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:0209

Comment 21 Manoj Katari 2024-02-12 05:52:46 UTC
*** Bug 2257414 has been marked as a duplicate of this bug. ***