Bug 2226963

Summary: CephMon and CephMgr are wrongly deployed on DistributedComputeHCIScaleOut nodes
Product: Red Hat OpenStack Reporter: yatanaka
Component: tripleo-ansibleAssignee: Manoj Katari <mkatari>
Status: MODIFIED --- QA Contact: Alfredo <alfrgarc>
Severity: medium Docs Contact:
Priority: medium    
Version: 17.0 (Wallaby)CC: eharney, fpantano, gfidente, johfulto, tkajinam
Target Milestone: z2Keywords: Triaged
Target Release: 17.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: tripleo-ansible-3.3.1-17.1.20230816000827.bd032f7.el9ost Doc Type: Bug Fix
Doc Text:
Before this update, for a DCN site with 3 DistributedComputeHCI nodes and at least 1 DistributedComputeHCIScaleOut node, incorrect spec (roles->hosts map) is generated by cephadm. With this update, spec will be generated correctly in a DCN site with a mix of DistributedComputeHCI and DistributedComputeHCIScaleOut nodes.
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description yatanaka 2023-07-27 07:18:48 UTC
Description of problem:

DistributedComputeHCIScaleOut should only have CephOSD, and shouldn't have CephMon nor CephMgr.

~~~
https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/roles/DistributedComputeHCIScaleOut.yaml#L29-L31
    - OS::TripleO::Services::CephClient
    - OS::TripleO::Services::CephExternal
    - OS::TripleO::Services::CephOSD
~~~

However, CephMon and CephMgr are deployed on DistributedComputeHCIScaleOut in my RHOSP 17.0.1 lab.

~~~
[root@dcn0-compute-0 ~]# podman ps |grep ceph
fa4985f495f8  undercloud.ctlplane.yatanaka.example.com:8787/rhceph/rhceph-5-rhel8@sha256:b25f6178c91483c5248f9794122f1f6731e42cbc8ddba8402c7a9e2911e0e874  -n client.crash.d...  3 hours ago   Up 3 hours ago                        ceph-d961401d-50e0-50ac-a40f-ef07cbc752a6-crash-dcn0-compute-0
30bcde9123aa  undercloud.ctlplane.yatanaka.example.com:8787/openshift4/ose-prometheus-node-exporter:v4.6                                                   --no-collector.ti...  3 hours ago   Up 3 hours ago                        ceph-d961401d-50e0-50ac-a40f-ef07cbc752a6-node-exporter-dcn0-compute-0
9be8e742b6b4  undercloud.ctlplane.yatanaka.example.com:8787/rhceph/rhceph-5-rhel8@sha256:b25f6178c91483c5248f9794122f1f6731e42cbc8ddba8402c7a9e2911e0e874  -n mon.dcn0-compu...  3 hours ago   Up 3 hours ago                        ceph-d961401d-50e0-50ac-a40f-ef07cbc752a6-mon-dcn0-compute-0
28b3bd3654dc  undercloud.ctlplane.yatanaka.example.com:8787/rhceph/rhceph-5-rhel8@sha256:b25f6178c91483c5248f9794122f1f6731e42cbc8ddba8402c7a9e2911e0e874  -n mgr.dcn0-compu...  3 hours ago   Up 3 hours ago                        ceph-d961401d-50e0-50ac-a40f-ef07cbc752a6-mgr-dcn0-compute-0-efkzri

(undercloud) [stack@undercloud ~]$ ssh heat-admin sudo cephadm shell --config /etc/ceph/dcn0.conf --keyring /etc/ceph/dcn0.client.admin.keyring -- ceph orch ps
Inferring fsid d961401d-50e0-50ac-a40f-ef07cbc752a6
crash.dcn0-compute-0              dcn0-compute-0                             running (2h)     3m ago   2d    6627k        -  16.2.10-187.el8cp  72d512a15e58  fa4985f495f8 <=========(*)DistributedComputeHCIScaleOut
crash.dcn0-computehci-0           dcn0-computehci-0                          running (2h)     3m ago   2d    6627k        -  16.2.10-187.el8cp  72d512a15e58  97d594b3eb84  
crash.dcn0-computehci-1           dcn0-computehci-1                          running (2h)     3m ago   2d    6627k        -  16.2.10-187.el8cp  72d512a15e58  22e6d923190d  
crash.dcn0-computehci-2           dcn0-computehci-2                          running (2h)     3m ago   2d    6627k        -  16.2.10-187.el8cp  72d512a15e58  a499d142239f  
mgr.dcn0-compute-0.efkzri         dcn0-compute-0                             running (2h)     3m ago   2d     395M        -  16.2.10-187.el8cp  72d512a15e58  28b3bd3654dc <=========(*)DistributedComputeHCIScaleOut
mgr.dcn0-computehci-0.nlakit      dcn0-computehci-0  *:9283                  running (2h)     3m ago   2d     470M        -  16.2.10-187.el8cp  72d512a15e58  ea200fa686bb  
mgr.dcn0-computehci-1.rioxdc      dcn0-computehci-1                          running (2h)     3m ago   2d     403M        -  16.2.10-187.el8cp  72d512a15e58  f3a9b33ebb6c  
mgr.dcn0-computehci-2.xuhuhm      dcn0-computehci-2                          running (2h)     3m ago   2d     403M        -  16.2.10-187.el8cp  72d512a15e58  e74abfdf8bde  
mon.dcn0-compute-0                dcn0-compute-0                             running (2h)     3m ago   2d     128M    2048M  16.2.10-187.el8cp  72d512a15e58  9be8e742b6b4 <=========(*)DistributedComputeHCIScaleOut
mon.dcn0-computehci-0             dcn0-computehci-0                          running (2h)     3m ago   2d     273M    2048M  16.2.10-187.el8cp  72d512a15e58  25cedb06cdfd  
mon.dcn0-computehci-1             dcn0-computehci-1                          running (2h)     3m ago   2d     229M    2048M  16.2.10-187.el8cp  72d512a15e58  18ae06f39ea1  
mon.dcn0-computehci-2             dcn0-computehci-2                          running (2h)     3m ago   2d     187M    2048M  16.2.10-187.el8cp  72d512a15e58  5f86d45a7f0a  
node-exporter.dcn0-compute-0      dcn0-compute-0     172.16.1.100:9100       running (2h)     3m ago   2d    22.1M        -  1.0.1              c8af8d642c9a  30bcde9123aa <=========(*)DistributedComputeHCIScaleOut
node-exporter.dcn0-computehci-0   dcn0-computehci-0  172.16.1.34:9100        running (2h)     3m ago   2d    23.0M        -  1.0.1              c8af8d642c9a  e7606e183ced  
node-exporter.dcn0-computehci-1   dcn0-computehci-1  172.16.1.104:9100       running (2h)     3m ago   2d    23.5M        -  1.0.1              c8af8d642c9a  1bde9c8b9b9b  
node-exporter.dcn0-computehci-2   dcn0-computehci-2  172.16.1.96:9100        running (2h)     3m ago   2d    21.6M        -  1.0.1              c8af8d642c9a  a2a77f39a9c1  
~~~

When I check the spec file for cephadm generated by TripleO, I can see that Mon/Mgr/Osd are scheduled on DistributedComputeHCIScaleOut node as well as DistributedComputeHCI

~~~
(undercloud) [stack@undercloud ~]$ cat overcloud-deploy/dcn0/generated_ceph_spec.yaml
   :
placement:
  hosts:
  - dcn0-computehci-0  <==============(*) DistributedComputeHCI
  - dcn0-computehci-1  <==============(*) DistributedComputeHCI
  - dcn0-computehci-2  <==============(*) DistributedComputeHCI
  - dcn0-compute-0     <==============(*) DistributedComputeHCIScaleOut
service_id: mon
service_name: mon
service_type: mon
---
placement:
  hosts:
  - dcn0-computehci-0  <==============(*) DistributedComputeHCI
  - dcn0-computehci-1  <==============(*) DistributedComputeHCI
  - dcn0-computehci-2  <==============(*) DistributedComputeHCI
  - dcn0-compute-0     <==============(*) DistributedComputeHCIScaleOut
service_id: mgr
service_name: mgr
service_type: mgr
---
data_devices:
  all: true
placement:
  hosts:
  - dcn0-computehci-0  <==============(*) DistributedComputeHCI
  - dcn0-computehci-1  <==============(*) DistributedComputeHCI
  - dcn0-computehci-2  <==============(*) DistributedComputeHCI
  - dcn0-compute-0     <==============(*) DistributedComputeHCIScaleOut
service_id: default_drive_group
service_name: osd.default_drive_group
service_type: osd
~~~

This bug comes from the following code.
The regular expression is not proper.

~~~
https://github.com/openstack/tripleo-ansible/blob/master/tripleo_ansible/ansible_plugins/modules/ceph_spec_bootstrap.py#L280
                        pat = host_fmt.replace('%stackname%', '.*').replace('-%index%', '')
                        reg = re.compile(pat)
                        matching_hosts = []
                        for host in name_map:
                            if reg.match(host):
                                matching_hosts.append(name_map[host])
~~~

I did rpdb debug here, and the following is the result of the rpdb debug.
The regex for DistributedComputeHCI role is '.*-distributedcomputehci'.
Ideally, this should only matches DistributedComputeHCI nodes.
However, it matches both DistributedComputeHCI nodes and DistributedComputeHCIScaleOut nodes.

~~~
(Pdb) p reg
re.compile('.*-distributedcomputehci')  <=============(*) wrong regex

(Pdb) print(json.dumps(name_map, indent=2))
{
  "dcn0-distributedcomputehci-0": "dcn0-computehci-0",
  "dcn0-distributedcomputehci-1": "dcn0-computehci-1",
  "dcn0-distributedcomputehci-2": "dcn0-computehci-2",
  "dcn0-distributedcomputehciscaleout-0": "dcn0-compute-0"  <=============(*) This regex also matches DistributedComputeHCIScaleOut node wrongly.
}
~~~

That's why Mon/Mgr are deployed on DistributedComputeHCIScaleOut as well as DistributedComputeHCI

I think this regex should be like '.*-distributedcomputehci-', not '.*-distributedcomputehci'


Version-Release number of selected component (if applicable):
RHOSP 17.0.1

How reproducible:

Steps to Reproduce:
1. Run `openstack overcloud ceph deploy` for a DCN site with 3 DistributedComputeHCI nodes and at least 1 DistributedComputeHCIScaleOut.


Actual results:
DistributedComputeHCIScaleOut has MGR/MON as well as OSD


Expected results:
DistributedComputeHCIScaleOut only has OSD, not MGR/MON

Comment 1 yatanaka 2023-07-27 08:27:16 UTC
JFYI, I'm pasting roles_data.yaml and overcloud-baremetal-deploy.yaml used to deploy ceph below:

~~~
(undercloud) [stack@undercloud ~]$ cat dcn0/dcn0_roles.yaml 
###############################################################################
# File generated by TripleO
###############################################################################
###############################################################################
# Role: DistributedComputeHCI                                                 #
###############################################################################
- name: DistributedComputeHCI
  description: |
    Distributed Compute Node role with Ceph, Cinder volume, and Glance.
  tags:
    - compute
  networks:
    InternalApi:
      subnet: internal_api_subnet
    Tenant:
      subnet: tenant_subnet
    Storage:
      subnet: storage_subnet
    StorageMgmt:
      subnet: storage_mgmt_subnet
  RoleParametersDefault:
    FsAioMaxNumber: 1048576
    TunedProfileName: "throughput-performance"
  # CephOSD present so serial has to be 1
  update_serial: 1
  ServicesDefault:
    - OS::TripleO::Services::Aide
    - OS::TripleO::Services::AuditD
    - OS::TripleO::Services::BarbicanClient
    - OS::TripleO::Services::BootParams
    - OS::TripleO::Services::CACerts
    - OS::TripleO::Services::CephClient
    - OS::TripleO::Services::CephExternal
    - OS::TripleO::Services::CephGrafana
    - OS::TripleO::Services::CephMds
    - OS::TripleO::Services::CephMgr
    - OS::TripleO::Services::CephMon
    - OS::TripleO::Services::CephRbdMirror
    - OS::TripleO::Services::CephRgw
    - OS::TripleO::Services::CephOSD
    - OS::TripleO::Services::CinderVolumeEdge
    - OS::TripleO::Services::Collectd
    - OS::TripleO::Services::ComputeCeilometerAgent
    - OS::TripleO::Services::ComputeNeutronCorePlugin
    - OS::TripleO::Services::ComputeNeutronL3Agent
    - OS::TripleO::Services::ComputeNeutronMetadataAgent
    - OS::TripleO::Services::ComputeNeutronOvsAgent
    - OS::TripleO::Services::Etcd
    - OS::TripleO::Services::Frr
    - OS::TripleO::Services::GlanceApiEdge
    - OS::TripleO::Services::IpaClient
    - OS::TripleO::Services::Ipsec
    - OS::TripleO::Services::Iscsid
    - OS::TripleO::Services::Kernel
    - OS::TripleO::Services::LoginDefs
    - OS::TripleO::Services::MetricsQdr
    - OS::TripleO::Services::Multipathd
    - OS::TripleO::Services::MySQLClient
    - OS::TripleO::Services::NeutronBgpVpnBagpipe
    - OS::TripleO::Services::NeutronLinuxbridgeAgent
    - OS::TripleO::Services::NeutronVppAgent
    - OS::TripleO::Services::NovaAZConfig
    - OS::TripleO::Services::NovaCompute
    - OS::TripleO::Services::NovaLibvirt
    - OS::TripleO::Services::NovaLibvirtGuests
    - OS::TripleO::Services::NovaMigrationTarget
    - OS::TripleO::Services::ContainersLogrotateCrond
    - OS::TripleO::Services::Podman
    - OS::TripleO::Services::Rhsm
    - OS::TripleO::Services::Rsyslog
    - OS::TripleO::Services::RsyslogSidecar
    - OS::TripleO::Services::Securetty
    - OS::TripleO::Services::Snmp
    - OS::TripleO::Services::Sshd
    - OS::TripleO::Services::Timesync
    - OS::TripleO::Services::Timezone
    - OS::TripleO::Services::TripleoFirewall
    - OS::TripleO::Services::TripleoPackages
    - OS::TripleO::Services::Tuned
    - OS::TripleO::Services::Vpp
    - OS::TripleO::Services::OVNController
    - OS::TripleO::Services::OVNMetadataAgent
###############################################################################
# Role: DistributedComputeHCIScaleOut                                         #
###############################################################################
- name: DistributedComputeHCIScaleOut
  description: |
    Distributed Compute Node role with CephOSD and HAproxy for Glance.
  tags:
    - compute
  networks:
    InternalApi:
      subnet: internal_api_subnet
    Tenant:
      subnet: tenant_subnet
    Storage:
      subnet: storage_subnet
    StorageMgmt:
      subnet: storage_mgmt_subnet
  RoleParametersDefault:
    FsAioMaxNumber: 1048576
    TunedProfileName: "throughput-performance"
  # CephOSD present so serial has to be 1
  update_serial: 1
  ServicesDefault:
    - OS::TripleO::Services::Aide
    - OS::TripleO::Services::AuditD
    - OS::TripleO::Services::BarbicanClient
    - OS::TripleO::Services::BootParams
    - OS::TripleO::Services::CACerts
    - OS::TripleO::Services::CephClient
    - OS::TripleO::Services::CephExternal
    - OS::TripleO::Services::CephOSD
    - OS::TripleO::Services::Collectd
    - OS::TripleO::Services::ComputeCeilometerAgent
    - OS::TripleO::Services::ComputeNeutronCorePlugin
    - OS::TripleO::Services::ComputeNeutronL3Agent
    - OS::TripleO::Services::ComputeNeutronMetadataAgent
    - OS::TripleO::Services::ComputeNeutronOvsAgent
    - OS::TripleO::Services::Frr
    - OS::TripleO::Services::HAproxyEdge
    - OS::TripleO::Services::IpaClient
    - OS::TripleO::Services::Ipsec
    - OS::TripleO::Services::Iscsid
    - OS::TripleO::Services::Kernel
    - OS::TripleO::Services::LoginDefs
    - OS::TripleO::Services::MetricsQdr
    - OS::TripleO::Services::Multipathd
    - OS::TripleO::Services::MySQLClient
    - OS::TripleO::Services::NeutronBgpVpnBagpipe
    - OS::TripleO::Services::NeutronLinuxbridgeAgent
    - OS::TripleO::Services::NeutronVppAgent
    - OS::TripleO::Services::NovaAZConfig
    - OS::TripleO::Services::NovaCompute
    - OS::TripleO::Services::NovaLibvirt
    - OS::TripleO::Services::NovaLibvirtGuests
    - OS::TripleO::Services::NovaMigrationTarget
    - OS::TripleO::Services::ContainersLogrotateCrond
    - OS::TripleO::Services::Podman
    - OS::TripleO::Services::Rhsm
    - OS::TripleO::Services::Rsyslog
    - OS::TripleO::Services::RsyslogSidecar
    - OS::TripleO::Services::Securetty
    - OS::TripleO::Services::Snmp
    - OS::TripleO::Services::Sshd
    - OS::TripleO::Services::Timesync
    - OS::TripleO::Services::Timezone
    - OS::TripleO::Services::TripleoFirewall
    - OS::TripleO::Services::TripleoPackages
    - OS::TripleO::Services::Tuned
    - OS::TripleO::Services::Vpp
    - OS::TripleO::Services::OVNController
    - OS::TripleO::Services::OVNMetadataAgent


(undercloud) [stack@undercloud ~]$ cat dcn0/overcloud-baremetal-deploy.yaml 
- name: DistributedComputeHCI
  count: 3
  defaults:
    networks:
    - network: ctlplane
      vif: true
    - network: external
      subnet: external_subnet
    - network: internal_api
      subnet: internal_api_subnet
    - network: storage
      subnet: storage_subnet
    - network: storage_mgmt
      subnet: storage_mgmt_subnet
    - network: tenant
      subnet: tenant_subnet
    network_config:
      template: /home/stack/dcn0/two_interfaces.j2
      default_route_network:
      - external
  instances:
  - hostname: dcn0-computehci-0
    name: dcn0_computehci0
  - hostname: dcn0-computehci-1
    name: dcn0_computehci1
  - hostname: dcn0-computehci-2
    name: dcn0_computehci2
- name: DistributedComputeHCIScaleOut
  count: 1
  defaults:
    networks:
    - network: ctlplane
      vif: true
    - network: external
      subnet: external_subnet
    - network: internal_api
      subnet: internal_api_subnet
    - network: storage
      subnet: storage_subnet
    - network: storage_mgmt
      subnet: storage_mgmt_subnet
    - network: tenant
      subnet: tenant_subnet
    network_config:
      template: /home/stack/dcn0/two_interfaces.j2
      default_route_network:
      - external
  instances:
  - hostname: dcn0-compute-0
    name: dcn0_compute0
~~~

Comment 5 Manoj Katari 2023-08-16 05:49:22 UTC
How to test:

On a DCN site with 3 DistributedComputeHCI nodes and at least 1 DistributedComputeHCIScaleOut node.

Run ceph deploy using the command openstack 'overcloud ceph deploy'

The DistributedComputeHCIScaleOut node should not have MON/MGR service but only OSD service.

Also, any node configured for a role should have only the these services listed for that role as in [1]

[1] https://github.com/openstack/tripleo-heat-templates/blob/stable/wallaby/roles/