Bug 2082734

Summary: ceph dashboard proxy is unable to bind on a socket due to conflict with ceph-mgr
Product: Red Hat OpenStack Reporter: Marian Krcmarik <mkrcmari>
Component: tripleo-ansibleAssignee: Francesco Pantano <fpantano>
Status: CLOSED ERRATA QA Contact: Yogev Rabl <yrabl>
Severity: high Docs Contact:
Priority: high    
Version: 17.0 (Wallaby)CC: alfrgarc, fpantano, jschluet
Target Milestone: gaKeywords: Triaged
Target Release: 17.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tripleo-ansible-3.3.1-0.20220701161440.c410227.el9ost Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-09-21 12:21:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marian Krcmarik 2022-05-07 00:34:51 UTC
Description of problem:
Overcloud deployment fails If ceph dashboard is being deployed with using cephadm. The exact error is:
FATAL | Run pacemaker restart if the config file for the service changed | central-controller-0 | error={"changed": false, "error": "Failed running command", "msg": "Error running /var/lib/container-config-scripts/pacemaker_restart_bundle.sh haproxy haproxy-bundle haproxy-bundle Started. rc: 1, stdout: , stderr: Waiting for the cluster to apply configuration changes (timeout: 600 seconds)...\nresource 'haproxy-bundle' is not running on any node\nWaiting for the cluster to apply configuration changes (timeout: 600 seconds)...\nError: resource 'haproxy-bundle' is not running on any node\nError: Errors have occurred, therefore pcs is unable to continue\n"}

which happens because haproxy fails to bind socket for ceph dashboard proxy:
Starting proxy ceph_dashboard: cannot bind socket (Address already in use) [192.168.24.82:8444]

Because ceph-mgr is binded on that socket for all interfaces:
# netstat -putna | grep 8444
tcp6       0      0 :::8444                 :::* LISTEN      6444/ceph-mgr

ceph-mgr should not be listening on all interfaces (which is default unless specifically configured) but specific IP address should be set for each ceph-mgr host.

The specific IP addresses where ceph dashboard module should be listening are actually set for each host in ceph cluster:
# ceph config dump | grep server_addr
  mgr                                                                       advanced  mgr/dashboard/central-controller-0-pwaxfm/server_addr      172.23.1.11                                                                                                                          * 
  mgr                                                                       advanced  mgr/dashboard/central-controller-1-iczifz/server_addr      172.23.1.42                                                                                                                          * 
  mgr                                                                       advanced  mgr/dashboard/central-controller-2-qxfada/server_addr      172.23.1.25


The problem is that the name of the hosts do not match the name with which the ceph-mgr was started:
# ps ax| grep ceph-mgr
   6442 ?        Ss     0:00 /dev/init -- /usr/bin/ceph-mgr -n mgr.central-controller-0.pwaxfm -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug

The difference is in last "-" versus ".":
central-controller-0-pwaxfm vs. central-controller-0.pwaxfm

It comes probably from these code:
https://opendev.org/openstack/tripleo-ansible/src/branch/master/tripleo_ansible/roles/tripleo_cephadm/tasks/dashboard/configure_dashboard_backends.yml#L17-L33
where the name of the ceph-mgr node is being taken from container name which has dash and not a dot but ceph-mgr process is started with name including a dot.

Version-Release number of selected component (if applicable):
tripleo-ansible-3.3.1-0.20220407091528.0bc2994.el9ost.noarch

How reproducible:
Always

Comment 9 errata-xmlrpc 2022-09-21 12:21:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543