Bug 2082734 - ceph dashboard proxy is unable to bind on a socket due to conflict with ceph-mgr
Summary: ceph dashboard proxy is unable to bind on a socket due to conflict with ceph-mgr
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: tripleo-ansible
Version: 17.0 (Wallaby)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ga
: 17.0
Assignee: Francesco Pantano
QA Contact: Yogev Rabl
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-07 00:34 UTC by Marian Krcmarik
Modified: 2022-09-21 12:21 UTC (History)
3 users (show)

Fixed In Version: tripleo-ansible-3.3.1-0.20220701161440.c410227.el9ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-09-21 12:21:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 841081 0 None master: MERGED tripleo-ansible: Normalize the server_addr dashboard backend (Ifc9a0a9ac2f13c891ccde826aef2ab7cbdb5d690) 2022-06-23 19:09:22 UTC
OpenStack gerrit 841875 0 None stable/wallaby: MERGED tripleo-ansible: Normalize the server_addr dashboard backend (Ifc9a0a9ac2f13c891ccde826aef2ab7cbdb5d690) 2022-06-23 19:09:27 UTC
OpenStack gerrit 845115 0 None master: MERGED tripleo-ansible: Set current ceph dashboard mgr backend fact (I56d8862522fe82cb9f2373574ee067fbe4cae98d) 2022-06-23 19:09:32 UTC
OpenStack gerrit 846256 0 None stable/wallaby: MERGED tripleo-ansible: Set current ceph dashboard mgr backend fact (I56d8862522fe82cb9f2373574ee067fbe4cae98d) 2022-06-23 19:09:37 UTC
Red Hat Issue Tracker OSP-15071 0 None None None 2022-05-07 00:40:20 UTC
Red Hat Product Errata RHEA-2022:6543 0 None None None 2022-09-21 12:21:52 UTC

Description Marian Krcmarik 2022-05-07 00:34:51 UTC
Description of problem:
Overcloud deployment fails If ceph dashboard is being deployed with using cephadm. The exact error is:
FATAL | Run pacemaker restart if the config file for the service changed | central-controller-0 | error={"changed": false, "error": "Failed running command", "msg": "Error running /var/lib/container-config-scripts/pacemaker_restart_bundle.sh haproxy haproxy-bundle haproxy-bundle Started. rc: 1, stdout: , stderr: Waiting for the cluster to apply configuration changes (timeout: 600 seconds)...\nresource 'haproxy-bundle' is not running on any node\nWaiting for the cluster to apply configuration changes (timeout: 600 seconds)...\nError: resource 'haproxy-bundle' is not running on any node\nError: Errors have occurred, therefore pcs is unable to continue\n"}

which happens because haproxy fails to bind socket for ceph dashboard proxy:
Starting proxy ceph_dashboard: cannot bind socket (Address already in use) [192.168.24.82:8444]

Because ceph-mgr is binded on that socket for all interfaces:
# netstat -putna | grep 8444
tcp6       0      0 :::8444                 :::* LISTEN      6444/ceph-mgr

ceph-mgr should not be listening on all interfaces (which is default unless specifically configured) but specific IP address should be set for each ceph-mgr host.

The specific IP addresses where ceph dashboard module should be listening are actually set for each host in ceph cluster:
# ceph config dump | grep server_addr
  mgr                                                                       advanced  mgr/dashboard/central-controller-0-pwaxfm/server_addr      172.23.1.11                                                                                                                          * 
  mgr                                                                       advanced  mgr/dashboard/central-controller-1-iczifz/server_addr      172.23.1.42                                                                                                                          * 
  mgr                                                                       advanced  mgr/dashboard/central-controller-2-qxfada/server_addr      172.23.1.25


The problem is that the name of the hosts do not match the name with which the ceph-mgr was started:
# ps ax| grep ceph-mgr
   6442 ?        Ss     0:00 /dev/init -- /usr/bin/ceph-mgr -n mgr.central-controller-0.pwaxfm -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug

The difference is in last "-" versus ".":
central-controller-0-pwaxfm vs. central-controller-0.pwaxfm

It comes probably from these code:
https://opendev.org/openstack/tripleo-ansible/src/branch/master/tripleo_ansible/roles/tripleo_cephadm/tasks/dashboard/configure_dashboard_backends.yml#L17-L33
where the name of the ceph-mgr node is being taken from container name which has dash and not a dot but ceph-mgr process is started with name including a dot.

Version-Release number of selected component (if applicable):
tripleo-ansible-3.3.1-0.20220407091528.0bc2994.el9ost.noarch

How reproducible:
Always

Comment 9 errata-xmlrpc 2022-09-21 12:21:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543


Note You need to log in before you can comment on or make changes to this bug.