Description of problem: Hi team, Seems we're hitting an issue when enabling Ceph dashboard in OpenStack cluster. It works fine if it's initial OSP deployment, but fails when it's a subsequent deploy (e.g. we decided to enable Ceph dashboard on already running cluster → redeploy OSP cluster including 'ceph-dashboard.yaml' env file). In this case ceph-ansible fails at the following step: ~~~ 2020-06-23 15:45:13,008 p=979186 u=root | TASK [ceph-dashboard : enable mgr dashboard module (restart)] ****************** 2020-06-23 15:45:13,008 p=979186 u=root | Tuesday 23 June 2020 15:45:13 +0200 (0:00:01.596) 0:18:39.686 ********** 2020-06-23 15:45:14,729 p=979186 u=root | ok: [control0001-cdm -> 10.240.16.12] => changed=false cmd: - podman - exec - ceph-mon-control0001-cdm - ceph - --cluster - ceph - mgr - module - enable - dashboard delta: '0:00:01.393589' end: '2020-06-23 15:45:14.702988' rc: 0 start: '2020-06-23 15:45:13.309399' stderr: '' stderr_lines: <omitted> stdout: '' stdout_lines: <omitted> stdout_lines: <omitted> 2020-06-23 15:45:14,992 p=979186 u=root | TASK [ceph-dashboard : set or update dashboard admin username and password] **** 2020-06-23 15:45:14,993 p=979186 u=root | Tuesday 23 June 2020 15:45:14 +0200 (0:00:01.984) 0:18:41.671 ********** 2020-06-23 15:45:27,262 p=979186 u=root | FAILED - RETRYING: set or update dashboard admin username and password (6 retries left). 2020-06-23 15:45:34,116 p=979186 u=root | FAILED - RETRYING: set or update dashboard admin username and password (5 retries left). 2020-06-23 15:45:41,072 p=979186 u=root | FAILED - RETRYING: set or update dashboard admin username and password (4 retries left). 2020-06-23 15:45:48,076 p=979186 u=root | FAILED - RETRYING: set or update dashboard admin username and password (3 retries left). 2020-06-23 15:45:54,963 p=979186 u=root | FAILED - RETRYING: set or update dashboard admin username and password (2 retries left). 2020-06-23 15:46:01,902 p=979186 u=root | FAILED - RETRYING: set or update dashboard admin username and password (1 retries left). 2020-06-23 15:46:08,783 p=979186 u=root | fatal: [control0001-cdm -> 10.240.16.12]: FAILED! => changed=false attempts: 6 cmd: |- if podman exec ceph-mon-control0001-cdm ceph --cluster ceph dashboard ac-user-show admin; then podman exec ceph-mon-control0001-cdm ceph --cluster ceph dashboard ac-user-set-password admin foobar else podman exec ceph-mon-control0001-cdm ceph --cluster ceph dashboard ac-user-create admin foobar read-only fi delta: '0:00:01.613421' end: '2020-06-23 15:46:08.750524' msg: non-zero return code rc: 5 start: '2020-06-23 15:46:07.137103' stderr: |- Error EIO: Module 'dashboard' has experienced an error and cannot handle commands: OSError("No socket could be created -- (('::', 8444, 0, 0): [Errno 98] Address already in use)",) Error: non zero exit code: 5: OCI runtime error Error EIO: Module 'dashboard' has experienced an error and cannot handle commands: OSError("No socket could be created -- (('::', 8444, 0, 0): [Errno 98] Address already in use)",) Error: non zero exit code: 5: OCI runtime error stderr_lines: <omitted> stdout: '' stdout_lines: <omitted> ~~~ Checking '/usr/share/ceph-ansible/roles/ceph-dashboard/tasks/configure_dashboard.yml' we see this step at lines 82-95, preceeded by dashboard module restart: ~~~ 57 - name: "set the dashboard port ({{ dashboard_port }})" 58 command: "{{ container_exec_cmd }} ceph --cluster {{ cluster }} config set mgr mgr/dashboard/server_port {{ dashboard_port }}" 59 changed_when: false 60 delegate_to: "{{ groups[mon_group_name][0] }}" 61 run_once: true 62 63 - name: "set the dashboard SSL port ({{ dashboard_port }})" 64 command: "{{ container_exec_cmd }} ceph --cluster {{ cluster }} config set mgr mgr/dashboard/ssl_server_port {{ dashboard_port }}" 65 delegate_to: "{{ groups[mon_group_name][0] }}" 66 run_once: true 67 changed_when: false 68 failed_when: false # Do not fail if the option does not exist, it only exists post-14.2.0 69 70 - name: disable mgr dashboard module (restart) 71 command: "{{ container_exec_cmd }} ceph --cluster {{ cluster }} mgr module disable dashboard" 72 delegate_to: "{{ groups[mon_group_name][0] }}" 73 run_once: true 74 changed_when: false 75 76 - name: enable mgr dashboard module (restart) 77 command: "{{ container_exec_cmd }} ceph --cluster {{ cluster }} mgr module enable dashboard" 78 delegate_to: "{{ groups[mon_group_name][0] }}" 79 run_once: true 80 changed_when: false 81 82 - name: set or update dashboard admin username and password 83 shell: | 84 if {{ container_exec_cmd }} ceph --cluster {{ cluster }} dashboard ac-user-show {{ dashboard_admin_user | quote }}; then 85 {{ container_exec_cmd }} ceph --cluster {{ cluster }} dashboard ac-user-set-password {{ dashboard_admin_user | quote }} {{ dashboard_admin_password | quote }} 86 else 87 {{ container_exec_cmd }} ceph --cluster {{ cluster }} dashboard ac-user-create {{ dashboard_admin_user | quote }} {{ dashboard_admin_password | quote }} {{ 'read-only' if dashboard_admin_user_ro | bool else 'administrator' }} 88 fi 89 retries: 6 90 delay: 5 91 register: ac_result 92 delegate_to: "{{ groups[mon_group_name][0] }}" 93 run_once: true 94 changed_when: false 95 until: ac_result.rc == 0 ~~~ The error we get basically shows the dashboard module is trying to bind to all addresses on the controller on port 8444 (if i read it correctly), and it's in use by HAProxy in a running cluster: (('::', 8444, 0, 0): [Errno 98] Address already in use)",) Same is confirmed with 'ceph status' output: ~~~ cluster: id: 949aa89f-c155-4baf-8ffc-5c633914e9f4 health: HEALTH_ERR Module 'dashboard' has failed: OSError("No socket could be created -- (('::', 8444, 0, 0): [Errno 98] Address already in use)",) ... ~~~ Here is what we assume is happening: Initial OSP deploy (ok) - HAProxy is not yet running when the Ceph dashboard is being provisioned, so the listen address conflict does not happen because before Pacemaker brings up HAProxy the dashboard backends have been moved to the proper listen address. Subsequent deployments (not ok) - haproxy is already running with a listener on port 8444, so the dashboard provisioning process fails because the module isn't yet configured with a specific bind address when it is first started up. Checking '/usr/share/ceph-ansible/roles/ceph-dashboard/tasks/configure_dashboard.yml' again, we see that correct addresses for binding are set much later after trying to start the dashboard module -- when configuring the dashboard backends (lines 133-136): ~~~ 133 - include_tasks: configure_dashboard_backends.yml 134 with_items: '{{ groups[mgr_group_name] | default(groups[mon_group_name]) }}' 135 vars: 136 dashboard_backend: '{{ item }}' ~~~ Contents of 'configure_dashboard_backends.yml': ~~~ --- - name: get current mgr backend - ipv4 set_fact: mgr_server_addr: "{{ hostvars[dashboard_backend]['ansible_all_ipv4_addresses'] | ips_in_ranges(public_network.split(',')) | first }}" when: ip_version == 'ipv4' - name: get current mgr backend - ipv6 set_fact: mgr_server_addr: "{{ hostvars[dashboard_backend]['ansible_all_ipv6_addresses'] | ips_in_ranges(public_network.split(',')) | last }}" when: ip_version == 'ipv6' - name: config the current dashboard backend command: "{{ container_exec_cmd }} ceph --cluster {{ cluster }} config set mgr mgr/dashboard/{{ hostvars[dashboard_backend]['ansible_hostname'] }}/server_addr {{ mgr_server_addr }}" delegate_to: "{{ groups[mon_group_name][0] }}" changed_when: false run_once: true ~~~ Probably it makes sense to run 'configure_dashboard_backends.yml' *before* starting the dashboard for the first time (lines 70-80)? As the ports which are configured prior to starting the module (lines 57-68) are set up correctly: ~~~ 57 - name: "set the dashboard port ({{ dashboard_port }})" 58 command: "{{ container_exec_cmd }} ceph --cluster {{ cluster }} config set mgr mgr/dashboard/server_port {{ dashboard_port }}" 59 changed_when: false 60 delegate_to: "{{ groups[mon_group_name][0] }}" 61 run_once: true 62 63 - name: "set the dashboard SSL port ({{ dashboard_port }})" 64 command: "{{ container_exec_cmd }} ceph --cluster {{ cluster }} config set mgr mgr/dashboard/ssl_server_port {{ dashboard_port }}" 65 delegate_to: "{{ groups[mon_group_name][0] }}" 66 run_once: true 67 changed_when: false 68 failed_when: false # Do not fail if the option does not exist, it only exists post-14.2.0 69 70 - name: disable mgr dashboard module (restart) 71 command: "{{ container_exec_cmd }} ceph --cluster {{ cluster }} mgr module disable dashboard" 72 delegate_to: "{{ groups[mon_group_name][0] }}" 73 run_once: true 74 changed_when: false 75 76 - name: enable mgr dashboard module (restart) 77 command: "{{ container_exec_cmd }} ceph --cluster {{ cluster }} mgr module enable dashboard" 78 delegate_to: "{{ groups[mon_group_name][0] }}" 79 run_once: true 80 changed_when: false ~~~ Actually performing the steps manually before re-running the overcloud deployment process allows it to go through without failing, so this can be considered as a workaround: ~~~ 1. On each controller: "podman exec ceph-mon-$(hostname -f) ceph --cluster ceph config set mgr mgr/dashboard/<controller-shortname>/server_addr <ip-address-of-the-controller-on-the-storage-network>" 2. Restart the dashboard module 3. Redeploy OSP ~~~ Version-Release number of selected component (if applicable): ceph-ansible-4.0.23-1.el8cp.noarch Thanks in advance for having a look into this. Sergii
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 4.1 Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4144