1851455 – Enabling Ceph dashboard fails on running OpenStack cluster

Bug 1851455 - Enabling Ceph dashboard fails on running OpenStack cluster

Summary: Enabling Ceph dashboard fails on running OpenStack cluster

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Ansible
Sub Component:
Version:	4.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	z2
Target Release:	4.1
Assignee:	Dimitri Savineau
QA Contact:	Vasishta
Docs Contact:	Aron Gunn
URL:
Whiteboard:
Depends On:
Blocks:	1816167
TreeView+	depends on / blocked

Reported:	2020-06-26 14:56 UTC by Sergii Mykhailushko
Modified:	2024-06-13 22:44 UTC (History)
CC List:	12 users (show)
Fixed In Version:	ceph-ansible-4.0.29-1.el8cp, ceph-ansible-4.0.29-1.el7cp
Doc Type:	Bug Fix
Doc Text:	.Enabling the Ceph Dashboard fails on an existing OpenStack environment In an existing OpenStack environment, when configuring the Ceph Dashboard's IP address and port after the Ceph Manager dashboard module was enabled, was causing a conflict with the HAProxy configuration. To avoid this conflict, configure the Ceph Dashboard's IP address and port before enabling the Ceph Manager dashboard module.
Clone Of:
Environment:
Last Closed:	2020-09-30 17:26:19 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ceph ceph-ansible pull 5466	None	closed	dashboard: configure mgr backend before restart	2021-01-14 21:05:09 UTC
Red Hat Issue Tracker	RHCEPH-8052	None	None	None	2023-12-15 18:20:34 UTC
Red Hat Product Errata	RHBA-2020:4144	None	None	None	2020-09-30 17:26:44 UTC

Description Sergii Mykhailushko 2020-06-26 14:56:31 UTC

Description of problem:

Hi team,

Seems we're hitting an issue when enabling Ceph dashboard in OpenStack cluster. It works fine if it's initial OSP deployment, but fails when it's a subsequent deploy (e.g. we decided to enable Ceph dashboard on already running cluster → redeploy OSP cluster including 'ceph-dashboard.yaml' env file). In this case ceph-ansible fails at the following step:

~~~
2020-06-23 15:45:13,008 p=979186 u=root |  TASK [ceph-dashboard : enable mgr dashboard module (restart)] ******************
2020-06-23 15:45:13,008 p=979186 u=root |  Tuesday 23 June 2020  15:45:13 +0200 (0:00:01.596)       0:18:39.686 ********** 
2020-06-23 15:45:14,729 p=979186 u=root |  ok: [control0001-cdm -> 10.240.16.12] => changed=false 
  cmd:
  - podman
  - exec
  - ceph-mon-control0001-cdm
  - ceph
  - --cluster
  - ceph
  - mgr
  - module
  - enable
  - dashboard
  delta: '0:00:01.393589'
  end: '2020-06-23 15:45:14.702988'
  rc: 0
  start: '2020-06-23 15:45:13.309399'
  stderr: ''
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>

  stdout_lines: <omitted>
2020-06-23 15:45:14,992 p=979186 u=root |  TASK [ceph-dashboard : set or update dashboard admin username and password] ****
2020-06-23 15:45:14,993 p=979186 u=root |  Tuesday 23 June 2020  15:45:14 +0200 (0:00:01.984)       0:18:41.671 ********** 
2020-06-23 15:45:27,262 p=979186 u=root |  FAILED - RETRYING: set or update dashboard admin username and password (6 retries left).
2020-06-23 15:45:34,116 p=979186 u=root |  FAILED - RETRYING: set or update dashboard admin username and password (5 retries left).
2020-06-23 15:45:41,072 p=979186 u=root |  FAILED - RETRYING: set or update dashboard admin username and password (4 retries left).
2020-06-23 15:45:48,076 p=979186 u=root |  FAILED - RETRYING: set or update dashboard admin username and password (3 retries left).
2020-06-23 15:45:54,963 p=979186 u=root |  FAILED - RETRYING: set or update dashboard admin username and password (2 retries left).
2020-06-23 15:46:01,902 p=979186 u=root |  FAILED - RETRYING: set or update dashboard admin username and password (1 retries left).
2020-06-23 15:46:08,783 p=979186 u=root |  fatal: [control0001-cdm -> 10.240.16.12]: FAILED! => changed=false 
  attempts: 6
  cmd: |-
    if podman exec ceph-mon-control0001-cdm ceph --cluster ceph dashboard ac-user-show admin; then
      podman exec ceph-mon-control0001-cdm ceph --cluster ceph dashboard ac-user-set-password admin foobar
    else
      podman exec ceph-mon-control0001-cdm ceph --cluster ceph dashboard ac-user-create admin foobar read-only
    fi
  delta: '0:00:01.613421'
  end: '2020-06-23 15:46:08.750524'
  msg: non-zero return code
  rc: 5
  start: '2020-06-23 15:46:07.137103'
  stderr: |-
    Error EIO: Module 'dashboard' has experienced an error and cannot handle commands: OSError("No socket could be created -- (('::', 8444, 0, 0): [Errno 98] Address already in use)",)
    Error: non zero exit code: 5: OCI runtime error
    Error EIO: Module 'dashboard' has experienced an error and cannot handle commands: OSError("No socket could be created -- (('::', 8444, 0, 0): [Errno 98] Address already in use)",)
    Error: non zero exit code: 5: OCI runtime error
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>
~~~

Checking '/usr/share/ceph-ansible/roles/ceph-dashboard/tasks/configure_dashboard.yml' we see this step at lines 82-95, preceeded by dashboard module restart:

~~~
57 - name: "set the dashboard port ({{ dashboard_port }})"
58   command: "{{ container_exec_cmd }} ceph --cluster {{ cluster }} config set mgr mgr/dashboard/server_port {{ dashboard_port }}"
59   changed_when: false
60   delegate_to: "{{ groups[mon_group_name][0] }}"
61   run_once: true
62 
63 - name: "set the dashboard SSL port ({{ dashboard_port }})"
64   command: "{{ container_exec_cmd }} ceph --cluster {{ cluster }} config set mgr mgr/dashboard/ssl_server_port {{ dashboard_port }}"
65   delegate_to: "{{ groups[mon_group_name][0] }}"
66   run_once: true
67   changed_when: false
68   failed_when: false # Do not fail if the option does not exist, it only exists post-14.2.0
69 
70 - name: disable mgr dashboard module (restart)
71   command: "{{ container_exec_cmd }} ceph --cluster {{ cluster }} mgr module disable dashboard"
72   delegate_to: "{{ groups[mon_group_name][0] }}"
73   run_once: true
74   changed_when: false
75 
76 - name: enable mgr dashboard module (restart)
77   command: "{{ container_exec_cmd }} ceph --cluster {{ cluster }} mgr module enable dashboard"
78   delegate_to: "{{ groups[mon_group_name][0] }}"
79   run_once: true
80   changed_when: false
81
82 - name: set or update dashboard admin username and password
83   shell: |
84     if {{ container_exec_cmd }} ceph --cluster {{ cluster }} dashboard ac-user-show {{ dashboard_admin_user | quote }}; then
85       {{ container_exec_cmd }} ceph --cluster {{ cluster }} dashboard ac-user-set-password {{ dashboard_admin_user | quote }} {{ dashboard_admin_password | quote }}
86     else
87       {{ container_exec_cmd }} ceph --cluster {{ cluster }} dashboard ac-user-create {{ dashboard_admin_user | quote }} {{ dashboard_admin_password | quote }} {{ 'read-only' if dashboard_admin_user_ro | bool else 'administrator' }}
88     fi
89   retries: 6
90   delay: 5
91   register: ac_result
92   delegate_to: "{{ groups[mon_group_name][0] }}"
93   run_once: true
94   changed_when: false
95   until: ac_result.rc == 0
~~~

The error we get basically shows the dashboard module is trying to bind to all addresses on the controller on port 8444 (if i read it correctly), and it's in use by HAProxy in a running cluster:

(('::', 8444, 0, 0): [Errno 98] Address already in use)",)

Same is confirmed with 'ceph status' output:

~~~
  cluster:
    id:     949aa89f-c155-4baf-8ffc-5c633914e9f4
    health: HEALTH_ERR
            Module 'dashboard' has failed: OSError("No socket could be created -- (('::', 8444, 0, 0): [Errno 98] Address already in use)",)
...
~~~

Here is what we assume is happening:

Initial OSP deploy (ok)
- HAProxy is not yet running when the Ceph dashboard is being provisioned, so the listen address conflict does not happen because before Pacemaker brings up HAProxy the dashboard backends have been moved to the proper listen address.


Subsequent deployments (not ok)
- haproxy is already running with a listener on port 8444, so the dashboard provisioning process fails because the module isn't yet configured with a specific bind address when it is first started up.

Checking '/usr/share/ceph-ansible/roles/ceph-dashboard/tasks/configure_dashboard.yml' again, we see that correct addresses for binding are set much later after trying to start the dashboard module -- when configuring the dashboard backends (lines 133-136):

~~~
133 - include_tasks: configure_dashboard_backends.yml
134   with_items: '{{ groups[mgr_group_name] | default(groups[mon_group_name]) }}'
135   vars:
136     dashboard_backend: '{{ item }}'
~~~

Contents of 'configure_dashboard_backends.yml':
~~~
---
- name: get current mgr backend - ipv4
  set_fact:
    mgr_server_addr: "{{ hostvars[dashboard_backend]['ansible_all_ipv4_addresses'] | ips_in_ranges(public_network.split(',')) | first }}"
  when: ip_version == 'ipv4'

- name: get current mgr backend - ipv6
  set_fact:
    mgr_server_addr: "{{ hostvars[dashboard_backend]['ansible_all_ipv6_addresses'] | ips_in_ranges(public_network.split(',')) | last }}"
  when: ip_version == 'ipv6'

- name: config the current dashboard backend
  command: "{{ container_exec_cmd }} ceph --cluster {{ cluster }} config set mgr mgr/dashboard/{{ hostvars[dashboard_backend]['ansible_hostname'] }}/server_addr {{ mgr_server_addr }}"
  delegate_to: "{{ groups[mon_group_name][0] }}"
  changed_when: false
  run_once: true
~~~

Probably it makes sense to run 'configure_dashboard_backends.yml' *before* starting the dashboard for the first time (lines 70-80)? As the ports which are configured prior to starting the module (lines 57-68) are set up correctly:

~~~
57 - name: "set the dashboard port ({{ dashboard_port }})"
58   command: "{{ container_exec_cmd }} ceph --cluster {{ cluster }} config set mgr mgr/dashboard/server_port {{ dashboard_port }}"
59   changed_when: false
60   delegate_to: "{{ groups[mon_group_name][0] }}"
61   run_once: true
62 
63 - name: "set the dashboard SSL port ({{ dashboard_port }})"
64   command: "{{ container_exec_cmd }} ceph --cluster {{ cluster }} config set mgr mgr/dashboard/ssl_server_port {{ dashboard_port }}"
65   delegate_to: "{{ groups[mon_group_name][0] }}"
66   run_once: true
67   changed_when: false
68   failed_when: false # Do not fail if the option does not exist, it only exists post-14.2.0
69 
70 - name: disable mgr dashboard module (restart)
71   command: "{{ container_exec_cmd }} ceph --cluster {{ cluster }} mgr module disable dashboard"
72   delegate_to: "{{ groups[mon_group_name][0] }}"
73   run_once: true
74   changed_when: false
75 
76 - name: enable mgr dashboard module (restart)
77   command: "{{ container_exec_cmd }} ceph --cluster {{ cluster }} mgr module enable dashboard"
78   delegate_to: "{{ groups[mon_group_name][0] }}"
79   run_once: true
80   changed_when: false
~~~

Actually performing the steps manually before re-running the overcloud deployment process allows it to go through without failing, so this can be considered as a workaround:

~~~
1. On each controller: "podman exec ceph-mon-$(hostname -f) ceph --cluster ceph config set mgr mgr/dashboard/<controller-shortname>/server_addr <ip-address-of-the-controller-on-the-storage-network>"
2. Restart the dashboard module
3. Redeploy OSP
~~~


Version-Release number of selected component (if applicable):

ceph-ansible-4.0.23-1.el8cp.noarch


Thanks in advance for having a look into this.
Sergii

Comment 9 errata-xmlrpc 2020-09-30 17:26:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 4.1 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4144

Note You need to log in before you can comment on or make changes to this bug.