Bug 1851455

Summary: Enabling Ceph dashboard fails on running OpenStack cluster
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Sergii Mykhailushko <smykhail>
Component: Ceph-AnsibleAssignee: Dimitri Savineau <dsavinea>
Status: CLOSED ERRATA QA Contact: Vasishta <vashastr>
Severity: medium Docs Contact: Aron Gunn <agunn>
Priority: medium    
Version: 4.0CC: agunn, aschoen, ceph-eng-bugs, dsavinea, fpantano, gabrioux, gmeno, nthomas, pasik, tserlin, vereddy, ykaul
Target Milestone: z2   
Target Release: 4.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ceph-ansible-4.0.29-1.el8cp, ceph-ansible-4.0.29-1.el7cp Doc Type: Bug Fix
Doc Text:
.Enabling the Ceph Dashboard fails on an existing OpenStack environment In an existing OpenStack environment, when configuring the Ceph Dashboard's IP address and port after the Ceph Manager dashboard module was enabled, was causing a conflict with the HAProxy configuration. To avoid this conflict, configure the Ceph Dashboard's IP address and port before enabling the Ceph Manager dashboard module.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-30 17:26:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1816167    

Description Sergii Mykhailushko 2020-06-26 14:56:31 UTC
Description of problem:

Hi team,

Seems we're hitting an issue when enabling Ceph dashboard in OpenStack cluster. It works fine if it's initial OSP deployment, but fails when it's a subsequent deploy (e.g. we decided to enable Ceph dashboard on already running cluster → redeploy OSP cluster including 'ceph-dashboard.yaml' env file). In this case ceph-ansible fails at the following step:

~~~
2020-06-23 15:45:13,008 p=979186 u=root |  TASK [ceph-dashboard : enable mgr dashboard module (restart)] ******************
2020-06-23 15:45:13,008 p=979186 u=root |  Tuesday 23 June 2020  15:45:13 +0200 (0:00:01.596)       0:18:39.686 ********** 
2020-06-23 15:45:14,729 p=979186 u=root |  ok: [control0001-cdm -> 10.240.16.12] => changed=false 
  cmd:
  - podman
  - exec
  - ceph-mon-control0001-cdm
  - ceph
  - --cluster
  - ceph
  - mgr
  - module
  - enable
  - dashboard
  delta: '0:00:01.393589'
  end: '2020-06-23 15:45:14.702988'
  rc: 0
  start: '2020-06-23 15:45:13.309399'
  stderr: ''
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>

  stdout_lines: <omitted>
2020-06-23 15:45:14,992 p=979186 u=root |  TASK [ceph-dashboard : set or update dashboard admin username and password] ****
2020-06-23 15:45:14,993 p=979186 u=root |  Tuesday 23 June 2020  15:45:14 +0200 (0:00:01.984)       0:18:41.671 ********** 
2020-06-23 15:45:27,262 p=979186 u=root |  FAILED - RETRYING: set or update dashboard admin username and password (6 retries left).
2020-06-23 15:45:34,116 p=979186 u=root |  FAILED - RETRYING: set or update dashboard admin username and password (5 retries left).
2020-06-23 15:45:41,072 p=979186 u=root |  FAILED - RETRYING: set or update dashboard admin username and password (4 retries left).
2020-06-23 15:45:48,076 p=979186 u=root |  FAILED - RETRYING: set or update dashboard admin username and password (3 retries left).
2020-06-23 15:45:54,963 p=979186 u=root |  FAILED - RETRYING: set or update dashboard admin username and password (2 retries left).
2020-06-23 15:46:01,902 p=979186 u=root |  FAILED - RETRYING: set or update dashboard admin username and password (1 retries left).
2020-06-23 15:46:08,783 p=979186 u=root |  fatal: [control0001-cdm -> 10.240.16.12]: FAILED! => changed=false 
  attempts: 6
  cmd: |-
    if podman exec ceph-mon-control0001-cdm ceph --cluster ceph dashboard ac-user-show admin; then
      podman exec ceph-mon-control0001-cdm ceph --cluster ceph dashboard ac-user-set-password admin foobar
    else
      podman exec ceph-mon-control0001-cdm ceph --cluster ceph dashboard ac-user-create admin foobar read-only
    fi
  delta: '0:00:01.613421'
  end: '2020-06-23 15:46:08.750524'
  msg: non-zero return code
  rc: 5
  start: '2020-06-23 15:46:07.137103'
  stderr: |-
    Error EIO: Module 'dashboard' has experienced an error and cannot handle commands: OSError("No socket could be created -- (('::', 8444, 0, 0): [Errno 98] Address already in use)",)
    Error: non zero exit code: 5: OCI runtime error
    Error EIO: Module 'dashboard' has experienced an error and cannot handle commands: OSError("No socket could be created -- (('::', 8444, 0, 0): [Errno 98] Address already in use)",)
    Error: non zero exit code: 5: OCI runtime error
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>
~~~

Checking '/usr/share/ceph-ansible/roles/ceph-dashboard/tasks/configure_dashboard.yml' we see this step at lines 82-95, preceeded by dashboard module restart:

~~~
57 - name: "set the dashboard port ({{ dashboard_port }})"
58   command: "{{ container_exec_cmd }} ceph --cluster {{ cluster }} config set mgr mgr/dashboard/server_port {{ dashboard_port }}"
59   changed_when: false
60   delegate_to: "{{ groups[mon_group_name][0] }}"
61   run_once: true
62 
63 - name: "set the dashboard SSL port ({{ dashboard_port }})"
64   command: "{{ container_exec_cmd }} ceph --cluster {{ cluster }} config set mgr mgr/dashboard/ssl_server_port {{ dashboard_port }}"
65   delegate_to: "{{ groups[mon_group_name][0] }}"
66   run_once: true
67   changed_when: false
68   failed_when: false # Do not fail if the option does not exist, it only exists post-14.2.0
69 
70 - name: disable mgr dashboard module (restart)
71   command: "{{ container_exec_cmd }} ceph --cluster {{ cluster }} mgr module disable dashboard"
72   delegate_to: "{{ groups[mon_group_name][0] }}"
73   run_once: true
74   changed_when: false
75 
76 - name: enable mgr dashboard module (restart)
77   command: "{{ container_exec_cmd }} ceph --cluster {{ cluster }} mgr module enable dashboard"
78   delegate_to: "{{ groups[mon_group_name][0] }}"
79   run_once: true
80   changed_when: false
81
82 - name: set or update dashboard admin username and password
83   shell: |
84     if {{ container_exec_cmd }} ceph --cluster {{ cluster }} dashboard ac-user-show {{ dashboard_admin_user | quote }}; then
85       {{ container_exec_cmd }} ceph --cluster {{ cluster }} dashboard ac-user-set-password {{ dashboard_admin_user | quote }} {{ dashboard_admin_password | quote }}
86     else
87       {{ container_exec_cmd }} ceph --cluster {{ cluster }} dashboard ac-user-create {{ dashboard_admin_user | quote }} {{ dashboard_admin_password | quote }} {{ 'read-only' if dashboard_admin_user_ro | bool else 'administrator' }}
88     fi
89   retries: 6
90   delay: 5
91   register: ac_result
92   delegate_to: "{{ groups[mon_group_name][0] }}"
93   run_once: true
94   changed_when: false
95   until: ac_result.rc == 0
~~~

The error we get basically shows the dashboard module is trying to bind to all addresses on the controller on port 8444 (if i read it correctly), and it's in use by HAProxy in a running cluster:

(('::', 8444, 0, 0): [Errno 98] Address already in use)",)

Same is confirmed with 'ceph status' output:

~~~
  cluster:
    id:     949aa89f-c155-4baf-8ffc-5c633914e9f4
    health: HEALTH_ERR
            Module 'dashboard' has failed: OSError("No socket could be created -- (('::', 8444, 0, 0): [Errno 98] Address already in use)",)
...
~~~

Here is what we assume is happening:

Initial OSP deploy (ok)
- HAProxy is not yet running when the Ceph dashboard is being provisioned, so the listen address conflict does not happen because before Pacemaker brings up HAProxy the dashboard backends have been moved to the proper listen address.


Subsequent deployments (not ok)
- haproxy is already running with a listener on port 8444, so the dashboard provisioning process fails because the module isn't yet configured with a specific bind address when it is first started up.

Checking '/usr/share/ceph-ansible/roles/ceph-dashboard/tasks/configure_dashboard.yml' again, we see that correct addresses for binding are set much later after trying to start the dashboard module -- when configuring the dashboard backends (lines 133-136):

~~~
133 - include_tasks: configure_dashboard_backends.yml
134   with_items: '{{ groups[mgr_group_name] | default(groups[mon_group_name]) }}'
135   vars:
136     dashboard_backend: '{{ item }}'
~~~

Contents of 'configure_dashboard_backends.yml':
~~~
---
- name: get current mgr backend - ipv4
  set_fact:
    mgr_server_addr: "{{ hostvars[dashboard_backend]['ansible_all_ipv4_addresses'] | ips_in_ranges(public_network.split(',')) | first }}"
  when: ip_version == 'ipv4'

- name: get current mgr backend - ipv6
  set_fact:
    mgr_server_addr: "{{ hostvars[dashboard_backend]['ansible_all_ipv6_addresses'] | ips_in_ranges(public_network.split(',')) | last }}"
  when: ip_version == 'ipv6'

- name: config the current dashboard backend
  command: "{{ container_exec_cmd }} ceph --cluster {{ cluster }} config set mgr mgr/dashboard/{{ hostvars[dashboard_backend]['ansible_hostname'] }}/server_addr {{ mgr_server_addr }}"
  delegate_to: "{{ groups[mon_group_name][0] }}"
  changed_when: false
  run_once: true
~~~

Probably it makes sense to run 'configure_dashboard_backends.yml' *before* starting the dashboard for the first time (lines 70-80)? As the ports which are configured prior to starting the module (lines 57-68) are set up correctly:

~~~
57 - name: "set the dashboard port ({{ dashboard_port }})"
58   command: "{{ container_exec_cmd }} ceph --cluster {{ cluster }} config set mgr mgr/dashboard/server_port {{ dashboard_port }}"
59   changed_when: false
60   delegate_to: "{{ groups[mon_group_name][0] }}"
61   run_once: true
62 
63 - name: "set the dashboard SSL port ({{ dashboard_port }})"
64   command: "{{ container_exec_cmd }} ceph --cluster {{ cluster }} config set mgr mgr/dashboard/ssl_server_port {{ dashboard_port }}"
65   delegate_to: "{{ groups[mon_group_name][0] }}"
66   run_once: true
67   changed_when: false
68   failed_when: false # Do not fail if the option does not exist, it only exists post-14.2.0
69 
70 - name: disable mgr dashboard module (restart)
71   command: "{{ container_exec_cmd }} ceph --cluster {{ cluster }} mgr module disable dashboard"
72   delegate_to: "{{ groups[mon_group_name][0] }}"
73   run_once: true
74   changed_when: false
75 
76 - name: enable mgr dashboard module (restart)
77   command: "{{ container_exec_cmd }} ceph --cluster {{ cluster }} mgr module enable dashboard"
78   delegate_to: "{{ groups[mon_group_name][0] }}"
79   run_once: true
80   changed_when: false
~~~

Actually performing the steps manually before re-running the overcloud deployment process allows it to go through without failing, so this can be considered as a workaround:

~~~
1. On each controller: "podman exec ceph-mon-$(hostname -f) ceph --cluster ceph config set mgr mgr/dashboard/<controller-shortname>/server_addr <ip-address-of-the-controller-on-the-storage-network>"
2. Restart the dashboard module
3. Redeploy OSP
~~~


Version-Release number of selected component (if applicable):

ceph-ansible-4.0.23-1.el8cp.noarch


Thanks in advance for having a look into this.
Sergii

Comment 9 errata-xmlrpc 2020-09-30 17:26:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 4.1 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4144