Bug 1859872 - During upgrade moving from One RGW per node to Two RGW per node fails with ceph-ansible multisite configured
Summary: During upgrade moving from One RGW per node to Two RGW per node fails with ce...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: Ceph-Ansible
Version: 4.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.2
Assignee: Guillaume Abrioux
QA Contact: Tejas
Aron Gunn
URL:
Whiteboard:
Depends On:
Blocks: 1890121 1816167
TreeView+ depends on / blocked
 
Reported: 2020-07-23 08:38 UTC by Raul Mahiques
Modified: 2021-06-09 16:16 UTC (History)
14 users (show)

Fixed In Version: ceph-ansible-4.0.34-1.el8cp, ceph-ansible-4.0.34-1.el7cp
Doc Type: Bug Fix
Doc Text:
.Adding a new Ceph Object Gateway instance when upgrading fails The `radosgw_frontend_port` option did not consider more than one Ceph Object Gateway instance, and configured port `8080` to all instances. With this release, the `radosgw_frontend_port` option is increased for each Ceph Object Gateway instance, allowing you to use more than one Ceph Object Gateway instance.
Clone Of:
Environment:
Last Closed: 2021-01-12 14:56:02 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph-ansible pull 5613 0 None closed rgw: support 1+ rgw instance in `radosgw_frontend_port` 2021-01-26 15:22:04 UTC
Github ceph ceph-ansible pull 5804 0 None closed [skip ci] facts: fix 'set_fact rgw_instances with rgw multisite' 2021-01-26 15:22:03 UTC
Github ceph ceph-ansible pull 5806 0 None closed facts: fix 'set_fact rgw_instances with rgw multisite' (bp #5804) 2021-01-26 15:22:03 UTC
Github ceph ceph-ansible pull 5807 0 None closed facts: fix 'set_fact rgw_instances with rgw multisite' (bp #5804) 2021-01-26 15:22:03 UTC
Red Hat Product Errata RHSA-2021:0081 0 None None None 2021-01-12 14:56:35 UTC

Description Raul Mahiques 2020-07-23 08:38:36 UTC
Description of problem:

Moving from RHCS 3.3 to RHCS 4.1, once we are in RHCS4.1 we want to start using 2 RGW per node. 

We have multisite configured in ceph-ansible:

[root@xxx]# cat group_vars/all.yml | grep rgw_multisite
rgw_multisite: True

Because we have multiple-realms the configuration of each rgw node is specified in host_vars/nodeX

To configure 2 RGW per node, we add the following info to the all.yml file:

$ cat group_vars/all.yml | grep radosgw_num
radosgw_num_instances: 2

And also we have a per node rgw entry in the ceph_conf_overrides, for example:

  client.rgw.cepha.rgw0:
    host: cepha 
    keyring: /var/lib/ceph/radosgw/ceph-rgw.cepha.rgw0/keyring
    log file: /var/log/ceph/ceph-rgw-0-cepha.log
    log_to_file: true
    rgw frontends: beast endpoint=x.x.x.x:8080
    rgw_dynamic_resharding: false
    rgw_enable_apis: s3,admin
    rgw_zone: ieec1
    rgw_zonegroup: produccion
    rgw_realm: xxxx
  client.rgw.cepha.rgw1:
    host: cepha
    keyring: /var/lib/ceph/radosgw/ceph-rgw.cepha.rgw1/keyring
    log file: /var/log/ceph/ceph-rgw-1-cepha.log
    log_to_file: true
    rgw frontends: beast endpoint=x.x.x.x:8081
    rgw_dynamic_resharding: false
    rgw_enable_apis: s3,admin
    rgw_zone: ieec1
    rgw_zonegroup: produccion
    rgw_realm: xxxx

With this config we then run the site-container.yml  with the radosgw limit:

#ansible-playbook -vv -i inventory site-container.yml --limit rgws

The play finishes ok, but no changes have been made, we have the same count of RGW services. This is because in the task in the file roles/ceph-facts/tasks/set_radosgw_address.yml, never maches de conditional '- rgw_instances is undefined' and the task is always skipped.

- name: set_fact rgw_instances with rgw multisite
  set_fact:
    rgw_instances: "{{ rgw_instances|default([]) | union([{'instance_name': 'rgw' + item | string, 'radosgw_address': _radosgw_address, 'radosgw_frontend_port': radosgw_frontend_port | int, 'rgw_realm': rgw_realm | string, 'rgw_zonegroup': rgw_zonegroup | string, 'rgw_zone': rgw_zone | string, 'system_access_key': system_access_key, 'system_secret_key': system_secret_key, 'rgw_zone_user': rgw_zone_user, 'rgw_zone_user_display_name': rgw_zone_user_display_name, 'endpoint': (rgw_pull_proto + '://' + rgw_pullhost + ':' + rgw_pull_port | string) if not rgw_zonemaster | bool and rgw_zonesecondary | bool else omit }]) }}"
  with_sequence: start=0 end={{ radosgw_num_instances|int - 1 }}
  when:
    - inventory_hostname in groups.get(rgw_group_name, [])
    - rgw_instances is undefined
    - rgw_multisite | bool

Ansible log of the task that gets skipped:

TASK [ceph-facts : set_fact rgw_instances with rgw multisite] *****************************************************************************************************************************************************************************************************************************************************************
task path: /root/danip/ceph-ansible-dc1/roles/ceph-facts/tasks/set_radosgw_address.yml:53
Monday 20 July 2020  07:16:00 -0400 (0:00:00.221)       0:12:57.310 *********** 
skipping: [cepha] => (item=0)  => changed=false 
  ansible_loop_var: item
  item: '0'
  skip_reason: Conditional result was False
skipping: [cepha] => (item=1)  => changed=false 
  ansible_loop_var: item
  item: '1'
  skip_reason: Conditional result was False
skipping: [cephb] => (item=0)  => changed=false 
  ansible_loop_var: item
  item: '0'
  skip_reason: Conditional result was False
skipping: [cephc] => (item=0)  => changed=false 
  ansible_loop_var: item
  item: '0'
  skip_reason: Conditional result was False
skipping: [cephb] => (item=1)  => changed=false 
  ansible_loop_var: item
  item: '1'
  skip_reason: Conditional result was False
skipping: [cephc] => (item=1)  => changed=false 
  ansible_loop_var: item
  item: '1'
  skip_reason: Conditional result was False



If we remove the conditional check '- rgw_instances is undefined' , and run it like this:

- name: set_fact rgw_instances with rgw multisite
  set_fact:
    rgw_instances: "{{ rgw_instances|default([]) | union([{'instance_name': 'rgw' + item | string, 'radosgw_address': _radosgw_address, 'radosgw_frontend_port': radosgw_frontend_port | int, 'rgw_realm': rgw_realm | string, 'rgw_zonegroup': rgw_zonegroup | string, 'rgw_zone': rgw_zone | string, 'system_access_key': system_access_key, 'system_secret_key': system_secret_key, 'rgw_zone_user': rgw_zone_user, 'rgw_zone_user_display_name': rgw_zone_user_display_name, 'endpoint': (rgw_pull_proto + '://' + rgw_pullhost + ':' + rgw_pull_port | string) if not rgw_zonemaster | bool and rgw_zonesecondary | bool else omit }]) }}"
  with_sequence: start=0 end={{ radosgw_num_instances|int - 1 }}
  when:
    - inventory_hostname in groups.get(rgw_group_name, [])
    - rgw_multisite | bool

The number of radosgw gets configured to 2 per node, but the systemd units  fail to start because  both instances of radosGW are running on the same port, the env file of the systemd unit sets the same port for each RGW.

[root@cepha ~]$ cat /var/lib/ceph/radosgw/ceph-rgw.cepha.rgw?/EnvironmentFile
INST_NAME=rgw0
INST_PORT=8080
INST_NAME=rgw1
INST_PORT=8080


To workaround this issue we had to modify the fact rgw_instance  from file /root/danip/ceph-ansible-dc1/roles/ceph-facts/tasks/set_radosgw_address.yml file, so it increases by one the number of the port with:

 'radosgw_frontend_port': radosgw_frontend_port | int + item|int,

This is the full task with the modifications:


##ORIGINAL
- name: set_fact rgw_instances with rgw multisite
  set_fact:
    rgw_instances: "{{ rgw_instances|default([]) | union([{'instance_name': 'rgw' + item | string, 'radosgw_address': _radosgw_address, 'radosgw_frontend_port': radosgw_frontend_port | int, 'rgw_realm': rgw_realm | string, 'rgw_zonegroup': rgw_zonegroup | string, 'rgw_zone': rgw_zone | string, 'system_access_key': system_access_key, 'system_secret_key': system_secret_key, 'rgw_zone_user': rgw_zone_user, 'rgw_zone_user_display_name': rgw_zone_user_display_name, 'endpoint': (rgw_pull_proto + '://' + rgw_pullhost + ':' + rgw_pull_port | string) if not rgw_zonemaster | bool and rgw_zonesecondary | bool else omit }]) }}"
  with_sequence: start=0 end={{ radosgw_num_instances|int - 1 }}
  when:
    - inventory_hostname in groups.get(rgw_group_name, [])
    - rgw_instances is undefined
    - rgw_multisite | bool

## MODIFIED TO WORK.
- name: set_fact rgw_instances with rgw multisite
  set_fact:
    rgw_instances: "{{ rgw_instances|default([]) | union([{'instance_name': 'rgw' + item | string, 'radosgw_address': _radosgw_address, 'radosgw_frontend_port': radosgw_frontend_port | int + item|int, 'rgw_realm': rgw_realm | string, 'rgw_zonegroup': rgw_zonegroup | string, 'rgw_zone': rgw_zone | string, 'system_access_key': system_access_key, 'system_secret_key': system_secret_key, 'rgw_zone_user': rgw_zone_user, 'rgw_zone_user_display_name': rgw_zone_user_display_name, 'endpoint': (rgw_pull_proto + '://' + rgw_pullhost + ':' + rgw_pull_port | string) if not rgw_zonemaster | bool and rgw_zonesecondary | bool else omit }]) }}"
  with_sequence: start=0 end={{ radosgw_num_instances|int - 1 }}
  when:
    - inventory_hostname in groups.get(rgw_group_name, [])
    - rgw_multisite | bool



Note: All credits to Daniel Parkes

Comment 1 RHEL Program Management 2020-07-23 08:38:45 UTC
Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 2 Raul Mahiques 2020-07-23 08:40:31 UTC
there is a pull request with a fix
https://github.com/ceph/ceph-ansible/pull/5583

Comment 22 errata-xmlrpc 2021-01-12 14:56:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 4.2 Security and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0081


Note You need to log in before you can comment on or make changes to this bug.