Bug 1859872

Summary: During upgrade moving from One RGW per node to Two RGW per node fails with ceph-ansible multisite configured
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Raul Mahiques <rmahique>
Component: Ceph-AnsibleAssignee: Guillaume Abrioux <gabrioux>
Status: CLOSED ERRATA QA Contact: Tejas <tchandra>
Severity: high Docs Contact: Aron Gunn <agunn>
Priority: unspecified    
Version: 4.1CC: agunn, anharris, aschoen, ceph-eng-bugs, dsavinea, gabrioux, gmeno, knortema, mmuench, nthomas, tchandra, tserlin, vereddy, ykaul
Target Milestone: ---   
Target Release: 4.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-ansible-4.0.34-1.el8cp, ceph-ansible-4.0.34-1.el7cp Doc Type: Bug Fix
Doc Text:
.Adding a new Ceph Object Gateway instance when upgrading fails The `radosgw_frontend_port` option did not consider more than one Ceph Object Gateway instance, and configured port `8080` to all instances. With this release, the `radosgw_frontend_port` option is increased for each Ceph Object Gateway instance, allowing you to use more than one Ceph Object Gateway instance.
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-01-12 14:56:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1816167, 1890121    

Description Raul Mahiques 2020-07-23 08:38:36 UTC
Description of problem:

Moving from RHCS 3.3 to RHCS 4.1, once we are in RHCS4.1 we want to start using 2 RGW per node. 

We have multisite configured in ceph-ansible:

[root@xxx]# cat group_vars/all.yml | grep rgw_multisite
rgw_multisite: True

Because we have multiple-realms the configuration of each rgw node is specified in host_vars/nodeX

To configure 2 RGW per node, we add the following info to the all.yml file:

$ cat group_vars/all.yml | grep radosgw_num
radosgw_num_instances: 2

And also we have a per node rgw entry in the ceph_conf_overrides, for example:

  client.rgw.cepha.rgw0:
    host: cepha 
    keyring: /var/lib/ceph/radosgw/ceph-rgw.cepha.rgw0/keyring
    log file: /var/log/ceph/ceph-rgw-0-cepha.log
    log_to_file: true
    rgw frontends: beast endpoint=x.x.x.x:8080
    rgw_dynamic_resharding: false
    rgw_enable_apis: s3,admin
    rgw_zone: ieec1
    rgw_zonegroup: produccion
    rgw_realm: xxxx
  client.rgw.cepha.rgw1:
    host: cepha
    keyring: /var/lib/ceph/radosgw/ceph-rgw.cepha.rgw1/keyring
    log file: /var/log/ceph/ceph-rgw-1-cepha.log
    log_to_file: true
    rgw frontends: beast endpoint=x.x.x.x:8081
    rgw_dynamic_resharding: false
    rgw_enable_apis: s3,admin
    rgw_zone: ieec1
    rgw_zonegroup: produccion
    rgw_realm: xxxx

With this config we then run the site-container.yml  with the radosgw limit:

#ansible-playbook -vv -i inventory site-container.yml --limit rgws

The play finishes ok, but no changes have been made, we have the same count of RGW services. This is because in the task in the file roles/ceph-facts/tasks/set_radosgw_address.yml, never maches de conditional '- rgw_instances is undefined' and the task is always skipped.

- name: set_fact rgw_instances with rgw multisite
  set_fact:
    rgw_instances: "{{ rgw_instances|default([]) | union([{'instance_name': 'rgw' + item | string, 'radosgw_address': _radosgw_address, 'radosgw_frontend_port': radosgw_frontend_port | int, 'rgw_realm': rgw_realm | string, 'rgw_zonegroup': rgw_zonegroup | string, 'rgw_zone': rgw_zone | string, 'system_access_key': system_access_key, 'system_secret_key': system_secret_key, 'rgw_zone_user': rgw_zone_user, 'rgw_zone_user_display_name': rgw_zone_user_display_name, 'endpoint': (rgw_pull_proto + '://' + rgw_pullhost + ':' + rgw_pull_port | string) if not rgw_zonemaster | bool and rgw_zonesecondary | bool else omit }]) }}"
  with_sequence: start=0 end={{ radosgw_num_instances|int - 1 }}
  when:
    - inventory_hostname in groups.get(rgw_group_name, [])
    - rgw_instances is undefined
    - rgw_multisite | bool

Ansible log of the task that gets skipped:

TASK [ceph-facts : set_fact rgw_instances with rgw multisite] *****************************************************************************************************************************************************************************************************************************************************************
task path: /root/danip/ceph-ansible-dc1/roles/ceph-facts/tasks/set_radosgw_address.yml:53
Monday 20 July 2020  07:16:00 -0400 (0:00:00.221)       0:12:57.310 *********** 
skipping: [cepha] => (item=0)  => changed=false 
  ansible_loop_var: item
  item: '0'
  skip_reason: Conditional result was False
skipping: [cepha] => (item=1)  => changed=false 
  ansible_loop_var: item
  item: '1'
  skip_reason: Conditional result was False
skipping: [cephb] => (item=0)  => changed=false 
  ansible_loop_var: item
  item: '0'
  skip_reason: Conditional result was False
skipping: [cephc] => (item=0)  => changed=false 
  ansible_loop_var: item
  item: '0'
  skip_reason: Conditional result was False
skipping: [cephb] => (item=1)  => changed=false 
  ansible_loop_var: item
  item: '1'
  skip_reason: Conditional result was False
skipping: [cephc] => (item=1)  => changed=false 
  ansible_loop_var: item
  item: '1'
  skip_reason: Conditional result was False



If we remove the conditional check '- rgw_instances is undefined' , and run it like this:

- name: set_fact rgw_instances with rgw multisite
  set_fact:
    rgw_instances: "{{ rgw_instances|default([]) | union([{'instance_name': 'rgw' + item | string, 'radosgw_address': _radosgw_address, 'radosgw_frontend_port': radosgw_frontend_port | int, 'rgw_realm': rgw_realm | string, 'rgw_zonegroup': rgw_zonegroup | string, 'rgw_zone': rgw_zone | string, 'system_access_key': system_access_key, 'system_secret_key': system_secret_key, 'rgw_zone_user': rgw_zone_user, 'rgw_zone_user_display_name': rgw_zone_user_display_name, 'endpoint': (rgw_pull_proto + '://' + rgw_pullhost + ':' + rgw_pull_port | string) if not rgw_zonemaster | bool and rgw_zonesecondary | bool else omit }]) }}"
  with_sequence: start=0 end={{ radosgw_num_instances|int - 1 }}
  when:
    - inventory_hostname in groups.get(rgw_group_name, [])
    - rgw_multisite | bool

The number of radosgw gets configured to 2 per node, but the systemd units  fail to start because  both instances of radosGW are running on the same port, the env file of the systemd unit sets the same port for each RGW.

[root@cepha ~]$ cat /var/lib/ceph/radosgw/ceph-rgw.cepha.rgw?/EnvironmentFile
INST_NAME=rgw0
INST_PORT=8080
INST_NAME=rgw1
INST_PORT=8080


To workaround this issue we had to modify the fact rgw_instance  from file /root/danip/ceph-ansible-dc1/roles/ceph-facts/tasks/set_radosgw_address.yml file, so it increases by one the number of the port with:

 'radosgw_frontend_port': radosgw_frontend_port | int + item|int,

This is the full task with the modifications:


##ORIGINAL
- name: set_fact rgw_instances with rgw multisite
  set_fact:
    rgw_instances: "{{ rgw_instances|default([]) | union([{'instance_name': 'rgw' + item | string, 'radosgw_address': _radosgw_address, 'radosgw_frontend_port': radosgw_frontend_port | int, 'rgw_realm': rgw_realm | string, 'rgw_zonegroup': rgw_zonegroup | string, 'rgw_zone': rgw_zone | string, 'system_access_key': system_access_key, 'system_secret_key': system_secret_key, 'rgw_zone_user': rgw_zone_user, 'rgw_zone_user_display_name': rgw_zone_user_display_name, 'endpoint': (rgw_pull_proto + '://' + rgw_pullhost + ':' + rgw_pull_port | string) if not rgw_zonemaster | bool and rgw_zonesecondary | bool else omit }]) }}"
  with_sequence: start=0 end={{ radosgw_num_instances|int - 1 }}
  when:
    - inventory_hostname in groups.get(rgw_group_name, [])
    - rgw_instances is undefined
    - rgw_multisite | bool

## MODIFIED TO WORK.
- name: set_fact rgw_instances with rgw multisite
  set_fact:
    rgw_instances: "{{ rgw_instances|default([]) | union([{'instance_name': 'rgw' + item | string, 'radosgw_address': _radosgw_address, 'radosgw_frontend_port': radosgw_frontend_port | int + item|int, 'rgw_realm': rgw_realm | string, 'rgw_zonegroup': rgw_zonegroup | string, 'rgw_zone': rgw_zone | string, 'system_access_key': system_access_key, 'system_secret_key': system_secret_key, 'rgw_zone_user': rgw_zone_user, 'rgw_zone_user_display_name': rgw_zone_user_display_name, 'endpoint': (rgw_pull_proto + '://' + rgw_pullhost + ':' + rgw_pull_port | string) if not rgw_zonemaster | bool and rgw_zonesecondary | bool else omit }]) }}"
  with_sequence: start=0 end={{ radosgw_num_instances|int - 1 }}
  when:
    - inventory_hostname in groups.get(rgw_group_name, [])
    - rgw_multisite | bool



Note: All credits to Daniel Parkes

Comment 1 RHEL Program Management 2020-07-23 08:38:45 UTC
Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 2 Raul Mahiques 2020-07-23 08:40:31 UTC
there is a pull request with a fix
https://github.com/ceph/ceph-ansible/pull/5583

Comment 22 errata-xmlrpc 2021-01-12 14:56:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 4.2 Security and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0081