1859872 – During upgrade moving from One RGW per node to Two RGW per node fails with ceph-ansible multisite configured

Bug 1859872 - During upgrade moving from One RGW per node to Two RGW per node fails with ceph-ansible multisite configured

Summary: During upgrade moving from One RGW per node to Two RGW per node fails with ce...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Ansible
Sub Component:
Version:	4.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.2
Assignee:	Guillaume Abrioux
QA Contact:	Tejas
Docs Contact:	Aron Gunn
URL:
Whiteboard:
Depends On:
Blocks:	1816167 1890121
TreeView+	depends on / blocked

Reported:	2020-07-23 08:38 UTC by Raul Mahiques
Modified:	2023-12-15 18:36 UTC (History)
CC List:	14 users (show)
Fixed In Version:	ceph-ansible-4.0.34-1.el8cp, ceph-ansible-4.0.34-1.el7cp
Doc Type:	Bug Fix
Doc Text:	.Adding a new Ceph Object Gateway instance when upgrading fails The `radosgw_frontend_port` option did not consider more than one Ceph Object Gateway instance, and configured port `8080` to all instances. With this release, the `radosgw_frontend_port` option is increased for each Ceph Object Gateway instance, allowing you to use more than one Ceph Object Gateway instance.
Clone Of:
Environment:
Last Closed:	2021-01-12 14:56:02 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ceph ceph-ansible pull 5613	None	closed	rgw: support 1+ rgw instance in `radosgw_frontend_port`	2021-01-26 15:22:04 UTC
Github	ceph ceph-ansible pull 5804	None	closed	[skip ci] facts: fix 'set_fact rgw_instances with rgw multisite'	2021-01-26 15:22:03 UTC
Github	ceph ceph-ansible pull 5806	None	closed	facts: fix 'set_fact rgw_instances with rgw multisite' (bp #5804)	2021-01-26 15:22:03 UTC
Github	ceph ceph-ansible pull 5807	None	closed	facts: fix 'set_fact rgw_instances with rgw multisite' (bp #5804)	2021-01-26 15:22:03 UTC
Red Hat Issue Tracker	RHCEPH-8056	None	None	None	2023-12-15 18:36:12 UTC
Red Hat Product Errata	RHSA-2021:0081	None	None	None	2021-01-12 14:56:35 UTC

Description Raul Mahiques 2020-07-23 08:38:36 UTC

Description of problem:

Moving from RHCS 3.3 to RHCS 4.1, once we are in RHCS4.1 we want to start using 2 RGW per node. 

We have multisite configured in ceph-ansible:

[root@xxx]# cat group_vars/all.yml | grep rgw_multisite
rgw_multisite: True

Because we have multiple-realms the configuration of each rgw node is specified in host_vars/nodeX

To configure 2 RGW per node, we add the following info to the all.yml file:

$ cat group_vars/all.yml | grep radosgw_num
radosgw_num_instances: 2

And also we have a per node rgw entry in the ceph_conf_overrides, for example:

  client.rgw.cepha.rgw0:
    host: cepha 
    keyring: /var/lib/ceph/radosgw/ceph-rgw.cepha.rgw0/keyring
    log file: /var/log/ceph/ceph-rgw-0-cepha.log
    log_to_file: true
    rgw frontends: beast endpoint=x.x.x.x:8080
    rgw_dynamic_resharding: false
    rgw_enable_apis: s3,admin
    rgw_zone: ieec1
    rgw_zonegroup: produccion
    rgw_realm: xxxx
  client.rgw.cepha.rgw1:
    host: cepha
    keyring: /var/lib/ceph/radosgw/ceph-rgw.cepha.rgw1/keyring
    log file: /var/log/ceph/ceph-rgw-1-cepha.log
    log_to_file: true
    rgw frontends: beast endpoint=x.x.x.x:8081
    rgw_dynamic_resharding: false
    rgw_enable_apis: s3,admin
    rgw_zone: ieec1
    rgw_zonegroup: produccion
    rgw_realm: xxxx

With this config we then run the site-container.yml  with the radosgw limit:

#ansible-playbook -vv -i inventory site-container.yml --limit rgws

The play finishes ok, but no changes have been made, we have the same count of RGW services. This is because in the task in the file roles/ceph-facts/tasks/set_radosgw_address.yml, never maches de conditional '- rgw_instances is undefined' and the task is always skipped.

- name: set_fact rgw_instances with rgw multisite
  set_fact:
    rgw_instances: "{{ rgw_instances|default([]) | union([{'instance_name': 'rgw' + item | string, 'radosgw_address': _radosgw_address, 'radosgw_frontend_port': radosgw_frontend_port | int, 'rgw_realm': rgw_realm | string, 'rgw_zonegroup': rgw_zonegroup | string, 'rgw_zone': rgw_zone | string, 'system_access_key': system_access_key, 'system_secret_key': system_secret_key, 'rgw_zone_user': rgw_zone_user, 'rgw_zone_user_display_name': rgw_zone_user_display_name, 'endpoint': (rgw_pull_proto + '://' + rgw_pullhost + ':' + rgw_pull_port | string) if not rgw_zonemaster | bool and rgw_zonesecondary | bool else omit }]) }}"
  with_sequence: start=0 end={{ radosgw_num_instances|int - 1 }}
  when:
    - inventory_hostname in groups.get(rgw_group_name, [])
    - rgw_instances is undefined
    - rgw_multisite | bool

Ansible log of the task that gets skipped:

TASK [ceph-facts : set_fact rgw_instances with rgw multisite] *****************************************************************************************************************************************************************************************************************************************************************
task path: /root/danip/ceph-ansible-dc1/roles/ceph-facts/tasks/set_radosgw_address.yml:53
Monday 20 July 2020  07:16:00 -0400 (0:00:00.221)       0:12:57.310 *********** 
skipping: [cepha] => (item=0)  => changed=false 
  ansible_loop_var: item
  item: '0'
  skip_reason: Conditional result was False
skipping: [cepha] => (item=1)  => changed=false 
  ansible_loop_var: item
  item: '1'
  skip_reason: Conditional result was False
skipping: [cephb] => (item=0)  => changed=false 
  ansible_loop_var: item
  item: '0'
  skip_reason: Conditional result was False
skipping: [cephc] => (item=0)  => changed=false 
  ansible_loop_var: item
  item: '0'
  skip_reason: Conditional result was False
skipping: [cephb] => (item=1)  => changed=false 
  ansible_loop_var: item
  item: '1'
  skip_reason: Conditional result was False
skipping: [cephc] => (item=1)  => changed=false 
  ansible_loop_var: item
  item: '1'
  skip_reason: Conditional result was False



If we remove the conditional check '- rgw_instances is undefined' , and run it like this:

- name: set_fact rgw_instances with rgw multisite
  set_fact:
    rgw_instances: "{{ rgw_instances|default([]) | union([{'instance_name': 'rgw' + item | string, 'radosgw_address': _radosgw_address, 'radosgw_frontend_port': radosgw_frontend_port | int, 'rgw_realm': rgw_realm | string, 'rgw_zonegroup': rgw_zonegroup | string, 'rgw_zone': rgw_zone | string, 'system_access_key': system_access_key, 'system_secret_key': system_secret_key, 'rgw_zone_user': rgw_zone_user, 'rgw_zone_user_display_name': rgw_zone_user_display_name, 'endpoint': (rgw_pull_proto + '://' + rgw_pullhost + ':' + rgw_pull_port | string) if not rgw_zonemaster | bool and rgw_zonesecondary | bool else omit }]) }}"
  with_sequence: start=0 end={{ radosgw_num_instances|int - 1 }}
  when:
    - inventory_hostname in groups.get(rgw_group_name, [])
    - rgw_multisite | bool

The number of radosgw gets configured to 2 per node, but the systemd units  fail to start because  both instances of radosGW are running on the same port, the env file of the systemd unit sets the same port for each RGW.

[root@cepha ~]$ cat /var/lib/ceph/radosgw/ceph-rgw.cepha.rgw?/EnvironmentFile
INST_NAME=rgw0
INST_PORT=8080
INST_NAME=rgw1
INST_PORT=8080


To workaround this issue we had to modify the fact rgw_instance  from file /root/danip/ceph-ansible-dc1/roles/ceph-facts/tasks/set_radosgw_address.yml file, so it increases by one the number of the port with:

 'radosgw_frontend_port': radosgw_frontend_port | int + item|int,

This is the full task with the modifications:


##ORIGINAL
- name: set_fact rgw_instances with rgw multisite
  set_fact:
    rgw_instances: "{{ rgw_instances|default([]) | union([{'instance_name': 'rgw' + item | string, 'radosgw_address': _radosgw_address, 'radosgw_frontend_port': radosgw_frontend_port | int, 'rgw_realm': rgw_realm | string, 'rgw_zonegroup': rgw_zonegroup | string, 'rgw_zone': rgw_zone | string, 'system_access_key': system_access_key, 'system_secret_key': system_secret_key, 'rgw_zone_user': rgw_zone_user, 'rgw_zone_user_display_name': rgw_zone_user_display_name, 'endpoint': (rgw_pull_proto + '://' + rgw_pullhost + ':' + rgw_pull_port | string) if not rgw_zonemaster | bool and rgw_zonesecondary | bool else omit }]) }}"
  with_sequence: start=0 end={{ radosgw_num_instances|int - 1 }}
  when:
    - inventory_hostname in groups.get(rgw_group_name, [])
    - rgw_instances is undefined
    - rgw_multisite | bool

## MODIFIED TO WORK.
- name: set_fact rgw_instances with rgw multisite
  set_fact:
    rgw_instances: "{{ rgw_instances|default([]) | union([{'instance_name': 'rgw' + item | string, 'radosgw_address': _radosgw_address, 'radosgw_frontend_port': radosgw_frontend_port | int + item|int, 'rgw_realm': rgw_realm | string, 'rgw_zonegroup': rgw_zonegroup | string, 'rgw_zone': rgw_zone | string, 'system_access_key': system_access_key, 'system_secret_key': system_secret_key, 'rgw_zone_user': rgw_zone_user, 'rgw_zone_user_display_name': rgw_zone_user_display_name, 'endpoint': (rgw_pull_proto + '://' + rgw_pullhost + ':' + rgw_pull_port | string) if not rgw_zonemaster | bool and rgw_zonesecondary | bool else omit }]) }}"
  with_sequence: start=0 end={{ radosgw_num_instances|int - 1 }}
  when:
    - inventory_hostname in groups.get(rgw_group_name, [])
    - rgw_multisite | bool



Note: All credits to Daniel Parkes

Comment 1 RHEL Program Management 2020-07-23 08:38:45 UTC

Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 2 Raul Mahiques 2020-07-23 08:40:31 UTC

there is a pull request with a fix
https://github.com/ceph/ceph-ansible/pull/5583

Comment 22 errata-xmlrpc 2021-01-12 14:56:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 4.2 Security and Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0081

Note You need to log in before you can comment on or make changes to this bug.