Bug 1622505

Summary:	Playbook for cluster setup with multiple RGW instances fails.
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	shilpa <smanjara>
Component:	Ceph-Ansible	Assignee:	Guillaume Abrioux <gabrioux>
Status:	CLOSED ERRATA	QA Contact:	ceph-qe-bugs <ceph-qe-bugs>
Severity:	urgent	Docs Contact:
Priority:	high
Version:	3.1	CC:	agunn, aschoen, ceph-eng-bugs, dondavis, flucifre, gfidente, gmeno, hnallurv, kdreyer, nobody+410372, nthomas, sankarshan, shan, tchandra, tserlin, vakulkar
Target Milestone:	rc	Keywords:	Automation
Target Release:	3.1	Flags:	vakulkar: automate_bug+
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	RHEL: ceph-ansible-3.1.5-1.el7cp Ubuntu: ceph-ansible_3.1.5-2redhat1	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-09-26 18:24:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1578730, 1640093

Description shilpa 2018-08-27 11:06:51 UTC

Description of problem:

This is related to https://bugzilla.redhat.com/show_bug.cgi?id=1618678. Raised this new BZ because this is not a blocker for 3.1 testing.

Playbook fails to set up cluster when there are multiple rgw instances involved. 

 "error: 'dict object' has no attribute 'rgw_hostname'" 


Version-Release number of selected component (if applicable):

3.1.0-0.1.rc21.el7cp 

How reproducible:
Always


Actual results:

INFO:teuthology.orchestra.run.clara012.stdout:TASK [ceph-config : generate ceph configuration file: c1.conf] *****************
INFO:teuthology.orchestra.run.clara012.stdout:task path: /home/ubuntu/ceph-ansible/roles/ceph-config/tasks/main.yml:12
INFO:teuthology.orchestra.run.clara012.stdout:Friday 17 August 2018  06:36:51 +0000 (0:00:00.351)       0:05:00.432 *********

INFO:teuthology.orchestra.run.clara012.stdout:fatal: [clara012.ceph.redhat.com]: FAILED! => {}
 MSG:
'dict object' has no attribute 'rgw_hostname'


INFO:teuthology.orchestra.run.clara012.stdout:PLAY RECAP *********************************************************************
INFO:teuthology.orchestra.run.clara012.stdout:clara012.ceph.redhat.com   : ok=44   changed=12   unreachable=0    failed=1
 INFO:teuthology.orchestra.run.clara012.stdout:pluto004.ceph.redhat.com   : ok=1    changed=0    unreachable=0    failed=0


Expected results:
cCuster configuration with single RGW node works with this version. 

Additional info:

Config parameters:

    ceph_ansible:
      rhbuild: '3.1'
      vars:
        ceph_conf_overrides:
          global:
            mon_max_pg_per_osd: 1024
            osd default pool size: 2
            osd pool default pg num: 64
            osd pool default pgp num: 64
        ceph_origin: distro
        ceph_repository: rhcs
        ceph_stable: true
        ceph_stable_release: luminous
        ceph_stable_rh_storage: true
        ceph_test: true
        journal_size: 1024
        osd_auto_discovery: true
        osd_scenario: collocated

Logs @ http://magna002.ceph.redhat.com/smanjara-2018-08-23_05:37:58-rgw:multisite-ansible-luminous-distro-basic-multi/307495/teuthology.log

Comment 5 Sébastien Han 2018-08-31 08:08:34 UTC

Thomas, no this is not fixed yet.

Comment 6 Guillaume Abrioux 2018-09-11 09:18:27 UTC

fixed in v3.1.4

Comment 9 Christina Meno 2018-09-17 20:48:17 UTC

This regression was caught by the downstream OSP automation tests.
The required setup to trigger it is install or upgrade ceph on a cluster using FQDN option.

any re-run of ceph-ansible (v3.1.3) on an existing cluster configured to "used_fqdn" is going to break the ceph.conf on rgw node

The effect is that the ceph.conf on an RGW node will have missing fields

It is not in stable-3.0.
It was introduced in August

The fix is already released as v3.1.4

Comment 10 Federico Lucifredi 2018-09-17 21:15:11 UTC

do I understand correctly that Ceph-ansible 3.1.4 is a newer version than is included in 3.1?

Best -F

Comment 13 Harish NV Rao 2018-09-18 07:03:18 UTC

(In reply to Gregory Meno from comment #9)
> This regression was caught by the downstream OSP automation tests.
> The required setup to trigger it is install or upgrade ceph on a cluster
> using FQDN option.
> 
> any re-run of ceph-ansible (v3.1.3) on an existing cluster configured to
> "used_fqdn" is going to break the ceph.conf on rgw node

Gregory are you referring to the options 'mon_use_fqdn' and 'mds_use_fqdn'? If yes, then they are not supported in 3.1 as per bz https://bugzilla.redhat.com/show_bug.cgi?id=1613155. 

We have hit the original issue with and without FQDN mentioned for more than one RGWs in the inventory file.

Comment 14 Guillaume Abrioux 2018-09-18 07:25:20 UTC

Just to clarify a bit:

As Harish mentionned in c13, we started to not support anymore these options.
Therefore, we had to keep backward compatibility with existing cluster.
The commit which was supposed to provide this backward compatibility missed something and brought this current bug which has been finally fixed in v3.1.4.

Comment 15 Persona non grata 2018-09-18 11:25:22 UTC

We tried the following scenarios with ceph-ansible version ceph-ansible 3.1.3-2redhat1  on Ubuntu.

With inventory file(full names):
---------------------
[mons]
magna006.ceph.redhat.com
[mgrs]
magna006.ceph.redhat.com
[osds]
magna064.ceph.redhat.com devices="['/dev/sdb','/dev/sdc','/dev/sdd']" osd_scenario="collocated" osd_objectstore="bluestore" dmcrypt="true"
magna111.ceph.redhat.com dedicated_devices="['/dev/sdb', '/dev/sdb']" devices="['/dev/sdc','/dev/sdd']" osd_scenario="non-collocated" osd_objectstore="bluestore" dmcrypt="true"
magna117.ceph.redhat.com devices="['/dev/sdb','/dev/sdc','/dev/sdd']" osd_scenario="collocated" osd_objectstore="bluestore"
[rgws]
magna053.ceph.redhat.com
magna061.ceph.redhat.com
---------------------
Cluster was up,playbook did not fail, but RGWs were not installed.

With inventory file(short names):
--------------------
[mons]
magna006
[mgrs]
magna006[osds]
magna064 devices="['/dev/sdb','/dev/sdc','/dev/sdd']" osd_scenario="collocated" osd_objectstore="bluestore" dmcrypt="true"
magna111 dedicated_devices="['/dev/sdb', '/dev/sdb']" devices="['/dev/sdc','/dev/sdd']" osd_scenario="non-collocated" osd_objectstore="bluestore" dmcrypt="true"
magna117 devices="['/dev/sdb','/dev/sdc','/dev/sdd']" osd_scenario="collocated" osd_objectstore="bluestore"
[rgws]
magna053
magna061
--------------------
Cluster was up ,RGWs were installed. Ansible logs are kept in magna002:/home/sshreeka/ansible_logs

Comment 16 Harish NV Rao 2018-09-18 11:27:01 UTC

(In reply to Harish NV Rao from comment #13)
> (In reply to Gregory Meno from comment #9)
> > This regression was caught by the downstream OSP automation tests.
> > The required setup to trigger it is install or upgrade ceph on a cluster
> > using FQDN option.
> > 
> > any re-run of ceph-ansible (v3.1.3) on an existing cluster configured to
> > "used_fqdn" is going to break the ceph.conf on rgw node
> 
> Gregory are you referring to the options 'mon_use_fqdn' and 'mds_use_fqdn'?
> If yes, then they are not supported in 3.1 as per bz
> https://bugzilla.redhat.com/show_bug.cgi?id=1613155. 
> 
> We have hit the original issue with and without FQDN mentioned for more than
> one RGWs in the inventory file.

^^ above is w.r.t RHEL 7.5

Comment 26 Christina Meno 2018-09-19 00:26:08 UTC

ok,

This latest build(3.1.5) should address c18. 
Here's the plan we discussed this morning:

We'd like Giulio to run it through the OSP 
automation that caught this as a blocker.

AND 

ceph QE will chech that they cannot reproduce
the error.

THEN

We'll produce an RC and move forward with 3.1

cheers,
G

Comment 28 Giulio Fidente 2018-09-19 20:11:27 UTC

3.1.5 passed; thanks a lot for caring and for the special effort!

Comment 29 Harish NV Rao 2018-09-20 13:01:54 UTC

Ceph QE's tests for this fix have passed.

Comment 30 Harish NV Rao 2018-09-20 13:03:21 UTC

Based on comment 28 and 29, moving this BZ to verified state

Comment 32 errata-xmlrpc 2018-09-26 18:24:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2819

Comment 33 Donny Davis 2018-10-09 02:41:13 UTC

This happens in OSP13 as well.

Comment 34 Giulio Fidente 2018-10-09 10:33:01 UTC

(In reply to Donny Davis from comment #33)
> This happens in OSP13 as well.

Donny, are you seeing the issue with ceph-ansible-3.1.5-1.el7cp or an older version?

Comment 35 Donny Davis 2018-10-09 13:01:58 UTC

I have the latest available package for OSP13 installed. 

ceph-ansible-3.1.3-1.el7cp.noarch

I patched the offending file with this
https://github.com/ceph/ceph-ansible/blob/4ce11a84938bb5377f422f01dbf3477bd0f607a9/roles/ceph-config/templates/ceph.conf.j2

And all seems to be well. It also seems to have corrected another issue I was going to raise, which is RGW not working from the OSP Dashboard (horizon). It would just throw an error before.

Comment 36 Giulio Fidente 2018-10-10 11:33:19 UTC

The bug was fixed in 3.1.5, you should enable the Ceph Tools repos to get the newer version installed instead of the version included in the OSP repos.