Bug 1569290

Summary: Ceph upgrade fails during OSP10 to OSP13 fast forward upgrade while running TASK [ceph-client : chmod cephx key(s)]
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Marius Cornea <mcornea>
Component: Ceph-AnsibleAssignee: Sébastien Han <shan>
Status: CLOSED CURRENTRELEASE QA Contact: Yogev Rabl <yrabl>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.0CC: adeza, aschoen, ceph-eng-bugs, gfidente, gmeno, johfulto, kdreyer, nthomas, pgrist, sankarshan, sasha, yprokule, yrabl
Target Milestone: rc   
Target Release: 3.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: RHEL: ceph-ansible-3.1.0-0.1.beta8.el7cp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-09 08:52:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1548353, 1571947    
Attachments:
Description Flags
ceph-install-workflow.log none

Description Marius Cornea 2018-04-19 00:27:30 UTC
Created attachment 1423819 [details]
ceph-install-workflow.log

Description of problem:
Ceph upgrade fails during OSP10 to OSP13 fast forward upgrade while running  TASK [ceph-client : chmod cephx key(s)]

Snippet from /var/log/mistral/ceph-install-workflow.log:

2018-04-18 20:01:41,306 p=26738 u=mistral |  TASK [ceph-client : chmod cephx key(s)] ****************************************
2018-04-18 20:01:41,307 p=26738 u=mistral |  task path: /usr/share/ceph-ansible/roles/ceph-client/tasks/create_users_keys.yml:99
2018-04-18 20:01:41,307 p=26738 u=mistral |  Wednesday 18 April 2018  20:01:41 -0400 (0:00:00.072)       0:11:19.185 ******* 
2018-04-18 20:01:41,573 p=26738 u=mistral |  changed: [192.168.24.8] => (item={'caps': {'mds': u"''", 'osd': u"'allow class-read object_prefix rbd_children, allow rwx pool=volumes, allow rwx pool=backups, allow rwx pool=vms, allow rwx poo
l=images, allow rwx pool=metrics'", 'mon': u"'allow r'", 'mgr': u"'allow *'"}, 'mode': u'0600', 'key': u'AQDroddaAAAAABAA64fr7tW0dXZue2Pl9Wi8Qg==', 'name': u'client.openstack'}) => {"changed": true, "gid": 0, "group": "root", "item": {"ca
ps": {"mds": "''", "mgr": "'allow *'", "mon": "'allow r'", "osd": "'allow class-read object_prefix rbd_children, allow rwx pool=volumes, allow rwx pool=backups, allow rwx pool=vms, allow rwx pool=images, allow rwx pool=metrics'"}, "key": 
"AQDroddaAAAAABAA64fr7tW0dXZue2Pl9Wi8Qg==", "mode": "0600", "name": "client.openstack"}, "mode": "0600", "owner": "root", "path": "/etc/ceph/ceph.client.openstack.keyring", "secontext": "system_u:object_r:container_file_t:s0", "size": 262
, "state": "file", "uid": 0}
2018-04-18 20:01:41,818 p=26738 u=mistral |  failed: [192.168.24.8] (item={'caps': {'mds': u"'allow *'", 'osd': u"'allow rw'", 'mon': u'\'allow r, allow command \\\\\\"auth del\\\\\\", allow command \\\\\\"auth caps\\\\\\", allow command 
\\\\\\"auth get\\\\\\", allow command \\\\\\"auth get-or-create\\\\\\"\'', 'mgr': u"'allow *'"}, 'name': u'client.manila', 'key': u'AQC+t9daAAAAABAA0zzoLz1NwZbBxrHNk14I4g==', 'mode': u'0600'}) => {"changed": false, "item": {"caps": {"mds"
: "'allow *'", "mgr": "'allow *'", "mon": "'allow r, allow command \\\\\\\"auth del\\\\\\\", allow command \\\\\\\"auth caps\\\\\\\", allow command \\\\\\\"auth get\\\\\\\", allow command \\\\\\\"auth get-or-create\\\\\\\"'", "osd": "'all
ow rw'"}, "key": "AQC+t9daAAAAABAA0zzoLz1NwZbBxrHNk14I4g==", "mode": "0600", "name": "client.manila"}, "msg": "file (/etc/ceph/ceph.client.manila.keyring) is absent, cannot continue", "path": "/etc/ceph/ceph.client.manila.keyring", "state
": "absent"}
2018-04-18 20:01:42,066 p=26738 u=mistral |  failed: [192.168.24.8] (item={'caps': {'mds': u"''", 'osd': u"'allow rwx'", 'mon': u"'allow rw'", 'mgr': u"'allow *'"}, 'mode': u'0600', 'key': u'AQDroddaAAAAABAAtrsj06ioGk1GRO2T4XUgOw==', 'nam
e': u'client.radosgw'}) => {"changed": false, "item": {"caps": {"mds": "''", "mgr": "'allow *'", "mon": "'allow rw'", "osd": "'allow rwx'"}, "key": "AQDroddaAAAAABAAtrsj06ioGk1GRO2T4XUgOw==", "mode": "0600", "name": "client.radosgw"}, "ms
g": "file (/etc/ceph/ceph.client.radosgw.keyring) is absent, cannot continue", "path": "/etc/ceph/ceph.client.radosgw.keyring", "state": "absent"}

/etc/ceph/ceph.client.radosgw.keyring and /etc/ceph/ceph.client.manila.keyring do not exist on the compute node(192.168.24.8)as these services were not enabled for the initial OSP10 deployment

Version-Release number of selected component (if applicable):
ceph-ansible-3.1.0-0.1.beta6.el7cp.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP10 with 3 controllers + 2 computes + 3 ceph osd nodes
2. Upgrade to OSP13 via the fast forward procedure

Actual results:
Ceph upgrade stage fails while running ceph-ansible.

Expected results:
Ceph upgrade succeeds fine.

Additional info:
Attaching /var/log/mistral/ceph-install-workflow.log and sosreports.

Comment 4 Giulio Fidente 2018-04-19 12:35:22 UTC
From OSP12 Director is always passing the full list of keyrings to be created while previously it could selectively create keyrings based on which services were enabled.

In the long term we want Director to behave as it used to in OSP10, but after moving to ceph-ansible this became a complicated issue to solve in the Heat templates.

If we could make ceph-ansible continue/create the missing keyrings that'd give us more time to work on a proper fix in Director.

Comment 5 Sébastien Han 2018-04-20 08:51:25 UTC
ceph-ansible is doing its job correctly, it's just director passing the false information. So ceph-ansible behaves as expected.

Giulio, how long is the director fix is going to take?

Comment 6 Giulio Fidente 2018-04-20 09:28:16 UTC
(In reply to leseb from comment #5)
> ceph-ansible is doing its job correctly, it's just director passing the
> false information. So ceph-ansible behaves as expected.

agreed, but it's also behaving differently then it used to
 
> Giulio, how long is the director fix is going to take?

I am not sure this is going to happen anytime soon and it is more a generic limitation in the framework that a bug; to start with I filed an RFE for Director #1569920 but it's targeted for Rocky.

Comment 7 Sébastien Han 2018-04-20 09:35:16 UTC
(In reply to Giulio Fidente from comment #6)
> (In reply to leseb from comment #5)
> > ceph-ansible is doing its job correctly, it's just director passing the
> > false information. So ceph-ansible behaves as expected.
> 
> agreed, but it's also behaving differently then it used to

Can you be more specific?

>  
> > Giulio, how long is the director fix is going to take?
> 
> I am not sure this is going to happen anytime soon and it is more a generic
> limitation in the framework that a bug; to start with I filed an RFE for
> Director #1569920 but it's targeted for Rocky.

Ok so if I understand correctly, ceph-ansible has to somehow workaround this because it's blocking the release?

Comment 8 Giulio Fidente 2018-04-20 09:43:00 UTC
(In reply to leseb from comment #7)
> (In reply to Giulio Fidente from comment #6)
> > (In reply to leseb from comment #5)
> > > ceph-ansible is doing its job correctly, it's just director passing the
> > > false information. So ceph-ansible behaves as expected.
> > 
> > agreed, but it's also behaving differently then it used to
> 
> Can you be more specific?

previously (beta5) it wouldn't fail on the same task, even though we passed the same parameters we are passing now
 
> > > Giulio, how long is the director fix is going to take?
> > 
> > I am not sure this is going to happen anytime soon and it is more a generic
> > limitation in the framework that a bug; to start with I filed an RFE for
> > Director #1569920 but it's targeted for Rocky.
> 
> Ok so if I understand correctly, ceph-ansible has to somehow workaround this
> because it's blocking the release?

yes, and it is not great, I agree; I do want to address this in Director but I don't think it can happen quickly enough so we'll have to carry some technical debt until it is resolved

Comment 9 Sébastien Han 2018-04-23 17:09:19 UTC
Will be in beta8

Comment 12 Yogev Rabl 2018-05-18 18:58:20 UTC
Verified