Bug 1568157

Summary: ceph-client role fails to create key because the ceph-create-keys container is not running
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: John Fulton <johfulto>
Component: Ceph-AnsibleAssignee: Sébastien Han <shan>
Status: CLOSED CURRENTRELEASE QA Contact: Vasishta <vashastr>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.0CC: adeza, ahrechan, aschoen, ceph-eng-bugs, gfidente, gmeno, mcornea, nthomas, ohochman, psedlak, sankarshan, sasha, tserlin, yrabl
Target Milestone: rc   
Target Release: 3.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-27 05:10:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1548353, 1571947    

Description John Fulton 2018-04-16 21:22:26 UTC
While using ceph-ansible-3.1.0.0-0.beta6.1.el7.noarch to configure a server to communicate with an external ceph cluster, the deployment fails on task: "create cephx key(s)" [1]. 

I was able to reproduce the problem by running the same command the ceph-ansible tried to run [2]. 

When examining the server where the task failed I found that the ceph-create-keys daemon (with the same container ID) was running at some point but that it had stopped. 

[root@rhosp-ctrl0 ~]# docker ps -a
CONTAINER ID        IMAGE                                                             COMMAND                  CREATED             STATUS                         PORTS               NAMES
6d391bc9842c        registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest           "sleep 300"              About an hour ago   Exited (0) About an hour ago                       ceph-create-keys

Maybe there is a race condition with the dummy container not being available when needed? (just a theory) I'll attach more logs.

https://github.com/ceph/ceph-ansible/blob/f711c51f395df181c5f821fca7ef879af79fdb64/roles/ceph-client/tasks/create_users_keys.yml#L15-L26

[1] 
2018-04-16 19:54:55,600 p=2591 u=mistral |  TASK [ceph-client : create cephx key(s)] ***************************************
2018-04-16 19:54:55,601 p=2591 u=mistral |  Monday 16 April 2018  19:54:55 +0000 (0:00:00.048)       0:01:06.258 ********** 
2018-04-16 19:59:55,593 p=2591 u=mistral |  failed: [192.168.213.214] (item={'caps': {'mds': u'', 'osd': u'allow class-read object_prefix rbd_children, allow rwx pool=rhosp-volumes, allow rwx pool=rhosp-backup, allow rwx pool=rhosp-vms, allow rwx pool=rhosp-images, allow rwx pool=rhosp-metrics', 'mon': u'allow r', 'mgr': u'allow *'}, 'mode': u'0600', 'key': u'AQC8ZSlakbFkMBAAAZFQjIgWVZQ+HVnKc3FpTw==', 'name': u'client.rhosp'}) => {"changed": true, "cmd": ["docker", "exec", "ceph-create-keys", "ceph-authtool", "--create-keyring", "/etc/ceph/ceph.client.rhosp.keyring", "--name", "client.rhosp", "--add-key", "AQC8ZSlakbFkMBAAAZFQjIgWVZQ+HVnKc3FpTw==", "--cap", "mds", "", "--cap", "osd", "allow class-read object_prefix rbd_children, allow rwx pool=rhosp-volumes, allow rwx pool=rhosp-backup, allow rwx pool=rhosp-vms, allow rwx pool=rhosp-images, allow rwx pool=rhosp-metrics", "--cap", "mon", "allow r", "--cap", "mgr", "allow *"], "delta": "0:04:59.598733", "end": "2018-04-16 15:59:55.574075", "item": {"caps": {"mds": "", "mgr": "allow *", "mon": "allow r", "osd": "allow class-read object_prefix rbd_children, allow rwx pool=rhosp-volumes, allow rwx pool=rhosp-backup, allow rwx pool=rhosp-vms, allow rwx pool=rhosp-images, allow rwx pool=rhosp-metrics"}, "key": "AQC8ZSlakbFkMBAAAZFQjIgWVZQ+HVnKc3FpTw==", "mode": "0600", "name": "client.rhosp"}, "msg": "non-zero return code", "rc": 126, "start": "2018-04-16 15:54:55.975342", "stderr": "", "stderr_lines": [], "stdout": "rpc error: code = 2 desc = oci runtime error: exec failed: container \"6d391bc9842c993ae6123f023d24da305a96dfe8d64e3607973c665cd8880129\" does not exist", "stdout_lines": ["rpc error: code = 2 desc = oci runtime error: exec failed: container \"6d391bc9842c993ae6123f023d24da305a96dfe8d64e3607973c665cd8880129\" does not exist"]}

[2] 
[root@rhosp-ctrl0 ~]# docker exec ceph-create-keys ceph-authtool --create-keyring /etc/ceph/ceph.client.rhosp.keyring --name client.rhosp --add-key AQC8ZSlakbFkMBAAAZFQjIgWVZQ+HVnKc3FpTw== --cap mds  --cap osd allow class-read object_prefix rbd_children, allow rwx pool=rhosp-volumes, allow rwx pool=rhosp-backup, allow rwx pool=rhosp-vms, allow rwx pool=rhosp-images, allow rwx pool=rhosp-metrics --cap mon allow r --cap mgr allow * 
Error response from daemon: Container 6d391bc9842c993ae6123f023d24da305a96dfe8d64e3607973c665cd8880129 is not running
[root@rhosp-ctrl0 ~]#

Comment 3 John Fulton 2018-04-16 21:41:18 UTC
I reproduced this for the ceph clients when deploying a new ceph cluster w/ tripleo too. Here is the output of ansible-playbook -vvv https://ptpb.pw/-QC_

Here is the ceph-install log from the original report though it didn't have -vvv to ansible-playbook.  http://ix.io/17Y7

Comment 4 John Fulton 2018-04-17 13:30:21 UTC
*** Bug 1568234 has been marked as a duplicate of this bug. ***

Comment 5 John Fulton 2018-04-18 23:44:02 UTC
*** Bug 1569258 has been marked as a duplicate of this bug. ***

Comment 6 Artem Hrechanychenko 2018-04-19 10:00:23 UTC
Reproduced in OSP13 puddle 2018-04-13.1 with TLS everywhere scenarion and disabled SElinux for OC nodes

Comment 7 Artem Hrechanychenko 2018-04-19 10:00:56 UTC
Reproduced in OSP13 puddle 2018-04-13.1 with TLS everywhere scenario and disabled SElinux for OC nodes and UC node

Comment 9 Omri Hochman 2018-04-19 13:12:32 UTC
RHOS OSP13 Deployment with the latest puddle failing over this issue.

Comment 14 John Fulton 2018-04-25 18:09:11 UTC
Part of the fix for this is that we needed the following:

https://github.com/ceph/ceph-ansible/commit/90e47c5fb0c95f4b1a17cdf2a019bdcebc77a773

which landed in 

 https://github.com/ceph/ceph-ansible/releases/tag/v3.1.0beta8

Comment 15 Yogev Rabl 2018-05-16 14:11:14 UTC
verified on ceph-ansible-3.1.0-0.1.rc2.el7cp.noarch