While using ceph-ansible-3.1.0.0-0.beta6.1.el7.noarch to configure a server to communicate with an external ceph cluster, the deployment fails on task: "create cephx key(s)" [1]. I was able to reproduce the problem by running the same command the ceph-ansible tried to run [2]. When examining the server where the task failed I found that the ceph-create-keys daemon (with the same container ID) was running at some point but that it had stopped. [root@rhosp-ctrl0 ~]# docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 6d391bc9842c registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest "sleep 300" About an hour ago Exited (0) About an hour ago ceph-create-keys Maybe there is a race condition with the dummy container not being available when needed? (just a theory) I'll attach more logs. https://github.com/ceph/ceph-ansible/blob/f711c51f395df181c5f821fca7ef879af79fdb64/roles/ceph-client/tasks/create_users_keys.yml#L15-L26 [1] 2018-04-16 19:54:55,600 p=2591 u=mistral | TASK [ceph-client : create cephx key(s)] *************************************** 2018-04-16 19:54:55,601 p=2591 u=mistral | Monday 16 April 2018 19:54:55 +0000 (0:00:00.048) 0:01:06.258 ********** 2018-04-16 19:59:55,593 p=2591 u=mistral | failed: [192.168.213.214] (item={'caps': {'mds': u'', 'osd': u'allow class-read object_prefix rbd_children, allow rwx pool=rhosp-volumes, allow rwx pool=rhosp-backup, allow rwx pool=rhosp-vms, allow rwx pool=rhosp-images, allow rwx pool=rhosp-metrics', 'mon': u'allow r', 'mgr': u'allow *'}, 'mode': u'0600', 'key': u'AQC8ZSlakbFkMBAAAZFQjIgWVZQ+HVnKc3FpTw==', 'name': u'client.rhosp'}) => {"changed": true, "cmd": ["docker", "exec", "ceph-create-keys", "ceph-authtool", "--create-keyring", "/etc/ceph/ceph.client.rhosp.keyring", "--name", "client.rhosp", "--add-key", "AQC8ZSlakbFkMBAAAZFQjIgWVZQ+HVnKc3FpTw==", "--cap", "mds", "", "--cap", "osd", "allow class-read object_prefix rbd_children, allow rwx pool=rhosp-volumes, allow rwx pool=rhosp-backup, allow rwx pool=rhosp-vms, allow rwx pool=rhosp-images, allow rwx pool=rhosp-metrics", "--cap", "mon", "allow r", "--cap", "mgr", "allow *"], "delta": "0:04:59.598733", "end": "2018-04-16 15:59:55.574075", "item": {"caps": {"mds": "", "mgr": "allow *", "mon": "allow r", "osd": "allow class-read object_prefix rbd_children, allow rwx pool=rhosp-volumes, allow rwx pool=rhosp-backup, allow rwx pool=rhosp-vms, allow rwx pool=rhosp-images, allow rwx pool=rhosp-metrics"}, "key": "AQC8ZSlakbFkMBAAAZFQjIgWVZQ+HVnKc3FpTw==", "mode": "0600", "name": "client.rhosp"}, "msg": "non-zero return code", "rc": 126, "start": "2018-04-16 15:54:55.975342", "stderr": "", "stderr_lines": [], "stdout": "rpc error: code = 2 desc = oci runtime error: exec failed: container \"6d391bc9842c993ae6123f023d24da305a96dfe8d64e3607973c665cd8880129\" does not exist", "stdout_lines": ["rpc error: code = 2 desc = oci runtime error: exec failed: container \"6d391bc9842c993ae6123f023d24da305a96dfe8d64e3607973c665cd8880129\" does not exist"]} [2] [root@rhosp-ctrl0 ~]# docker exec ceph-create-keys ceph-authtool --create-keyring /etc/ceph/ceph.client.rhosp.keyring --name client.rhosp --add-key AQC8ZSlakbFkMBAAAZFQjIgWVZQ+HVnKc3FpTw== --cap mds --cap osd allow class-read object_prefix rbd_children, allow rwx pool=rhosp-volumes, allow rwx pool=rhosp-backup, allow rwx pool=rhosp-vms, allow rwx pool=rhosp-images, allow rwx pool=rhosp-metrics --cap mon allow r --cap mgr allow * Error response from daemon: Container 6d391bc9842c993ae6123f023d24da305a96dfe8d64e3607973c665cd8880129 is not running [root@rhosp-ctrl0 ~]#
I reproduced this for the ceph clients when deploying a new ceph cluster w/ tripleo too. Here is the output of ansible-playbook -vvv https://ptpb.pw/-QC_ Here is the ceph-install log from the original report though it didn't have -vvv to ansible-playbook. http://ix.io/17Y7
*** Bug 1568234 has been marked as a duplicate of this bug. ***
*** Bug 1569258 has been marked as a duplicate of this bug. ***
Reproduced in OSP13 puddle 2018-04-13.1 with TLS everywhere scenarion and disabled SElinux for OC nodes
Reproduced in OSP13 puddle 2018-04-13.1 with TLS everywhere scenario and disabled SElinux for OC nodes and UC node
RHOS OSP13 Deployment with the latest puddle failing over this issue.
Part of the fix for this is that we needed the following: https://github.com/ceph/ceph-ansible/commit/90e47c5fb0c95f4b1a17cdf2a019bdcebc77a773 which landed in https://github.com/ceph/ceph-ansible/releases/tag/v3.1.0beta8
verified on ceph-ansible-3.1.0-0.1.rc2.el7cp.noarch