.`ceph-ansible` does not properly check for running containers
In an environment where the Docker application is not preinstalled, the `ceph-ansible` utility fails to deploy a Ceph Storage Cluster because it tries to restart `ceph-mgr` containers when deploying the `ceph-mon` role. This attempt fails because the `ceph-mgr` container is not deployed yet. In addition, the `docker ps` command returns the following error:
either you don't have docker-client or docker-client-common installed
Because `ceph-ansible` only checks if the output of `docker ps` exists, and not its content, `ceph-ansible` misinterprets this result for a running container. When the `ceph-ansible` handler is run later during Monitor deployment, the script it executes fails because no `ceph-mgr` container is found.
To work around this problem, make sure that Docker is installed before using `ceph-ansible`. For details, see the https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_atomic_host/7/html-single/getting_started_with_containers/#getting_docker_in_rhel_7[Getting Docker in RHEL 7] section in the Getting Started with Containers guide for Red Hat Enterprise Linux Atomic Host 7.
Description of problem:
In a vanilla environment with no docker pre-installed ceph-ansible fails to deploy because it tries to restart ceph-mgr containers when deploying the ceph-mon role. This attempt fails because no ceph-mgr container is there yet.
I believe the problem is here: https://github.com/ceph/ceph-ansible/blob/6320f3aac79e817e766fdcee66108d2204e182bd/roles/ceph-defaults/handlers/main.yml#L293
When docker is not installed the docker ps command will yield "either you don't have docker-client or docker-client-common installed" as stdout. Since ceph-ansible is not checking the content of the output but just if there is output at all it misinterprets this result for a running container.
When the handler is run later during mon deployment the script it executes fails because no ceph-mgr container is found.
Version-Release number of selected component (if applicable):
Deploy on systems in vanilla state - no docker installed.
During the deployment the handler called "restart ceph mgr daemon(s) - container" fails.
The handler is not executing just because there is a line of output in the result of the docker ps -q --filter=... query. It should only execute if that output actually is a running container.
I believe it would be best if the check for a running container happens "after" docker is installed and is more thorough. I.e. check if the output is actually the UUID of a container or something else.
I believe that this should be a known issue.
test should be something like, Docker is required for ceph-ansible use with containers see https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_atomic_host/7/html-single/getting_started_with_containers/#getting_docker_in_rhel_7 for information on how to install docker
Seb. please confirm
The issue does not seem valid to me, as you can see here: https://2.jenkins.ceph.com/job/ceph-ansible-prs-luminous-ansible2.4-docker_cluster/147/console the role ceph-config is passing.
This is running on master branch, so tag 3.0.15.
I confirm what Sebastien said. I couldn't reproduce this issue.
Here's how to reproduce:
1. vanilla RHEL 7 install on all nodes
2. on all nodes install cockpit-docker (which will install docker-common as a dependency)
3. run the installer with containerized install
When https://github.com/ceph/ceph-ansible/blob/5a10b048b01697a213e4516861fd5e39a6dea461/roles/ceph-defaults/tasks/check_socket_container.yml#L38 is run the output from the command:
docker ps -q --filter='name=ceph-mgr-...'
"Cannot connect to the Docker daemon. Is the docker daemon running on this host?"
This line is misinterpreted as a running container by:
later. The check "ceph_mgr_container_stat.get('stdout_lines', )|length != 0" needs to be re-written to check for the existing of an actual containers based on the content of stdout_lines.
indeed, I confirm this can cause issues.
Initially, we went with `ceph_mgr_container_stat.get('stdout_lines', )|length != 0` because the `docker ps ...` cli always return 0, even when the name of the container doesn't exist.
In this particular case, I think we can get around this by adding a condition in roles/ceph-defaults/handlers/main.yml
upstream PR: https://github.com/ceph/ceph-ansible/pull/2214
Daniel, if you have still your environment, do you mind testing with this branch of ceph-ansible?
Bara, the doc points to an Atomic host OS, I don't think this is a general use case. Don't we have a doc that explains how to install Docker?
Thanks, the rest of the explanation looks good.
My bad, read too fast! lgtm
I've Deployed on systems in vanilla state - no docker installed and it works fine without fail.No issues with the handler called "restart ceph mgr daemon(s) - container".Deploying Red Hat Ceph Storage 2 as a Container Image was successful and has cluster health_OK
Where as if i follow C11 of this bz and C3 of clone BZ 1528432(for 3.0 Z1) to verify (which are more less the same..both mentioned to have a docker-common package installed on all nodes and later ceph-ansible to initialize containerized cluster to check it gets deployed without trying to restart mgrs).In this case I'm running into an error in the initial part of playbook itself..TASK [ceph-defaults : generate cluster fsid] gets skipped.Not sure abt what is causing it..moving back to assigned state.Attaching the logs and details of the setup for reference.
(ansible admin node -- magna108)
*** all.yml ***
Version-Release number of selected component in both scenarios:
Created attachment 1388799 [details]
FIle contains ansible-playbook log ,all.yml,osds.yml,hosts content
The last error mentioned isn't related to the initial issue reported here, I thinks a dedicated BZ should have been filled for this.
However, this should be fixed in https://github.com/ceph/ceph-ansible/pull/2365
By the way, I'm quite surprised to see this error only occurs on one node (magna108), the other nodes are fine. This brings me to the conclusion that there is something different on this node that is causing this.
actually we were testing with the wrong package installed.
We should have tested this with ceph-client installed and not just ceph-common.
We tested this with Veera and it passed.
I think we can move it to VERIFIED.
(In reply to Guillaume Abrioux from comment #30)
> Hi all,
> actually we were testing with the wrong package installed.
> We should have tested this with ceph-client installed and not just
> We tested this with Veera and it passed.
> I think we can move it to VERIFIED.
Following C11 of this bz and C3/C5 of clone BZ 1528432 will result in playbook failing at fsid generation Task .From logs it was found out that ,it needs docker-client package installed instead of just docker-common pkg.
So the right way to test this scenario is.
The steps to verify should be :
1. vanilla RHEL 7 install on all nodes
2. on all nodes install docker-client package (which will install docker-common
as a dependency) instead of cockpit-docker pkg / docker-common package alone
3. configure ceph-ansible to initialize containerized cluster, run ansible
Following the above steps the playbook ran with no issues and cluster gets deployed without trying to restart mgrs. Hence moving to verfied state
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.