Bug 1510555
| Summary: | ceph-ansible does not properly check for running containers | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Daniel Messer <dmesser> | ||||
| Component: | Ceph-Ansible | Assignee: | Guillaume Abrioux <gabrioux> | ||||
| Status: | CLOSED ERRATA | QA Contact: | subhash <vpoliset> | ||||
| Severity: | high | Docs Contact: | Erin Donnelly <edonnell> | ||||
| Priority: | unspecified | ||||||
| Version: | 2.5 | CC: | adeza, aschoen, ceph-eng-bugs, dmesser, edonnell, gabrioux, gmeno, hnallurv, kdreyer, nthomas, sankarshan, shan, tserlin | ||||
| Target Milestone: | rc | ||||||
| Target Release: | 2.5 | ||||||
| Hardware: | All | ||||||
| OS: | All | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | RHEL: ceph-ansible-3.0.22-1.el7cp Ubuntu: ceph-ansible_3.0.22-2redhat1 | Doc Type: | Known Issue | ||||
| Doc Text: |
.`ceph-ansible` does not properly check for running containers
In an environment where the Docker application is not preinstalled, the `ceph-ansible` utility fails to deploy a Ceph Storage Cluster because it tries to restart `ceph-mgr` containers when deploying the `ceph-mon` role. This attempt fails because the `ceph-mgr` container is not deployed yet. In addition, the `docker ps` command returns the following error:
----
either you don't have docker-client or docker-client-common installed
----
Because `ceph-ansible` only checks if the output of `docker ps` exists, and not its content, `ceph-ansible` misinterprets this result for a running container. When the `ceph-ansible` handler is run later during Monitor deployment, the script it executes fails because no `ceph-mgr` container is found.
To work around this problem, make sure that Docker is installed before using `ceph-ansible`. For details, see the https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_atomic_host/7/html-single/getting_started_with_containers/#getting_docker_in_rhel_7[Getting Docker in RHEL 7] section in the Getting Started with Containers guide for Red Hat Enterprise Linux Atomic Host 7.
|
Story Points: | --- | ||||
| Clone Of: | |||||||
| : | 1528432 (view as bug list) | Environment: | |||||
| Last Closed: | 2018-02-21 19:44:55 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 1528432 | ||||||
| Bug Blocks: | 1494421 | ||||||
| Attachments: |
|
||||||
|
Description
Daniel Messer
2017-11-07 16:32:49 UTC
I believe that this should be a known issue. test should be something like, Docker is required for ceph-ansible use with containers see https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_atomic_host/7/html-single/getting_started_with_containers/#getting_docker_in_rhel_7 for information on how to install docker Seb. please confirm The issue does not seem valid to me, as you can see here: https://2.jenkins.ceph.com/job/ceph-ansible-prs-luminous-ansible2.4-docker_cluster/147/console the role ceph-config is passing. This is running on master branch, so tag 3.0.15. I confirm what Sebastien said. I couldn't reproduce this issue. Here's how to reproduce: 1. vanilla RHEL 7 install on all nodes 2. on all nodes install cockpit-docker (which will install docker-common as a dependency) 3. run the installer with containerized install When https://github.com/ceph/ceph-ansible/blob/5a10b048b01697a213e4516861fd5e39a6dea461/roles/ceph-defaults/tasks/check_socket_container.yml#L38 is run the output from the command: docker ps -q --filter='name=ceph-mgr-...' is: "Cannot connect to the Docker daemon. Is the docker daemon running on this host?" This line is misinterpreted as a running container by: https://github.com/ceph/ceph-ansible/blob/6320f3aac79e817e766fdcee66108d2204e182bd/roles/ceph-defaults/handlers/main.yml#L293 later. The check "ceph_mgr_container_stat.get('stdout_lines', [])|length != 0" needs to be re-written to check for the existing of an actual containers based on the content of stdout_lines. Hi Daniel,
indeed, I confirm this can cause issues.
Initially, we went with `ceph_mgr_container_stat.get('stdout_lines', [])|length != 0` because the `docker ps ...` cli always return 0, even when the name of the container doesn't exist.
In this particular case, I think we can get around this by adding a condition in roles/ceph-defaults/handlers/main.yml
upstream PR: https://github.com/ceph/ceph-ansible/pull/2214
Daniel, if you have still your environment, do you mind testing with this branch of ceph-ansible?
Bara, the doc points to an Atomic host OS, I don't think this is a general use case. Don't we have a doc that explains how to install Docker? Thanks, the rest of the explanation looks good. My bad, read too fast! lgtm Hi I've Deployed on systems in vanilla state - no docker installed and it works fine without fail.No issues with the handler called "restart ceph mgr daemon(s) - container".Deploying Red Hat Ceph Storage 2 as a Container Image was successful and has cluster health_OK Where as if i follow C11 of this bz and C3 of clone BZ 1528432(for 3.0 Z1) to verify (which are more less the same..both mentioned to have a docker-common package installed on all nodes and later ceph-ansible to initialize containerized cluster to check it gets deployed without trying to restart mgrs).In this case I'm running into an error in the initial part of playbook itself..TASK [ceph-defaults : generate cluster fsid] gets skipped.Not sure abt what is causing it..moving back to assigned state.Attaching the logs and details of the setup for reference. (ansible admin node -- magna108) ***inventory file** [mons] magna004 magna009 magna011 [osds] magna034 magna083 magna014 *** all.yml *** fetch_directory: ~/ceph-ansible-keys ceph_origin: distro ceph_repository: rhcs monitor_interface: eno1 public_network: 10.8.128.0/21 docker: true ceph_docker_image: "rhceph" ceph_docker_image_tag: "ceph-2-rhel-7-docker-candidate-56015-20180126175803" ceph_docker_registry: "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888" containerized_deployment: true Version-Release number of selected component in both scenarios: ceph-ansible-3.0.21-1.el7cp.noarch, ceph-2-rhel-7-docker-candidate-56015-20180126175803 Created attachment 1388799 [details]
FIle contains ansible-playbook log ,all.yml,osds.yml,hosts content
ansible-playbook log
The last error mentioned isn't related to the initial issue reported here, I thinks a dedicated BZ should have been filled for this. However, this should be fixed in https://github.com/ceph/ceph-ansible/pull/2365 By the way, I'm quite surprised to see this error only occurs on one node (magna108), the other nodes are fine. This brings me to the conclusion that there is something different on this node that is causing this. Hi all, actually we were testing with the wrong package installed. We should have tested this with ceph-client installed and not just ceph-common. We tested this with Veera and it passed. I think we can move it to VERIFIED. (In reply to Guillaume Abrioux from comment #30) > Hi all, > > actually we were testing with the wrong package installed. > We should have tested this with ceph-client installed and not just > ceph-common. > > We tested this with Veera and it passed. > > I think we can move it to VERIFIED. Following C11 of this bz and C3/C5 of clone BZ 1528432 will result in playbook failing at fsid generation Task .From logs it was found out that ,it needs docker-client package installed instead of just docker-common pkg. So the right way to test this scenario is. The steps to verify should be : 1. vanilla RHEL 7 install on all nodes 2. on all nodes install docker-client package (which will install docker-common as a dependency) instead of cockpit-docker pkg / docker-common package alone 3. configure ceph-ansible to initialize containerized cluster, run ansible playbook site-docker.yml Following the above steps the playbook ran with no issues and cluster gets deployed without trying to restart mgrs. Hence moving to verfied state Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0340 |