1510555 – ceph-ansible does not properly check for running containers

Bug 1510555 - ceph-ansible does not properly check for running containers

Summary: ceph-ansible does not properly check for running containers

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Ansible
Sub Component:
Version:	2.5
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	2.5
Assignee:	Guillaume Abrioux
QA Contact:	subhash
Docs Contact:	Erin Donnelly
URL:
Whiteboard:
Depends On:	1528432
Blocks:	1494421
TreeView+	depends on / blocked

Reported:	2017-11-07 16:32 UTC by Daniel Messer
Modified:	2018-02-21 19:44 UTC (History)
CC List:	13 users (show)
Fixed In Version:	RHEL: ceph-ansible-3.0.22-1.el7cp Ubuntu: ceph-ansible_3.0.22-2redhat1
Doc Type:	Known Issue
Doc Text:	.`ceph-ansible` does not properly check for running containers In an environment where the Docker application is not preinstalled, the `ceph-ansible` utility fails to deploy a Ceph Storage Cluster because it tries to restart `ceph-mgr` containers when deploying the `ceph-mon` role. This attempt fails because the `ceph-mgr` container is not deployed yet. In addition, the `docker ps` command returns the following error: ---- either you don't have docker-client or docker-client-common installed ---- Because `ceph-ansible` only checks if the output of `docker ps` exists, and not its content, `ceph-ansible` misinterprets this result for a running container. When the `ceph-ansible` handler is run later during Monitor deployment, the script it executes fails because no `ceph-mgr` container is found. To work around this problem, make sure that Docker is installed before using `ceph-ansible`. For details, see the https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_atomic_host/7/html-single/getting_started_with_containers/#getting_docker_in_rhel_7[Getting Docker in RHEL 7] section in the Getting Started with Containers guide for Red Hat Enterprise Linux Atomic Host 7.
Clone Of:
Clones:	1528432 (view as bug list)
Environment:
Last Closed:	2018-02-21 19:44:55 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
FIle contains ansible-playbook log ,all.yml,osds.yml,hosts content (72.71 KB, text/plain) 2018-01-31 08:29 UTC, subhash	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ceph ceph-ansible pull 2214	'None'	closed	handlers: restart daemons only if docker is running	2020-12-08 01:50:28 UTC
Github	ceph ceph-ansible pull 2365	'None'	closed	Fix 1510555	2020-12-08 01:50:26 UTC
Red Hat Product Errata	RHBA-2018:0340	normal	SHIPPED_LIVE	Red Hat Ceph Storage 2.5 bug fix and enhancement update	2018-02-22 00:50:32 UTC

Description Daniel Messer 2017-11-07 16:32:49 UTC

Description of problem:

In a vanilla environment with no docker pre-installed ceph-ansible fails to deploy because it tries to restart ceph-mgr containers when deploying the ceph-mon role. This attempt fails because no ceph-mgr container is there yet.

I believe the problem is here: https://github.com/ceph/ceph-ansible/blob/6320f3aac79e817e766fdcee66108d2204e182bd/roles/ceph-defaults/handlers/main.yml#L293

and here:
https://github.com/ceph/ceph-ansible/blob/5a10b048b01697a213e4516861fd5e39a6dea461/roles/ceph-defaults/tasks/check_socket_container.yml#L38

When docker is not installed the docker ps command will yield "either you don't have docker-client or docker-client-common installed" as stdout. Since ceph-ansible is not checking the content of the output but just if there is output at all it misinterprets this result for a running container.
When the handler is run later during mon deployment the script it executes fails because no ceph-mgr container is found.

Version-Release number of selected component (if applicable):

ceph-ansible-3.0.9-1.el7cp
ceph-3.0-rhel-7-docker-candidate-61072-20171104225422

How reproducible:

Deploy on systems in vanilla state - no docker installed.

Actual results:

During the deployment the handler called "restart ceph mgr daemon(s) - container" fails.

Expected results:

The handler is not executing just because there is a line of output in the result of the docker ps -q --filter=... query. It should only execute if that output actually is a running container.

Additional info:

I believe it would be best if the check for a running container happens "after" docker is installed and is more thorough. I.e. check if the output is actually the UUID of a container or something else.

Comment 5 Christina Meno 2017-11-23 00:06:27 UTC

I believe that this should be a known issue.
test should be something like, Docker is required for ceph-ansible use with containers see https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_atomic_host/7/html-single/getting_started_with_containers/#getting_docker_in_rhel_7 for information on how to install docker

Seb. please confirm

Comment 7 Sébastien Han 2017-11-23 12:38:19 UTC

The issue does not seem valid to me, as you can see here: https://2.jenkins.ceph.com/job/ceph-ansible-prs-luminous-ansible2.4-docker_cluster/147/console the role ceph-config is passing.
This is running on master branch, so tag 3.0.15.

Comment 9 Guillaume Abrioux 2017-11-23 13:34:45 UTC

I confirm what Sebastien said. I couldn't reproduce this issue.

Comment 11 Daniel Messer 2017-11-27 12:28:01 UTC

Here's how to reproduce:

1. vanilla RHEL 7 install on all nodes
2. on all nodes install cockpit-docker (which will install docker-common as a dependency)
3. run the installer with containerized install

When https://github.com/ceph/ceph-ansible/blob/5a10b048b01697a213e4516861fd5e39a6dea461/roles/ceph-defaults/tasks/check_socket_container.yml#L38 is run the output from the command:

docker ps -q --filter='name=ceph-mgr-...'

is:

"Cannot connect to the Docker daemon. Is the docker daemon running on this host?"

This line is misinterpreted as a running container by: 

https://github.com/ceph/ceph-ansible/blob/6320f3aac79e817e766fdcee66108d2204e182bd/roles/ceph-defaults/handlers/main.yml#L293

later. The check "ceph_mgr_container_stat.get('stdout_lines', [])|length != 0" needs to be re-written to check for the existing of an actual containers based on the content of stdout_lines.

Comment 12 Guillaume Abrioux 2017-11-27 14:17:31 UTC

Hi Daniel,

indeed, I confirm this can cause issues.

Initially, we went with `ceph_mgr_container_stat.get('stdout_lines', [])|length != 0` because the `docker ps ...` cli always return 0, even when the name of the container doesn't exist.
In this particular case, I think we can get around this by adding a condition in roles/ceph-defaults/handlers/main.yml

upstream PR: https://github.com/ceph/ceph-ansible/pull/2214

Daniel, if you have still your environment, do you mind testing with this branch of ceph-ansible?

Comment 15 Sébastien Han 2017-11-27 16:54:11 UTC

Bara, the doc points to an Atomic host OS, I don't think this is a general use case. Don't we have a doc that explains how to install Docker?

Thanks, the rest of the explanation looks good.

Comment 17 Sébastien Han 2017-11-28 09:29:34 UTC

My bad, read too fast! lgtm

Comment 21 subhash 2018-01-30 19:19:23 UTC

Hi

I've Deployed on systems in vanilla state - no docker installed and it works fine without fail.No issues with the handler called "restart ceph mgr daemon(s) - container".Deploying Red Hat Ceph Storage 2 as a Container Image was successful and has cluster health_OK

Where as if i follow C11 of this bz and C3 of clone BZ 1528432(for 3.0 Z1) to verify (which are more less the same..both mentioned to have a docker-common package installed on all nodes and later ceph-ansible to initialize containerized cluster to check it gets deployed without trying to restart mgrs).In this case I'm running into an error in the initial part of playbook itself..TASK [ceph-defaults : generate cluster fsid] gets skipped.Not sure abt what is causing it..moving back to assigned state.Attaching the logs and details of the setup for reference.

(ansible admin node -- magna108) 
***inventory file** 
[mons]
magna004
magna009
magna011

[osds]	
magna034
magna083
magna014


*** all.yml ***


fetch_directory: ~/ceph-ansible-keys
ceph_origin: distro
ceph_repository: rhcs
monitor_interface: eno1
public_network: 10.8.128.0/21
docker: true
ceph_docker_image: "rhceph"
ceph_docker_image_tag: "ceph-2-rhel-7-docker-candidate-56015-20180126175803"
ceph_docker_registry: "brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888"
containerized_deployment: true 

Version-Release number of selected component in both scenarios:
ceph-ansible-3.0.21-1.el7cp.noarch,
ceph-2-rhel-7-docker-candidate-56015-20180126175803

Comment 22 subhash 2018-01-31 08:29:40 UTC

Created attachment 1388799 [details]
FIle contains ansible-playbook log ,all.yml,osds.yml,hosts content

ansible-playbook log

Comment 23 Guillaume Abrioux 2018-01-31 08:47:45 UTC

The last error mentioned isn't related to the initial issue reported here, I thinks a dedicated BZ should have been filled for this.
However, this should be fixed in https://github.com/ceph/ceph-ansible/pull/2365

Comment 24 Guillaume Abrioux 2018-01-31 09:05:34 UTC

By the way, I'm quite surprised to see this error only occurs on one node (magna108), the other nodes are fine. This brings me to the conclusion that there is something different on this node that is causing this.

Comment 30 Guillaume Abrioux 2018-02-01 11:42:46 UTC

Hi all,

actually we were testing with the wrong package installed.
We should have tested this with ceph-client installed and not just ceph-common.

We tested this with Veera and it passed.

I think we can move it to VERIFIED.

Comment 31 subhash 2018-02-01 12:09:58 UTC

(In reply to Guillaume Abrioux from comment #30)
> Hi all,
> 
> actually we were testing with the wrong package installed.
> We should have tested this with ceph-client installed and not just
> ceph-common.
> 
> We tested this with Veera and it passed.
> 
> I think we can move it to VERIFIED.

Following C11 of this bz and C3/C5 of clone BZ 1528432 will result in playbook failing at fsid generation Task .From logs it was found out that ,it needs docker-client package installed instead of just docker-common pkg.

So the right way to test this scenario is.

The steps to verify should be :

1. vanilla RHEL 7 install on all nodes
2. on all nodes install docker-client package (which will install docker-common 
   as a dependency) instead of cockpit-docker pkg / docker-common package alone
3. configure ceph-ansible to initialize containerized cluster, run ansible 
   playbook site-docker.yml

Following the above steps the playbook ran with no issues and cluster gets deployed without trying to restart mgrs. Hence moving to verfied state

Comment 38 errata-xmlrpc 2018-02-21 19:44:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0340

Note You need to log in before you can comment on or make changes to this bug.