Description of problem: the cluster is installed using ubuntu ISO. When I run the purge-cluster.yml it fails with : TASK: [check for anything running ceph] *************************************** failed: [magna046] => {"changed": true, "cmd": "ps awux | grep -v grep | grep -q -- ceph-", "delta": "0:00:00.022172", "end": "2016-05-25 10:52:20.059321", "failed": true, "failed_when_result": true, "rc": 0, "start": "2016-05-25 10:52:20.037149", "stdout_lines": [], "warnings": []} stderr: grep: write error Version-Release number of selected component (if applicable): Ceph-ansible: 1.0.5-16.el7scon.noarch ceph : 10.2.1-4redhat1xenial How reproducible: Always Steps to Reproduce: 1. Install a one node cluster using ubuntu ISO. 2. run a playbook with purge-cluster.yml 3. purge fails Actual results: purge fails Expected results: purge is expected to pass Additional info: group_vars files located at : magna006:/root/bz/purgefail the playbook log is pasted below root@magna046:~# ceph -s --cluster ceph-cluster12 cluster 1ed9d25b-c637-4fae-b443-dd7c687615a6 health HEALTH_ERR 320 pgs are stuck inactive for more than 300 seconds 320 pgs degraded 320 pgs stuck inactive 320 pgs undersized monmap e1: 1 mons at {magna046=10.8.128.46:6789/0} election epoch 3, quorum 0 magna046 fsmap e2: 0/0/1 up osdmap e12: 2 osds: 2 up, 2 in flags sortbitwise pgmap v27: 320 pgs, 3 pools, 0 bytes data, 0 objects 71984 kB used, 1852 GB / 1852 GB avail 320 undersized+degraded+peered root@magna046:~# dpkg -l | grep ceph ii ceph-base 10.2.1-4redhat1xenial amd64 common ceph daemon libraries and management tools ii ceph-common 10.2.1-4redhat1xenial amd64 common utilities to mount and interact with a ceph storage cluster ii ceph-fs-common 10.2.1-4redhat1xenial amd64 common utilities to mount and interact with a ceph file system ii ceph-fuse 10.2.1-4redhat1xenial amd64 FUSE-based client for the Ceph distributed file system ii ceph-mds 10.2.1-4redhat1xenial amd64 metadata server for the ceph distributed file system ii ceph-mon 10.2.1-4redhat1xenial amd64 monitor server for the ceph storage system ii ceph-osd 10.2.1-4redhat1xenial amd64 OSD server for the ceph storage system ii libcephfs1 10.2.1-4redhat1xenial amd64 Ceph distributed file system client library ii python-cephfs 10.2.1-4redhat1xenial amd64 Python libraries for the Ceph libcephfs library root@magna046:~# root@magna046:~# ps -ef | grep ceph ceph 16523 1 0 05:50 ? 00:00:08 /usr/bin/ceph-mon -f --cluster ceph-cluster12 --id magna046 --setuser ceph --setgroup ceph ceph 18181 1 0 05:50 ? 00:00:42 /usr/bin/ceph-osd -f --cluster ceph-cluster12 --id 0 --setuser ceph --setgroup ceph ceph 18741 1 0 05:51 ? 00:00:30 /usr/bin/ceph-osd -f --cluster ceph-cluster12 --id 1 --setuser ceph --setgroup ceph root 27985 13011 0 11:17 pts/0 00:00:00 grep --color=auto ceph root@magna046:~# [root@magna006 ceph-ansible]# ansible-playbook purge-cluster.yml Are you sure you want to purge the cluster? [no]: yes PLAY [confirm whether user really meant to purge the cluster] ***************** GATHERING FACTS *************************************************************** ok: [localhost] TASK: [exit playbook, if user did not mean to purge cluster] ****************** skipping: [localhost] PLAY [stop ceph cluster] ****************************************************** GATHERING FACTS *************************************************************** ok: [magna046] TASK: [check for a device list] *********************************************** skipping: [magna046] TASK: [get osd numbers] ******************************************************* ok: [magna046] TASK: [are we using systemd] ************************************************** changed: [magna046] TASK: [stop ceph.target with systemd] ***************************************** skipping: [magna046] TASK: [stop ceph-osd with systemd] ******************************************** skipping: [magna046] => (item=cluster12) skipping: [magna046] => (item=cluster12) TASK: [stop ceph mons with systemd] ******************************************* skipping: [magna046] TASK: [stop ceph mdss with systemd] ******************************************* skipping: [magna046] TASK: [stop ceph rgws with systemd] ******************************************* skipping: [magna046] TASK: [stop ceph rbd mirror with systemd] ************************************* skipping: [magna046] TASK: [stop ceph osds] ******************************************************** skipping: [magna046] TASK: [stop ceph mons] ******************************************************** skipping: [magna046] TASK: [stop ceph mdss] ******************************************************** skipping: [magna046] TASK: [stop ceph rgws] ******************************************************** skipping: [magna046] TASK: [stop ceph osds on ubuntu] ********************************************** changed: [magna046] => (item=cluster12) changed: [magna046] => (item=cluster12) TASK: [stop ceph mons on ubuntu] ********************************************** ok: [magna046] TASK: [stop ceph mdss on ubuntu] ********************************************** skipping: [magna046] TASK: [stop ceph rgws on ubuntu] ********************************************** skipping: [magna046] TASK: [stop ceph rbd mirror on ubuntu] **************************************** skipping: [magna046] TASK: [check for anything running ceph] *************************************** failed: [magna046] => {"changed": true, "cmd": "ps awux | grep -v grep | grep -q -- ceph-", "delta": "0:00:00.022172", "end": "2016-05-25 10:52:20.059321", "failed": true, "failed_when_result": true, "rc": 0, "start": "2016-05-25 10:52:20.037149", "stdout_lines": [], "warnings": []} stderr: grep: write error FATAL: all hosts have already failed -- aborting PLAY RECAP ******************************************************************** to retry, use: --limit @/root/purge-cluster.retry localhost : ok=1 changed=0 unreachable=0 failed=0 magna046 : ok=6 changed=2 unreachable=0 failed=1
@tejas can you try to run manually the grep command when this failure happens? I wasn't able to reproduce locally: ps awux | grep -v grep | grep -q -- ceph-
hi Alfredo, when I run the command manually: root@magna052:~# ps awux | grep -v grep | grep -q -- ceph- root@magna052:~ there is no output. However the ceph processes are running on the system: root@magna052:~# ps awux | grep ceph- ceph 21422 0.1 0.1 352988 48352 ? Ssl 05:55 0:00 /usr/bin/ceph-mon -f --cluster ceph-cluster12 --id magna052 --setuser ceph --setgroup ceph ceph 23214 5.8 0.1 850816 45836 ? Ssl 05:55 0:30 /usr/bin/ceph-osd -f --cluster ceph-cluster12 --id 0 --setuser ceph --setgroup ceph ceph 23814 3.0 0.1 848780 39468 ? Ssl 05:56 0:15 /usr/bin/ceph-osd -f --cluster ceph-cluster12 --id 1 --setuser ceph --setgroup ceph ceph 24359 1.7 0.1 846732 38504 ? Ssl 05:56 0:08 /usr/bin/ceph-osd -f --cluster ceph-cluster12 --id 2 --setuser ceph --setgroup ceph root 25366 0.0 0.0 16576 2096 pts/0 S+ 06:04 0:00 grep --color=auto ceph- root@magna052:~# Maybe the "ps awux | grep -v grep | grep -q -- ceph-" is not able to see the ceph processes? Thanks, Tejas
@tejas the "-q" flag in grep will silent output. It doesn't mean that it "is not able to see ceph processes". Did you run "ps awux | grep -v grep | grep -q -- ceph-" when the failure was triggered? (this is a requirement)
*** Bug 1359539 has been marked as a duplicate of this bug. ***
After investigating this a bit, I found that the "-q" flag in the last portion of the shell command is causing the error. If removed, the command will not fail and work as expected. The idea behind the "-q" flag is that it will exit immediately at the first match while not sending any output to stdout. Since it is causing the (reproducible) issue, the code change will only involve a removal
*** Bug 1359537 has been marked as a duplicate of this bug. ***
Pull request opened: https://github.com/ceph/ceph-ansible/pull/900
This will ship concurrently with RHCS 2.1.
*** Bug 1354687 has been marked as a duplicate of this bug. ***
This issue was raised with the older ceph-ansible. In ceph-ansible 2.1 I have not not seen this issue on Ubuntu, so I will mark this verified. Thanks, Tejas
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:0515