1339576 – [ceph-ansible] purge-cluster fails with "grep: write error"

Bug 1339576 - [ceph-ansible] purge-cluster fails with "grep: write error"

Summary: [ceph-ansible] purge-cluster fails with "grep: write error"

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Storage Console
Classification:	Red Hat Storage
Component:	ceph-ansible
Sub Component:
Version:	2
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	2
Assignee:	Alfredo Deza
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	1354687 1359537 1359539 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-05-25 11:20 UTC by Tejas
Modified:	2017-03-14 15:50 UTC (History)
CC List:	10 users (show)
Fixed In Version:	ceph-ansible-2.1.1-1.el7scon
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Environment:
Last Closed:	2017-03-14 15:50:34 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:0515	0	normal	SHIPPED_LIVE	Important: ansible and ceph-ansible security, bug fix, and enhancement update	2017-04-18 21:12:31 UTC

Description Tejas 2016-05-25 11:20:07 UTC

Description of problem:
the cluster is installed using ubuntu ISO.
When I run the purge-cluster.yml it fails with :
TASK: [check for anything running ceph] *************************************** 
failed: [magna046] => {"changed": true, "cmd": "ps awux | grep -v grep | grep -q -- ceph-", "delta": "0:00:00.022172", "end": "2016-05-25 10:52:20.059321", "failed": true, "failed_when_result": true, "rc": 0, "start": "2016-05-25 10:52:20.037149", "stdout_lines": [], "warnings": []}
stderr: grep: write error


Version-Release number of selected component (if applicable):
Ceph-ansible: 1.0.5-16.el7scon.noarch
ceph :  10.2.1-4redhat1xenial

How reproducible:
Always

Steps to Reproduce:
1. Install a one node cluster using ubuntu ISO.
2.  run a playbook with purge-cluster.yml
3. purge fails

Actual results:
purge fails

Expected results:
purge is expected to pass

Additional info:

group_vars files located at :
magna006:/root/bz/purgefail

the playbook log is pasted below

root@magna046:~# ceph -s --cluster ceph-cluster12
    cluster 1ed9d25b-c637-4fae-b443-dd7c687615a6
     health HEALTH_ERR
            320 pgs are stuck inactive for more than 300 seconds
            320 pgs degraded
            320 pgs stuck inactive
            320 pgs undersized
     monmap e1: 1 mons at {magna046=10.8.128.46:6789/0}
            election epoch 3, quorum 0 magna046
      fsmap e2: 0/0/1 up
     osdmap e12: 2 osds: 2 up, 2 in
            flags sortbitwise
      pgmap v27: 320 pgs, 3 pools, 0 bytes data, 0 objects
            71984 kB used, 1852 GB / 1852 GB avail
                 320 undersized+degraded+peered


root@magna046:~# dpkg -l | grep ceph
ii  ceph-base                            10.2.1-4redhat1xenial                    amd64        common ceph daemon libraries and management tools
ii  ceph-common                          10.2.1-4redhat1xenial                    amd64        common utilities to mount and interact with a ceph storage cluster
ii  ceph-fs-common                       10.2.1-4redhat1xenial                    amd64        common utilities to mount and interact with a ceph file system
ii  ceph-fuse                            10.2.1-4redhat1xenial                    amd64        FUSE-based client for the Ceph distributed file system
ii  ceph-mds                             10.2.1-4redhat1xenial                    amd64        metadata server for the ceph distributed file system
ii  ceph-mon                             10.2.1-4redhat1xenial                    amd64        monitor server for the ceph storage system
ii  ceph-osd                             10.2.1-4redhat1xenial                    amd64        OSD server for the ceph storage system
ii  libcephfs1                           10.2.1-4redhat1xenial                    amd64        Ceph distributed file system client library
ii  python-cephfs                        10.2.1-4redhat1xenial                    amd64        Python libraries for the Ceph libcephfs library
root@magna046:~# 

root@magna046:~# ps -ef | grep ceph
ceph       16523       1  0 05:50 ?        00:00:08 /usr/bin/ceph-mon -f --cluster ceph-cluster12 --id magna046 --setuser ceph --setgroup ceph
ceph       18181       1  0 05:50 ?        00:00:42 /usr/bin/ceph-osd -f --cluster ceph-cluster12 --id 0 --setuser ceph --setgroup ceph
ceph       18741       1  0 05:51 ?        00:00:30 /usr/bin/ceph-osd -f --cluster ceph-cluster12 --id 1 --setuser ceph --setgroup ceph
root       27985   13011  0 11:17 pts/0    00:00:00 grep --color=auto ceph
root@magna046:~# 


[root@magna006 ceph-ansible]# ansible-playbook purge-cluster.yml 
Are you sure you want to purge the cluster? [no]: yes

PLAY [confirm whether user really meant to purge the cluster] ***************** 

GATHERING FACTS *************************************************************** 
ok: [localhost]

TASK: [exit playbook, if user did not mean to purge cluster] ****************** 
skipping: [localhost]

PLAY [stop ceph cluster] ****************************************************** 

GATHERING FACTS *************************************************************** 
ok: [magna046]

TASK: [check for a device list] *********************************************** 
skipping: [magna046]

TASK: [get osd numbers] ******************************************************* 
ok: [magna046]

TASK: [are we using systemd] ************************************************** 
changed: [magna046]

TASK: [stop ceph.target with systemd] ***************************************** 
skipping: [magna046]

TASK: [stop ceph-osd with systemd] ******************************************** 
skipping: [magna046] => (item=cluster12)
skipping: [magna046] => (item=cluster12)

TASK: [stop ceph mons with systemd] ******************************************* 
skipping: [magna046]

TASK: [stop ceph mdss with systemd] ******************************************* 
skipping: [magna046]

TASK: [stop ceph rgws with systemd] ******************************************* 
skipping: [magna046]

TASK: [stop ceph rbd mirror with systemd] ************************************* 
skipping: [magna046]

TASK: [stop ceph osds] ******************************************************** 
skipping: [magna046]

TASK: [stop ceph mons] ******************************************************** 
skipping: [magna046]

TASK: [stop ceph mdss] ******************************************************** 
skipping: [magna046]

TASK: [stop ceph rgws] ******************************************************** 
skipping: [magna046]

TASK: [stop ceph osds on ubuntu] ********************************************** 
changed: [magna046] => (item=cluster12)
changed: [magna046] => (item=cluster12)

TASK: [stop ceph mons on ubuntu] ********************************************** 
ok: [magna046]

TASK: [stop ceph mdss on ubuntu] ********************************************** 
skipping: [magna046]

TASK: [stop ceph rgws on ubuntu] ********************************************** 
skipping: [magna046]

TASK: [stop ceph rbd mirror on ubuntu] **************************************** 
skipping: [magna046]

TASK: [check for anything running ceph] *************************************** 
failed: [magna046] => {"changed": true, "cmd": "ps awux | grep -v grep | grep -q -- ceph-", "delta": "0:00:00.022172", "end": "2016-05-25 10:52:20.059321", "failed": true, "failed_when_result": true, "rc": 0, "start": "2016-05-25 10:52:20.037149", "stdout_lines": [], "warnings": []}
stderr: grep: write error

FATAL: all hosts have already failed -- aborting

PLAY RECAP ******************************************************************** 
           to retry, use: --limit @/root/purge-cluster.retry

localhost                  : ok=1    changed=0    unreachable=0    failed=0   
magna046                   : ok=6    changed=2    unreachable=0    failed=1

Comment 2 Alfredo Deza 2016-05-25 13:30:09 UTC

@tejas can you try to run manually the grep command when this failure happens? I wasn't able to reproduce locally:

    ps awux | grep -v grep | grep -q -- ceph-

Comment 4 Tejas 2016-05-26 06:25:23 UTC

hi Alfredo,

    when I run the command manually:
root@magna052:~# ps awux | grep -v grep | grep -q -- ceph-
root@magna052:~

there is no output.
However the ceph processes are running on the system:
root@magna052:~# ps awux |  grep  ceph-
ceph       21422  0.1  0.1 352988 48352 ?        Ssl  05:55   0:00 /usr/bin/ceph-mon -f --cluster ceph-cluster12 --id magna052 --setuser ceph --setgroup ceph
ceph       23214  5.8  0.1 850816 45836 ?        Ssl  05:55   0:30 /usr/bin/ceph-osd -f --cluster ceph-cluster12 --id 0 --setuser ceph --setgroup ceph
ceph       23814  3.0  0.1 848780 39468 ?        Ssl  05:56   0:15 /usr/bin/ceph-osd -f --cluster ceph-cluster12 --id 1 --setuser ceph --setgroup ceph
ceph       24359  1.7  0.1 846732 38504 ?        Ssl  05:56   0:08 /usr/bin/ceph-osd -f --cluster ceph-cluster12 --id 2 --setuser ceph --setgroup ceph
root       25366  0.0  0.0  16576  2096 pts/0    S+   06:04   0:00 grep --color=auto ceph-
root@magna052:~# 

Maybe the "ps awux | grep -v grep | grep -q -- ceph-" is not able to see the ceph processes?

Thanks,
Tejas

Comment 5 Alfredo Deza 2016-05-31 13:57:38 UTC

@tejas the "-q" flag in grep will silent output. It doesn't mean that it "is not able to see ceph processes".

Did you run "ps awux | grep -v grep | grep -q -- ceph-" when the failure was triggered? (this is a requirement)

Comment 6 Alfredo Deza 2016-07-25 12:16:03 UTC

*** Bug 1359539 has been marked as a duplicate of this bug. ***

Comment 7 Alfredo Deza 2016-07-25 14:05:20 UTC

After investigating this a bit, I found that the "-q" flag in the last portion of the shell command is causing the error.

If removed, the command will not fail and work as expected.

The idea behind the "-q" flag is that it will exit immediately at the first match while not sending any output to stdout.

Since it is causing the (reproducible) issue, the code change will only involve a removal

Comment 8 Alfredo Deza 2016-07-25 14:16:29 UTC

*** Bug 1359537 has been marked as a duplicate of this bug. ***

Comment 9 Alfredo Deza 2016-07-25 14:18:33 UTC

Pull request opened: https://github.com/ceph/ceph-ansible/pull/900

Comment 11 Federico Lucifredi 2016-10-07 17:04:05 UTC

This will ship concurrently with RHCS 2.1.

Comment 14 Andrew Schoen 2017-01-11 21:18:38 UTC

*** Bug 1354687 has been marked as a duplicate of this bug. ***

Comment 16 Tejas 2017-01-30 10:38:21 UTC

This issue was raised with the older ceph-ansible.
In ceph-ansible 2.1 I have not not seen this issue on Ubuntu, so I will mark this verified.

Thanks,
Tejas

Comment 18 errata-xmlrpc 2017-03-14 15:50:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:0515

Note You need to log in before you can comment on or make changes to this bug.