Bug 1505011

Summary:	[Ceph-Ansible 3.0.3-1.el7cp] dont default restart machines during purge-cluster
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Vasu Kulkarni <vakulkar>
Component:	Ceph-Ansible	Assignee:	Sébastien Han <shan>
Status:	CLOSED ERRATA	QA Contact:	Vasu Kulkarni <vakulkar>
Severity:	medium	Docs Contact:
Priority:	urgent
Version:	3.0	CC:	adeza, anharris, aschoen, ceph-eng-bugs, ceph-qe-bugs, gmeno, hnallurv, kdreyer, nthomas, sankarshan
Target Milestone:	rc
Target Release:	3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	RHEL: ceph-ansible-3.0.6-1.el7cp Ubuntu: ceph-ansible_3.0.6-2redhat1	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-12-05 23:48:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Vasu Kulkarni 2017-10-21 02:37:09 UTC

Description of problem:

I see the new behavior in purge-cluster where it is now doing reboot of servers now, ( Ceph-ansible version 3.0.3-1.el7cp ) 

  handlers:
- name: restart machine

The default behavior is bad for anyone running scripts using ssh as the underlying connections will die, Please keep the original non reboot option and use the reboot option based on additional vars. most of my tests are failing due to this

2017-10-20T20:28:59.413 INFO:teuthology.orchestra.run.clara007.stdout:RUNNING HANDLER [wait for server to boot] **************************************
2017-10-20T20:29:00.612 ERROR:teuthology.run_tasks:Manager failed: ceph-ansible
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_rh/teuthology/run_tasks.py", line 159, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_rh/teuthology/task/__init__.py", line 134, in __exit__
    self.teardown()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_rh/teuthology/task/ceph_ansible.py", line 215, in teardown
    run.Raw(str_args)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_rh/teuthology/orchestra/remote.py", line 193, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_rh/teuthology/orchestra/run.py", line 423, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_rh/teuthology/orchestra/run.py", line 155, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_rh/teuthology/orchestra/run.py", line 169, in _raise_for_status

Comment 3 Sébastien Han 2017-10-23 10:20:31 UTC

The restart is definitely not optimal and should be removed if possible. This is a bad workaround, meaning we are giving up on finding the root cause of not being able to delete /var/lib/ceph. Our CI does not reproduce that, it'd be nice to investigate in your env. At least to understand why this directory cannot be removed.

Thanks in advance for your help.

Comment 4 Vasu Kulkarni 2017-10-24 01:29:41 UTC

Sebastein,

I think the issue is in Ceph itself and I think it pop ups now and then as its highly dependent on the order of purge and what background tasks it is running, We have seen this problem in ceph-deploy too where due to few monitor lock in /var/lib/ceph/mon it can't cleanup that dir.

I believe you could fail the purge-cluster here as well if it fails to cleanup /var/lib/ceph and they can rerun it again with reboot enabled which could be documented.

A recent example from ceph-deploy where few tests failed due to stale locks on  /var/lib/ceph/mon even after purge

http://pulpito.ceph.com/teuthology-2017-10-21_05:55:02-ceph-deploy-luminous-distro-basic-vps/

Command failed on vpm091 with status 1: 'sudo tar cz -f /tmp/tmpPRsXUl -C /var/lib/ceph/mon -- .'

Comment 6 Drew Harris 2017-10-26 00:55:18 UTC

Moved this to 3.0 since it looks like it's blocking Vasu's testing.

Comment 7 Sébastien Han 2017-10-26 12:24:50 UTC

https://github.com/ceph/ceph-ansible/releases/tag/v3.0.6 contains the fix, Ken please build a new package.
Thanks.

Comment 14 errata-xmlrpc 2017-12-05 23:48:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3387