Bug 1505011

Summary: [Ceph-Ansible 3.0.3-1.el7cp] dont default restart machines during purge-cluster
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vasu Kulkarni <vakulkar>
Component: Ceph-AnsibleAssignee: Sébastien Han <shan>
Status: CLOSED ERRATA QA Contact: Vasu Kulkarni <vakulkar>
Severity: medium Docs Contact:
Priority: urgent    
Version: 3.0CC: adeza, anharris, aschoen, ceph-eng-bugs, ceph-qe-bugs, gmeno, hnallurv, kdreyer, nthomas, sankarshan
Target Milestone: rc   
Target Release: 3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: RHEL: ceph-ansible-3.0.6-1.el7cp Ubuntu: ceph-ansible_3.0.6-2redhat1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-05 23:48:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vasu Kulkarni 2017-10-21 02:37:09 UTC
Description of problem:

I see the new behavior in purge-cluster where it is now doing reboot of servers now, ( Ceph-ansible version 3.0.3-1.el7cp ) 

  handlers:
- name: restart machine

The default behavior is bad for anyone running scripts using ssh as the underlying connections will die, Please keep the original non reboot option and use the reboot option based on additional vars. most of my tests are failing due to this

2017-10-20T20:28:59.413 INFO:teuthology.orchestra.run.clara007.stdout:RUNNING HANDLER [wait for server to boot] **************************************
2017-10-20T20:29:00.612 ERROR:teuthology.run_tasks:Manager failed: ceph-ansible
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_rh/teuthology/run_tasks.py", line 159, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_rh/teuthology/task/__init__.py", line 134, in __exit__
    self.teardown()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_rh/teuthology/task/ceph_ansible.py", line 215, in teardown
    run.Raw(str_args)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_rh/teuthology/orchestra/remote.py", line 193, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_rh/teuthology/orchestra/run.py", line 423, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_rh/teuthology/orchestra/run.py", line 155, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_rh/teuthology/orchestra/run.py", line 169, in _raise_for_status

Comment 3 Sébastien Han 2017-10-23 10:20:31 UTC
The restart is definitely not optimal and should be removed if possible. This is a bad workaround, meaning we are giving up on finding the root cause of not being able to delete /var/lib/ceph. Our CI does not reproduce that, it'd be nice to investigate in your env. At least to understand why this directory cannot be removed.

Thanks in advance for your help.

Comment 4 Vasu Kulkarni 2017-10-24 01:29:41 UTC
Sebastein,

I think the issue is in Ceph itself and I think it pop ups now and then as its highly dependent on the order of purge and what background tasks it is running, We have seen this problem in ceph-deploy too where due to few monitor lock in /var/lib/ceph/mon it can't cleanup that dir.

I believe you could fail the purge-cluster here as well if it fails to cleanup /var/lib/ceph and they can rerun it again with reboot enabled which could be documented.

A recent example from ceph-deploy where few tests failed due to stale locks on  /var/lib/ceph/mon even after purge

http://pulpito.ceph.com/teuthology-2017-10-21_05:55:02-ceph-deploy-luminous-distro-basic-vps/

Command failed on vpm091 with status 1: 'sudo tar cz -f /tmp/tmpPRsXUl -C /var/lib/ceph/mon -- .'

Comment 6 Drew Harris 2017-10-26 00:55:18 UTC
Moved this to 3.0 since it looks like it's blocking Vasu's testing.

Comment 7 Sébastien Han 2017-10-26 12:24:50 UTC
https://github.com/ceph/ceph-ansible/releases/tag/v3.0.6 contains the fix, Ken please build a new package.
Thanks.

Comment 14 errata-xmlrpc 2017-12-05 23:48:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3387