Bug 1505011 - [Ceph-Ansible 3.0.3-1.el7cp] dont default restart machines during purge-cluster
Summary: [Ceph-Ansible 3.0.3-1.el7cp] dont default restart machines during purge-cluster
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Ansible
Version: 3.0
Hardware: Unspecified
OS: Unspecified
urgent
medium
Target Milestone: rc
: 3.0
Assignee: Sébastien Han
QA Contact: Vasu Kulkarni
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-10-21 02:37 UTC by Vasu Kulkarni
Modified: 2022-02-21 18:19 UTC (History)
10 users (show)

Fixed In Version: RHEL: ceph-ansible-3.0.6-1.el7cp Ubuntu: ceph-ansible_3.0.6-2redhat1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-05 23:48:29 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph-ansible pull 2112 0 None None None 2017-10-26 12:20:40 UTC
Red Hat Product Errata RHBA-2017:3387 0 normal SHIPPED_LIVE Red Hat Ceph Storage 3.0 bug fix and enhancement update 2017-12-06 03:03:45 UTC

Description Vasu Kulkarni 2017-10-21 02:37:09 UTC
Description of problem:

I see the new behavior in purge-cluster where it is now doing reboot of servers now, ( Ceph-ansible version 3.0.3-1.el7cp ) 

  handlers:
- name: restart machine

The default behavior is bad for anyone running scripts using ssh as the underlying connections will die, Please keep the original non reboot option and use the reboot option based on additional vars. most of my tests are failing due to this

2017-10-20T20:28:59.413 INFO:teuthology.orchestra.run.clara007.stdout:RUNNING HANDLER [wait for server to boot] **************************************
2017-10-20T20:29:00.612 ERROR:teuthology.run_tasks:Manager failed: ceph-ansible
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_rh/teuthology/run_tasks.py", line 159, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_rh/teuthology/task/__init__.py", line 134, in __exit__
    self.teardown()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_rh/teuthology/task/ceph_ansible.py", line 215, in teardown
    run.Raw(str_args)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_rh/teuthology/orchestra/remote.py", line 193, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_rh/teuthology/orchestra/run.py", line 423, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_rh/teuthology/orchestra/run.py", line 155, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_rh/teuthology/orchestra/run.py", line 169, in _raise_for_status

Comment 3 Sébastien Han 2017-10-23 10:20:31 UTC
The restart is definitely not optimal and should be removed if possible. This is a bad workaround, meaning we are giving up on finding the root cause of not being able to delete /var/lib/ceph. Our CI does not reproduce that, it'd be nice to investigate in your env. At least to understand why this directory cannot be removed.

Thanks in advance for your help.

Comment 4 Vasu Kulkarni 2017-10-24 01:29:41 UTC
Sebastein,

I think the issue is in Ceph itself and I think it pop ups now and then as its highly dependent on the order of purge and what background tasks it is running, We have seen this problem in ceph-deploy too where due to few monitor lock in /var/lib/ceph/mon it can't cleanup that dir.

I believe you could fail the purge-cluster here as well if it fails to cleanup /var/lib/ceph and they can rerun it again with reboot enabled which could be documented.

A recent example from ceph-deploy where few tests failed due to stale locks on  /var/lib/ceph/mon even after purge

http://pulpito.ceph.com/teuthology-2017-10-21_05:55:02-ceph-deploy-luminous-distro-basic-vps/

Command failed on vpm091 with status 1: 'sudo tar cz -f /tmp/tmpPRsXUl -C /var/lib/ceph/mon -- .'

Comment 6 Drew Harris 2017-10-26 00:55:18 UTC
Moved this to 3.0 since it looks like it's blocking Vasu's testing.

Comment 7 Sébastien Han 2017-10-26 12:24:50 UTC
https://github.com/ceph/ceph-ansible/releases/tag/v3.0.6 contains the fix, Ken please build a new package.
Thanks.

Comment 14 errata-xmlrpc 2017-12-05 23:48:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3387


Note You need to log in before you can comment on or make changes to this bug.