Bug 1505011
| Summary: | [Ceph-Ansible 3.0.3-1.el7cp] dont default restart machines during purge-cluster | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Vasu Kulkarni <vakulkar> |
| Component: | Ceph-Ansible | Assignee: | Sébastien Han <shan> |
| Status: | CLOSED ERRATA | QA Contact: | Vasu Kulkarni <vakulkar> |
| Severity: | medium | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 3.0 | CC: | adeza, anharris, aschoen, ceph-eng-bugs, ceph-qe-bugs, gmeno, hnallurv, kdreyer, nthomas, sankarshan |
| Target Milestone: | rc | ||
| Target Release: | 3.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | RHEL: ceph-ansible-3.0.6-1.el7cp Ubuntu: ceph-ansible_3.0.6-2redhat1 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-12-05 23:48:29 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
The restart is definitely not optimal and should be removed if possible. This is a bad workaround, meaning we are giving up on finding the root cause of not being able to delete /var/lib/ceph. Our CI does not reproduce that, it'd be nice to investigate in your env. At least to understand why this directory cannot be removed. Thanks in advance for your help. Sebastein, I think the issue is in Ceph itself and I think it pop ups now and then as its highly dependent on the order of purge and what background tasks it is running, We have seen this problem in ceph-deploy too where due to few monitor lock in /var/lib/ceph/mon it can't cleanup that dir. I believe you could fail the purge-cluster here as well if it fails to cleanup /var/lib/ceph and they can rerun it again with reboot enabled which could be documented. A recent example from ceph-deploy where few tests failed due to stale locks on /var/lib/ceph/mon even after purge http://pulpito.ceph.com/teuthology-2017-10-21_05:55:02-ceph-deploy-luminous-distro-basic-vps/ Command failed on vpm091 with status 1: 'sudo tar cz -f /tmp/tmpPRsXUl -C /var/lib/ceph/mon -- .' Moved this to 3.0 since it looks like it's blocking Vasu's testing. https://github.com/ceph/ceph-ansible/releases/tag/v3.0.6 contains the fix, Ken please build a new package. Thanks. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:3387 |
Description of problem: I see the new behavior in purge-cluster where it is now doing reboot of servers now, ( Ceph-ansible version 3.0.3-1.el7cp ) handlers: - name: restart machine The default behavior is bad for anyone running scripts using ssh as the underlying connections will die, Please keep the original non reboot option and use the reboot option based on additional vars. most of my tests are failing due to this 2017-10-20T20:28:59.413 INFO:teuthology.orchestra.run.clara007.stdout:RUNNING HANDLER [wait for server to boot] ************************************** 2017-10-20T20:29:00.612 ERROR:teuthology.run_tasks:Manager failed: ceph-ansible Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_git_teuthology_rh/teuthology/run_tasks.py", line 159, in run_tasks suppress = manager.__exit__(*exc_info) File "/home/teuthworker/src/git.ceph.com_git_teuthology_rh/teuthology/task/__init__.py", line 134, in __exit__ self.teardown() File "/home/teuthworker/src/git.ceph.com_git_teuthology_rh/teuthology/task/ceph_ansible.py", line 215, in teardown run.Raw(str_args) File "/home/teuthworker/src/git.ceph.com_git_teuthology_rh/teuthology/orchestra/remote.py", line 193, in run r = self._runner(client=self.ssh, name=self.shortname, **kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_rh/teuthology/orchestra/run.py", line 423, in run r.wait() File "/home/teuthworker/src/git.ceph.com_git_teuthology_rh/teuthology/orchestra/run.py", line 155, in wait self._raise_for_status() File "/home/teuthworker/src/git.ceph.com_git_teuthology_rh/teuthology/orchestra/run.py", line 169, in _raise_for_status