Bug 1282484
Summary: | [Ceph-deploy]: Ceph-deploy crashed while trying to remove a mon which is added manually | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | shylesh <shmohan> | ||||
Component: | Ceph-Installer | Assignee: | Alfredo Deza <adeza> | ||||
Status: | CLOSED ERRATA | QA Contact: | ceph-qe-bugs <ceph-qe-bugs> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 1.3.1 | CC: | adeza, aschoen, ceph-eng-bugs, flucifre, hnallurv, hyelloji, kdreyer, nthomas, sankarshan, shmohan | ||||
Target Milestone: | rc | ||||||
Target Release: | 1.3.2 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | RHEL: ceph-deploy-1.5.27.4-3.el7cp Ubuntu: ceph-deploy_1.5.27.4-4redhat1 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2016-02-29 14:44:10 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
shylesh
2015-11-16 14:56:50 UTC
Shylesh, did manual way of removing mon work? OK pushing this to 1.3.2. Please determine if we should document this in the known issues before re-targeting to 1.3.2. This needs to be in known issues in release notes for 1.3.1. Shylesh, is this issue seen only on Ubuntu? Please confirm. (In reply to Harish NV Rao from comment #2) > Shylesh, did manual way of removing mon work? Harish, Yes, manual removal of mon works fine as per the document @Harish, This issue is also reproducible on RHEL as well as UBUNTU. Here is the output from RHEL [root@cephqe3 ceph-config]# ceph-deploy mon destroy cephqe6 [ceph_deploy.conf][DEBUG ] found configuration file at: /root/ceph-config/cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.27.3): /usr/bin/ceph-deploy mon destroy cephqe6 [ceph_deploy.cli][INFO ] ceph-deploy options: [ceph_deploy.cli][INFO ] username : None [ceph_deploy.cli][INFO ] verbose : False [ceph_deploy.cli][INFO ] overwrite_conf : False [ceph_deploy.cli][INFO ] subcommand : destroy [ceph_deploy.cli][INFO ] quiet : False [ceph_deploy.cli][INFO ] cd_conf : <ceph_deploy.conf.cephdeploy.Conf instance at 0x2877638> [ceph_deploy.cli][INFO ] cluster : ceph [ceph_deploy.cli][INFO ] mon : ['cephqe6'] [ceph_deploy.cli][INFO ] func : <function mon at 0x2869d70> [ceph_deploy.cli][INFO ] ceph_conf : None [ceph_deploy.cli][INFO ] default_release : False [ceph_deploy.mon][DEBUG ] Removing mon from cephqe6 [cephqe6][DEBUG ] connected to host: cephqe6 [cephqe6][DEBUG ] detect platform information from remote host [cephqe6][DEBUG ] detect machine type [cephqe6][DEBUG ] get remote short hostname [cephqe6][INFO ] Running command: ceph --cluster=ceph -n mon. -k /var/lib/ceph/mon/ceph-cephqe6/keyring mon remove cephqe6 [cephqe6][WARNIN] removed mon.cephqe6 at 10.70.44.46:6789/0, there are now 2 monitors [cephqe6][INFO ] polling the daemon to verify it stopped [ceph_deploy][ERROR ] Traceback (most recent call last): [ceph_deploy][ERROR ] File "/usr/lib/python2.7/site-packages/ceph_deploy/util/decorators.py", line 69, in newfunc [ceph_deploy][ERROR ] return f(*a, **kw) [ceph_deploy][ERROR ] File "/usr/lib/python2.7/site-packages/ceph_deploy/cli.py", line 169, in _main [ceph_deploy][ERROR ] return args.func(args) [ceph_deploy][ERROR ] File "/usr/lib/python2.7/site-packages/ceph_deploy/mon.py", line 444, in mon [ceph_deploy][ERROR ] mon_destroy(args) [ceph_deploy][ERROR ] File "/usr/lib/python2.7/site-packages/ceph_deploy/mon.py", line 382, in mon_destroy [ceph_deploy][ERROR ] hostname, [ceph_deploy][ERROR ] File "/usr/lib/python2.7/site-packages/ceph_deploy/mon.py", line 343, in destroy_mon [ceph_deploy][ERROR ] if is_running(conn, status_args): [ceph_deploy][ERROR ] UnboundLocalError: local variable 'status_args' referenced before assignment [ceph_deploy][ERROR ] This is happening because ceph-deploy doesn't have systemd support for destroying monitors. A PR has been created and is ready for review: https://github.com/ceph/ceph-deploy/pull/375 See 1255495, need to use systemd/upstart to launch MONs. *** This bug has been marked as a duplicate of bug 1255497 *** See 1255495, need to use systemd/upstart to launch MONs. I actually don't think this is a duplicate. bug 1255497 is for starting monitors that are done with the `ceph` tool directly. This is not the case for *destroying* monitors (this ticket). The codepaths are distinct. And the pull request that is opened addresses *this* issue but not the other one. I would much rather prefer to keep these separate so that the work can be traced to a specific fix: cannot destroy monitors in a systemd server. Upstream ticket: http://tracker.ceph.com/issues/14049 Let's see if we can get this fix into RHCS 1.3.2 Change to cherry-pick to ceph-1.3-rhel-patches in Gerrit: https://github.com/ceph/ceph-deploy/pull/375 Created attachment 1120059 [details] Attaching the Mon Logs On the Ubuntu Setup, though the Mon was removed there was an error while running the command.. Ceph Cluster status before destroying : http://pastebin.test.redhat.com/345144 When removing with ceph-deploy method : ubuntu@magna012:~/install/ubuntu/u130/ceph-config$ ceph-deploy mon destroy magna051 [ceph_deploy.conf][DEBUG ] found configuration file at: /home/ubuntu/install/ubuntu/u130/ceph-config/cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.27.4): /usr/bin/ceph-deploy mon destroy magna051 [ceph_deploy.cli][INFO ] ceph-deploy options: [ceph_deploy.cli][INFO ] username : None [ceph_deploy.cli][INFO ] verbose : False [ceph_deploy.cli][INFO ] overwrite_conf : False [ceph_deploy.cli][INFO ] subcommand : destroy [ceph_deploy.cli][INFO ] quiet : False [ceph_deploy.cli][INFO ] cd_conf : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7fdc53a7abd8> [ceph_deploy.cli][INFO ] cluster : ceph [ceph_deploy.cli][INFO ] mon : ['magna051'] [ceph_deploy.cli][INFO ] func : <function mon at 0x7fdc53a552a8> [ceph_deploy.cli][INFO ] ceph_conf : None [ceph_deploy.cli][INFO ] default_release : False [ceph_deploy.mon][DEBUG ] Removing mon from magna051 [magna051][DEBUG ] connection detected need for sudo [magna051][DEBUG ] connected to host: magna051 [magna051][DEBUG ] detect platform information from remote host [magna051][DEBUG ] detect machine type [magna051][DEBUG ] get remote short hostname [magna051][INFO ] Running command: sudo ceph --cluster=ceph -n mon. -k /var/lib/ceph/mon/ceph-magna051/keyring mon remove magna051 [magna051][WARNIN] removed mon.magna051 at 10.8.128.51:6789/0, there are now 1 monitors [ceph_deploy.mon][ERROR ] unsupported init system detected, cannot continue [ceph_deploy][ERROR ] GenericError: Failed to destroy 1 monitors ------------------------------------------------------------------------------- ceph cluster state after removing the cluster.. ubuntu@magna051:~/tmp$ sudo ceph quorum_status --format json-pretty { "election_epoch": 3, "quorum": [ 0 ], "quorum_names": [ "magna015" ], "quorum_leader_name": "magna015", "monmap": { "epoch": 7, "fsid": "6b95f328-0a99-44ea-82cc-ef5f811868d4", "modified": "2016-02-01 10:10:53.734977", "created": "0.000000", "mons": [ { "rank": 0, "name": "magna015", "addr": "10.8.128.15:6789\/0" } ] } } -------------------------------------------------------------------------------- ubuntu@magna012:~/install/ubuntu/u130/ceph-config$ ceph -s cluster 6b95f328-0a99-44ea-82cc-ef5f811868d4 health HEALTH_OK monmap e7: 1 mons at {magna015=10.8.128.15:6789/0} election epoch 3, quorum 0 magna015 osdmap e18189: 12 osds: 12 up, 12 in pgmap v325696: 768 pgs, 11 pools, 1098 MB data, 22383 kobjects 670 GB used, 10441 GB / 11112 GB avail 768 active+clean client io 7536 B/s rd, 25267 B/s wr, 39 op/s -------------------------------------------------------------------------------- attaching the mon logs which was added and removed. We are back again on a particular situation where ceph-deploy depends on having a monitor added via the ceph command line tools (ceph-mon in this case) which is supposed to leave a file that indicates what system was used to boot the monitor. In the case of the server under test that path should be: /var/lib/ceph/mon/ceph-magna051/ But there isn't a 'sysvinit' or an 'upstart' system there, and because the server is not systemd the tool reports it can't determine what to use to stop the monitor. Those file would be placed there when/if the monitor was started using the ceph-mon tool. This will require a bit of work in ceph-deploy to get it right, possible workarounds are: * do not add a monitor without ceph-deploy * 'touch' the needed file to hint what init system should be used (this sounds the worst) * add documentation to issue the right command to stop the monitor because ceph-deploy may not be able to do so Bottom line is that ceph-deploy needs some work here to be able to handle this properly. (In reply to Alfredo Deza from comment #16) > * do not add a monitor without ceph-deploy This is my preference. Let's update the docs to reflect the new error string. Previous error string was UnboundLocalError: local variable 'status_args' referenced before assignment" New error string: unsupported init system detected, cannot continue The docs currently read: > The monitor is removed despite the error, however, ceph-deploy fails to > remove the monitor’s configuration directory located in the > /var/lib/ceph/mon/ directory. To work around this issue, remove the > monitor’s directory manually. Alfredo, if that's correct, let's just go with that for RHCS 1.3.2 Stepping through the code there is one notable addition: the mon process (daemon) is still hanging around. Not sure if this is a problem worth noting. Adding a fallback which would mean a small code change *just for this operation* shouldn't be that hard and I could have something ready for review today. Decided to go ahead and implement this. Pull request is available here: https://github.com/ceph/ceph-deploy/pull/385 Once that is merged we can cherry-pick and cut a new ceph-deploy release merged commit 379dfd2 into ceph-deploy/master This involves more than one commit but fixes this problem. Today in #ceph-devel Sage mentioned that this might have caused a regression on Ubuntu Trusty? Alfredo, can you please verify that this is really working on Trusty (or create an additional PR if appropriate?) You're way ahead of me. Looks like https://github.com/ceph/ceph-deploy/pull/386 is what we need? Could you please cherry-pick that to ceph-1.3-rhel-patches in Gerrit? Today's build has all the necessary patches for this issue. Works perfectly without any errors.. Moving to verified state. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:0313 |