Bug 1282484

Summary: [Ceph-deploy]: Ceph-deploy crashed while trying to remove a mon which is added manually
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: shylesh <shmohan>
Component: Ceph-InstallerAssignee: Alfredo Deza <adeza>
Status: CLOSED ERRATA QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: high Docs Contact:
Priority: unspecified    
Version: 1.3.1CC: adeza, aschoen, ceph-eng-bugs, flucifre, hnallurv, hyelloji, kdreyer, nthomas, sankarshan, shmohan
Target Milestone: rc   
Target Release: 1.3.2   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: RHEL: ceph-deploy-1.5.27.4-3.el7cp Ubuntu: ceph-deploy_1.5.27.4-4redhat1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-02-29 14:44:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Attaching the Mon Logs none

Description shylesh 2015-11-16 14:56:50 UTC
Description of problem:
I added a monitor manually, while trying to remove it using "ceph-deploy mon destroy", ceph-deploy crashed but mon was removed . /var/lib/ceph/mon/ceph-{id} was not cleaned up.

 ceph@magna059:~/ceph-config$ dpkg -l | grep ceph-deploy
ii  ceph-deploy                         1.5.27.3trusty                        all          Ceph-deploy is an easy to use configuration tool


ceph@magna059:~/ceph-config$ dpkg -l | grep ceph
ii  ceph-common                         0.94.3.3-1trusty                      amd64        common utilities to mount and interact with a ceph storage cluster




 
ceph@magna059:~/ceph-config$ ceph-deploy mon destroy magna110
[ceph_deploy.conf][DEBUG ] found configuration file at: /home/ceph/ceph-config/cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.27.3): /usr/bin/ceph-deploy mon destroy magna110
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username                      : None
[ceph_deploy.cli][INFO  ]  verbose                       : False
[ceph_deploy.cli][INFO  ]  overwrite_conf                : False
[ceph_deploy.cli][INFO  ]  subcommand                    : destroy
[ceph_deploy.cli][INFO  ]  quiet                         : False
[ceph_deploy.cli][INFO  ]  cd_conf                       : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7f355e8b7cb0>
[ceph_deploy.cli][INFO  ]  cluster                       : ceph
[ceph_deploy.cli][INFO  ]  mon                           : ['magna110']
[ceph_deploy.cli][INFO  ]  func                          : <function mon at 0x7f355e8a12a8>
[ceph_deploy.cli][INFO  ]  ceph_conf                     : None
[ceph_deploy.cli][INFO  ]  default_release               : False
[ceph_deploy.mon][DEBUG ] Removing mon from magna110
ceph@magna110's password:
[magna110][DEBUG ] connection detected need for sudo
ceph@magna110's password:
[magna110][DEBUG ] connected to host: magna110
[magna110][DEBUG ] detect platform information from remote host
[magna110][DEBUG ] detect machine type
[magna110][DEBUG ] get remote short hostname
[magna110][INFO  ] Running command: sudo ceph --cluster=ceph -n mon. -k /var/lib/ceph/mon/ceph-magna110/keyring mon remove magna110
[magna110][WARNIN] removed mon.magna110 at 10.8.128.110:6789/0, there are now 2 monitors
[magna110][INFO  ] polling the daemon to verify it stopped
[ceph_deploy][ERROR ] Traceback (most recent call last):
[ceph_deploy][ERROR ]   File "/usr/lib/python2.7/dist-packages/ceph_deploy/util/decorators.py", line 69, in newfunc
[ceph_deploy][ERROR ]     return f(*a, **kw)
[ceph_deploy][ERROR ]   File "/usr/lib/python2.7/dist-packages/ceph_deploy/cli.py", line 169, in _main
[ceph_deploy][ERROR ]     return args.func(args)
[ceph_deploy][ERROR ]   File "/usr/lib/python2.7/dist-packages/ceph_deploy/mon.py", line 444, in mon
[ceph_deploy][ERROR ]     mon_destroy(args)
[ceph_deploy][ERROR ]   File "/usr/lib/python2.7/dist-packages/ceph_deploy/mon.py", line 382, in mon_destroy
[ceph_deploy][ERROR ]     hostname,
[ceph_deploy][ERROR ]   File "/usr/lib/python2.7/dist-packages/ceph_deploy/mon.py", line 343, in destroy_mon
[ceph_deploy][ERROR ]     if is_running(conn, status_args):
[ceph_deploy][ERROR ] UnboundLocalError: local variable 'status_args' referenced before assignment
[ceph_deploy][ERROR ]

Comment 2 Harish NV Rao 2015-11-16 22:08:16 UTC
Shylesh, did manual way of removing mon work?

Comment 3 Federico Lucifredi 2015-11-16 22:31:27 UTC
OK pushing this to 1.3.2.

Please determine if we should document this in the known issues before re-targeting to 1.3.2.

Comment 4 Harish NV Rao 2015-11-16 22:40:39 UTC
This needs to be in known issues in release notes for 1.3.1.

Comment 5 Harish NV Rao 2015-11-16 22:42:25 UTC
Shylesh, is this issue seen only on Ubuntu? Please confirm.

Comment 6 shylesh 2015-11-17 07:01:34 UTC
(In reply to Harish NV Rao from comment #2)
> Shylesh, did manual way of removing mon work?

Harish,
Yes, manual removal of mon works fine as per the document

Comment 7 shylesh 2015-11-17 10:38:42 UTC
@Harish,

This issue is also reproducible on RHEL as well as UBUNTU.


Here is the output from RHEL [root@cephqe3 ceph-config]# ceph-deploy mon destroy cephqe6
[ceph_deploy.conf][DEBUG ] found configuration file at: /root/ceph-config/cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.27.3): /usr/bin/ceph-deploy mon destroy cephqe6
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username                      : None
[ceph_deploy.cli][INFO  ]  verbose                       : False
[ceph_deploy.cli][INFO  ]  overwrite_conf                : False
[ceph_deploy.cli][INFO  ]  subcommand                    : destroy
[ceph_deploy.cli][INFO  ]  quiet                         : False
[ceph_deploy.cli][INFO  ]  cd_conf                       : <ceph_deploy.conf.cephdeploy.Conf instance at 0x2877638>
[ceph_deploy.cli][INFO  ]  cluster                       : ceph
[ceph_deploy.cli][INFO  ]  mon                           : ['cephqe6']
[ceph_deploy.cli][INFO  ]  func                          : <function mon at 0x2869d70>
[ceph_deploy.cli][INFO  ]  ceph_conf                     : None
[ceph_deploy.cli][INFO  ]  default_release               : False
[ceph_deploy.mon][DEBUG ] Removing mon from cephqe6
[cephqe6][DEBUG ] connected to host: cephqe6 
[cephqe6][DEBUG ] detect platform information from remote host
[cephqe6][DEBUG ] detect machine type
[cephqe6][DEBUG ] get remote short hostname
[cephqe6][INFO  ] Running command: ceph --cluster=ceph -n mon. -k /var/lib/ceph/mon/ceph-cephqe6/keyring mon remove cephqe6
[cephqe6][WARNIN] removed mon.cephqe6 at 10.70.44.46:6789/0, there are now 2 monitors
[cephqe6][INFO  ] polling the daemon to verify it stopped
[ceph_deploy][ERROR ] Traceback (most recent call last):
[ceph_deploy][ERROR ]   File "/usr/lib/python2.7/site-packages/ceph_deploy/util/decorators.py", line 69, in newfunc
[ceph_deploy][ERROR ]     return f(*a, **kw)
[ceph_deploy][ERROR ]   File "/usr/lib/python2.7/site-packages/ceph_deploy/cli.py", line 169, in _main
[ceph_deploy][ERROR ]     return args.func(args)
[ceph_deploy][ERROR ]   File "/usr/lib/python2.7/site-packages/ceph_deploy/mon.py", line 444, in mon
[ceph_deploy][ERROR ]     mon_destroy(args)
[ceph_deploy][ERROR ]   File "/usr/lib/python2.7/site-packages/ceph_deploy/mon.py", line 382, in mon_destroy
[ceph_deploy][ERROR ]     hostname,
[ceph_deploy][ERROR ]   File "/usr/lib/python2.7/site-packages/ceph_deploy/mon.py", line 343, in destroy_mon
[ceph_deploy][ERROR ]     if is_running(conn, status_args):
[ceph_deploy][ERROR ] UnboundLocalError: local variable 'status_args' referenced before assignment
[ceph_deploy][ERROR ]

Comment 8 Alfredo Deza 2015-12-10 21:22:49 UTC
This is happening because ceph-deploy doesn't have systemd support for destroying monitors.

A PR has been created and is ready for review: https://github.com/ceph/ceph-deploy/pull/375

Comment 9 Federico Lucifredi 2015-12-10 23:50:09 UTC
See 1255495, need to use systemd/upstart to launch MONs.

*** This bug has been marked as a duplicate of bug 1255497 ***

Comment 10 Federico Lucifredi 2015-12-10 23:50:28 UTC
See 1255495, need to use systemd/upstart to launch MONs.

Comment 11 Alfredo Deza 2015-12-11 12:13:21 UTC
I actually don't think this is a duplicate. bug 1255497 is for starting monitors that are done with the `ceph` tool directly. This is not the case for *destroying* monitors (this ticket).

The codepaths are distinct. And the pull request that is opened addresses *this* issue but not the other one.

I would much rather prefer to keep these separate so that the work can be traced to a specific fix: cannot destroy monitors in a systemd server.

Upstream ticket: http://tracker.ceph.com/issues/14049

Comment 12 Ken Dreyer (Red Hat) 2015-12-14 16:08:19 UTC
Let's see if we can get this fix into RHCS 1.3.2

Comment 13 Ken Dreyer (Red Hat) 2016-01-20 18:37:21 UTC
Change to cherry-pick to ceph-1.3-rhel-patches in Gerrit: https://github.com/ceph/ceph-deploy/pull/375

Comment 15 Hemanth Kumar 2016-02-01 10:22:49 UTC
Created attachment 1120059 [details]
Attaching the Mon Logs

On the Ubuntu Setup, though the Mon was removed there was an error while running the command..

Ceph Cluster status before destroying : http://pastebin.test.redhat.com/345144

When removing with ceph-deploy method :

ubuntu@magna012:~/install/ubuntu/u130/ceph-config$ ceph-deploy mon destroy magna051
[ceph_deploy.conf][DEBUG ] found configuration file at: /home/ubuntu/install/ubuntu/u130/ceph-config/cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.27.4): /usr/bin/ceph-deploy mon destroy magna051
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username                      : None
[ceph_deploy.cli][INFO  ]  verbose                       : False
[ceph_deploy.cli][INFO  ]  overwrite_conf                : False
[ceph_deploy.cli][INFO  ]  subcommand                    : destroy
[ceph_deploy.cli][INFO  ]  quiet                         : False
[ceph_deploy.cli][INFO  ]  cd_conf                       : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7fdc53a7abd8>
[ceph_deploy.cli][INFO  ]  cluster                       : ceph
[ceph_deploy.cli][INFO  ]  mon                           : ['magna051']
[ceph_deploy.cli][INFO  ]  func                          : <function mon at 0x7fdc53a552a8>
[ceph_deploy.cli][INFO  ]  ceph_conf                     : None
[ceph_deploy.cli][INFO  ]  default_release               : False
[ceph_deploy.mon][DEBUG ] Removing mon from magna051
[magna051][DEBUG ] connection detected need for sudo
[magna051][DEBUG ] connected to host: magna051 
[magna051][DEBUG ] detect platform information from remote host
[magna051][DEBUG ] detect machine type
[magna051][DEBUG ] get remote short hostname
[magna051][INFO  ] Running command: sudo ceph --cluster=ceph -n mon. -k /var/lib/ceph/mon/ceph-magna051/keyring mon remove magna051
[magna051][WARNIN] removed mon.magna051 at 10.8.128.51:6789/0, there are now 1 monitors
[ceph_deploy.mon][ERROR ] unsupported init system detected, cannot continue
[ceph_deploy][ERROR ] GenericError: Failed to destroy 1 monitors
-------------------------------------------------------------------------------

ceph cluster state after removing the cluster..

ubuntu@magna051:~/tmp$ sudo ceph quorum_status --format json-pretty

{
    "election_epoch": 3,
    "quorum": [
        0
    ],
    "quorum_names": [
        "magna015"
    ],
    "quorum_leader_name": "magna015",
    "monmap": {
        "epoch": 7,
        "fsid": "6b95f328-0a99-44ea-82cc-ef5f811868d4",
        "modified": "2016-02-01 10:10:53.734977",
        "created": "0.000000",
        "mons": [
            {
                "rank": 0,
                "name": "magna015",
                "addr": "10.8.128.15:6789\/0"
            }
        ]
    }
}
--------------------------------------------------------------------------------
ubuntu@magna012:~/install/ubuntu/u130/ceph-config$ ceph -s
    cluster 6b95f328-0a99-44ea-82cc-ef5f811868d4
     health HEALTH_OK
     monmap e7: 1 mons at {magna015=10.8.128.15:6789/0}
            election epoch 3, quorum 0 magna015
     osdmap e18189: 12 osds: 12 up, 12 in
      pgmap v325696: 768 pgs, 11 pools, 1098 MB data, 22383 kobjects
            670 GB used, 10441 GB / 11112 GB avail
                 768 active+clean
  client io 7536 B/s rd, 25267 B/s wr, 39 op/s
--------------------------------------------------------------------------------

attaching the mon logs which was added and removed.

Comment 16 Alfredo Deza 2016-02-01 12:21:51 UTC
We are back again on a particular situation where ceph-deploy depends on having a monitor added via the ceph command line tools (ceph-mon in this case) which is supposed to leave a file that indicates what system was used to boot the monitor.

In the case of the server under test that path should be:

    /var/lib/ceph/mon/ceph-magna051/

But there isn't a 'sysvinit' or an 'upstart' system there, and because the server is not systemd the tool reports it can't determine what to use to stop the monitor.

Those file would be placed there when/if the monitor was started using the ceph-mon tool.

This will require a bit of work in ceph-deploy to get it right, possible workarounds are:

* do not add a monitor without ceph-deploy
* 'touch' the needed file to hint what init system should be used (this sounds the worst)
* add documentation to issue the right command to stop the monitor because ceph-deploy may not be able to do so

Bottom line is that ceph-deploy needs some work here to be able to handle this properly.

Comment 17 Ken Dreyer (Red Hat) 2016-02-01 23:10:24 UTC
(In reply to Alfredo Deza from comment #16)
> * do not add a monitor without ceph-deploy

This is my preference.

Let's update the docs to reflect the new error string. Previous error string was

  UnboundLocalError: local variable 'status_args' referenced before assignment"

New error string:

  unsupported init system detected, cannot continue

The docs currently read:

> The monitor is removed despite the error, however, ceph-deploy fails to
> remove the monitor’s configuration directory located in the
> /var/lib/ceph/mon/ directory. To work around this issue, remove the
> monitor’s directory manually.

Alfredo, if that's correct, let's just go with that for RHCS 1.3.2

Comment 18 Alfredo Deza 2016-02-02 11:53:18 UTC
Stepping through the code there is one notable addition: the mon process (daemon) is still hanging around. Not sure if this is a problem worth noting.

Adding a fallback which would mean a small code change *just for this operation* shouldn't be that hard and I could have something ready for review today.

Comment 19 Alfredo Deza 2016-02-02 13:04:03 UTC
Decided to go ahead and implement this. Pull request is available here:

https://github.com/ceph/ceph-deploy/pull/385

Once that is merged we can cherry-pick and cut a new ceph-deploy release

Comment 20 Alfredo Deza 2016-02-03 18:36:25 UTC
merged commit 379dfd2 into ceph-deploy/master

This involves more than one commit but fixes this problem.

Comment 22 Ken Dreyer (Red Hat) 2016-02-04 20:40:18 UTC
Today in #ceph-devel Sage mentioned that this might have caused a regression on Ubuntu Trusty?

Alfredo, can you please verify that this is really working on Trusty (or create an additional PR if appropriate?)

Comment 23 Ken Dreyer (Red Hat) 2016-02-04 20:45:29 UTC
You're way ahead of me. Looks like https://github.com/ceph/ceph-deploy/pull/386 is what we need? Could you please cherry-pick that to ceph-1.3-rhel-patches in Gerrit?

Comment 24 Ken Dreyer (Red Hat) 2016-02-06 02:47:33 UTC
Today's build has all the necessary patches for this issue.

Comment 26 Hemanth Kumar 2016-02-08 18:10:42 UTC
Works perfectly without any errors..

Moving to verified state.

Comment 28 errata-xmlrpc 2016-02-29 14:44:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:0313