Bug 1564117 - [ceph-ansible] [ceph-container] : shrink mon - monitor is not removed from the cluster
Summary: [ceph-ansible] [ceph-container] : shrink mon - monitor is not removed from th...
Status: ASSIGNED
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: Ceph-Ansible
Version: 3.0
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: z2
: 3.1
Assignee: Guillaume Abrioux
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Keywords:
Depends On:
Blocks: 1557269
TreeView+ depends on / blocked
 
Reported: 2018-04-05 12:22 UTC by Vasishta
Modified: 2019-05-31 02:24 UTC (History)
12 users (show)

(edit)
.Sometimes the `shrink-mon` Ansible playbook fails to remove a monitor from the monmap

The `shrink-mon` Ansible playbook will sometimes fail to remove a monitor from the monmap even though the playbook completes its run successfully. The cluster status shows the monitor intended to be deleted as down. To workaround this issue, launch the `shrink-mon` playbook again with the intention of removing the same monitor, or remove the monitor from the monmap manually.
Clone Of:
(edit)
Last Closed:


Attachments (Terms of Use)

Description Vasishta 2018-04-05 12:22:05 UTC
Description of problem:
Monitors doesn't get removed though shrink-mon completes successfully. It was observed that only monitor service was down.

Version-Release number of selected component (if applicable):
ceph-ansible-3.0.28-1.el7cp.noarch

How reproducible:
Always [1]

Steps to Reproduce:
1. Configure a containerized cluster with at least 3 monitors.
2. Remove one monitor using shrink-mon playbook


Actual results:
Monitor will not get removed though shrink-mon playbook completes run successfully.

Expected results:
Mon should be removed from the cluster 

Additional info:
As an workaround, shrink-cluster can be initiated again with same monitor to be killed.

[1] - It was observed that when same mon is added and tried to shrink any mon, monitor was removed from the cluster. So As per observations, it can be concluded that this behavior might not be repeated on same cluster.

CLI log snippet -

TASK [remove monitor from the quorum] 

changed: [localhost -> argo022] => {"changed": true, "cmd": ["docker", "exec", "ceph-mon-argo022", "ceph", "--cluster", "ceph1", "mon", "remove", "argo021"], "delta": "0:00:00.296740", "end": "2018-04-05 11:30:59.183353", "failed_when_result": false, "rc": 0, "start": "2018-04-05 11:30:58.886613", "stderr": "removing mon.argo021 at 10.8.128.221:6789/0, there will be 2 monitors", "stderr_lines": ["removing mon.argo021 at 10.8.128.221:6789/0, there will be 2 monitors"], "stdout": "", "stdout_lines": []}

TASK [verify the monitor is out of the cluster] 

changed: [localhost -> argo022] => {"attempts": 2, "changed": true, "cmd": "docker exec ceph-mon-argo022 ceph --cluster ceph1 -s -f json | python -c 'import sys, json; print(json.load(sys.stdin)[\"quorum_names\"])'", "delta": "0:00:02.267049", "end": "2018-04-05 11:31:12.785793", "failed_when_result": false, "rc": 0, "start": "2018-04-05 11:31:10.518744", "stderr": "", "stderr_lines": [], "stdout": "[u'argo020', u'argo022']", "stdout_lines": ["[u'argo020', u'argo022']"]}

$ sudo docker exec ceph-mon-argo020 ceph -s --cluster ceph1
  cluster:
    id:     9e97c343-8e0e-4499-9b02-21221646dbbf
    health: HEALTH_WARN
            1/3 mons down, quorum argo020,argo022
 
  services:
    mon: 3 daemons, quorum argo020,argo022, out of quorum: argo021
    mgr: argo020(active)
    osd: 8 osds: 8 up, 8 in
 
  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 bytes
    usage:   858 MB used, 11802 GB / 11803 GB avail
    pgs:

Comment 5 Guillaume Abrioux 2018-04-12 09:24:27 UTC
I can reproduce this issue but it seems this occurs randomly.
In any cases, the playbook runs fine, the command to remove the mon from the cluster replies correctly [1] but curiously, sometimes, the cluster keeps the shrinked mon in the monmap and reports it as 'down'.
As Vasishta mentioned, launching manually the same command than the one in the playbook actually removes the mon.

I'm still trying to figure out what's wrong, not sure yet whether this is something at ceph-ansible level.

[1] "removing mon.argo021 at 10.8.128.221:6789/0, there will be 2 monitors"

Comment 7 seb 2018-07-27 12:41:52 UTC
What's the target for this? Drew?


Note You need to log in before you can comment on or make changes to this bug.