Bug 1564117 - [ceph-ansible] [ceph-container] : shrink mon - monitor is not removed from the cluster
Summary: [ceph-ansible] [ceph-container] : shrink mon - monitor is not removed from th...
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: Ceph-Ansible
Version: 3.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: z2
: 3.3
Assignee: Guillaume Abrioux
QA Contact: Vasishta
Depends On:
Blocks: 1557269
TreeView+ depends on / blocked
Reported: 2018-04-05 12:22 UTC by Vasishta
Modified: 2019-10-03 13:55 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
.Sometimes the `shrink-mon` Ansible playbook fails to remove a monitor from the monmap The `shrink-mon` Ansible playbook will sometimes fail to remove a monitor from the monmap even though the playbook completes its run successfully. The cluster status shows the monitor intended to be deleted as down. To workaround this issue, launch the `shrink-mon` playbook again with the intention of removing the same monitor, or remove the monitor from the monmap manually.
Clone Of:
Last Closed: 2019-10-03 13:55:22 UTC
Target Upstream Version:

Attachments (Terms of Use)

Description Vasishta 2018-04-05 12:22:05 UTC
Description of problem:
Monitors doesn't get removed though shrink-mon completes successfully. It was observed that only monitor service was down.

Version-Release number of selected component (if applicable):

How reproducible:
Always [1]

Steps to Reproduce:
1. Configure a containerized cluster with at least 3 monitors.
2. Remove one monitor using shrink-mon playbook

Actual results:
Monitor will not get removed though shrink-mon playbook completes run successfully.

Expected results:
Mon should be removed from the cluster 

Additional info:
As an workaround, shrink-cluster can be initiated again with same monitor to be killed.

[1] - It was observed that when same mon is added and tried to shrink any mon, monitor was removed from the cluster. So As per observations, it can be concluded that this behavior might not be repeated on same cluster.

CLI log snippet -

TASK [remove monitor from the quorum] 

changed: [localhost -> argo022] => {"changed": true, "cmd": ["docker", "exec", "ceph-mon-argo022", "ceph", "--cluster", "ceph1", "mon", "remove", "argo021"], "delta": "0:00:00.296740", "end": "2018-04-05 11:30:59.183353", "failed_when_result": false, "rc": 0, "start": "2018-04-05 11:30:58.886613", "stderr": "removing mon.argo021 at, there will be 2 monitors", "stderr_lines": ["removing mon.argo021 at, there will be 2 monitors"], "stdout": "", "stdout_lines": []}

TASK [verify the monitor is out of the cluster] 

changed: [localhost -> argo022] => {"attempts": 2, "changed": true, "cmd": "docker exec ceph-mon-argo022 ceph --cluster ceph1 -s -f json | python -c 'import sys, json; print(json.load(sys.stdin)[\"quorum_names\"])'", "delta": "0:00:02.267049", "end": "2018-04-05 11:31:12.785793", "failed_when_result": false, "rc": 0, "start": "2018-04-05 11:31:10.518744", "stderr": "", "stderr_lines": [], "stdout": "[u'argo020', u'argo022']", "stdout_lines": ["[u'argo020', u'argo022']"]}

$ sudo docker exec ceph-mon-argo020 ceph -s --cluster ceph1
    id:     9e97c343-8e0e-4499-9b02-21221646dbbf
    health: HEALTH_WARN
            1/3 mons down, quorum argo020,argo022
    mon: 3 daemons, quorum argo020,argo022, out of quorum: argo021
    mgr: argo020(active)
    osd: 8 osds: 8 up, 8 in
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 bytes
    usage:   858 MB used, 11802 GB / 11803 GB avail

Comment 5 Guillaume Abrioux 2018-04-12 09:24:27 UTC
I can reproduce this issue but it seems this occurs randomly.
In any cases, the playbook runs fine, the command to remove the mon from the cluster replies correctly [1] but curiously, sometimes, the cluster keeps the shrinked mon in the monmap and reports it as 'down'.
As Vasishta mentioned, launching manually the same command than the one in the playbook actually removes the mon.

I'm still trying to figure out what's wrong, not sure yet whether this is something at ceph-ansible level.

[1] "removing mon.argo021 at, there will be 2 monitors"

Comment 7 seb 2018-07-27 12:41:52 UTC
What's the target for this? Drew?

Comment 8 Giridhar Ramaraju 2019-08-05 13:10:29 UTC
Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. 


Comment 9 Giridhar Ramaraju 2019-08-05 13:11:32 UTC
Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. 


Note You need to log in before you can comment on or make changes to this bug.