Description of problem: Monitors doesn't get removed though shrink-mon completes successfully. It was observed that only monitor service was down. Version-Release number of selected component (if applicable): ceph-ansible-3.0.28-1.el7cp.noarch How reproducible: Always [1] Steps to Reproduce: 1. Configure a containerized cluster with at least 3 monitors. 2. Remove one monitor using shrink-mon playbook Actual results: Monitor will not get removed though shrink-mon playbook completes run successfully. Expected results: Mon should be removed from the cluster Additional info: As an workaround, shrink-cluster can be initiated again with same monitor to be killed. [1] - It was observed that when same mon is added and tried to shrink any mon, monitor was removed from the cluster. So As per observations, it can be concluded that this behavior might not be repeated on same cluster. CLI log snippet - TASK [remove monitor from the quorum] changed: [localhost -> argo022] => {"changed": true, "cmd": ["docker", "exec", "ceph-mon-argo022", "ceph", "--cluster", "ceph1", "mon", "remove", "argo021"], "delta": "0:00:00.296740", "end": "2018-04-05 11:30:59.183353", "failed_when_result": false, "rc": 0, "start": "2018-04-05 11:30:58.886613", "stderr": "removing mon.argo021 at 10.8.128.221:6789/0, there will be 2 monitors", "stderr_lines": ["removing mon.argo021 at 10.8.128.221:6789/0, there will be 2 monitors"], "stdout": "", "stdout_lines": []} TASK [verify the monitor is out of the cluster] changed: [localhost -> argo022] => {"attempts": 2, "changed": true, "cmd": "docker exec ceph-mon-argo022 ceph --cluster ceph1 -s -f json | python -c 'import sys, json; print(json.load(sys.stdin)[\"quorum_names\"])'", "delta": "0:00:02.267049", "end": "2018-04-05 11:31:12.785793", "failed_when_result": false, "rc": 0, "start": "2018-04-05 11:31:10.518744", "stderr": "", "stderr_lines": [], "stdout": "[u'argo020', u'argo022']", "stdout_lines": ["[u'argo020', u'argo022']"]} $ sudo docker exec ceph-mon-argo020 ceph -s --cluster ceph1 cluster: id: 9e97c343-8e0e-4499-9b02-21221646dbbf health: HEALTH_WARN 1/3 mons down, quorum argo020,argo022 services: mon: 3 daemons, quorum argo020,argo022, out of quorum: argo021 mgr: argo020(active) osd: 8 osds: 8 up, 8 in data: pools: 0 pools, 0 pgs objects: 0 objects, 0 bytes usage: 858 MB used, 11802 GB / 11803 GB avail pgs:
I can reproduce this issue but it seems this occurs randomly. In any cases, the playbook runs fine, the command to remove the mon from the cluster replies correctly [1] but curiously, sometimes, the cluster keeps the shrinked mon in the monmap and reports it as 'down'. As Vasishta mentioned, launching manually the same command than the one in the playbook actually removes the mon. I'm still trying to figure out what's wrong, not sure yet whether this is something at ceph-ansible level. [1] "removing mon.argo021 at 10.8.128.221:6789/0, there will be 2 monitors"
What's the target for this? Drew?
Updating the QA Contact to a Hemant. Hemant will be rerouting them to the appropriate QE Associate. Regards, Giri