Created attachment 1324431 [details] File contains contents ansible-playbook log Description of problem: When more than one osd to be removed, Operation works on only one OSD, the first one. Others are only marked out it seems. Version-Release number of selected component (if applicable): ceph-ansible-3.0.0-0.1.rc6.el7cp.noarch How reproducible: Always (2/2) Steps to Reproduce: Run ansible-playbook shrink-osd.yml -e osd_to_kill=<OSD-id-1>,<OSD-id-2> Actual results: Only first OSD is being removed, other one is not being removed failed: [localhost -> ] (item=[u'7', None]) => {"changed": true, "cmd": ["ceph-disk", "destroy", "--cluster", "12_3a", "--destroy-by-id", "7", "--zap"], "delta": "0:00:00.213187", "end": "2017-09-11 11:46:36.468725", "failed": true, "item": ["7", null], "rc": 1, "start": "2017-09-11 11:46:36.255538", "stderr": "ceph-disk: Error: found no device matching : osd id 7", "stderr_lines": ["ceph-disk: Error: found no device matching : osd id 7"], "stdout": "", "stdout_lines": []} Expected results: All the OSDs of which ids has been passed for shrink-osd operation must be removed Additional info: 1) Both osd 7 & 9 in below scenario were passed as second argument to the argument "osd_to_kill" in two different runs # ceph osd tree --cluster 12_3a ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 6.33008 root default -5 2.71289 host magna068 0 hdd 0.90430 osd.0 up 1.00000 1.00000 5 hdd 0.90430 osd.5 up 1.00000 1.00000 8 hdd 0.90430 osd.8 up 1.00000 1.00000 -3 2.71289 host magna069 2 hdd 0.90430 osd.2 up 1.00000 1.00000 3 hdd 0.90430 osd.3 up 1.00000 1.00000 6 hdd 0.90430 osd.6 up 1.00000 1.00000 -7 0.90430 host magna093 1 hdd 0.90430 osd.1 up 1.00000 1.00000 -9 0 host magna097 7 0 osd.7 up 0 1.00000 9 0 osd.9 up 0 1.00000 # ceph -s --cluster 12_3a ---------- services: --------- osd: 9 osds: 9 up, 7 in; 3 remapped pgs --------- By seeing the log, it seems for the second osd-id onward, ansible failed to recognise corresponding osd node hostname, On two different runs, both osd-ids those were given for the argument osd_to_kill were on single node, it failed to recognise the hostname.
found the bug working on a fix
Please let me know if https://github.com/ceph/ceph-ansible/pull/1885/commits fixes your issue. Thanks!
Created attachment 1325313 [details] File contains contents ansible-playbook log Hi Sebastien, 'shrink-osd' branch didn't work for me. I have attached the ansible log (verbose mode enabled. I observed that This time removing both osds failed. This time I had tried osds with two different nodes. playbook tried searching osd_id_1 on the node where osd_id_2 was running so 'task deactivating osd(s)' failed for osd_id_1, but task 'set osd(s) out when ceph-disk deactivating fail' was successful as command was executed from magna097. Playbook failed on osd_id_2 saying 'dict object' has no attribute 'stderr' for the task 'set osd(s) out when ceph-disk deactivating fail'. So finally osd_id_1 was marked out without removing both. Regards, Vasishta
Created attachment 1325347 [details] File contains contents ansible-playbook log I tried removing two osds on a single node, this time also osd_id_2 was not searched on appropriate host, its considering 'localhost'. As osd_id_1 was searched on appropriate node and osd service was stopped as part of the task - [deactivating osd(s)]. Regards, Vasishta
Ok I have the right fix now. Moving this to POST soon, but feel free to test https://github.com/ceph/ceph-ansible/pull/1885 one more time. Thanks!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:3387