Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1490355

Summary: [ceph-ansible] - shrink-osd failing when more than one osd is to be removed
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vasishta <vashastr>
Component: Ceph-AnsibleAssignee: Sébastien Han <shan>
Status: CLOSED ERRATA QA Contact: Vasishta <vashastr>
Severity: high Docs Contact:
Priority: high    
Version: 3.0CC: adeza, aschoen, ceph-eng-bugs, gmeno, hnallurv, kdreyer, nthomas, sankarshan, seb
Target Milestone: rc   
Target Release: 3.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: RHEL: ceph-ansible-3.0.0-0.1.rc8.1.el7cp Ubuntu: ceph-ansible_3.0.0~rc8-2redhat1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-05 23:42:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
File contains contents ansible-playbook log
none
File contains contents ansible-playbook log
none
File contains contents ansible-playbook log none

Description Vasishta 2017-09-11 12:06:45 UTC
Created attachment 1324431 [details]
File contains contents ansible-playbook log

Description of problem:
When more than one osd to be removed, Operation works on only one OSD, the first one. Others are only marked out it seems.

Version-Release number of selected component (if applicable):
ceph-ansible-3.0.0-0.1.rc6.el7cp.noarch

How reproducible:
Always (2/2)

Steps to Reproduce:
Run ansible-playbook shrink-osd.yml -e osd_to_kill=<OSD-id-1>,<OSD-id-2>

Actual results:
Only first OSD is being removed, other one is not being removed

failed: [localhost -> ] (item=[u'7', None]) => {"changed": true, "cmd": ["ceph-disk", "destroy", "--cluster", "12_3a", "--destroy-by-id", "7", "--zap"], "delta": "0:00:00.213187", "end": "2017-09-11 11:46:36.468725", "failed": true, "item": ["7", null], "rc": 1, "start": "2017-09-11 11:46:36.255538", "stderr": "ceph-disk: Error: found no device matching : osd id 7", "stderr_lines": ["ceph-disk: Error: found no device matching : osd id 7"], "stdout": "", "stdout_lines": []}

Expected results:
All the OSDs of which ids has been passed for shrink-osd operation must be removed

Additional info:
1) Both osd 7 & 9 in below scenario were passed as second argument to the argument "osd_to_kill" in two different runs

# ceph osd tree --cluster 12_3a
ID CLASS WEIGHT  TYPE NAME         STATUS REWEIGHT PRI-AFF 
-1       6.33008 root default                              
-5       2.71289     host magna068                         
 0   hdd 0.90430         osd.0         up  1.00000 1.00000 
 5   hdd 0.90430         osd.5         up  1.00000 1.00000 
 8   hdd 0.90430         osd.8         up  1.00000 1.00000 
-3       2.71289     host magna069                         
 2   hdd 0.90430         osd.2         up  1.00000 1.00000 
 3   hdd 0.90430         osd.3         up  1.00000 1.00000 
 6   hdd 0.90430         osd.6         up  1.00000 1.00000 
-7       0.90430     host magna093                         
 1   hdd 0.90430         osd.1         up  1.00000 1.00000 
-9             0     host magna097                         
 7             0 osd.7                 up        0 1.00000 
 9             0 osd.9                 up        0 1.00000 

# ceph -s --cluster 12_3a
  ----------
 
  services:
  ---------
    osd: 9 osds: 9 up, 7 in; 3 remapped pgs
  ---------

By seeing the log, it seems for the second osd-id onward, ansible failed to recognise corresponding  osd node hostname, On two different runs, both osd-ids those were given for the argument osd_to_kill were on single node, it failed to recognise the hostname.

Comment 2 seb 2017-09-12 23:58:18 UTC
found the bug working on a fix

Comment 3 seb 2017-09-13 04:22:23 UTC
Please let me know if https://github.com/ceph/ceph-ansible/pull/1885/commits fixes your issue. Thanks!

Comment 4 Vasishta 2017-09-13 11:06:53 UTC
Created attachment 1325313 [details]
File contains contents ansible-playbook log

Hi Sebastien,

'shrink-osd' branch didn't work for me. I have attached the ansible log (verbose mode enabled.
 
I observed that This time removing both osds failed. This time I had tried osds with two different nodes. playbook tried searching osd_id_1 on the node where osd_id_2 was running so 'task deactivating osd(s)' failed for osd_id_1, but task 'set osd(s) out when ceph-disk deactivating fail' was successful as command was executed from magna097. Playbook failed on osd_id_2 saying 'dict object' has no attribute 'stderr' for the task 'set osd(s) out when ceph-disk deactivating fail'. 
So finally osd_id_1 was marked out without removing both.

Regards,
Vasishta

Comment 5 Vasishta 2017-09-13 11:47:09 UTC
Created attachment 1325347 [details]
File contains contents ansible-playbook log

I tried removing two osds on a single node, this time also osd_id_2 was not searched on appropriate host, its considering 'localhost'. 
As osd_id_1 was searched on appropriate node and osd service was stopped as part of the task - [deactivating osd(s)].

Regards,
Vasishta

Comment 6 seb 2017-09-13 21:22:20 UTC
Ok I have the right fix now. Moving this to POST soon, but feel free to test https://github.com/ceph/ceph-ansible/pull/1885 one more time.

Thanks!

Comment 11 errata-xmlrpc 2017-12-05 23:42:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3387