1490355 – [ceph-ansible] - shrink-osd failing when more than one osd is to be removed

Bug 1490355 - [ceph-ansible] - shrink-osd failing when more than one osd is to be removed

Summary: [ceph-ansible] - shrink-osd failing when more than one osd is to be removed

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Ansible
Sub Component:
Version:	3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	3.0
Assignee:	Sébastien Han
QA Contact:	Vasishta
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-09-11 12:06 UTC by Vasishta
Modified:	2017-12-05 23:42 UTC (History)
CC List:	9 users (show)
Fixed In Version:	RHEL: ceph-ansible-3.0.0-0.1.rc8.1.el7cp Ubuntu: ceph-ansible_3.0.0~rc8-2redhat1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-12-05 23:42:56 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
File contains contents ansible-playbook log (59.78 KB, text/plain) 2017-09-11 12:06 UTC, Vasishta	no flags	Details
File contains contents ansible-playbook log (296.51 KB, text/plain) 2017-09-13 11:06 UTC, Vasishta	no flags	Details
File contains contents ansible-playbook log (296.85 KB, text/plain) 2017-09-13 11:47 UTC, Vasishta	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ceph ceph-ansible pull 1885	0	'None'	'closed'	'shrink-osd: fix when multiple osds'	2019-11-25 03:50:57 UTC
Red Hat Product Errata	RHBA-2017:3387	0	normal	SHIPPED_LIVE	Red Hat Ceph Storage 3.0 bug fix and enhancement update	2017-12-06 03:03:45 UTC

Description Vasishta 2017-09-11 12:06:45 UTC

Created attachment 1324431 [details]
File contains contents ansible-playbook log

Description of problem:
When more than one osd to be removed, Operation works on only one OSD, the first one. Others are only marked out it seems.

Version-Release number of selected component (if applicable):
ceph-ansible-3.0.0-0.1.rc6.el7cp.noarch

How reproducible:
Always (2/2)

Steps to Reproduce:
Run ansible-playbook shrink-osd.yml -e osd_to_kill=<OSD-id-1>,<OSD-id-2>

Actual results:
Only first OSD is being removed, other one is not being removed

failed: [localhost -> ] (item=[u'7', None]) => {"changed": true, "cmd": ["ceph-disk", "destroy", "--cluster", "12_3a", "--destroy-by-id", "7", "--zap"], "delta": "0:00:00.213187", "end": "2017-09-11 11:46:36.468725", "failed": true, "item": ["7", null], "rc": 1, "start": "2017-09-11 11:46:36.255538", "stderr": "ceph-disk: Error: found no device matching : osd id 7", "stderr_lines": ["ceph-disk: Error: found no device matching : osd id 7"], "stdout": "", "stdout_lines": []}

Expected results:
All the OSDs of which ids has been passed for shrink-osd operation must be removed

Additional info:
1) Both osd 7 & 9 in below scenario were passed as second argument to the argument "osd_to_kill" in two different runs

# ceph osd tree --cluster 12_3a
ID CLASS WEIGHT  TYPE NAME         STATUS REWEIGHT PRI-AFF 
-1       6.33008 root default                              
-5       2.71289     host magna068                         
 0   hdd 0.90430         osd.0         up  1.00000 1.00000 
 5   hdd 0.90430         osd.5         up  1.00000 1.00000 
 8   hdd 0.90430         osd.8         up  1.00000 1.00000 
-3       2.71289     host magna069                         
 2   hdd 0.90430         osd.2         up  1.00000 1.00000 
 3   hdd 0.90430         osd.3         up  1.00000 1.00000 
 6   hdd 0.90430         osd.6         up  1.00000 1.00000 
-7       0.90430     host magna093                         
 1   hdd 0.90430         osd.1         up  1.00000 1.00000 
-9             0     host magna097                         
 7             0 osd.7                 up        0 1.00000 
 9             0 osd.9                 up        0 1.00000 

# ceph -s --cluster 12_3a
  ----------
 
  services:
  ---------
    osd: 9 osds: 9 up, 7 in; 3 remapped pgs
  ---------

By seeing the log, it seems for the second osd-id onward, ansible failed to recognise corresponding  osd node hostname, On two different runs, both osd-ids those were given for the argument osd_to_kill were on single node, it failed to recognise the hostname.

Comment 2 seb 2017-09-12 23:58:18 UTC

found the bug working on a fix

Comment 3 seb 2017-09-13 04:22:23 UTC

Please let me know if https://github.com/ceph/ceph-ansible/pull/1885/commits fixes your issue. Thanks!

Comment 4 Vasishta 2017-09-13 11:06:53 UTC

Created attachment 1325313 [details]
File contains contents ansible-playbook log

Hi Sebastien,

'shrink-osd' branch didn't work for me. I have attached the ansible log (verbose mode enabled.
 
I observed that This time removing both osds failed. This time I had tried osds with two different nodes. playbook tried searching osd_id_1 on the node where osd_id_2 was running so 'task deactivating osd(s)' failed for osd_id_1, but task 'set osd(s) out when ceph-disk deactivating fail' was successful as command was executed from magna097. Playbook failed on osd_id_2 saying 'dict object' has no attribute 'stderr' for the task 'set osd(s) out when ceph-disk deactivating fail'. 
So finally osd_id_1 was marked out without removing both.

Regards,
Vasishta

Comment 5 Vasishta 2017-09-13 11:47:09 UTC

Created attachment 1325347 [details]
File contains contents ansible-playbook log

I tried removing two osds on a single node, this time also osd_id_2 was not searched on appropriate host, its considering 'localhost'. 
As osd_id_1 was searched on appropriate node and osd service was stopped as part of the task - [deactivating osd(s)].

Regards,
Vasishta

Comment 6 seb 2017-09-13 21:22:20 UTC

Ok I have the right fix now. Moving this to POST soon, but feel free to test https://github.com/ceph/ceph-ansible/pull/1885 one more time.

Thanks!

Comment 11 errata-xmlrpc 2017-12-05 23:42:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3387

Note You need to log in before you can comment on or make changes to this bug.