1333399 – OSD addition failed for [HOST1:map[/dev/vdc:/dev/vdb] HOST2:map[/dev/vdc:/dev/vdb]]

Bug 1333399 - OSD addition failed for [HOST1:map[/dev/vdc:/dev/vdb] HOST2:map[/dev/vdc:/dev/vdb]]

Summary: OSD addition failed for [HOST1:map[/dev/vdc:/dev/vdb] HOST2:map[/dev/vdc:/dev...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Storage Console
Classification:	Red Hat Storage
Component:	core
Sub Component:
Version:	2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	3
Assignee:	Nishanth Thomas
QA Contact:	sds-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-05-05 12:21 UTC by Daniel Horák
Modified:	2017-03-23 04:04 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-03-23 04:04:03 UTC
Embargoed:

Attachments	(Terms of Use)
Create cluster task (126.36 KB, image/png) 2016-05-05 12:21 UTC, Daniel Horák	no flags	Details
Task Create Cluster 1 (106.48 KB, image/png) 2016-05-09 11:31 UTC, Daniel Horák	no flags	Details
Task Create Cluster 2 (125.88 KB, image/png) 2016-05-09 11:31 UTC, Daniel Horák	no flags	Details
POST request data (3.21 KB, text/plain) 2016-05-09 11:32 UTC, Daniel Horák	no flags	Details
Cluster OSDs detail page (89.94 KB, image/png) 2016-05-09 11:35 UTC, Daniel Horák	no flags	Details
View All

Description Daniel Horák 2016-05-05 12:21:29 UTC

Created attachment 1154187 [details]
Create cluster task

Description of problem:
  OSD addition fails in cluster creation task for some nodes/devices without clear reason.
  The cluster task ends as "Success".

Version-Release number of selected component (if applicable):
  ceph-ansible-1.0.5-5.el7scon.noarch
  ceph-base-10.2.0-1.el7cp.x86_64
  ceph-common-10.2.0-1.el7cp.x86_64
  ceph-installer-1.0.6-1.el7scon.noarch
  ceph-osd-10.2.0-1.el7cp.x86_64
  ceph-selinux-10.2.0-1.el7cp.x86_64
  rhscon-agent-0.0.5-1.el7scon.noarch
  rhscon-ceph-0.0.12-1.el7scon.x86_64
  rhscon-core-0.0.15-1.el7scon.x86_64
  rhscon-ui-0.0.28-1.el7scon.noarch
  salt-2015.5.5-1.el7.noarch
  salt-master-2015.5.5-1.el7.noarch
  salt-minion-2015.5.5-1.el7.noarch

How reproducible:
  Often on some types of deployments. For me it's usualy on VMs wit more spare dics, but not only..

Steps to Reproduce:
1. Prepare nodes for cluster with more spare disks (accordingly to doc).
2. Create cluster.
3. Check the details of task related to the cluster creation.

Actual results:
  The subtask contains error:

    OSD addition failed for [host1.example.com:map[/dev/vdc:/dev/vdb] host2.example.com:map[/dev/vdc:/dev/vdb]]

  And some OSDs was not properly added, but the task result is 'Success' (see the attachment).

Expected results:
  All the selected OSDs are properly created and added.
  Any problem is properly decribed for easier debuging.
  If there is any problem, the task result should not be 'Success'.

Additional info:
  See the attachment with all logs and informations from all the nodes.

Comment 3 Nishanth Thomas 2016-05-07 12:48:18 UTC

Returning task status as success is by design. The idea is that if at least one monitor is successfully added the cluster is created. if OSD creation fails it will be reported but task status still be success. Think of a scenario where out of 50 OSDs if fails , do you want to fail the cluster creation?

As we talked about different cases where osd creation fails, I am not sure this belongs to one of these cases. If you can provide a setup where this can be reproducible, it will be great help. Also we need /var/log/messages to look at the ceph-installer log messages to see the cause of failure

Comment 4 Martin Kudlej 2016-05-09 06:48:15 UTC

I think that it should be obvious to user(UI) or in REST API that cluster creation has not failed in general but there are some failed sub-tasks like OSD hasn't created on one node. I think that 3rd state should be introduced and in UI should be clearly visible that something has gone wrong.

Comment 5 Daniel Horák 2016-05-09 07:39:00 UTC

(In reply to Nishanth Thomas from comment #3)
> Returning task status as success is by design. The idea is that if at least
> one monitor is successfully added the cluster is created. if OSD creation
> fails it will be reported but task status still be success. Think of a
> scenario where out of 50 OSDs if fails , do you want to fail the cluster
> creation?

If one,two or twenty OSDs fails it's definitely not success and same apply for MONs. Here are few questions/notes:

1) User should be clearly notified that something failed/is broken. 

2) Is it possible (for the user) to discover the root of the issue why the OSD/MON creation fails (to be able to fix it)?

3) Is it possible to recover from this state (relaunch the OSD/MON creation on the failed tasks)?

4) My opinion is, that if the answer to two previous question is yes (it is possible for the user to fix the issue and relaunch the failed process), then the result state can be something like "WARNING". If it is not possible to recover from the failed state, than the task result should be FAILED/ERROR (maybe with some explanation, that the cluster is created but in some disabled/limited state).

Comment 8 Daniel Horák 2016-05-09 11:31:05 UTC

Created attachment 1155249 [details]
Task Create Cluster 1

Comment 9 Daniel Horák 2016-05-09 11:31:31 UTC

Created attachment 1155250 [details]
Task Create Cluster 2

Comment 10 Daniel Horák 2016-05-09 11:32:06 UTC

Created attachment 1155251 [details]
POST request data

Comment 11 Daniel Horák 2016-05-09 11:35:59 UTC

Created attachment 1155253 [details]
Cluster OSDs detail page

Comment 13 Nishanth Thomas 2016-05-11 06:00:38 UTC

As discussed yesterday, The requirement from QE is to have separate state for partially successful tasks. As we all agreed, its a substantial work and its a risk to bring this in later in the cycle where feature freeze is couple of weeks away. So decided to move this bug to the next release

Comment 14 Nishanth Thomas 2016-05-11 06:01:54 UTC

Daniel request you to raise any specific issues separately. Moving this out of 2.0

Note You need to log in before you can comment on or make changes to this bug.