1299997 – not clear how to recover from failure of "Create Cluster" operation

Bug 1299997 - not clear how to recover from failure of "Create Cluster" operation

Summary: not clear how to recover from failure of "Create Cluster" operation

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Storage Console
Classification:	Red Hat Storage
Component:	core
Sub Component:
Version:	2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	2
Assignee:	Shubhendu Tripathi
QA Contact:	sds-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:	1349458
Blocks:	Console-2-DevFreeze
TreeView+	depends on / blocked

Reported:	2016-01-19 17:24 UTC by Martin Bukatovic
Modified:	2016-07-29 15:34 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-07-19 11:20:47 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1296172	unspecified	CLOSED	task details not available	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1296464	high	CLOSED	Create Cluster: ceph-disk activate-all fails in non deterministic way because of missing device file	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1296985	unspecified	CLOSED	task order	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1343677	unspecified	CLOSED	Unassigned host cannot be used in new cluster	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1361620	unspecified	CLOSED	RFE possibility to stop a task	2021-02-22 00:41:40 UTC

Internal Links: 1296172 1296464 1296985 1343677 1361620

Description Martin Bukatovic 2016-01-19 17:24:29 UTC

Description of problem
======================

When "Create Cluster" operation fails, it's not clear how to recover from
this state, eg. How to retry or continue? How to check the current state
of the hosts? How to restore to a previous state if possible?

Version-Release number of selected component
============================================

Console server:

~~~
# rpm -qa 'rhscon*'
rhscon-core-0.0.6-0.1.alpha1.el7.x86_64
rhscon-ui-0.0.6-0.1.alpha1.el7.noarch
rhscon-ceph-0.0.4-0.1.alpha1.el7.x86_64
~~~

Storage nodes:

~~~
# rpm -qa 'rhscon*'
rhscon-agent-0.0.3-0.2.alpha1.el7.noarch
~~~

How reproducible
================

100 % (assuming that the "Cluster Create" operation fails)

Steps to Reproduce
==================

1. Install skyring on server and prepare few hosts for cluster setup
2. Accept all nodes
3. Use Create Cluster wizard to create new cluster so that
   the operation will fail
   
Note: To achieve cluster creation to fail as requested in step #3,
I don't need to do anything special right now as it consistently fails
because of BZ 1296464 for me. When this BZ is fixed, one would need to
*make it fail on purpose*.

Actual results
==============

When "Create Cluster" task fails, the failure is reported on the "Task" page.
Particular task item on "Tasks" page has a red icon and it's status reads
"Failed".

So far so good.

Note that there are few related BZs I'm not going to discuss here in this BZ:

 * BZ 1296172 - it's not possible to click on details button
   of a task, so that no more details can be reached from the web interface
 * BZ 1296985 - order of task items on Task page seems to be random sometimes,
   which makes it harder to notice or reach information about the failure

The problem is that one doesn't know what state each machine ended up in and
what can be done about it. When I check "Hosts" page, I see no errors or
details reported here.

I can't simply try rerun Create Cluster wizard, because I don't see any
machines available any more.

For example in my case: setup of the 1st node of the cluster succeeded, but
setup failed on 2nd node so that I have:

 * 1st node configured properly
 * 2nd node stuck in the middle of the setup, which failed here
 * 3rt and 4th nodes were not touched

I can't find this information via console UI. With just 5 testing virtual
machines it's easy to check myself and redeploy, but what is admin deploying
100 node cluster supposed to do when setup fails somewhere in the middle?

Expected results
================

When "Create Cluster" task fails, the failure is reported on the "Task" page.
Particular task item on "Tasks" page has a red icon and it's status reads
"Failed".

Additional details of the state of the machines are provided via skyring web
ui. Maybe some overview which would classify hosts into particular states would
be useful so that it's clear which nodes are configured with success and which
ones failed. Moreover based on the actual state of the cluster, it would be
possible to either:

 * retry or continue Create Cluster operation
 * restore original state (/me is not sure if this makes sense)
 * something else which would be appropriate

Comment 3 Ju Lim 2016-06-17 13:23:42 UTC

Will need to document what to do when this happens for 1.0 release for supportability.

Comment 4 Shubhendu Tripathi 2016-07-11 06:09:47 UTC

Doc BZ#1349458 is created to document the trouble shooting section in admin guide.

Note You need to log in before you can comment on or make changes to this bug.