Bug 1299997

Summary: not clear how to recover from failure of "Create Cluster" operation
Product: [Red Hat Storage] Red Hat Storage Console Reporter: Martin Bukatovic <mbukatov>
Component: coreAssignee: Shubhendu Tripathi <shtripat>
Status: CLOSED WONTFIX QA Contact: sds-qe-bugs
Severity: medium Docs Contact:
Priority: unspecified    
Version: 2CC: julim, mkudlej, nthomas, rnachimu, sankarshan, shtripat, vsarmila
Target Milestone: ---   
Target Release: 2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-07-19 11:20:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1349458    
Bug Blocks: 1344195    

Description Martin Bukatovic 2016-01-19 17:24:29 UTC
Description of problem
======================

When "Create Cluster" operation fails, it's not clear how to recover from
this state, eg. How to retry or continue? How to check the current state
of the hosts? How to restore to a previous state if possible?

Version-Release number of selected component
============================================

Console server:

~~~
# rpm -qa 'rhscon*'
rhscon-core-0.0.6-0.1.alpha1.el7.x86_64
rhscon-ui-0.0.6-0.1.alpha1.el7.noarch
rhscon-ceph-0.0.4-0.1.alpha1.el7.x86_64
~~~

Storage nodes:

~~~
# rpm -qa 'rhscon*'
rhscon-agent-0.0.3-0.2.alpha1.el7.noarch
~~~

How reproducible
================

100 % (assuming that the "Cluster Create" operation fails)

Steps to Reproduce
==================

1. Install skyring on server and prepare few hosts for cluster setup
2. Accept all nodes
3. Use Create Cluster wizard to create new cluster so that
   the operation will fail
   
Note: To achieve cluster creation to fail as requested in step #3,
I don't need to do anything special right now as it consistently fails
because of BZ 1296464 for me. When this BZ is fixed, one would need to
*make it fail on purpose*.

Actual results
==============

When "Create Cluster" task fails, the failure is reported on the "Task" page.
Particular task item on "Tasks" page has a red icon and it's status reads
"Failed".

So far so good.

Note that there are few related BZs I'm not going to discuss here in this BZ:

 * BZ 1296172 - it's not possible to click on details button
   of a task, so that no more details can be reached from the web interface
 * BZ 1296985 - order of task items on Task page seems to be random sometimes,
   which makes it harder to notice or reach information about the failure

The problem is that one doesn't know what state each machine ended up in and
what can be done about it. When I check "Hosts" page, I see no errors or
details reported here.

I can't simply try rerun Create Cluster wizard, because I don't see any
machines available any more.

For example in my case: setup of the 1st node of the cluster succeeded, but
setup failed on 2nd node so that I have:

 * 1st node configured properly
 * 2nd node stuck in the middle of the setup, which failed here
 * 3rt and 4th nodes were not touched

I can't find this information via console UI. With just 5 testing virtual
machines it's easy to check myself and redeploy, but what is admin deploying
100 node cluster supposed to do when setup fails somewhere in the middle?

Expected results
================

When "Create Cluster" task fails, the failure is reported on the "Task" page.
Particular task item on "Tasks" page has a red icon and it's status reads
"Failed".

Additional details of the state of the machines are provided via skyring web
ui. Maybe some overview which would classify hosts into particular states would
be useful so that it's clear which nodes are configured with success and which
ones failed. Moreover based on the actual state of the cluster, it would be
possible to either:

 * retry or continue Create Cluster operation
 * restore original state (/me is not sure if this makes sense)
 * something else which would be appropriate

Comment 3 Ju Lim 2016-06-17 13:23:42 UTC
Will need to document what to do when this happens for 1.0 release for supportability.

Comment 4 Shubhendu Tripathi 2016-07-11 06:09:47 UTC
Doc BZ#1349458 is created to document the trouble shooting section in admin guide.