This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 1299997 - not clear how to recover from failure of "Create Cluster" operation
not clear how to recover from failure of "Create Cluster" operation
Status: CLOSED WONTFIX
Product: Red Hat Storage Console
Classification: Red Hat
Component: core (Show other bugs)
2
Unspecified Unspecified
unspecified Severity medium
: ---
: 2
Assigned To: Shubhendu Tripathi
sds-qe-bugs
:
Depends On: 1349458
Blocks: Console-2-DevFreeze
  Show dependency treegraph
 
Reported: 2016-01-19 12:24 EST by Martin Bukatovic
Modified: 2016-07-29 11:34 EDT (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-07-19 07:20:47 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Martin Bukatovic 2016-01-19 12:24:29 EST
Description of problem
======================

When "Create Cluster" operation fails, it's not clear how to recover from
this state, eg. How to retry or continue? How to check the current state
of the hosts? How to restore to a previous state if possible?

Version-Release number of selected component
============================================

Console server:

~~~
# rpm -qa 'rhscon*'
rhscon-core-0.0.6-0.1.alpha1.el7.x86_64
rhscon-ui-0.0.6-0.1.alpha1.el7.noarch
rhscon-ceph-0.0.4-0.1.alpha1.el7.x86_64
~~~

Storage nodes:

~~~
# rpm -qa 'rhscon*'
rhscon-agent-0.0.3-0.2.alpha1.el7.noarch
~~~

How reproducible
================

100 % (assuming that the "Cluster Create" operation fails)

Steps to Reproduce
==================

1. Install skyring on server and prepare few hosts for cluster setup
2. Accept all nodes
3. Use Create Cluster wizard to create new cluster so that
   the operation will fail
   
Note: To achieve cluster creation to fail as requested in step #3,
I don't need to do anything special right now as it consistently fails
because of BZ 1296464 for me. When this BZ is fixed, one would need to
*make it fail on purpose*.

Actual results
==============

When "Create Cluster" task fails, the failure is reported on the "Task" page.
Particular task item on "Tasks" page has a red icon and it's status reads
"Failed".

So far so good.

Note that there are few related BZs I'm not going to discuss here in this BZ:

 * BZ 1296172 - it's not possible to click on details button
   of a task, so that no more details can be reached from the web interface
 * BZ 1296985 - order of task items on Task page seems to be random sometimes,
   which makes it harder to notice or reach information about the failure

The problem is that one doesn't know what state each machine ended up in and
what can be done about it. When I check "Hosts" page, I see no errors or
details reported here.

I can't simply try rerun Create Cluster wizard, because I don't see any
machines available any more.

For example in my case: setup of the 1st node of the cluster succeeded, but
setup failed on 2nd node so that I have:

 * 1st node configured properly
 * 2nd node stuck in the middle of the setup, which failed here
 * 3rt and 4th nodes were not touched

I can't find this information via console UI. With just 5 testing virtual
machines it's easy to check myself and redeploy, but what is admin deploying
100 node cluster supposed to do when setup fails somewhere in the middle?

Expected results
================

When "Create Cluster" task fails, the failure is reported on the "Task" page.
Particular task item on "Tasks" page has a red icon and it's status reads
"Failed".

Additional details of the state of the machines are provided via skyring web
ui. Maybe some overview which would classify hosts into particular states would
be useful so that it's clear which nodes are configured with success and which
ones failed. Moreover based on the actual state of the cluster, it would be
possible to either:

 * retry or continue Create Cluster operation
 * restore original state (/me is not sure if this makes sense)
 * something else which would be appropriate
Comment 3 Ju Lim 2016-06-17 09:23:42 EDT
Will need to document what to do when this happens for 1.0 release for supportability.
Comment 4 Shubhendu Tripathi 2016-07-11 02:09:47 EDT
Doc BZ#1349458 is created to document the trouble shooting section in admin guide.

Note You need to log in before you can comment on or make changes to this bug.