Bug 1299997 - not clear how to recover from failure of "Create Cluster" operation
Summary: not clear how to recover from failure of "Create Cluster" operation
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Storage Console
Classification: Red Hat Storage
Component: core
Version: 2
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 2
Assignee: Shubhendu Tripathi
QA Contact: sds-qe-bugs
URL:
Whiteboard:
Depends On: 1349458
Blocks: Console-2-DevFreeze
TreeView+ depends on / blocked
 
Reported: 2016-01-19 17:24 UTC by Martin Bukatovic
Modified: 2016-07-29 15:34 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-07-19 11:20:47 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1296172 0 unspecified CLOSED task details not available 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1296464 0 high CLOSED Create Cluster: ceph-disk activate-all fails in non deterministic way because of missing device file 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1296985 0 unspecified CLOSED task order 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1343677 0 unspecified CLOSED Unassigned host cannot be used in new cluster 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1361620 0 unspecified CLOSED RFE possibility to stop a task 2021-02-22 00:41:40 UTC


Description Martin Bukatovic 2016-01-19 17:24:29 UTC
Description of problem
======================

When "Create Cluster" operation fails, it's not clear how to recover from
this state, eg. How to retry or continue? How to check the current state
of the hosts? How to restore to a previous state if possible?

Version-Release number of selected component
============================================

Console server:

~~~
# rpm -qa 'rhscon*'
rhscon-core-0.0.6-0.1.alpha1.el7.x86_64
rhscon-ui-0.0.6-0.1.alpha1.el7.noarch
rhscon-ceph-0.0.4-0.1.alpha1.el7.x86_64
~~~

Storage nodes:

~~~
# rpm -qa 'rhscon*'
rhscon-agent-0.0.3-0.2.alpha1.el7.noarch
~~~

How reproducible
================

100 % (assuming that the "Cluster Create" operation fails)

Steps to Reproduce
==================

1. Install skyring on server and prepare few hosts for cluster setup
2. Accept all nodes
3. Use Create Cluster wizard to create new cluster so that
   the operation will fail
   
Note: To achieve cluster creation to fail as requested in step #3,
I don't need to do anything special right now as it consistently fails
because of BZ 1296464 for me. When this BZ is fixed, one would need to
*make it fail on purpose*.

Actual results
==============

When "Create Cluster" task fails, the failure is reported on the "Task" page.
Particular task item on "Tasks" page has a red icon and it's status reads
"Failed".

So far so good.

Note that there are few related BZs I'm not going to discuss here in this BZ:

 * BZ 1296172 - it's not possible to click on details button
   of a task, so that no more details can be reached from the web interface
 * BZ 1296985 - order of task items on Task page seems to be random sometimes,
   which makes it harder to notice or reach information about the failure

The problem is that one doesn't know what state each machine ended up in and
what can be done about it. When I check "Hosts" page, I see no errors or
details reported here.

I can't simply try rerun Create Cluster wizard, because I don't see any
machines available any more.

For example in my case: setup of the 1st node of the cluster succeeded, but
setup failed on 2nd node so that I have:

 * 1st node configured properly
 * 2nd node stuck in the middle of the setup, which failed here
 * 3rt and 4th nodes were not touched

I can't find this information via console UI. With just 5 testing virtual
machines it's easy to check myself and redeploy, but what is admin deploying
100 node cluster supposed to do when setup fails somewhere in the middle?

Expected results
================

When "Create Cluster" task fails, the failure is reported on the "Task" page.
Particular task item on "Tasks" page has a red icon and it's status reads
"Failed".

Additional details of the state of the machines are provided via skyring web
ui. Maybe some overview which would classify hosts into particular states would
be useful so that it's clear which nodes are configured with success and which
ones failed. Moreover based on the actual state of the cluster, it would be
possible to either:

 * retry or continue Create Cluster operation
 * restore original state (/me is not sure if this makes sense)
 * something else which would be appropriate

Comment 3 Ju Lim 2016-06-17 13:23:42 UTC
Will need to document what to do when this happens for 1.0 release for supportability.

Comment 4 Shubhendu Tripathi 2016-07-11 06:09:47 UTC
Doc BZ#1349458 is created to document the trouble shooting section in admin guide.


Note You need to log in before you can comment on or make changes to this bug.