Description of problem ====================== When "Create Cluster" operation fails, it's not clear how to recover from this state, eg. How to retry or continue? How to check the current state of the hosts? How to restore to a previous state if possible? Version-Release number of selected component ============================================ Console server: ~~~ # rpm -qa 'rhscon*' rhscon-core-0.0.6-0.1.alpha1.el7.x86_64 rhscon-ui-0.0.6-0.1.alpha1.el7.noarch rhscon-ceph-0.0.4-0.1.alpha1.el7.x86_64 ~~~ Storage nodes: ~~~ # rpm -qa 'rhscon*' rhscon-agent-0.0.3-0.2.alpha1.el7.noarch ~~~ How reproducible ================ 100 % (assuming that the "Cluster Create" operation fails) Steps to Reproduce ================== 1. Install skyring on server and prepare few hosts for cluster setup 2. Accept all nodes 3. Use Create Cluster wizard to create new cluster so that the operation will fail Note: To achieve cluster creation to fail as requested in step #3, I don't need to do anything special right now as it consistently fails because of BZ 1296464 for me. When this BZ is fixed, one would need to *make it fail on purpose*. Actual results ============== When "Create Cluster" task fails, the failure is reported on the "Task" page. Particular task item on "Tasks" page has a red icon and it's status reads "Failed". So far so good. Note that there are few related BZs I'm not going to discuss here in this BZ: * BZ 1296172 - it's not possible to click on details button of a task, so that no more details can be reached from the web interface * BZ 1296985 - order of task items on Task page seems to be random sometimes, which makes it harder to notice or reach information about the failure The problem is that one doesn't know what state each machine ended up in and what can be done about it. When I check "Hosts" page, I see no errors or details reported here. I can't simply try rerun Create Cluster wizard, because I don't see any machines available any more. For example in my case: setup of the 1st node of the cluster succeeded, but setup failed on 2nd node so that I have: * 1st node configured properly * 2nd node stuck in the middle of the setup, which failed here * 3rt and 4th nodes were not touched I can't find this information via console UI. With just 5 testing virtual machines it's easy to check myself and redeploy, but what is admin deploying 100 node cluster supposed to do when setup fails somewhere in the middle? Expected results ================ When "Create Cluster" task fails, the failure is reported on the "Task" page. Particular task item on "Tasks" page has a red icon and it's status reads "Failed". Additional details of the state of the machines are provided via skyring web ui. Maybe some overview which would classify hosts into particular states would be useful so that it's clear which nodes are configured with success and which ones failed. Moreover based on the actual state of the cluster, it would be possible to either: * retry or continue Create Cluster operation * restore original state (/me is not sure if this makes sense) * something else which would be appropriate
Will need to document what to do when this happens for 1.0 release for supportability.
Doc BZ#1349458 is created to document the trouble shooting section in admin guide.