Description of problem: There can be issues/problems/timeouts during cluster importing. Now if any node is not imported import is marked as success. I think if any of node is not imported import should be in marked as "failed". Also there should be possibility how to fix this issue. There should exist steps to complete import for timeouted nodes. List of not imported nodes should be visible in "import" task with clear reason why node was not imported. Version-Release number of selected component (if applicable): ceph-ansible-1.0.5-5.el7scon.noarch ceph-installer-1.0.6-1.el7scon.noarch rhscon-ceph-0.0.11-1.el7scon.x86_64 rhscon-core-0.0.14-1.el7scon.x86_64 rhscon-ui-0.0.28-1.el7scon.noarch How reproducible: in case of node import timeouted Steps to Reproduce: 1. create cluster 2. unmanage cluster 3. forget cluster 4. wait till all nodes are visible in USM as available 5. import cluster Actual results: If any node is not imported import is marked as success. User don't know about this issue. User cannot fix this issue. Expected results: If any node is not imported import is marked as failed. User know about this issue and see list of not imported nodes with proper message. User can fix this issue by pushing button "Retry import" Additional info: bigfin.log: 14:51 < shubhendu> 2016-05-04T13:47:15.869+02:00 ERROR import_cluster.go:455 PopulateNodeNetworkDetails] Node mkudlej-usm1-node3.os1.phx2.redhat.com still in initializing state. Continuing to other 14:51 < shubhendu> 2016-05-04T13:49:26+0000 INFO saltwrapper.py:49 saltwrapper.wrapper] args=(<salt.client.LocalClient object at 0x33a3810>, ['mkudlej-usm1-node1.os1.phx2.redhat.com'], 'cmd.run', ["lsblk --all --bytes --noheadings --output='NAME,KNAME,FSTYPE,MOUNTPOINT,UUID,PARTUUID,MODEL,SIZE,TYPE,PKNAME,VENDOR' --path --raw"]), kwargs={'expr_form': 'list'}
Created attachment 1160973 [details] Import cluster flow with failed nodes
Below are the steps to verify the issue 1. Invoke import cluster from UI 2. Select a mon node from the nodes list 3. It shows the list of participating nodes of the cluster 4. Now bring down salt-minion on one of the nodes of the cluster (say a OSD) 5. Submit the import cluster request 6. While importing the failed node would be listed in the steps as error Attached a screen shot a import cluster flow with one failed node for reference.
I still see this issue. I've tried reproducer from comment #3 with one monitor and one node and monitor was reinitialized. But node cannot be reinitialized even if it is part of cluster. ceph-ansible-1.0.5-25.el7scon.noarch ceph-installer-1.0.12-4.el7scon.noarch rhscon-ceph-0.0.31-1.el7scon.x86_64 rhscon-core-0.0.32-1.el7scon.x86_64 rhscon-core-selinux-0.0.32-1.el7scon.noarch rhscon-ui-0.0.46-1.el7scon.noarch
Martin, the feature is developed in way that - - If nodes are not already accepted they would get accepted while import cluster flow - If some node fails accept while import or due to slow network, it would clearly marked as failed node in import cluster task steps Now while import cluster few cluster fail to accept/initialize and are reported as failed in task, only re-initialize trigger from UI is not going to help as the node might get accepted/initialized but there is no way could be marked as participating in cluster. To work around this situation, below are few options - 1. Forget the cluster in USM and try to import back again after re-starting the salt-minion services on the storage nodes 2. OR prior to starting import cluster, if no of nodes in the cluster are huge, the nodes are already listed as un-accepted in USM. Accept the nodes before starting import cluster and then trigger import. In option-2, if the no of nodes is really huge like hundreds, better accept in batches of nodes as again due to salt-master level limitation of serial accept execution, few nodes might get time out. This way chances of node accept/initialize failing while import is removed and cluster import should be smooth. Reverting to ON_QA and I would raise a doc BZ for adding a troubleshooting section for import cluster node failures.
Created a doc BZ#1356503 to document the troubleshooting section in admin guide for this.
Tested with ceph-ansible-1.0.5-32.el7scon.noarch ceph-installer-1.0.14-1.el7scon.noarch rhscon-ceph-0.0.40-1.el7scon.x86_64 rhscon-core-0.0.41-1.el7scon.x86_64 rhscon-core-selinux-0.0.41-1.el7scon.noarch rhscon-ui-0.0.52-1.el7scon.noarch and it works as it is described in comment #5