Created attachment 1542128 [details] screenshot of RHGSWA with failed ImportCluster task Description of problem ====================== When tendrl-node-agent service is not running on at least one storage server, import of cluster fails after 5 minute timeout with non descriptive error message which doesn't indicate what went wrong at all. Consequent attempt to unmanage cluster fails in the same way. Version-Release number of selected component ============================================ RHGSWA server: # rpm -qa | grep tendrl tendrl-node-agent-1.6.3-18.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-3.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-3.el7rhgs.noarch tendrl-api-1.6.3-13.el7rhgs.noarch tendrl-api-httpd-1.6.3-13.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-21.el7rhgs.noarch tendrl-ansible-1.6.3-11.el7rhgs.noarch tendrl-commons-1.6.3-17.el7rhgs.noarch tendrl-ui-1.6.3-15.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-21.el7rhgs.noarch Storage servers: # rpm -qa | grep tendrl tendrl-commons-1.6.3-17.el7rhgs.noarch tendrl-node-agent-1.6.3-18.el7rhgs.noarch tendrl-selinux-1.5.4-3.el7rhgs.noarch tendrl-collectd-selinux-1.5.4-3.el7rhgs.noarch tendrl-gluster-integration-1.6.3-15.el7rhgs.noarch How reproducible ================ 100 % Steps to Reproduce ================== 1. Create gluster trusted storage pool, with at least one volume 2. Install RHGSWA via tendrl-ansible 3. Open browser and go to RHGSWA to check that the cluster is visible and ready for import 4. Stop tendrl-node-agent.service on one storage node 5. Try to import the cluster Actual results ============== ImportCluster task become quickly stuck and fails after 5 minute timeout. The error message doesn't provide any useful indication where the problem might be: ~~~ error Failure in Job 8d75bd1b-af8b-47a2-97df-53affe14077d Flow tendrl.flows.ImportCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 240, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/__init__.py", line 131, in run exc_traceback) FlowExecutionFailedError: ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/__init__.py", line 98, in run\n super(ImportCluster, self).run()\n', ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/__init__.py", line 186, in run\n (atom_fqn, self._defs[\'help\'])\n', 'AtomExecutionFailedError: Atom Execution failed. Error: Error executing atom: tendrl.objects.Cluster.atoms.ImportCluster on flow: Import existing Gluster Cluster\n'] 08 Mar 2019 04:23:43 ~~~ ~~~ error Failed atom: tendrl.objects.Cluster.atoms.ImportCluster on flow: Import existing Gluster Cluster 08 Mar 2019 04:23:43 ~~~ Expected results ================ ImportCluster task fails and clearly report the root cause of the problem, that RHGSWA is not able to talk with tendrl-node-agent on affected storage node. Additional info =============== When one tries unmanage the cluster, the task fails in the same way (waiting on a timeout, then failure without descriptive error message). When one starts tendrl-node-agent service on affected node again and tries to unmanage, the task finishes fine and then it's possible to import the cluster with success. Combined with BZ 1686855, this could lead to further confusion (while RHGSWA is stuck on timeout waiting on missing node agent, last info event in the web ui states "Job ... for ImportCluster finished"). This BZ is about gaps in self monitoring, error checking and error reporting. Described scenario is not expected to happen under normal circumstances. But when happens, RHGSWA doesn't help with debugging.
Created attachment 1577782 [details] ImportCluster failure with correct message
Created attachment 1577784 [details] Unamange cluster failure with correct message
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:3251