Bug 1686888

Summary: When tendrl-node-agent service is not running, import cluster fails after timeout without clear indication what went wrong
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Martin Bukatovic <mbukatov>
Component: web-admin-tendrl-node-agentAssignee: Timothy Asir <tjeyasin>
Status: CLOSED ERRATA QA Contact: Sweta Anandpara <sanandpa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: rhgs-3.4CC: amukherj, nthomas, rcyriac, rhinduja, rhs-bugs, sanandpa, storage-qa-internal, tjeyasin
Target Milestone: ---   
Target Release: RHGS 3.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tendrl-commons-1.6.3-18.el7rhgs.noarch Doc Type: Bug Fix
Doc Text:
The node-agent service is responsible for import and remove (stop managing) operations. These operations timed out with a generic log message when the node-agent service was not running. This issue is now logged more clearly when it occurs.
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-30 12:23:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1688630, 1696807    
Attachments:
Description Flags
screenshot of RHGSWA with failed ImportCluster task
none
ImportCluster failure with correct message
none
Unamange cluster failure with correct message none

Description Martin Bukatovic 2019-03-08 15:32:42 UTC
Created attachment 1542128 [details]
screenshot of RHGSWA with failed  ImportCluster task

Description of problem
======================

When tendrl-node-agent service is not running on at least one storage server,
import of cluster fails after 5 minute timeout with non descriptive error
message which doesn't indicate what went wrong at all. Consequent attempt
to unmanage cluster fails in the same way.

Version-Release number of selected component
============================================

RHGSWA server:

# rpm -qa | grep tendrl
tendrl-node-agent-1.6.3-18.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-3.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-3.el7rhgs.noarch
tendrl-api-1.6.3-13.el7rhgs.noarch
tendrl-api-httpd-1.6.3-13.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-21.el7rhgs.noarch
tendrl-ansible-1.6.3-11.el7rhgs.noarch
tendrl-commons-1.6.3-17.el7rhgs.noarch
tendrl-ui-1.6.3-15.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-21.el7rhgs.noarch

Storage servers:

# rpm -qa | grep tendrl
tendrl-commons-1.6.3-17.el7rhgs.noarch
tendrl-node-agent-1.6.3-18.el7rhgs.noarch
tendrl-selinux-1.5.4-3.el7rhgs.noarch
tendrl-collectd-selinux-1.5.4-3.el7rhgs.noarch
tendrl-gluster-integration-1.6.3-15.el7rhgs.noarch

How reproducible
================

100 %

Steps to Reproduce
==================

1. Create gluster trusted storage pool, with at least one volume
2. Install RHGSWA via tendrl-ansible
3. Open browser and go to RHGSWA to check that the cluster is visible
   and ready for import
4. Stop tendrl-node-agent.service on one storage node
5. Try to import the cluster

Actual results
==============

ImportCluster task become quickly stuck and fails after 5 minute timeout.

The error message doesn't provide any useful indication where the problem
might be:

~~~
 error
Failure in Job 8d75bd1b-af8b-47a2-97df-53affe14077d Flow tendrl.flows.ImportCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 240, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/__init__.py", line 131, in run exc_traceback) FlowExecutionFailedError: ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/__init__.py", line 98, in run\n super(ImportCluster, self).run()\n', ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/__init__.py", line 186, in run\n (atom_fqn, self._defs[\'help\'])\n', 'AtomExecutionFailedError: Atom Execution failed. Error: Error executing atom: tendrl.objects.Cluster.atoms.ImportCluster on flow: Import existing Gluster Cluster\n']
08 Mar 2019 04:23:43 
~~~

~~~
error
Failed atom: tendrl.objects.Cluster.atoms.ImportCluster on flow: Import existing Gluster Cluster
08 Mar 2019 04:23:43
~~~

Expected results
================

ImportCluster task fails and clearly report the root cause of the problem,
that RHGSWA is not able to talk with tendrl-node-agent on affected storage
node.

Additional info
===============

When one tries unmanage the cluster, the task fails in the same way (waiting
on a timeout, then failure without descriptive error message).

When one starts tendrl-node-agent service on affected node again and tries
to unmanage, the task finishes fine and then it's possible to import the
cluster with success.

Combined with BZ 1686855, this could lead to further confusion (while RHGSWA
is stuck on timeout waiting on missing node agent, last info event in the web
ui states "Job ... for ImportCluster finished").

This BZ is about gaps in self monitoring, error checking and error reporting.
Described scenario is not expected to happen under normal circumstances. But
when happens, RHGSWA doesn't help with debugging.

Comment 6 Sweta Anandpara 2019-06-06 06:31:24 UTC
Created attachment 1577782 [details]
ImportCluster failure with correct message

Comment 7 Sweta Anandpara 2019-06-06 06:32:12 UTC
Created attachment 1577784 [details]
Unamange cluster failure with correct message

Comment 13 errata-xmlrpc 2019-10-30 12:23:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3251