Bug 1686888 - When tendrl-node-agent service is not running, import cluster fails after timeout without clear indication what went wrong
Summary: When tendrl-node-agent service is not running, import cluster fails after tim...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: web-admin-tendrl-node-agent
Version: rhgs-3.4
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: RHGS 3.5.0
Assignee: Timothy Asir
QA Contact: Sweta Anandpara
URL:
Whiteboard:
Depends On:
Blocks: 1688630 1696807
TreeView+ depends on / blocked
 
Reported: 2019-03-08 15:32 UTC by Martin Bukatovic
Modified: 2019-10-30 12:23 UTC (History)
8 users (show)

Fixed In Version: tendrl-commons-1.6.3-18.el7rhgs.noarch
Doc Type: Bug Fix
Doc Text:
The node-agent service is responsible for import and remove (stop managing) operations. These operations timed out with a generic log message when the node-agent service was not running. This issue is now logged more clearly when it occurs.
Clone Of:
Environment:
Last Closed: 2019-10-30 12:23:13 UTC
Embargoed:


Attachments (Terms of Use)
screenshot of RHGSWA with failed ImportCluster task (141.40 KB, image/png)
2019-03-08 15:32 UTC, Martin Bukatovic
no flags Details
ImportCluster failure with correct message (134.43 KB, image/png)
2019-06-06 06:31 UTC, Sweta Anandpara
no flags Details
Unamange cluster failure with correct message (113.93 KB, image/png)
2019-06-06 06:32 UTC, Sweta Anandpara
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:3251 0 None None None 2019-10-30 12:23:34 UTC

Description Martin Bukatovic 2019-03-08 15:32:42 UTC
Created attachment 1542128 [details]
screenshot of RHGSWA with failed  ImportCluster task

Description of problem
======================

When tendrl-node-agent service is not running on at least one storage server,
import of cluster fails after 5 minute timeout with non descriptive error
message which doesn't indicate what went wrong at all. Consequent attempt
to unmanage cluster fails in the same way.

Version-Release number of selected component
============================================

RHGSWA server:

# rpm -qa | grep tendrl
tendrl-node-agent-1.6.3-18.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-3.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-3.el7rhgs.noarch
tendrl-api-1.6.3-13.el7rhgs.noarch
tendrl-api-httpd-1.6.3-13.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-21.el7rhgs.noarch
tendrl-ansible-1.6.3-11.el7rhgs.noarch
tendrl-commons-1.6.3-17.el7rhgs.noarch
tendrl-ui-1.6.3-15.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-21.el7rhgs.noarch

Storage servers:

# rpm -qa | grep tendrl
tendrl-commons-1.6.3-17.el7rhgs.noarch
tendrl-node-agent-1.6.3-18.el7rhgs.noarch
tendrl-selinux-1.5.4-3.el7rhgs.noarch
tendrl-collectd-selinux-1.5.4-3.el7rhgs.noarch
tendrl-gluster-integration-1.6.3-15.el7rhgs.noarch

How reproducible
================

100 %

Steps to Reproduce
==================

1. Create gluster trusted storage pool, with at least one volume
2. Install RHGSWA via tendrl-ansible
3. Open browser and go to RHGSWA to check that the cluster is visible
   and ready for import
4. Stop tendrl-node-agent.service on one storage node
5. Try to import the cluster

Actual results
==============

ImportCluster task become quickly stuck and fails after 5 minute timeout.

The error message doesn't provide any useful indication where the problem
might be:

~~~
 error
Failure in Job 8d75bd1b-af8b-47a2-97df-53affe14077d Flow tendrl.flows.ImportCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 240, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/__init__.py", line 131, in run exc_traceback) FlowExecutionFailedError: ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/__init__.py", line 98, in run\n super(ImportCluster, self).run()\n', ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/__init__.py", line 186, in run\n (atom_fqn, self._defs[\'help\'])\n', 'AtomExecutionFailedError: Atom Execution failed. Error: Error executing atom: tendrl.objects.Cluster.atoms.ImportCluster on flow: Import existing Gluster Cluster\n']
08 Mar 2019 04:23:43 
~~~

~~~
error
Failed atom: tendrl.objects.Cluster.atoms.ImportCluster on flow: Import existing Gluster Cluster
08 Mar 2019 04:23:43
~~~

Expected results
================

ImportCluster task fails and clearly report the root cause of the problem,
that RHGSWA is not able to talk with tendrl-node-agent on affected storage
node.

Additional info
===============

When one tries unmanage the cluster, the task fails in the same way (waiting
on a timeout, then failure without descriptive error message).

When one starts tendrl-node-agent service on affected node again and tries
to unmanage, the task finishes fine and then it's possible to import the
cluster with success.

Combined with BZ 1686855, this could lead to further confusion (while RHGSWA
is stuck on timeout waiting on missing node agent, last info event in the web
ui states "Job ... for ImportCluster finished").

This BZ is about gaps in self monitoring, error checking and error reporting.
Described scenario is not expected to happen under normal circumstances. But
when happens, RHGSWA doesn't help with debugging.

Comment 6 Sweta Anandpara 2019-06-06 06:31:24 UTC
Created attachment 1577782 [details]
ImportCluster failure with correct message

Comment 7 Sweta Anandpara 2019-06-06 06:32:12 UTC
Created attachment 1577784 [details]
Unamange cluster failure with correct message

Comment 13 errata-xmlrpc 2019-10-30 12:23:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3251


Note You need to log in before you can comment on or make changes to this bug.