Bug 1686888

Summary:

When tendrl-node-agent service is not running, import cluster fails after timeout without clear indication what went wrong

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

Martin Bukatovic <mbukatov>

Component:

web-admin-tendrl-node-agent

Assignee:

Timothy Asir <tjeyasin>

Status:

CLOSED ERRATA

QA Contact:

Sweta Anandpara <sanandpa>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

rhgs-3.4

CC:

amukherj, nthomas, rcyriac, rhinduja, rhs-bugs, sanandpa, storage-qa-internal, tjeyasin

Target Milestone:

---

Target Release:

RHGS 3.5.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

tendrl-commons-1.6.3-18.el7rhgs.noarch

Doc Type:

Bug Fix

Doc Text:

The node-agent service is responsible for import and remove (stop managing) operations. These operations timed out with a generic log message when the node-agent service was not running. This issue is now logged more clearly when it occurs.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-10-30 12:23:13 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1688630, 1696807

Attachments:

Description	Flags
screenshot of RHGSWA with failed ImportCluster task	none
ImportCluster failure with correct message	none
Unamange cluster failure with correct message	none

Description Martin Bukatovic 2019-03-08 15:32:42 UTC

Created attachment 1542128 [details]
screenshot of RHGSWA with failed  ImportCluster task

Description of problem
======================

When tendrl-node-agent service is not running on at least one storage server,
import of cluster fails after 5 minute timeout with non descriptive error
message which doesn't indicate what went wrong at all. Consequent attempt
to unmanage cluster fails in the same way.

Version-Release number of selected component
============================================

RHGSWA server:

# rpm -qa | grep tendrl
tendrl-node-agent-1.6.3-18.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-3.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-3.el7rhgs.noarch
tendrl-api-1.6.3-13.el7rhgs.noarch
tendrl-api-httpd-1.6.3-13.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-21.el7rhgs.noarch
tendrl-ansible-1.6.3-11.el7rhgs.noarch
tendrl-commons-1.6.3-17.el7rhgs.noarch
tendrl-ui-1.6.3-15.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-21.el7rhgs.noarch

Storage servers:

# rpm -qa | grep tendrl
tendrl-commons-1.6.3-17.el7rhgs.noarch
tendrl-node-agent-1.6.3-18.el7rhgs.noarch
tendrl-selinux-1.5.4-3.el7rhgs.noarch
tendrl-collectd-selinux-1.5.4-3.el7rhgs.noarch
tendrl-gluster-integration-1.6.3-15.el7rhgs.noarch

How reproducible
================

100 %

Steps to Reproduce
==================

1. Create gluster trusted storage pool, with at least one volume
2. Install RHGSWA via tendrl-ansible
3. Open browser and go to RHGSWA to check that the cluster is visible
   and ready for import
4. Stop tendrl-node-agent.service on one storage node
5. Try to import the cluster

Actual results
==============

ImportCluster task become quickly stuck and fails after 5 minute timeout.

The error message doesn't provide any useful indication where the problem
might be:

~~~
 error
Failure in Job 8d75bd1b-af8b-47a2-97df-53affe14077d Flow tendrl.flows.ImportCluster with error: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py", line 240, in process_job the_flow.run() File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/__init__.py", line 131, in run exc_traceback) FlowExecutionFailedError: ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/__init__.py", line 98, in run\n super(ImportCluster, self).run()\n', ' File "/usr/lib/python2.7/site-packages/tendrl/commons/flows/__init__.py", line 186, in run\n (atom_fqn, self._defs[\'help\'])\n', 'AtomExecutionFailedError: Atom Execution failed. Error: Error executing atom: tendrl.objects.Cluster.atoms.ImportCluster on flow: Import existing Gluster Cluster\n']
08 Mar 2019 04:23:43 
~~~

~~~
error
Failed atom: tendrl.objects.Cluster.atoms.ImportCluster on flow: Import existing Gluster Cluster
08 Mar 2019 04:23:43
~~~

Expected results
================

ImportCluster task fails and clearly report the root cause of the problem,
that RHGSWA is not able to talk with tendrl-node-agent on affected storage
node.

Additional info
===============

When one tries unmanage the cluster, the task fails in the same way (waiting
on a timeout, then failure without descriptive error message).

When one starts tendrl-node-agent service on affected node again and tries
to unmanage, the task finishes fine and then it's possible to import the
cluster with success.

Combined with BZ 1686855, this could lead to further confusion (while RHGSWA
is stuck on timeout waiting on missing node agent, last info event in the web
ui states "Job ... for ImportCluster finished").

This BZ is about gaps in self monitoring, error checking and error reporting.
Described scenario is not expected to happen under normal circumstances. But
when happens, RHGSWA doesn't help with debugging.

Comment 6 Sweta Anandpara 2019-06-06 06:31:24 UTC

Created attachment 1577782 [details]
ImportCluster failure with correct message

Comment 7 Sweta Anandpara 2019-06-06 06:32:12 UTC

Created attachment 1577784 [details]
Unamange cluster failure with correct message

Comment 13 errata-xmlrpc 2019-10-30 12:23:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3251