1596655 – Unable to fix (rerun) failed cluster expand task

Bug 1596655 - Unable to fix (rerun) failed cluster expand task

Summary: Unable to fix (rerun) failed cluster expand task

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	web-admin-tendrl-gluster-integration
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	Shubhendu Tripathi
QA Contact:	Daniel Horák
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1503137
TreeView+	depends on / blocked

Reported:	2018-06-29 11:58 UTC by Daniel Horák
Modified:	2018-09-04 07:09 UTC (History)
CC List:	7 users (show)
Fixed In Version:	tendrl-commons-1.6.3-12.el7rhgs
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-09-04 07:08:24 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Expand Cluster button on Hosts page is disabled when Expansion task failed (71.47 KB, image/png) 2018-08-10 11:37 UTC, Daniel Horák	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	Tendrl ui issues 1025	0	None	None	None	2018-08-10 16:20:44 UTC
Red Hat Bugzilla	1596660	1	None	None	None	2024-09-18 00:48:12 UTC
Red Hat Product Errata	RHSA-2018:2616	0	None	None	None	2018-09-04 07:09:30 UTC

Internal Links: 1596660

Description Daniel Horák 2018-06-29 11:58:52 UTC

Description of problem:
  When cluster expand task fails for some reason, there is no way, how to
  rerun or fix it.

  As peer Bug 1582465, the tooltip for "Expansion Failed" cluster state is:
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    "If cluster expansion fails, check if tendrl-ansible was executed
    successfully and ensure the node agents are correctly configured."
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  which doesn't make much sense, as there is no way, how to rerun the expand
  cluster task.

Version-Release number of selected component (if applicable):
* RHGS WA Server
  Red Hat Enterprise Linux Server release 7.5 (Maipo)
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  grafana-4.3.2-3.el7rhgs.x86_64
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  libcollection-0.7.0-29.el7.x86_64
  tendrl-ansible-1.6.3-5.el7rhgs.noarch
  tendrl-api-1.6.3-3.el7rhgs.noarch
  tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
  tendrl-commons-1.6.3-7.el7rhgs.noarch
  tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch
  tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch
  tendrl-node-agent-1.6.3-7.el7rhgs.noarch
  tendrl-notifier-1.6.3-4.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-ui-1.6.3-4.el7rhgs.noarch

* Gluster Storage Server:
  Red Hat Enterprise Linux Server release 7.5 (Maipo)
  Red Hat Gluster Storage Server 3.4.0
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  glusterfs-3.12.2-13.el7rhgs.x86_64
  glusterfs-api-3.12.2-13.el7rhgs.x86_64
  glusterfs-cli-3.12.2-13.el7rhgs.x86_64
  glusterfs-client-xlators-3.12.2-13.el7rhgs.x86_64
  glusterfs-events-3.12.2-13.el7rhgs.x86_64
  glusterfs-fuse-3.12.2-13.el7rhgs.x86_64
  glusterfs-geo-replication-3.12.2-13.el7rhgs.x86_64
  glusterfs-libs-3.12.2-13.el7rhgs.x86_64
  glusterfs-rdma-3.12.2-13.el7rhgs.x86_64
  glusterfs-server-3.12.2-13.el7rhgs.x86_64
  gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
  gluster-nagios-common-0.2.4-1.el7rhgs.noarch
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  libcollection-0.7.0-29.el7.x86_64
  libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64
  python2-gluster-3.12.2-13.el7rhgs.x86_64
  python-debtcollector-1.8.0-1.el7ost.noarch
  tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-commons-1.6.3-7.el7rhgs.noarch
  tendrl-gluster-integration-1.6.3-5.el7rhgs.noarch
  tendrl-node-agent-1.6.3-7.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  vdsm-gluster-4.19.43-2.3.el7rhgs.noarch

How reproducible:
  100%

Steps to Reproduce:
1. Prepare, install and configure Gluster cluster (Gluster Trusted Storage Pool)
  plus one or more additional Gluster Storage nodes which are not part of
  the Gluster Trusted Storage Pool.
2. Install and Configure RHGS WA Server and Node Agents on the nodes in Gluster
  Trusted Storage Pool
3. Import Cluster into RHGS WA.
4. Extend the Gluster Trusted storage pool using the additional hosts.
5. Rerun tendrl-ansible playbook to configure Node Agents on the new nodes.
6. Disable RHGS WA repos on one of the added nodes (or do any other action to
  ensure, that Expand cluster task will fail).
7. Launch Expand cluster Process.

Actual results:
  Expand Cluster process will fail, because of expected failure during
  installation of tendrl-gluster-integration package.
  The problem is,
  that there is no way, how to relaunch the Expand cluster process once
  the issue is resolved (affected repositories are enabled in our case).

Expected results:
  It should be possible, to restart/relaunch failed Expand Cluster task.

Additional info:
  It is possible to unmanage the whole cluster and import it again, but
  this would lead to lost of all the historical data in Grafana (they are
  not easily accessible from the archive created by unmanage cluster task).

It might be related or depend on Bug 1583590.

Comment 3 Martin Bukatovic 2018-07-04 12:02:58 UTC

This is a bug, now an RFE.

Comment 7 Ju Lim 2018-07-23 21:33:38 UTC

In reviewing the suggested text, I made some minor edits.  Try this one:

"If cluster expansion fails, check if tendrl-ansible was executed successfully and ensure the node agents are correctly configured.  If cluster expansion failed due to errors, resolve the errors on affected nodes and re-initiate the Expand Cluster action."

Comment 9 Martin Bukatovic 2018-07-27 08:28:47 UTC

QE team will try to inflict 2 different errors (eg. breaking yum repos as
described in this BZ and cutting one machine off) during expand and see that
it's possible to recover following the tooltip text (see comment 7).

Any problem beyond that would require a separate bugzilla, with description
of particular expand error.

Comment 12 Daniel Horák 2018-08-10 11:37:04 UTC

Created attachment 1474988 [details]
Expand Cluster button on Hosts page is disabled when Expansion task failed

Moving back to ASSIGNED, because it is not possible to relaunch previously
failed Expansion task from the "Hosts" page. The "Expand Cluster" button is
visible but disabled (see attached screenshot).

Version-Release number of selected component (if applicable):
  Red Hat Gluster Web Administration Server:
  tendrl-ansible-1.6.3-6.el7rhgs.noarch
  tendrl-api-1.6.3-5.el7rhgs.noarch
  tendrl-api-httpd-1.6.3-5.el7rhgs.noarch
  tendrl-commons-1.6.3-11.el7rhgs.noarch
  tendrl-grafana-plugins-1.6.3-8.el7rhgs.noarch
  tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-monitoring-integration-1.6.3-8.el7rhgs.noarch
  tendrl-node-agent-1.6.3-9.el7rhgs.noarch
  tendrl-notifier-1.6.3-4.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-ui-1.6.3-9.el7rhgs.noarch

  Red Hat Gluster Storage Server:
  tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-commons-1.6.3-11.el7rhgs.noarch
  tendrl-gluster-integration-1.6.3-9.el7rhgs.noarch
  tendrl-node-agent-1.6.3-9.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch

Note: It is possible to relaunch the failed Expansion from the Clusters page,
from menu under the three dots on the right side of the particular cluster line.

>> ASSIGNED

Comment 13 Nishanth Thomas 2018-08-13 03:24:06 UTC

PR: https://github.com/Tendrl/ui/pull/1038

Comment 14 Daniel Horák 2018-08-15 14:31:53 UTC

Tested and Verified on two scenarios:
* disabling RHGS WA Repo(s) on one of the expanded Gluster Storage Server
* stopping tendrl-node-agent on one of the expanded Gluster Storage Server

In both cases, it was possible to relaunch the "expand" cluster task and
when the simulated issues was corrected, the expand job pass.

Version-Release number of selected component (if applicable):
  Red Hat Gluster Web Administration Server:
  Red Hat Enterprise Linux Server release 7.5 (Maipo)
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  etcd-3.2.7-1.el7.x86_64
  grafana-4.3.2-3.el7rhgs.x86_64
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  python-etcd-0.4.5-2.el7rhgs.noarch
  rubygem-etcd-0.3.0-2.el7rhgs.noarch
  tendrl-ansible-1.6.3-6.el7rhgs.noarch
  tendrl-api-1.6.3-5.el7rhgs.noarch
  tendrl-api-httpd-1.6.3-5.el7rhgs.noarch
  tendrl-commons-1.6.3-12.el7rhgs.noarch
  tendrl-grafana-plugins-1.6.3-10.el7rhgs.noarch
  tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-monitoring-integration-1.6.3-10.el7rhgs.noarch
  tendrl-node-agent-1.6.3-10.el7rhgs.noarch
  tendrl-notifier-1.6.3-4.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-ui-1.6.3-10.el7rhgs.noarch

  Red Hat Gluster Storage Server:
  Red Hat Enterprise Linux Server release 7.5 (Maipo)
  Red Hat Gluster Storage Server 3.4.0
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  glusterfs-3.12.2-16.el7rhgs.x86_64
  glusterfs-api-3.12.2-16.el7rhgs.x86_64
  glusterfs-cli-3.12.2-16.el7rhgs.x86_64
  glusterfs-client-xlators-3.12.2-16.el7rhgs.x86_64
  glusterfs-events-3.12.2-16.el7rhgs.x86_64
  glusterfs-fuse-3.12.2-16.el7rhgs.x86_64
  glusterfs-geo-replication-3.12.2-16.el7rhgs.x86_64
  glusterfs-libs-3.12.2-16.el7rhgs.x86_64
  glusterfs-rdma-3.12.2-16.el7rhgs.x86_64
  glusterfs-server-3.12.2-16.el7rhgs.x86_64
  gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
  gluster-nagios-common-0.2.4-1.el7rhgs.noarch
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64
  python2-gluster-3.12.2-16.el7rhgs.x86_64
  python-etcd-0.4.5-2.el7rhgs.noarch
  tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-commons-1.6.3-12.el7rhgs.noarch
  tendrl-gluster-integration-1.6.3-9.el7rhgs.noarch
  tendrl-node-agent-1.6.3-10.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  vdsm-gluster-4.19.43-2.3.el7rhgs.noarch

>> VERIFIED

Comment 16 errata-xmlrpc 2018-09-04 07:08:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2616

Note You need to log in before you can comment on or make changes to this bug.