Bug 1596655

Summary:

Unable to fix (rerun) failed cluster expand task

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

Daniel Horák <dahorak>

Component:

web-admin-tendrl-gluster-integration

Assignee:

Shubhendu Tripathi <shtripat>

Status:

CLOSED ERRATA

QA Contact:

Daniel Horák <dahorak>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

unspecified

CC:

apaladug, julim, mbukatov, negupta, nthomas, rhs-bugs, sankarshan

Target Milestone:

---

Target Release:

RHGS 3.4.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

tendrl-commons-1.6.3-12.el7rhgs

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-09-04 07:08:24 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1503137

Attachments:

Description	Flags
Expand Cluster button on Hosts page is disabled when Expansion task failed	none

Description Daniel Horák 2018-06-29 11:58:52 UTC

Description of problem:
  When cluster expand task fails for some reason, there is no way, how to
  rerun or fix it.

  As peer Bug 1582465, the tooltip for "Expansion Failed" cluster state is:
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    "If cluster expansion fails, check if tendrl-ansible was executed
    successfully and ensure the node agents are correctly configured."
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  which doesn't make much sense, as there is no way, how to rerun the expand
  cluster task.

Version-Release number of selected component (if applicable):
* RHGS WA Server
  Red Hat Enterprise Linux Server release 7.5 (Maipo)
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  grafana-4.3.2-3.el7rhgs.x86_64
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  libcollection-0.7.0-29.el7.x86_64
  tendrl-ansible-1.6.3-5.el7rhgs.noarch
  tendrl-api-1.6.3-3.el7rhgs.noarch
  tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
  tendrl-commons-1.6.3-7.el7rhgs.noarch
  tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch
  tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch
  tendrl-node-agent-1.6.3-7.el7rhgs.noarch
  tendrl-notifier-1.6.3-4.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-ui-1.6.3-4.el7rhgs.noarch

* Gluster Storage Server:
  Red Hat Enterprise Linux Server release 7.5 (Maipo)
  Red Hat Gluster Storage Server 3.4.0
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  glusterfs-3.12.2-13.el7rhgs.x86_64
  glusterfs-api-3.12.2-13.el7rhgs.x86_64
  glusterfs-cli-3.12.2-13.el7rhgs.x86_64
  glusterfs-client-xlators-3.12.2-13.el7rhgs.x86_64
  glusterfs-events-3.12.2-13.el7rhgs.x86_64
  glusterfs-fuse-3.12.2-13.el7rhgs.x86_64
  glusterfs-geo-replication-3.12.2-13.el7rhgs.x86_64
  glusterfs-libs-3.12.2-13.el7rhgs.x86_64
  glusterfs-rdma-3.12.2-13.el7rhgs.x86_64
  glusterfs-server-3.12.2-13.el7rhgs.x86_64
  gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
  gluster-nagios-common-0.2.4-1.el7rhgs.noarch
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  libcollection-0.7.0-29.el7.x86_64
  libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64
  python2-gluster-3.12.2-13.el7rhgs.x86_64
  python-debtcollector-1.8.0-1.el7ost.noarch
  tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-commons-1.6.3-7.el7rhgs.noarch
  tendrl-gluster-integration-1.6.3-5.el7rhgs.noarch
  tendrl-node-agent-1.6.3-7.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  vdsm-gluster-4.19.43-2.3.el7rhgs.noarch

How reproducible:
  100%

Steps to Reproduce:
1. Prepare, install and configure Gluster cluster (Gluster Trusted Storage Pool)
  plus one or more additional Gluster Storage nodes which are not part of
  the Gluster Trusted Storage Pool.
2. Install and Configure RHGS WA Server and Node Agents on the nodes in Gluster
  Trusted Storage Pool
3. Import Cluster into RHGS WA.
4. Extend the Gluster Trusted storage pool using the additional hosts.
5. Rerun tendrl-ansible playbook to configure Node Agents on the new nodes.
6. Disable RHGS WA repos on one of the added nodes (or do any other action to
  ensure, that Expand cluster task will fail).
7. Launch Expand cluster Process.

Actual results:
  Expand Cluster process will fail, because of expected failure during
  installation of tendrl-gluster-integration package.
  The problem is,
  that there is no way, how to relaunch the Expand cluster process once
  the issue is resolved (affected repositories are enabled in our case).

Expected results:
  It should be possible, to restart/relaunch failed Expand Cluster task.

Additional info:
  It is possible to unmanage the whole cluster and import it again, but
  this would lead to lost of all the historical data in Grafana (they are
  not easily accessible from the archive created by unmanage cluster task).

It might be related or depend on Bug 1583590.

Comment 3 Martin Bukatovic 2018-07-04 12:02:58 UTC

This is a bug, now an RFE.

Comment 7 Ju Lim 2018-07-23 21:33:38 UTC

In reviewing the suggested text, I made some minor edits.  Try this one:

"If cluster expansion fails, check if tendrl-ansible was executed successfully and ensure the node agents are correctly configured.  If cluster expansion failed due to errors, resolve the errors on affected nodes and re-initiate the Expand Cluster action."

Comment 9 Martin Bukatovic 2018-07-27 08:28:47 UTC

QE team will try to inflict 2 different errors (eg. breaking yum repos as
described in this BZ and cutting one machine off) during expand and see that
it's possible to recover following the tooltip text (see comment 7).

Any problem beyond that would require a separate bugzilla, with description
of particular expand error.

Comment 12 Daniel Horák 2018-08-10 11:37:04 UTC

Created attachment 1474988 [details]
Expand Cluster button on Hosts page is disabled when Expansion task failed

Moving back to ASSIGNED, because it is not possible to relaunch previously
failed Expansion task from the "Hosts" page. The "Expand Cluster" button is
visible but disabled (see attached screenshot).

Version-Release number of selected component (if applicable):
  Red Hat Gluster Web Administration Server:
  tendrl-ansible-1.6.3-6.el7rhgs.noarch
  tendrl-api-1.6.3-5.el7rhgs.noarch
  tendrl-api-httpd-1.6.3-5.el7rhgs.noarch
  tendrl-commons-1.6.3-11.el7rhgs.noarch
  tendrl-grafana-plugins-1.6.3-8.el7rhgs.noarch
  tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-monitoring-integration-1.6.3-8.el7rhgs.noarch
  tendrl-node-agent-1.6.3-9.el7rhgs.noarch
  tendrl-notifier-1.6.3-4.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-ui-1.6.3-9.el7rhgs.noarch

  Red Hat Gluster Storage Server:
  tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-commons-1.6.3-11.el7rhgs.noarch
  tendrl-gluster-integration-1.6.3-9.el7rhgs.noarch
  tendrl-node-agent-1.6.3-9.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch

Note: It is possible to relaunch the failed Expansion from the Clusters page,
from menu under the three dots on the right side of the particular cluster line.

>> ASSIGNED

Comment 13 Nishanth Thomas 2018-08-13 03:24:06 UTC

PR: https://github.com/Tendrl/ui/pull/1038

Comment 14 Daniel Horák 2018-08-15 14:31:53 UTC

Tested and Verified on two scenarios:
* disabling RHGS WA Repo(s) on one of the expanded Gluster Storage Server
* stopping tendrl-node-agent on one of the expanded Gluster Storage Server

In both cases, it was possible to relaunch the "expand" cluster task and
when the simulated issues was corrected, the expand job pass.

Version-Release number of selected component (if applicable):
  Red Hat Gluster Web Administration Server:
  Red Hat Enterprise Linux Server release 7.5 (Maipo)
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  etcd-3.2.7-1.el7.x86_64
  grafana-4.3.2-3.el7rhgs.x86_64
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  python-etcd-0.4.5-2.el7rhgs.noarch
  rubygem-etcd-0.3.0-2.el7rhgs.noarch
  tendrl-ansible-1.6.3-6.el7rhgs.noarch
  tendrl-api-1.6.3-5.el7rhgs.noarch
  tendrl-api-httpd-1.6.3-5.el7rhgs.noarch
  tendrl-commons-1.6.3-12.el7rhgs.noarch
  tendrl-grafana-plugins-1.6.3-10.el7rhgs.noarch
  tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-monitoring-integration-1.6.3-10.el7rhgs.noarch
  tendrl-node-agent-1.6.3-10.el7rhgs.noarch
  tendrl-notifier-1.6.3-4.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-ui-1.6.3-10.el7rhgs.noarch

  Red Hat Gluster Storage Server:
  Red Hat Enterprise Linux Server release 7.5 (Maipo)
  Red Hat Gluster Storage Server 3.4.0
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  glusterfs-3.12.2-16.el7rhgs.x86_64
  glusterfs-api-3.12.2-16.el7rhgs.x86_64
  glusterfs-cli-3.12.2-16.el7rhgs.x86_64
  glusterfs-client-xlators-3.12.2-16.el7rhgs.x86_64
  glusterfs-events-3.12.2-16.el7rhgs.x86_64
  glusterfs-fuse-3.12.2-16.el7rhgs.x86_64
  glusterfs-geo-replication-3.12.2-16.el7rhgs.x86_64
  glusterfs-libs-3.12.2-16.el7rhgs.x86_64
  glusterfs-rdma-3.12.2-16.el7rhgs.x86_64
  glusterfs-server-3.12.2-16.el7rhgs.x86_64
  gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
  gluster-nagios-common-0.2.4-1.el7rhgs.noarch
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64
  python2-gluster-3.12.2-16.el7rhgs.x86_64
  python-etcd-0.4.5-2.el7rhgs.noarch
  tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-commons-1.6.3-12.el7rhgs.noarch
  tendrl-gluster-integration-1.6.3-9.el7rhgs.noarch
  tendrl-node-agent-1.6.3-10.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  vdsm-gluster-4.19.43-2.3.el7rhgs.noarch

>> VERIFIED

Comment 16 errata-xmlrpc 2018-09-04 07:08:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2616