Description of problem: When cluster expand task fails for some reason, there is no way, how to rerun or fix it. As peer Bug 1582465, the tooltip for "Expansion Failed" cluster state is: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ "If cluster expansion fails, check if tendrl-ansible was executed successfully and ensure the node agents are correctly configured." ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ which doesn't make much sense, as there is no way, how to rerun the expand cluster task. Version-Release number of selected component (if applicable): * RHGS WA Server Red Hat Enterprise Linux Server release 7.5 (Maipo) collectd-5.7.2-3.1.el7rhgs.x86_64 collectd-ping-5.7.2-3.1.el7rhgs.x86_64 grafana-4.3.2-3.el7rhgs.x86_64 libcollectdclient-5.7.2-3.1.el7rhgs.x86_64 libcollection-0.7.0-29.el7.x86_64 tendrl-ansible-1.6.3-5.el7rhgs.noarch tendrl-api-1.6.3-3.el7rhgs.noarch tendrl-api-httpd-1.6.3-3.el7rhgs.noarch tendrl-commons-1.6.3-7.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch tendrl-node-agent-1.6.3-7.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-4.el7rhgs.noarch * Gluster Storage Server: Red Hat Enterprise Linux Server release 7.5 (Maipo) Red Hat Gluster Storage Server 3.4.0 collectd-5.7.2-3.1.el7rhgs.x86_64 collectd-ping-5.7.2-3.1.el7rhgs.x86_64 glusterfs-3.12.2-13.el7rhgs.x86_64 glusterfs-api-3.12.2-13.el7rhgs.x86_64 glusterfs-cli-3.12.2-13.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-13.el7rhgs.x86_64 glusterfs-events-3.12.2-13.el7rhgs.x86_64 glusterfs-fuse-3.12.2-13.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-13.el7rhgs.x86_64 glusterfs-libs-3.12.2-13.el7rhgs.x86_64 glusterfs-rdma-3.12.2-13.el7rhgs.x86_64 glusterfs-server-3.12.2-13.el7rhgs.x86_64 gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch libcollectdclient-5.7.2-3.1.el7rhgs.x86_64 libcollection-0.7.0-29.el7.x86_64 libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64 python2-gluster-3.12.2-13.el7rhgs.x86_64 python-debtcollector-1.8.0-1.el7ost.noarch tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch tendrl-commons-1.6.3-7.el7rhgs.noarch tendrl-gluster-integration-1.6.3-5.el7rhgs.noarch tendrl-node-agent-1.6.3-7.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch vdsm-gluster-4.19.43-2.3.el7rhgs.noarch How reproducible: 100% Steps to Reproduce: 1. Prepare, install and configure Gluster cluster (Gluster Trusted Storage Pool) plus one or more additional Gluster Storage nodes which are not part of the Gluster Trusted Storage Pool. 2. Install and Configure RHGS WA Server and Node Agents on the nodes in Gluster Trusted Storage Pool 3. Import Cluster into RHGS WA. 4. Extend the Gluster Trusted storage pool using the additional hosts. 5. Rerun tendrl-ansible playbook to configure Node Agents on the new nodes. 6. Disable RHGS WA repos on one of the added nodes (or do any other action to ensure, that Expand cluster task will fail). 7. Launch Expand cluster Process. Actual results: Expand Cluster process will fail, because of expected failure during installation of tendrl-gluster-integration package. The problem is, that there is no way, how to relaunch the Expand cluster process once the issue is resolved (affected repositories are enabled in our case). Expected results: It should be possible, to restart/relaunch failed Expand Cluster task. Additional info: It is possible to unmanage the whole cluster and import it again, but this would lead to lost of all the historical data in Grafana (they are not easily accessible from the archive created by unmanage cluster task). It might be related or depend on Bug 1583590.
This is a bug, now an RFE.
In reviewing the suggested text, I made some minor edits. Try this one: "If cluster expansion fails, check if tendrl-ansible was executed successfully and ensure the node agents are correctly configured. If cluster expansion failed due to errors, resolve the errors on affected nodes and re-initiate the Expand Cluster action."
QE team will try to inflict 2 different errors (eg. breaking yum repos as described in this BZ and cutting one machine off) during expand and see that it's possible to recover following the tooltip text (see comment 7). Any problem beyond that would require a separate bugzilla, with description of particular expand error.
Created attachment 1474988 [details] Expand Cluster button on Hosts page is disabled when Expansion task failed Moving back to ASSIGNED, because it is not possible to relaunch previously failed Expansion task from the "Hosts" page. The "Expand Cluster" button is visible but disabled (see attached screenshot). Version-Release number of selected component (if applicable): Red Hat Gluster Web Administration Server: tendrl-ansible-1.6.3-6.el7rhgs.noarch tendrl-api-1.6.3-5.el7rhgs.noarch tendrl-api-httpd-1.6.3-5.el7rhgs.noarch tendrl-commons-1.6.3-11.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-8.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-8.el7rhgs.noarch tendrl-node-agent-1.6.3-9.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-9.el7rhgs.noarch Red Hat Gluster Storage Server: tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch tendrl-commons-1.6.3-11.el7rhgs.noarch tendrl-gluster-integration-1.6.3-9.el7rhgs.noarch tendrl-node-agent-1.6.3-9.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch Note: It is possible to relaunch the failed Expansion from the Clusters page, from menu under the three dots on the right side of the particular cluster line. >> ASSIGNED
PR: https://github.com/Tendrl/ui/pull/1038
Tested and Verified on two scenarios: * disabling RHGS WA Repo(s) on one of the expanded Gluster Storage Server * stopping tendrl-node-agent on one of the expanded Gluster Storage Server In both cases, it was possible to relaunch the "expand" cluster task and when the simulated issues was corrected, the expand job pass. Version-Release number of selected component (if applicable): Red Hat Gluster Web Administration Server: Red Hat Enterprise Linux Server release 7.5 (Maipo) collectd-5.7.2-3.1.el7rhgs.x86_64 collectd-ping-5.7.2-3.1.el7rhgs.x86_64 etcd-3.2.7-1.el7.x86_64 grafana-4.3.2-3.el7rhgs.x86_64 libcollectdclient-5.7.2-3.1.el7rhgs.x86_64 python-etcd-0.4.5-2.el7rhgs.noarch rubygem-etcd-0.3.0-2.el7rhgs.noarch tendrl-ansible-1.6.3-6.el7rhgs.noarch tendrl-api-1.6.3-5.el7rhgs.noarch tendrl-api-httpd-1.6.3-5.el7rhgs.noarch tendrl-commons-1.6.3-12.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-10.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-10.el7rhgs.noarch tendrl-node-agent-1.6.3-10.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-10.el7rhgs.noarch Red Hat Gluster Storage Server: Red Hat Enterprise Linux Server release 7.5 (Maipo) Red Hat Gluster Storage Server 3.4.0 collectd-5.7.2-3.1.el7rhgs.x86_64 collectd-ping-5.7.2-3.1.el7rhgs.x86_64 glusterfs-3.12.2-16.el7rhgs.x86_64 glusterfs-api-3.12.2-16.el7rhgs.x86_64 glusterfs-cli-3.12.2-16.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-16.el7rhgs.x86_64 glusterfs-events-3.12.2-16.el7rhgs.x86_64 glusterfs-fuse-3.12.2-16.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-16.el7rhgs.x86_64 glusterfs-libs-3.12.2-16.el7rhgs.x86_64 glusterfs-rdma-3.12.2-16.el7rhgs.x86_64 glusterfs-server-3.12.2-16.el7rhgs.x86_64 gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch libcollectdclient-5.7.2-3.1.el7rhgs.x86_64 libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64 python2-gluster-3.12.2-16.el7rhgs.x86_64 python-etcd-0.4.5-2.el7rhgs.noarch tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch tendrl-commons-1.6.3-12.el7rhgs.noarch tendrl-gluster-integration-1.6.3-9.el7rhgs.noarch tendrl-node-agent-1.6.3-10.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch vdsm-gluster-4.19.43-2.3.el7rhgs.noarch >> VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2616