Bug 1514442

Summary: Successive attempts to import the same cluster on the same webadmin server fail
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Sweta Anandpara <sanandpa>
Component: web-admin-tendrl-monitoring-integrationAssignee: Shubhendu Tripathi <shtripat>
Status: CLOSED ERRATA QA Contact: Filip Balák <fbalak>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.3CC: fbalak, nthomas, rghatvis, rhs-bugs, shtripat
Target Milestone: ---   
Target Release: RHGS 3.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tendrl-ansible-1.6.1-2.el7rhgs.noarch.rpm, tendrl-api-1.6.1-1.el7rhgs.noarch.rpm, tendrl-commons-1.6.1-1.el7rhgs.noarch.rpm, tendrl-monitoring-integration-1.6.1-1.el7rhgs.noarch.rpm, tendrl-node-agent-1.6.1-1.el7, tendrl-ui-1.6.1-1.el7rhgs.noarch.rpm, Doc Type: Bug Fix
Doc Text:
Cause: Earlier if import cluster failed in RHGS-WA there was no way to trigger the same again from UI. User needed to clean the etcd details and fire import again. Consequence: Successive attempts to import the same clusrer used to be fail again and again Fix: Now with latest changes around import cluster and new feature to un-manage a cluster and import, if import cluster fails for a cluster due to invalid repos configured in storage nodes for installation of components, there is chance for user to correct the issues in underlying node and then fire re-import for the cluster. Also un-manage cluster flow helps in this flow, as the cluster can be un-managed and then re-imported. Now import job gets successful only if all the nodes report all required components installed and up and running on them and first round of sync done for the node. If any nodes fails to report the same import cluster job fails. User can now correct the issues reported in this case and execute re-import of cluster from RHGS-WA UI. Result: If import fails now, user can correct the issues on underlying nodes and execute re-import for the cluster.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-04 06:58:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1502877, 1503134, 1516135    

Description Sweta Anandpara 2017-11-17 12:39:23 UTC
Description of problem:
=======================
While trying to verify bugs 1504702 and 1507984, the unavailability of repos was simulated to validate the import failure and the messages that are captured. The steps related to the above bugs were validated successfully.

For proceeding, we cleaned up etcd database, restarted tendrl-node-agent on storage nodes and webadmin server (as directed by Nishanth). And we tried to import again, this time in a healthy environment - but it failed again with a traceback mentioning 'Atom execution failed. Error executing post run function: tendrl.objects.Cluster.atoms.ConfigureMonitoring'.

Screenshot of the error messages and /var/log/messages is copied to http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber>

There is a possibility of stale entries going haywire. 

Either ways, there is a high probability that the cluster import would fail at our customer site for some config/setup related reason, or for not having followed all the right steps. The correctional steps would be taken, and cluster-import would be tried another time. The second or n'th attempt of cluster import should not fail because of things having gone wrong until n-1'th attempt.

Version-Release number of selected component (if applicable):
==============================================================
glusterfs-3.8.4-52
tendrl-node-agent-1.5.4-2.el7rhgs.noarch
tendrl-ansible-1.5.4-1.el7rhgs.noarch
tendrl-selinux-1.5.3-2.el7rhgs.noarch
tendrl-commons-1.5.4-2.el7rhgs.noarch
tendrl-api-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.5.4-3.el7rhgs.noarch
tendrl-grafana-selinux-1.5.3-2.el7rhgs.noarch
tendrl-api-httpd-1.5.4-2.el7rhgs.noarch
tendrl-grafana-plugins-1.5.4-3.el7rhgs.noarch
tendrl-notifier-1.5.4-2.el7rhgs.noarch


How reproducible:
=================
Seen multiple times on the same setup.
Nishanth did a cleanup another time, and it failed again.

Comment 2 Rohan Kanade 2017-11-20 07:25:37 UTC
Tendrl does not support retries on importing an already failed-to-import cluster. It will be supported in the next release. Currently, user needs to clean up Tendrl central store (etcd) and retry import. 

Steps documented here: https://github.com/Tendrl/documentation/wiki/Tendrl-release-v1.5.4-(install-guide)#uninstall-tendrl

Comment 3 Rohan Kanade 2018-01-22 13:07:24 UTC
Tendrl will provide  flows called "unmanage cluster" and "delete cluster", in case of a failed import, user must "delete cluster" and try to import cluster again.

Comment 5 Shubhendu Tripathi 2018-03-05 02:37:24 UTC
Now if import fails due to issues like wrong repos set for glusterfs (required for dependency glusterfs-events), after failed import, the action is still allowed. User can opt to correct the underlying error and then un-manage + import can work out well from tendrl UI.

Comment 6 Filip Balák 2018-06-26 08:11:25 UTC
When reproducer from BZ 1570048 is applied and user fixes the problems and tries to import the cluster again, the import succeeds.
The unmanage function (BZ 1526338) seems to work correctly as a way to correct cluster misconfiguration. --> VERIFIED

Tested with:
tendrl-ansible-1.6.3-5.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-7.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch
tendrl-node-agent-1.6.3-7.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-4.el7rhgs.noarch

Comment 8 Shubhendu Tripathi 2018-09-04 02:56:06 UTC
Looks fine

Comment 10 errata-xmlrpc 2018-09-04 06:58:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2616