Bug 1616215
| Summary: | All alerts Service: glustershd is disconnected in cluster are cleared when service starts on one node | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Filip Balák <fbalak> | ||||
| Component: | web-admin-tendrl-notifier | Assignee: | gowtham <gshanmug> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Filip Balák <fbalak> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | rhgs-3.4 | CC: | apaladug, mbukatov, nthomas, rhs-bugs, sankarshan | ||||
| Target Milestone: | --- | ||||||
| Target Release: | RHGS 3.4.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | tendrl-gluster-integration-1.6.3-10.el7rhgs | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2018-09-04 07:09:07 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1503137 | ||||||
| Attachments: |
|
||||||
PR is under review: https://github.com/Tendrl/gluster-integration/pull/695 While raising SVC_CONNECTED and SVC_DISCONNECTED alert we are not assigning peer host_name to identify each node alert uniquely. So all nodes alerts are overwritten is the same alert again and again. That why we see only one SVC related alert in UI. And that is also cleared when any one node glustershd is back to normal. So, I have added peer host_name while raising an alert. So each node will raise its own SVC related alert and clear will happen for its own alert only. QE team will test this bug as noted in the description. All acks provided, attaching to the tracker. Service alerts seem to be cleared correctly. --> VERIFIED Tested with: tendrl-ansible-1.6.3-7.el7rhgs.noarch tendrl-api-1.6.3-5.el7rhgs.noarch tendrl-api-httpd-1.6.3-5.el7rhgs.noarch tendrl-commons-1.6.3-12.el7rhgs.noarch tendrl-gluster-integration-1.6.3-10.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-10.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-10.el7rhgs.noarch tendrl-node-agent-1.6.3-10.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-11.el7rhgs.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2616 |
Created attachment 1476107 [details] Events page with cleared alerts and terminal indicating that glustershd runs only on first node Description of problem: If glustershd service is stopped on multiple machines and on one is started then `Service: glustershd is disconnected in cluster <cluster>` is cleared and UI reports no alerts related to other nodes. Version-Release number of selected component (if applicable): tendrl-ansible-1.6.3-6.el7rhgs.noarch tendrl-api-1.6.3-5.el7rhgs.noarch tendrl-api-httpd-1.6.3-5.el7rhgs.noarch tendrl-commons-1.6.3-12.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-10.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-10.el7rhgs.noarch tendrl-node-agent-1.6.3-10.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-10.el7rhgs.noarch How reproducible: 100% Steps to Reproduce: 1. Import cluster with distributed replicated volume. 2. Connect to more than one nodes of the volume and get pid of glustershd process: $ cat /var/run/gluster/glustershd/glustershd.pid <glustershd-pid> 3. kill <glustershd-pid> on all connected machines. 4. Wait for alert in UI. 5. restart glusterd service on node of the with killed glustershd. This should start glustershd. Don't restart the service on other nodes. Actual results: `Service: glustershd is disconnected in cluster <cluster>` is cleared but on some nodes glustershd is still not running. Expected results: There should remain alerts that service glustershd is disconnected on other nodes and that problems with glustershd are not resolved. Additional info: