Created attachment 1472696 [details] Service: glustershd is disconnected in cluster notification Description of problem: When process glustershd is killed there is generated alert `Service: glustershd is disconnected in cluster <cluster>`. This alert is not cleared from UI when process is started again. Version-Release number of selected component (if applicable): tendrl-ansible-1.6.3-5.el7rhgs.noarch tendrl-api-1.6.3-4.el7rhgs.noarch tendrl-api-httpd-1.6.3-4.el7rhgs.noarch tendrl-commons-1.6.3-9.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-7.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-7.el7rhgs.noarch tendrl-node-agent-1.6.3-9.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-8.el7rhgs.noarch How reproducible: 100% Steps to Reproduce: 1. Import cluster with distributed replicated volume. 2. Connect to one of the volume nodes and get pid of glustershd process: $ cat /var/run/gluster/glustershd/glustershd.pid <glustershd-pid> 3. kill <glustershd-pid> 4. Wait for alert in UI. 5. restart glusterd service on node with killed glustershd. This should start glustershd. Actual results: Alert `Service: glustershd is disconnected in cluster <cluster>` remains in UI when glustershd is started again. Expected results: Alert should be cleared. Additional info:
What is the behaviour: 1. when you shutdown the glustershd again? The exixting alert will get overwritten or new alert is generated? 2. Did you get the clear event on the events panel?
Created attachment 1473621 [details] Events page with multiple alerts generated 1. There is a new alert generated and the old one remains. So when I killed and started glustershd on one machine, it generated multiple alerts for that machine. 2. When glustershd is killed, there is an event: `Service: glustershd is disconnected in cluster <cluster>` When glustershd is started, there is an event: `Service: glustershd is connected in cluster <cluster>` In attachment is situation when I killed/started glustershd 3x times on one machine and 1x on another cluster machine.
PR is under review https://github.com/Tendrl/commons/pull/1049
The bug has been acked properly, adding into the tracker.
one more reason for this issue is, sometimes node_agent message socket fails so alert message is not received properly. Problem is when message socket read is empty we are rasing exception RuntimeError but we are not handling exception properly. So message socket stops receiving alert messages. fixed PR: https://github.com/Tendrl/node-agent/pull/846
With given reproducer scenario this issue is fixed --> VERIFIED but during testing were filled BZ 1616208 and BZ 1616215. Tested with: tendrl-ansible-1.6.3-6.el7rhgs.noarch tendrl-api-1.6.3-5.el7rhgs.noarch tendrl-api-httpd-1.6.3-5.el7rhgs.noarch tendrl-commons-1.6.3-12.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-10.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-10.el7rhgs.noarch tendrl-node-agent-1.6.3-10.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-10.el7rhgs.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2616