Created attachment 1358621 [details] stale alert Description of problem: 'Service: glustershd is disconnected in cluster' warning alert is not cleared with 'Service: glustershd is disconnected in cluster' event. It stays forever in Alerts drawer, no other 'Service: glustershd is disconnected in cluster' event clears it. Version-Release number of selected component (if applicable): tendrl-ansible-1.5.4-1.el7rhgs.noarch tendrl-ui-1.5.4-4.el7rhgs.noarch tendrl-grafana-plugins-1.5.4-5.el7rhgs.noarch tendrl-selinux-1.5.3-2.el7rhgs.noarch tendrl-commons-1.5.4-4.el7rhgs.noarch tendrl-api-1.5.4-2.el7rhgs.noarch tendrl-api-httpd-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.5.4-5.el7rhgs.noarch tendrl-grafana-selinux-1.5.3-2.el7rhgs.noarch tendrl-node-agent-1.5.4-5.el7rhgs.noarch tendrl-notifier-1.5.4-3.el7rhgs.noarch How reproducible: 30% Steps to Reproduce: 1. Restart glusterd service when 'another transaction is in progress' for a volume. In my case some stale lock happened. 2. 3. Actual results: In first scenario RHGSWA generates two events with the same timestamp 'Service: glustershd is disconnected in cluster' and 'Service: glustershd is disconnected in cluster'. For the first one an alert is created (mail and snmp trap send if configured). The second event is 'ignored', no clear alert is generated. Expected results: A clear event should be processed even when it happens almost the same time. Additional info: Any clearing event which comes almost the same time as original alert doesn't clear the alert. I was able to have 'Status of peer: <hostname> in cluster <cluster_ID> changed from Connected to Disconnected' in Alerts drawer, because 'Disconnected to Connected' event comes almost the same time. Used scenario (it has bigger reproducibility): 1. Switch off one gluster node 2. Load some data to the gluster volume 3. Start node 4. Restart glusterd service on some other node (if needed several times) However for this particular alert another stop-start of glusterd service clears the warning.
Probability of occurrence of this is very low in normal condtions. I couldn't reproduce in my setup. Also there is a workaround mentioned in case of occurrence. I don't think its a blocker for current release. Moving this out.
Svc clearing alert is not matched with warning alert, so it is not cleared. it is fixed now https://github.com/Tendrl/gluster-integration/pull/543, https://github.com/Tendrl/commons/pull/801
I was not able to reproduce the issue with original build nor I was able to reproduce it with current version. It seems that it is very difficult to turn gluster into `Another transaction is in progress` state with new version of gluster with given reproducer. I came up with different scenario which leads to similar error. I was able to do it with repeatedly calling `gluster volume start <volume> force` from more nodes at once but when I restarted glusterd I didn't see any alert related to glustershd. I tested turning glustershd off which leads to described behaviour: BZ 1611601. I propose to close this BZ 1517233 and track progress in new BZ 1611601. Tested with: glusterfs-3.12.2-15.el7rhgs.x86_64 tendrl-ansible-1.6.3-6.el7rhgs.noarch tendrl-api-1.6.3-5.el7rhgs.noarch tendrl-api-httpd-1.6.3-5.el7rhgs.noarch tendrl-commons-1.6.3-11.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-8.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-8.el7rhgs.noarch tendrl-node-agent-1.6.3-9.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-9.el7rhgs.noarch
PM ack is already set on this BZ for it to be dropped.
I'm closing this BZ (see comment 4 for details on why), as was discussed on program meeting on 2018-08-14. Both development (Nishant) and product management (Anand) agrees.