Created attachment 1458705 [details] Peer disconnected alert Description of problem: When all of gluster nodes are shut down and after a while started, there remains an alert for one of the nodes: `Peer <node-id> in cluster <cluster-id> is Disconnected` All other alerts are cleared correctly. Version-Release number of selected component (if applicable): tendrl-ansible-1.6.3-5.el7rhgs.noarch tendrl-api-1.6.3-4.el7rhgs.noarch tendrl-api-httpd-1.6.3-4.el7rhgs.noarch tendrl-commons-1.6.3-8.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-6.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-6.el7rhgs.noarch tendrl-node-agent-1.6.3-8.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-6.el7rhgs.noarch How reproducible: 50% Steps to Reproduce: 1. Install WA. 2. Import cluster with 2 volumes, set cluster name. 3. Shut down all gluster nodes. 4. Wait for volumes to disappear from UI, at least 20 minutes (it seems that the issue happens more often with longer wait) 5. Start all nodes. 6. Check alerts in UI. Actual results: There remains one alert: `Peer <node-id> in cluster <cluster-id> is Disconnected` Expected results: There should be no alerts if nodes started correctly.
When you shut down all the nodes peer disconnected alert won't be raised, to raise peer disconnect alert alteast one storage node has to be up.
Filip, I followed the same step which you mentioned above but I can't reproduce this issue. It works fine. I have tried few times like down all the nodes same time as well as down the node one by one, in all the case it works fine.
I see almost always this issue (alert 'Node <node> is DOWN') in current version when I stop tendrl related services, wait 10 minutes and start the services again. I use playbooks [1] and [2] for this. Tested with: tendrl-ansible-1.6.3-10.el7rhgs.noarch tendrl-api-1.6.3-8.el7rhgs.noarch tendrl-api-httpd-1.6.3-8.el7rhgs.noarch tendrl-commons-1.6.3-13.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-15.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-15.el7rhgs.noarch tendrl-node-agent-1.6.3-11.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-14.el7rhgs.noarch [1] https://github.com/usmqe/usmqe-setup/blob/master/test_setup.tendrl_services_stopped_on_nodes.yml [2] https://github.com/usmqe/usmqe-setup/blob/master/test_teardown.tendrl_services_stopped_on_nodes.yml
I used a same script but i can't reproduce this, it works fine
And this issue is about Peer disconnect, Node down must be discussed in some other issue I think.
This is consistently reproduced via running our automated tests for alerting. We will need to add xfail check there for this bug.
Affected test cases are usmqe_tests.alerting.test_status_alerting: - test_host_status_api_alert - test_host_status_mail_alert - test_host_status_snmp_alert
This issue is fixed while fixing: https://bugzilla.redhat.com/show_bug.cgi?id=1687333