Bug 1600910 - Peer is Disconnected alert not cleared properly
Summary: Peer is Disconnected alert not cleared properly
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: web-admin-tendrl-notifier
Version: rhgs-3.4
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: ---
Assignee: Timothy Asir
QA Contact: sds-qe-bugs
URL:
Whiteboard:
Depends On: 1515276
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-13 10:50 UTC by Filip Balák
Modified: 2020-02-07 08:15 UTC (History)
4 users (show)

Fixed In Version: tendrl-commons-1.6.3-18.el7rhgs.noarch
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-02-07 08:15:06 UTC
Embargoed:


Attachments (Terms of Use)
Peer disconnected alert (119.67 KB, image/png)
2018-07-13 10:50 UTC, Filip Balák
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1580385 0 unspecified CLOSED Node is DOWN alert not cleared properly 2021-02-22 00:41:40 UTC

Internal Links: 1580385

Description Filip Balák 2018-07-13 10:50:40 UTC
Created attachment 1458705 [details]
Peer disconnected alert

Description of problem:
When all of gluster nodes are shut down and after a while started, there remains an alert for one of the nodes:
`Peer <node-id> in cluster <cluster-id> is Disconnected`
All other alerts are cleared correctly.

Version-Release number of selected component (if applicable):
tendrl-ansible-1.6.3-5.el7rhgs.noarch
tendrl-api-1.6.3-4.el7rhgs.noarch
tendrl-api-httpd-1.6.3-4.el7rhgs.noarch
tendrl-commons-1.6.3-8.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-6.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-6.el7rhgs.noarch
tendrl-node-agent-1.6.3-8.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-6.el7rhgs.noarch

How reproducible:
50%

Steps to Reproduce:
1. Install WA.
2. Import cluster with 2 volumes, set cluster name.
3. Shut down all gluster nodes.
4. Wait for volumes to disappear from UI, at least 20 minutes (it seems that the issue happens more often with longer wait)
5. Start all nodes.
6. Check alerts in UI.

Actual results:
There remains one alert:
`Peer <node-id> in cluster <cluster-id> is Disconnected`

Expected results:
There should be no alerts if nodes started correctly.

Comment 1 gowtham 2018-07-13 11:33:58 UTC
When you shut down all the nodes peer disconnected alert won't be raised, to raise peer disconnect alert alteast one storage node has to be up.

Comment 2 gowtham 2018-07-20 11:16:52 UTC
Filip, I followed the same step which you mentioned above but I can't reproduce this issue. It works fine. I have tried few times like down all the nodes same time as well as down the node one by one, in all the case it works fine.

Comment 4 Filip Balák 2018-12-05 13:18:46 UTC
I see almost always this issue (alert 'Node <node> is DOWN') in current version when I stop tendrl related services, wait 10 minutes and start the services again. I use playbooks [1] and [2] for this.

Tested with:
tendrl-ansible-1.6.3-10.el7rhgs.noarch
tendrl-api-1.6.3-8.el7rhgs.noarch
tendrl-api-httpd-1.6.3-8.el7rhgs.noarch
tendrl-commons-1.6.3-13.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-15.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-15.el7rhgs.noarch
tendrl-node-agent-1.6.3-11.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-14.el7rhgs.noarch

[1] https://github.com/usmqe/usmqe-setup/blob/master/test_setup.tendrl_services_stopped_on_nodes.yml
[2] https://github.com/usmqe/usmqe-setup/blob/master/test_teardown.tendrl_services_stopped_on_nodes.yml

Comment 5 gowtham 2018-12-06 06:02:01 UTC
I used a same script but i can't reproduce this, it works fine

Comment 6 gowtham 2018-12-06 06:03:06 UTC
And this issue is about Peer disconnect, Node down must be discussed in some other issue I think.

Comment 7 Martin Bukatovic 2019-04-03 07:48:55 UTC
This is consistently reproduced via running our automated tests for alerting. We will need to add xfail check there for this bug.

Comment 8 Martin Bukatovic 2019-04-03 09:31:45 UTC
Affected test cases are usmqe_tests.alerting.test_status_alerting:
- test_host_status_api_alert
- test_host_status_mail_alert
- test_host_status_snmp_alert

Comment 13 Nishanth Thomas 2020-02-07 08:15:50 UTC
This issue is fixed while fixing: https://bugzilla.redhat.com/show_bug.cgi?id=1687333


Note You need to log in before you can comment on or make changes to this bug.