1600910 – Peer is Disconnected alert not cleared properly

Bug 1600910 - Peer is Disconnected alert not cleared properly

Summary: Peer is Disconnected alert not cleared properly

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	web-admin-tendrl-notifier
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Timothy Asir
QA Contact:	sds-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:	1515276
Blocks:
TreeView+	depends on / blocked

Reported:	2018-07-13 10:50 UTC by Filip Balák
Modified:	2020-02-07 08:15 UTC (History)
CC List:	4 users (show)
Fixed In Version:	tendrl-commons-1.6.3-18.el7rhgs.noarch
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-02-07 08:15:06 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Peer disconnected alert (119.67 KB, image/png) 2018-07-13 10:50 UTC, Filip Balák	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1580385	0	unspecified	CLOSED	Node is DOWN alert not cleared properly	2021-02-22 00:41:40 UTC

Internal Links: 1580385

Description Filip Balák 2018-07-13 10:50:40 UTC

Created attachment 1458705 [details]
Peer disconnected alert

Description of problem:
When all of gluster nodes are shut down and after a while started, there remains an alert for one of the nodes:
`Peer <node-id> in cluster <cluster-id> is Disconnected`
All other alerts are cleared correctly.

Version-Release number of selected component (if applicable):
tendrl-ansible-1.6.3-5.el7rhgs.noarch
tendrl-api-1.6.3-4.el7rhgs.noarch
tendrl-api-httpd-1.6.3-4.el7rhgs.noarch
tendrl-commons-1.6.3-8.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-6.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-6.el7rhgs.noarch
tendrl-node-agent-1.6.3-8.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-6.el7rhgs.noarch

How reproducible:
50%

Steps to Reproduce:
1. Install WA.
2. Import cluster with 2 volumes, set cluster name.
3. Shut down all gluster nodes.
4. Wait for volumes to disappear from UI, at least 20 minutes (it seems that the issue happens more often with longer wait)
5. Start all nodes.
6. Check alerts in UI.

Actual results:
There remains one alert:
`Peer <node-id> in cluster <cluster-id> is Disconnected`

Expected results:
There should be no alerts if nodes started correctly.

Comment 1 gowtham 2018-07-13 11:33:58 UTC

When you shut down all the nodes peer disconnected alert won't be raised, to raise peer disconnect alert alteast one storage node has to be up.

Comment 2 gowtham 2018-07-20 11:16:52 UTC

Filip, I followed the same step which you mentioned above but I can't reproduce this issue. It works fine. I have tried few times like down all the nodes same time as well as down the node one by one, in all the case it works fine.

Comment 4 Filip Balák 2018-12-05 13:18:46 UTC

I see almost always this issue (alert 'Node <node> is DOWN') in current version when I stop tendrl related services, wait 10 minutes and start the services again. I use playbooks [1] and [2] for this.

Tested with:
tendrl-ansible-1.6.3-10.el7rhgs.noarch
tendrl-api-1.6.3-8.el7rhgs.noarch
tendrl-api-httpd-1.6.3-8.el7rhgs.noarch
tendrl-commons-1.6.3-13.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-15.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-15.el7rhgs.noarch
tendrl-node-agent-1.6.3-11.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-14.el7rhgs.noarch

[1] https://github.com/usmqe/usmqe-setup/blob/master/test_setup.tendrl_services_stopped_on_nodes.yml
[2] https://github.com/usmqe/usmqe-setup/blob/master/test_teardown.tendrl_services_stopped_on_nodes.yml

Comment 5 gowtham 2018-12-06 06:02:01 UTC

I used a same script but i can't reproduce this, it works fine

Comment 6 gowtham 2018-12-06 06:03:06 UTC

And this issue is about Peer disconnect, Node down must be discussed in some other issue I think.

Comment 7 Martin Bukatovic 2019-04-03 07:48:55 UTC

This is consistently reproduced via running our automated tests for alerting. We will need to add xfail check there for this bug.

Comment 8 Martin Bukatovic 2019-04-03 09:31:45 UTC

Affected test cases are usmqe_tests.alerting.test_status_alerting:
- test_host_status_api_alert
- test_host_status_mail_alert
- test_host_status_snmp_alert

Comment 13 Nishanth Thomas 2020-02-07 08:15:50 UTC

This issue is fixed while fixing: https://bugzilla.redhat.com/show_bug.cgi?id=1687333

Note You need to log in before you can comment on or make changes to this bug.