1611601 – Alert Service: glustershd is disconnected in cluster is not cleared

Bug 1611601 - Alert Service: glustershd is disconnected in cluster is not cleared

Summary: Alert Service: glustershd is disconnected in cluster is not cleared

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	web-admin-tendrl-notifier
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	gowtham
QA Contact:	Filip Balák
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1503137
TreeView+	depends on / blocked

Reported:	2018-08-02 12:59 UTC by Filip Balák
Modified:	2018-09-04 07:09 UTC (History)
CC List:	6 users (show)
Fixed In Version:	tendrl-node-agent-1.6.3-10.el7rhgs.noarch tendrl-commons-1.6.3-12.el7rhgs.noarch
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-09-04 07:08:56 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Service: glustershd is disconnected in cluster notification (118.23 KB, image/png) 2018-08-02 12:59 UTC, Filip Balák	no flags	Details
Events page with multiple alerts generated (88.73 KB, image/png) 2018-08-06 14:42 UTC, Filip Balák	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	Tendrl commons issues 1048	None	None	None	2018-08-06 16:37:10 UTC
Github	Tendrl commons pull 1049	None	None	None	2018-08-13 03:33:10 UTC
Github	Tendrl node-agent issues 845	None	None	None	2018-08-14 17:22:28 UTC
Red Hat Bugzilla	1517233	unspecified	CLOSED	clearing info alert doesn't remove warning alert	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1616208	unspecified	CLOSED	glustershd alerts should mention affected node	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1616215	unspecified	CLOSED	All alerts Service: glustershd is disconnected in cluster are cleared when service starts on one node	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHSA-2018:2616	None	None	None	2018-09-04 07:09:57 UTC

Internal Links: 1517233 1616208 1616215

Description Filip Balák 2018-08-02 12:59:17 UTC

Created attachment 1472696 [details]
Service: glustershd is disconnected in cluster notification

Description of problem:
When process glustershd is killed there is generated alert `Service: glustershd is disconnected in cluster <cluster>`. This alert is not cleared from UI when process is started again.

Version-Release number of selected component (if applicable):
tendrl-ansible-1.6.3-5.el7rhgs.noarch
tendrl-api-1.6.3-4.el7rhgs.noarch
tendrl-api-httpd-1.6.3-4.el7rhgs.noarch
tendrl-commons-1.6.3-9.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-7.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-7.el7rhgs.noarch
tendrl-node-agent-1.6.3-9.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-8.el7rhgs.noarch

How reproducible:
100%

Steps to Reproduce:
1. Import cluster with distributed replicated volume.
2. Connect to one of the volume nodes and get pid of glustershd process:
$ cat /var/run/gluster/glustershd/glustershd.pid
<glustershd-pid>
3. kill <glustershd-pid>
4. Wait for alert in UI.
5. restart glusterd service on node with killed glustershd. This should start glustershd.

Actual results:
Alert `Service: glustershd is disconnected in cluster <cluster>` remains in UI when glustershd is started again.

Expected results:
Alert should be cleared.

Additional info:

Comment 2 Nishanth Thomas 2018-08-06 12:22:57 UTC

What is the behaviour:

1. when you shutdown the glustershd again? The exixting alert will get overwritten or new alert is generated?

2. Did you get the clear event on the events panel?

Comment 3 Filip Balák 2018-08-06 14:42:48 UTC

Created attachment 1473621 [details]
Events page with multiple alerts generated

1. There is a new alert generated and the old one remains. So when I killed and started glustershd on one machine, it generated multiple alerts for that machine.

2. When glustershd is killed, there is an event:
`Service: glustershd is disconnected in cluster <cluster>`
When glustershd is started, there is an event:
`Service: glustershd is connected in cluster <cluster>`

In attachment is situation when I killed/started glustershd 3x times on one machine and 1x on another cluster machine.

Comment 4 gowtham 2018-08-06 16:37:37 UTC

PR is under review https://github.com/Tendrl/commons/pull/1049

Comment 6 Martin Bukatovic 2018-08-13 14:08:57 UTC

The bug has been acked properly, adding into the tracker.

Comment 7 gowtham 2018-08-14 17:22:28 UTC

one more reason for this issue is, sometimes node_agent message socket fails so alert message is not received properly. Problem is when message socket read is empty we are rasing exception RuntimeError but we are not handling exception properly. So message socket stops receiving alert messages. 

fixed PR: https://github.com/Tendrl/node-agent/pull/846

Comment 9 Filip Balák 2018-08-15 09:57:45 UTC

With given reproducer scenario this issue is fixed --> VERIFIED
but during testing were filled BZ 1616208 and BZ 1616215.

Tested with:
tendrl-ansible-1.6.3-6.el7rhgs.noarch
tendrl-api-1.6.3-5.el7rhgs.noarch
tendrl-api-httpd-1.6.3-5.el7rhgs.noarch
tendrl-commons-1.6.3-12.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-10.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-10.el7rhgs.noarch
tendrl-node-agent-1.6.3-10.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-10.el7rhgs.noarch

Comment 11 errata-xmlrpc 2018-09-04 07:08:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2616

Note You need to log in before you can comment on or make changes to this bug.