1564175 – False alerts when brick utilization breached 90%

Bug 1564175 - False alerts when brick utilization breached 90%

Summary: False alerts when brick utilization breached 90%

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	web-admin-tendrl-notifier
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	gowtham
QA Contact:	Filip Balák
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1503137
TreeView+	depends on / blocked

Reported:	2018-04-05 14:39 UTC by Filip Balák
Modified:	2018-09-04 07:04 UTC (History)
CC List:	4 users (show)
Fixed In Version:	tendrl-notifier-1.6.3-4.el7rhgs
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-09-04 07:03:46 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
UI page with bricks and alerts (160.80 KB, image/png) 2018-04-05 14:39 UTC, Filip Balák	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	/Tendrl monitoring-integration pull 467	None	None	None	2019-11-18 11:47:48 UTC
Github	Tendrl monitoring-integration issues 458	'None'	closed	False alerts when brick utilization breached 90%	2020-12-24 16:15:57 UTC
Github	Tendrl notifier issues 173	'None'	closed	Notifier replace new_alert using old alert when alert for same resource is modified by new alert	2020-12-24 16:16:00 UTC
Github	Tendrl notifier issues 177	'None'	closed	notifier has to check any changes in changes in alert for for sometime before send notification	2020-12-24 16:15:58 UTC
Red Hat Bugzilla	1531139	unspecified	CLOSED	[RFE] Brick Utilization: threshold breached Alert needs to be generated for brick usage above 90%	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHSA-2018:2616	None	None	None	2018-09-04 07:04:50 UTC

Internal Links: 1531139

Description Filip Balák 2018-04-05 14:39:26 UTC

Created attachment 1417778 [details]
UI page with bricks and alerts

Description of problem:
During testing BZ 1531139 I have breached utilization of all bricks above 90%. I received correct alerts that utilization breached 90% but for some bricks I received alerts that they are back to normal although they are not.

Version-Release number of selected component (if applicable):
tendrl-ansible-1.6.1-3.el7rhgs.noarch
tendrl-api-1.6.1-3.el7rhgs.noarch
tendrl-api-httpd-1.6.1-3.el7rhgs.noarch
tendrl-commons-1.6.1-3.el7rhgs.noarch
tendrl-grafana-plugins-1.6.1-3.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.1-3.el7rhgs.noarch
tendrl-node-agent-1.6.1-3.el7rhgs.noarch
tendrl-notifier-1.6.0-1.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.1-3.el7rhgs.noarch
glusterfs-3.12.2-7.el7rhgs.x86_64

How reproducible:
Not sure but it happened for more bricks.

Steps to Reproduce:
1. Import cluster with disperse volume.
2. Fill volume bricks with data.
3. Check alerts in mail, ui and snmp after utilization breaches 75% and 90%.

Actual results:
There are correct alerts about breaching 75% and 90% of utilization but there are also false alerts that utilization for some bricks is back to normal.

Expected results:
There should be only alerts that utilization breached thresholds when there was no data deletion.

Additional info:

Comment 1 gowtham 2018-04-24 10:07:29 UTC

I can't reproduce it

Comment 2 gowtham 2018-04-24 10:08:21 UTC

i can't download image also

Comment 3 gowtham 2018-05-08 13:46:01 UTC

As per alert dashbaord implementation, we are actually maintaining two different alert dashboards for the alert. one for warning and one for the alert. because grafana raise alert only once, it will raise alert again when it comes to the ok state. So we set a condition like 75 to 90 then warning alert will raise. when the alert goes above 90 then warning dashbaord raise ok alert. because it will check withing range only. Then critical dashbaord will raise an alert. When utilization comes to below 90 then critical will raise ok alert and warning will raise an alert.

So it will replace by INFO but after few seconds it wil again replace by warning or critical based on utilization percentage.

I have tested this it works fine, please wait for few minutes and then check again.

Because I already have a mechanism like only warning ok alert can clear warning alert and critical can clear the critical alert. it alerts goes above 75 to 90 then critical dashboard raise alert first then a warning is replaced by critical after that warning alert raise ok then it won't take. because the active alert is critical.


I tried to reproduce but it works fine.

Comment 4 Filip Balák 2018-05-09 08:05:38 UTC

I was not able to fully reproduce it with a new version but I have a notification in UI:
`Brick utilization of fbalak-usm1-gl4:|mnt|brick_gama_disperse_2|2 in volume_gama_disperse_4_plus_2x2 back to normal` and few more in mails related to bricks that got back to normal. The bricks failed and are currently down so it is not the same setup as the one I originally reported but it might be useful for you so I sent you a PM with access to the machines so that you can check it out.

Tested with:
tendrl-ansible-1.6.3-3.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-4.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-2.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-2.el7rhgs.noarch
tendrl-node-agent-1.6.3-4.el7rhgs.noarch
tendrl-notifier-1.6.3-2.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-1.el7rhgs.noarch

Comment 7 Filip Balák 2018-05-21 14:05:39 UTC

I was able to reproduce this with configuration containing cluster with 2 volumes. I filled disperse 4 plus 2x2 volume and the issue appeared.
--> ASSIGNED

Tested with:
glusterfs-3.12.2-11.el7rhgs.x86_64
tendrl-ansible-1.6.3-4.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-5.el7rhgs.noarch
tendrl-gluster-integration-1.6.3-3.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-3.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-3.el7rhgs.noarch
tendrl-node-agent-1.6.3-5.el7rhgs.noarch
tendrl-notifier-1.6.3-3.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-2.el7rhgs.noarch

Comment 8 gowtham 2018-05-23 19:17:32 UTC

I found root cause for this problem https://github.com/Tendrl/monitoring-integration/pull/467

Comment 9 Filip Balák 2018-06-04 16:21:49 UTC

This seems to be fixed in UI but not in SNMP and mail. I received several times mails and snmp messages that contain information about getting back to normal when utilization was not normal. It happened right before I received critical alert about full capacity.

```
Jun 04 10:46:13 fbalak-usm1-client.usmqe snmptrapd[804]: 2018-06-04 10:46:13 fbalak-usm1-server.usmqe.lab.eng.brq.redhat.com [UDP: [10.37.169.17]:59518->[10.37.16
9.118]:162]:                                                                             DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (0) 0:00:00.00        SNMPv2-MIB::snmpTrapOID.0 = OID: SNMPv2-MIB::coldStart        SNMPv2-SMI::private.2312.19.1.0 = STRING: "[INFO], Brick Utilization: threshold breached-Brick utilization of fbalak-usm1-gl2.usmqe.lab.eng.brq.redhat.com:|mnt|brick_gama_disperse_1|1 in volume_gama_disperse_4_plus_2x2 back to normal"

Jun 04 10:46:24 fbalak-usm1-client.usmqe snmptrapd[804]: 2018-06-04 10:46:24 fbalak-usm1-server.usmqe.lab.eng.brq.redhat.com [UDP: [10.37.169.17]:59569->[10.37.16
9.118]:162]:
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (0) 0:00:00.00        SNMPv2-MIB::snmpTrapOID.0 = OID: SNMPv2-MIB::coldStart        SNMPv2-SMI::private.2312.19.1.0 = STRING: "[CRITICAL], Brick Utilization: threshold breached-Brick utilization on fbalak-usm1-gl2.usmqe.lab.eng.brq.redhat.co
m:|mnt|brick_gama_disperse_1|1 in volume_gama_disperse_4_plus_2x2 at 92.19 % and nearing full capacity"
```

Tested with:
tendrl-ansible-1.6.3-4.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-6.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-4.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-4.el7rhgs.noarch
tendrl-node-agent-1.6.3-6.el7rhgs.noarch
tendrl-notifier-1.6.3-3.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-3.el7rhgs.noarch

Comment 11 gowtham 2018-06-07 16:59:39 UTC

Pr is under review  https://github.com/Tendrl/notifier/pull/178

Comment 12 Filip Balák 2018-06-26 06:28:39 UTC

I tested it several times and I haven't notice an issue. --> VERIFIED

Tested with:
tendrl-ansible-1.6.3-5.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-7.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch
tendrl-node-agent-1.6.3-7.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-4.el7rhgs.noarch

Comment 14 errata-xmlrpc 2018-09-04 07:03:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2616

Note You need to log in before you can comment on or make changes to this bug.