Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1564175 - False alerts when brick utilization breached 90%
False alerts when brick utilization breached 90%
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: web-admin-tendrl-notifier (Show other bugs)
3.4
Unspecified Unspecified
unspecified Severity unspecified
: ---
: RHGS 3.4.0
Assigned To: gowtham
Filip Balák
:
Depends On:
Blocks: 1503137
  Show dependency treegraph
 
Reported: 2018-04-05 10:39 EDT by Filip Balák
Modified: 2018-09-04 03:04 EDT (History)
4 users (show)

See Also:
Fixed In Version: tendrl-notifier-1.6.3-4.el7rhgs
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-09-04 03:03:46 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
UI page with bricks and alerts (160.80 KB, image/png)
2018-04-05 10:39 EDT, Filip Balák
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Github Tendrl/monitoring-integration/issues/458 None None None 2018-05-14 12:59 EDT
Github /Tendrl/monitoring-integration/pull/467 None None None 2018-05-23 15:14 EDT
Github Tendrl/notifier/issues/173 None None None 2018-05-14 12:59 EDT
Github Tendrl/notifier/issues/177 None None None 2018-06-07 12:59 EDT
Red Hat Product Errata RHSA-2018:2616 None None None 2018-09-04 03:04 EDT

  None (edit)
Description Filip Balák 2018-04-05 10:39:26 EDT
Created attachment 1417778 [details]
UI page with bricks and alerts

Description of problem:
During testing BZ 1531139 I have breached utilization of all bricks above 90%. I received correct alerts that utilization breached 90% but for some bricks I received alerts that they are back to normal although they are not.

Version-Release number of selected component (if applicable):
tendrl-ansible-1.6.1-3.el7rhgs.noarch
tendrl-api-1.6.1-3.el7rhgs.noarch
tendrl-api-httpd-1.6.1-3.el7rhgs.noarch
tendrl-commons-1.6.1-3.el7rhgs.noarch
tendrl-grafana-plugins-1.6.1-3.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.1-3.el7rhgs.noarch
tendrl-node-agent-1.6.1-3.el7rhgs.noarch
tendrl-notifier-1.6.0-1.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.1-3.el7rhgs.noarch
glusterfs-3.12.2-7.el7rhgs.x86_64

How reproducible:
Not sure but it happened for more bricks.

Steps to Reproduce:
1. Import cluster with disperse volume.
2. Fill volume bricks with data.
3. Check alerts in mail, ui and snmp after utilization breaches 75% and 90%.

Actual results:
There are correct alerts about breaching 75% and 90% of utilization but there are also false alerts that utilization for some bricks is back to normal.

Expected results:
There should be only alerts that utilization breached thresholds when there was no data deletion.

Additional info:
Comment 1 gowtham 2018-04-24 06:07:29 EDT
I can't reproduce it
Comment 2 gowtham 2018-04-24 06:08:21 EDT
i can't download image also
Comment 3 gowtham 2018-05-08 09:46:01 EDT
As per alert dashbaord implementation, we are actually maintaining two different alert dashboards for the alert. one for warning and one for the alert. because grafana raise alert only once, it will raise alert again when it comes to the ok state. So we set a condition like 75 to 90 then warning alert will raise. when the alert goes above 90 then warning dashbaord raise ok alert. because it will check withing range only. Then critical dashbaord will raise an alert. When utilization comes to below 90 then critical will raise ok alert and warning will raise an alert.

So it will replace by INFO but after few seconds it wil again replace by warning or critical based on utilization percentage.

I have tested this it works fine, please wait for few minutes and then check again.

Because I already have a mechanism like only warning ok alert can clear warning alert and critical can clear the critical alert. it alerts goes above 75 to 90 then critical dashboard raise alert first then a warning is replaced by critical after that warning alert raise ok then it won't take. because the active alert is critical.


I tried to reproduce but it works fine.
Comment 4 Filip Balák 2018-05-09 04:05:38 EDT
I was not able to fully reproduce it with a new version but I have a notification in UI:
`Brick utilization of fbalak-usm1-gl4:|mnt|brick_gama_disperse_2|2 in volume_gama_disperse_4_plus_2x2 back to normal` and few more in mails related to bricks that got back to normal. The bricks failed and are currently down so it is not the same setup as the one I originally reported but it might be useful for you so I sent you a PM with access to the machines so that you can check it out.

Tested with:
tendrl-ansible-1.6.3-3.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-4.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-2.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-2.el7rhgs.noarch
tendrl-node-agent-1.6.3-4.el7rhgs.noarch
tendrl-notifier-1.6.3-2.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-1.el7rhgs.noarch
Comment 7 Filip Balák 2018-05-21 10:05:39 EDT
I was able to reproduce this with configuration containing cluster with 2 volumes. I filled disperse 4 plus 2x2 volume and the issue appeared.
--> ASSIGNED

Tested with:
glusterfs-3.12.2-11.el7rhgs.x86_64
tendrl-ansible-1.6.3-4.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-5.el7rhgs.noarch
tendrl-gluster-integration-1.6.3-3.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-3.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-3.el7rhgs.noarch
tendrl-node-agent-1.6.3-5.el7rhgs.noarch
tendrl-notifier-1.6.3-3.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-2.el7rhgs.noarch
Comment 8 gowtham 2018-05-23 15:17:32 EDT
I found root cause for this problem https://github.com/Tendrl/monitoring-integration/pull/467
Comment 9 Filip Balák 2018-06-04 12:21:49 EDT
This seems to be fixed in UI but not in SNMP and mail. I received several times mails and snmp messages that contain information about getting back to normal when utilization was not normal. It happened right before I received critical alert about full capacity.

```
Jun 04 10:46:13 fbalak-usm1-client.usmqe snmptrapd[804]: 2018-06-04 10:46:13 fbalak-usm1-server.usmqe.lab.eng.brq.redhat.com [UDP: [10.37.169.17]:59518->[10.37.16
9.118]:162]:                                                                             DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (0) 0:00:00.00        SNMPv2-MIB::snmpTrapOID.0 = OID: SNMPv2-MIB::coldStart        SNMPv2-SMI::private.2312.19.1.0 = STRING: "[INFO], Brick Utilization: threshold breached-Brick utilization of fbalak-usm1-gl2.usmqe.lab.eng.brq.redhat.com:|mnt|brick_gama_disperse_1|1 in volume_gama_disperse_4_plus_2x2 back to normal"

Jun 04 10:46:24 fbalak-usm1-client.usmqe snmptrapd[804]: 2018-06-04 10:46:24 fbalak-usm1-server.usmqe.lab.eng.brq.redhat.com [UDP: [10.37.169.17]:59569->[10.37.16
9.118]:162]:
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (0) 0:00:00.00        SNMPv2-MIB::snmpTrapOID.0 = OID: SNMPv2-MIB::coldStart        SNMPv2-SMI::private.2312.19.1.0 = STRING: "[CRITICAL], Brick Utilization: threshold breached-Brick utilization on fbalak-usm1-gl2.usmqe.lab.eng.brq.redhat.co
m:|mnt|brick_gama_disperse_1|1 in volume_gama_disperse_4_plus_2x2 at 92.19 % and nearing full capacity"
```

Tested with:
tendrl-ansible-1.6.3-4.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-6.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-4.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-4.el7rhgs.noarch
tendrl-node-agent-1.6.3-6.el7rhgs.noarch
tendrl-notifier-1.6.3-3.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-3.el7rhgs.noarch
Comment 11 gowtham 2018-06-07 12:59:39 EDT
Pr is under review  https://github.com/Tendrl/notifier/pull/178
Comment 12 Filip Balák 2018-06-26 02:28:39 EDT
I tested it several times and I haven't notice an issue. --> VERIFIED

Tested with:
tendrl-ansible-1.6.3-5.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-7.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch
tendrl-node-agent-1.6.3-7.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-4.el7rhgs.noarch
Comment 14 errata-xmlrpc 2018-09-04 03:03:46 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2616

Note You need to log in before you can comment on or make changes to this bug.