Created attachment 1417778 [details] UI page with bricks and alerts Description of problem: During testing BZ 1531139 I have breached utilization of all bricks above 90%. I received correct alerts that utilization breached 90% but for some bricks I received alerts that they are back to normal although they are not. Version-Release number of selected component (if applicable): tendrl-ansible-1.6.1-3.el7rhgs.noarch tendrl-api-1.6.1-3.el7rhgs.noarch tendrl-api-httpd-1.6.1-3.el7rhgs.noarch tendrl-commons-1.6.1-3.el7rhgs.noarch tendrl-grafana-plugins-1.6.1-3.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.1-3.el7rhgs.noarch tendrl-node-agent-1.6.1-3.el7rhgs.noarch tendrl-notifier-1.6.0-1.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.1-3.el7rhgs.noarch glusterfs-3.12.2-7.el7rhgs.x86_64 How reproducible: Not sure but it happened for more bricks. Steps to Reproduce: 1. Import cluster with disperse volume. 2. Fill volume bricks with data. 3. Check alerts in mail, ui and snmp after utilization breaches 75% and 90%. Actual results: There are correct alerts about breaching 75% and 90% of utilization but there are also false alerts that utilization for some bricks is back to normal. Expected results: There should be only alerts that utilization breached thresholds when there was no data deletion. Additional info:
I can't reproduce it
i can't download image also
As per alert dashbaord implementation, we are actually maintaining two different alert dashboards for the alert. one for warning and one for the alert. because grafana raise alert only once, it will raise alert again when it comes to the ok state. So we set a condition like 75 to 90 then warning alert will raise. when the alert goes above 90 then warning dashbaord raise ok alert. because it will check withing range only. Then critical dashbaord will raise an alert. When utilization comes to below 90 then critical will raise ok alert and warning will raise an alert. So it will replace by INFO but after few seconds it wil again replace by warning or critical based on utilization percentage. I have tested this it works fine, please wait for few minutes and then check again. Because I already have a mechanism like only warning ok alert can clear warning alert and critical can clear the critical alert. it alerts goes above 75 to 90 then critical dashboard raise alert first then a warning is replaced by critical after that warning alert raise ok then it won't take. because the active alert is critical. I tried to reproduce but it works fine.
I was not able to fully reproduce it with a new version but I have a notification in UI: `Brick utilization of fbalak-usm1-gl4:|mnt|brick_gama_disperse_2|2 in volume_gama_disperse_4_plus_2x2 back to normal` and few more in mails related to bricks that got back to normal. The bricks failed and are currently down so it is not the same setup as the one I originally reported but it might be useful for you so I sent you a PM with access to the machines so that you can check it out. Tested with: tendrl-ansible-1.6.3-3.el7rhgs.noarch tendrl-api-1.6.3-3.el7rhgs.noarch tendrl-api-httpd-1.6.3-3.el7rhgs.noarch tendrl-commons-1.6.3-4.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-2.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-2.el7rhgs.noarch tendrl-node-agent-1.6.3-4.el7rhgs.noarch tendrl-notifier-1.6.3-2.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-1.el7rhgs.noarch
I was able to reproduce this with configuration containing cluster with 2 volumes. I filled disperse 4 plus 2x2 volume and the issue appeared. --> ASSIGNED Tested with: glusterfs-3.12.2-11.el7rhgs.x86_64 tendrl-ansible-1.6.3-4.el7rhgs.noarch tendrl-api-1.6.3-3.el7rhgs.noarch tendrl-api-httpd-1.6.3-3.el7rhgs.noarch tendrl-commons-1.6.3-5.el7rhgs.noarch tendrl-gluster-integration-1.6.3-3.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-3.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-3.el7rhgs.noarch tendrl-node-agent-1.6.3-5.el7rhgs.noarch tendrl-notifier-1.6.3-3.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-2.el7rhgs.noarch
I found root cause for this problem https://github.com/Tendrl/monitoring-integration/pull/467
This seems to be fixed in UI but not in SNMP and mail. I received several times mails and snmp messages that contain information about getting back to normal when utilization was not normal. It happened right before I received critical alert about full capacity. ``` Jun 04 10:46:13 fbalak-usm1-client.usmqe snmptrapd[804]: 2018-06-04 10:46:13 fbalak-usm1-server.usmqe.lab.eng.brq.redhat.com [UDP: [10.37.169.17]:59518->[10.37.16 9.118]:162]: DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (0) 0:00:00.00 SNMPv2-MIB::snmpTrapOID.0 = OID: SNMPv2-MIB::coldStart SNMPv2-SMI::private.2312.19.1.0 = STRING: "[INFO], Brick Utilization: threshold breached-Brick utilization of fbalak-usm1-gl2.usmqe.lab.eng.brq.redhat.com:|mnt|brick_gama_disperse_1|1 in volume_gama_disperse_4_plus_2x2 back to normal" Jun 04 10:46:24 fbalak-usm1-client.usmqe snmptrapd[804]: 2018-06-04 10:46:24 fbalak-usm1-server.usmqe.lab.eng.brq.redhat.com [UDP: [10.37.169.17]:59569->[10.37.16 9.118]:162]: DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (0) 0:00:00.00 SNMPv2-MIB::snmpTrapOID.0 = OID: SNMPv2-MIB::coldStart SNMPv2-SMI::private.2312.19.1.0 = STRING: "[CRITICAL], Brick Utilization: threshold breached-Brick utilization on fbalak-usm1-gl2.usmqe.lab.eng.brq.redhat.co m:|mnt|brick_gama_disperse_1|1 in volume_gama_disperse_4_plus_2x2 at 92.19 % and nearing full capacity" ``` Tested with: tendrl-ansible-1.6.3-4.el7rhgs.noarch tendrl-api-1.6.3-3.el7rhgs.noarch tendrl-api-httpd-1.6.3-3.el7rhgs.noarch tendrl-commons-1.6.3-6.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-4.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-4.el7rhgs.noarch tendrl-node-agent-1.6.3-6.el7rhgs.noarch tendrl-notifier-1.6.3-3.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-3.el7rhgs.noarch
Pr is under review https://github.com/Tendrl/notifier/pull/178
I tested it several times and I haven't notice an issue. --> VERIFIED Tested with: tendrl-ansible-1.6.3-5.el7rhgs.noarch tendrl-api-1.6.3-3.el7rhgs.noarch tendrl-api-httpd-1.6.3-3.el7rhgs.noarch tendrl-commons-1.6.3-7.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch tendrl-node-agent-1.6.3-7.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-4.el7rhgs.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2616