Bug 1516968 - tendrl-notifier doesn't send alerts for gluster native events for POSIX_HEALTH_CHECK_FAILED
Summary: tendrl-notifier doesn't send alerts for gluster native events for POSIX_HEALT...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: web-admin-tendrl-notifier
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Shubhendu Tripathi
QA Contact: Martin Kudlej
URL:
Whiteboard:
Depends On: 1517468
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-23 17:21 UTC by Martin Bukatovic
Modified: 2017-12-18 04:37 UTC (History)
7 users (show)

Fixed In Version: tendrl-gluster-integration-1.5.4-6.el7rhgs.noarch tendrl-notifier-1.5.4-5.el7rhgs.noarch
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-18 04:37:58 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github https://github.com/Tendrl gluster-integration pull 510 0 None None None 2017-11-27 11:39:34 UTC
Red Hat Product Errata RHEA-2017:3478 0 normal SHIPPED_LIVE RHGS Web Administration packages 2017-12-18 09:34:49 UTC

Description Martin Bukatovic 2017-11-23 17:21:41 UTC
Description of problem
======================

Tendrl notifier doesn't send alerts for POSIX_HEALTH_CHECK_FAILED
gluster native events.

Version-Release
===============

tendrl-notifier-1.5.4-3.el7rhgs.noarch

[root@usm1-server ~]# rpm -qa | grep tendrl | sort
tendrl-ansible-1.5.4-1.el7rhgs.noarch
tendrl-api-1.5.4-2.el7rhgs.noarch
tendrl-api-httpd-1.5.4-2.el7rhgs.noarch
tendrl-commons-1.5.4-4.el7rhgs.noarch
tendrl-grafana-plugins-1.5.4-5.el7rhgs.noarch
tendrl-grafana-selinux-1.5.3-2.el7rhgs.noarch
tendrl-monitoring-integration-1.5.4-5.el7rhgs.noarch
tendrl-node-agent-1.5.4-5.el7rhgs.noarch
tendrl-notifier-1.5.4-3.el7rhgs.noarch
tendrl-selinux-1.5.3-2.el7rhgs.noarch
tendrl-ui-1.5.4-4.el7rhgs.noarch

[root@usm1-gl1 ~]# rpm -qa | grep tendrl | sort
tendrl-collectd-selinux-1.5.3-2.el7rhgs.noarch
tendrl-commons-1.5.4-4.el7rhgs.noarch
tendrl-gluster-integration-1.5.4-4.el7rhgs.noarch
tendrl-node-agent-1.5.4-5.el7rhgs.noarch
tendrl-selinux-1.5.3-2.el7rhgs.noarch

Steps to Reproduce
==================

1. Install RHGS WA using tendrl-ansible
2. Configure alerting to send events via both smtp and snmp
3. Import gluster trusted storage pool with a volume
4. See alerts/events send shown via UI and send via smtp and snmp
   when you perform the following steps
5. On one storage node, locate all brick mount points,
   pick one and run lazy unmount on it, eg:
   [root@usm1-gl1 ~]# umount -l /mnt/brick_VOLNAME_1
6. Check incoming tendrl alerts via:
   * RHGS WA UI
   * email
   * snmp trap message

When qe playbooks for alerting test setup are used:

* https://github.com/usmqe/usmqe-setup/blob/master/test_setup.smtp.yml
* https://github.com/usmqe/usmqe-setup/blob/master/test_setup.snmp.yml

one can check incoming snmp trap messages via:

# journalctl -u snmptrapd -fe

and email messages via:

# mutt

Actual results
==============

I haven't received any alert (neither via Tendrl UI, email or snmp) for
this action.

Expected results
================

I receive alert for failed posix health check via all supported channels
(Tendrl UI, email and snmp).

Additional info
===============

Checking further on the machine when I did the lazy umount of a brick, I see
following error in event log:

```
[root@usm1-gl1 ~]# grep POSIX_HEALTH_CHECK_FAILED /var/log/glusterfs/events.log                                                              
[2017-11-23 11:09:01,087] WARNING [utils - 198:publish_to_webhook] - Event push failed to URL: http://0.0.0.0:8697/listen, Event: {"event": "POSIX_HEALTH_CHECK_FAILED", "message": {"brick": "usm1-gl1.example.com:/mnt/brick_gama_disperse_1/1", "error": "No such file or directory", "op": "open", "path": "/mnt/brick_gama_disperse_1/1/.glusterfs/health_check"}, "nodeid": "45d388d5-3979-4ee2-bdf6-e2a0fbf4ac7d", "ts": 1511453341}, Status Code: 500
```

Does it mean that RHGS WA failed to setup Gluster native eventing during
cluster import?

Comment 2 Martin Bukatovic 2017-11-23 17:24:28 UTC
> Does it mean that RHGS WA failed to setup Gluster native eventing during
> cluster import?

Asking for evaluation. If that is the case, the severity of this BZ would be
higher as no gluster native events would be processed and forwarded by tendrl
notifier, not just this single one.

Comment 3 Nishanth Thomas 2017-11-24 12:58:29 UTC
Did this events are raised by gluster?
Whatever events are raised by gluster will be logged at /var/log/glusterfs/events.log
Can you verify this and get us the required information

Comment 4 Martin Bukatovic 2017-11-24 13:32:29 UTC
(In reply to Nishanth Thomas from comment #3)
> Did this events are raised by gluster?
> Whatever events are raised by gluster will be logged at
> /var/log/glusterfs/events.log
> Can you verify this and get us the required information

Yes, gluster tried to send the event, but it seems that the push failed.

The evidence you are asking for is listed in the description of this BZ. There
you can see that gluster indeed tried to send the posix health event, but it
seems that the push failed, which may be the root cause of this BZ.

In comment 2, I was asking you to recheck that.

This was tested based on description from "List of Alerts and Notifications in
Tendrl"[1], RHGS WA should be able to receive native gluster events, process
them and resend some of them (as described in the list) as alerts. The event in
question "Posix health check failed for brick" is listed there.

[1] https://github.com/Tendrl/documentation/wiki/List-of-Alerts-and-Notifications-in-Tendrl

Comment 5 Martin Bukatovic 2017-11-25 17:27:00 UTC
Also see this output from gluster side:

```
# gluster-eventsapi status
Webhooks: 
http://0.0.0.0:8697/listen

+-------------------------------+-------------+-----------------------+
|     NODE                      | NODE STATUS | GLUSTEREVENTSD STATUS |
+-------------------------------+-------------+-----------------------+
| mbukatov-usm1-gl1.example.com |          UP |                    OK |
| mbukatov-usm1-gl2.example.com |          UP |                    OK |
| mbukatov-usm1-gl3.example.com |          UP |                    OK |
| mbukatov-usm1-gl4.example.com |          UP |                    OK |
| mbukatov-usm1-gl6.example.com |          UP |                    OK |
| localhost                     |          UP |                    OK |
+-------------------------------+-------------+-----------------------+
```

It seems that the webhook is micronfigured. Shouldn't I see some tendrl
component there, listening for these events?

Comment 6 Martin Bukatovic 2017-11-25 17:52:59 UTC
Nishant, I have created a separate BZ 1517468 to make my point clear.

The new BZ 1517468 is not duplicate of this one, as it affects all native
events. When BZ 1517468 is fixed, we would need to retest scenario in this
BZ again.

Comment 7 Nishanth Thomas 2017-11-27 07:13:42 UTC
@martin, are you saying you are not receiving any events mentioned in https://github.com/Tendrl/documentation/wiki/List-of-Alerts-and-Notifications-in-Tendrl ?

Comment 8 Shubhendu Tripathi 2017-11-27 11:39:34 UTC
https://github.com/Tendrl/gluster-integration/pull/510 fixes this issue for native event POSIX_HEALTH_CHECK_FAILED

Comment 9 Martin Bukatovic 2017-11-28 07:21:36 UTC
(In reply to Nishanth Thomas from comment #7)
> @martin, are you saying you are not receiving any events mentioned in
> https://github.com/Tendrl/documentation/wiki/List-of-Alerts-and-
> Notifications-in-Tendrl ?

So far, I wasn't able to receive any gluster native event, as listed in
last section in the wikipage you linked. Ask Martin K. for up to date testing
status.

My impression is that it could be caused by errors or misconfiguration of
sending messages from gluster to tendrl, as described in BZ 1517468. 

I have redrafted test cases based on the wikipage you linked when it become
available and I really appreciate this documentation effort upstream, because
I wouldn't be able to create test case which covers gluster native test cases
without it, nor I would be able to file this BZ without it.

Comment 11 Martin Kudlej 2017-11-29 15:28:40 UTC
I verified this bug with:
tendrl-gluster-integration-1.5.4-6.el7rhgs.noarch
tendrl-notifier-1.5.4-5.el7rhgs.noarch
--> VERIFIED

Comment 13 errata-xmlrpc 2017-12-18 04:37:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3478


Note You need to log in before you can comment on or make changes to this bug.