Bug 1516968

Summary: tendrl-notifier doesn't send alerts for gluster native events for POSIX_HEALTH_CHECK_FAILED
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Martin Bukatovic <mbukatov>
Component: web-admin-tendrl-notifierAssignee: Shubhendu Tripathi <shtripat>
Status: CLOSED ERRATA QA Contact: Martin Kudlej <mkudlej>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rhgs-3.3CC: mbukatov, mkudlej, nthomas, rhs-bugs, sanandpa, sankarshan, shtripat
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tendrl-gluster-integration-1.5.4-6.el7rhgs.noarch tendrl-notifier-1.5.4-5.el7rhgs.noarch Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-18 04:37:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1517468    
Bug Blocks:    

Description Martin Bukatovic 2017-11-23 17:21:41 UTC
Description of problem
======================

Tendrl notifier doesn't send alerts for POSIX_HEALTH_CHECK_FAILED
gluster native events.

Version-Release
===============

tendrl-notifier-1.5.4-3.el7rhgs.noarch

[root@usm1-server ~]# rpm -qa | grep tendrl | sort
tendrl-ansible-1.5.4-1.el7rhgs.noarch
tendrl-api-1.5.4-2.el7rhgs.noarch
tendrl-api-httpd-1.5.4-2.el7rhgs.noarch
tendrl-commons-1.5.4-4.el7rhgs.noarch
tendrl-grafana-plugins-1.5.4-5.el7rhgs.noarch
tendrl-grafana-selinux-1.5.3-2.el7rhgs.noarch
tendrl-monitoring-integration-1.5.4-5.el7rhgs.noarch
tendrl-node-agent-1.5.4-5.el7rhgs.noarch
tendrl-notifier-1.5.4-3.el7rhgs.noarch
tendrl-selinux-1.5.3-2.el7rhgs.noarch
tendrl-ui-1.5.4-4.el7rhgs.noarch

[root@usm1-gl1 ~]# rpm -qa | grep tendrl | sort
tendrl-collectd-selinux-1.5.3-2.el7rhgs.noarch
tendrl-commons-1.5.4-4.el7rhgs.noarch
tendrl-gluster-integration-1.5.4-4.el7rhgs.noarch
tendrl-node-agent-1.5.4-5.el7rhgs.noarch
tendrl-selinux-1.5.3-2.el7rhgs.noarch

Steps to Reproduce
==================

1. Install RHGS WA using tendrl-ansible
2. Configure alerting to send events via both smtp and snmp
3. Import gluster trusted storage pool with a volume
4. See alerts/events send shown via UI and send via smtp and snmp
   when you perform the following steps
5. On one storage node, locate all brick mount points,
   pick one and run lazy unmount on it, eg:
   [root@usm1-gl1 ~]# umount -l /mnt/brick_VOLNAME_1
6. Check incoming tendrl alerts via:
   * RHGS WA UI
   * email
   * snmp trap message

When qe playbooks for alerting test setup are used:

* https://github.com/usmqe/usmqe-setup/blob/master/test_setup.smtp.yml
* https://github.com/usmqe/usmqe-setup/blob/master/test_setup.snmp.yml

one can check incoming snmp trap messages via:

# journalctl -u snmptrapd -fe

and email messages via:

# mutt

Actual results
==============

I haven't received any alert (neither via Tendrl UI, email or snmp) for
this action.

Expected results
================

I receive alert for failed posix health check via all supported channels
(Tendrl UI, email and snmp).

Additional info
===============

Checking further on the machine when I did the lazy umount of a brick, I see
following error in event log:

```
[root@usm1-gl1 ~]# grep POSIX_HEALTH_CHECK_FAILED /var/log/glusterfs/events.log                                                              
[2017-11-23 11:09:01,087] WARNING [utils - 198:publish_to_webhook] - Event push failed to URL: http://0.0.0.0:8697/listen, Event: {"event": "POSIX_HEALTH_CHECK_FAILED", "message": {"brick": "usm1-gl1.example.com:/mnt/brick_gama_disperse_1/1", "error": "No such file or directory", "op": "open", "path": "/mnt/brick_gama_disperse_1/1/.glusterfs/health_check"}, "nodeid": "45d388d5-3979-4ee2-bdf6-e2a0fbf4ac7d", "ts": 1511453341}, Status Code: 500
```

Does it mean that RHGS WA failed to setup Gluster native eventing during
cluster import?

Comment 2 Martin Bukatovic 2017-11-23 17:24:28 UTC
> Does it mean that RHGS WA failed to setup Gluster native eventing during
> cluster import?

Asking for evaluation. If that is the case, the severity of this BZ would be
higher as no gluster native events would be processed and forwarded by tendrl
notifier, not just this single one.

Comment 3 Nishanth Thomas 2017-11-24 12:58:29 UTC
Did this events are raised by gluster?
Whatever events are raised by gluster will be logged at /var/log/glusterfs/events.log
Can you verify this and get us the required information

Comment 4 Martin Bukatovic 2017-11-24 13:32:29 UTC
(In reply to Nishanth Thomas from comment #3)
> Did this events are raised by gluster?
> Whatever events are raised by gluster will be logged at
> /var/log/glusterfs/events.log
> Can you verify this and get us the required information

Yes, gluster tried to send the event, but it seems that the push failed.

The evidence you are asking for is listed in the description of this BZ. There
you can see that gluster indeed tried to send the posix health event, but it
seems that the push failed, which may be the root cause of this BZ.

In comment 2, I was asking you to recheck that.

This was tested based on description from "List of Alerts and Notifications in
Tendrl"[1], RHGS WA should be able to receive native gluster events, process
them and resend some of them (as described in the list) as alerts. The event in
question "Posix health check failed for brick" is listed there.

[1] https://github.com/Tendrl/documentation/wiki/List-of-Alerts-and-Notifications-in-Tendrl

Comment 5 Martin Bukatovic 2017-11-25 17:27:00 UTC
Also see this output from gluster side:

```
# gluster-eventsapi status
Webhooks: 
http://0.0.0.0:8697/listen

+-------------------------------+-------------+-----------------------+
|     NODE                      | NODE STATUS | GLUSTEREVENTSD STATUS |
+-------------------------------+-------------+-----------------------+
| mbukatov-usm1-gl1.example.com |          UP |                    OK |
| mbukatov-usm1-gl2.example.com |          UP |                    OK |
| mbukatov-usm1-gl3.example.com |          UP |                    OK |
| mbukatov-usm1-gl4.example.com |          UP |                    OK |
| mbukatov-usm1-gl6.example.com |          UP |                    OK |
| localhost                     |          UP |                    OK |
+-------------------------------+-------------+-----------------------+
```

It seems that the webhook is micronfigured. Shouldn't I see some tendrl
component there, listening for these events?

Comment 6 Martin Bukatovic 2017-11-25 17:52:59 UTC
Nishant, I have created a separate BZ 1517468 to make my point clear.

The new BZ 1517468 is not duplicate of this one, as it affects all native
events. When BZ 1517468 is fixed, we would need to retest scenario in this
BZ again.

Comment 7 Nishanth Thomas 2017-11-27 07:13:42 UTC
@martin, are you saying you are not receiving any events mentioned in https://github.com/Tendrl/documentation/wiki/List-of-Alerts-and-Notifications-in-Tendrl ?

Comment 8 Shubhendu Tripathi 2017-11-27 11:39:34 UTC
https://github.com/Tendrl/gluster-integration/pull/510 fixes this issue for native event POSIX_HEALTH_CHECK_FAILED

Comment 9 Martin Bukatovic 2017-11-28 07:21:36 UTC
(In reply to Nishanth Thomas from comment #7)
> @martin, are you saying you are not receiving any events mentioned in
> https://github.com/Tendrl/documentation/wiki/List-of-Alerts-and-
> Notifications-in-Tendrl ?

So far, I wasn't able to receive any gluster native event, as listed in
last section in the wikipage you linked. Ask Martin K. for up to date testing
status.

My impression is that it could be caused by errors or misconfiguration of
sending messages from gluster to tendrl, as described in BZ 1517468. 

I have redrafted test cases based on the wikipage you linked when it become
available and I really appreciate this documentation effort upstream, because
I wouldn't be able to create test case which covers gluster native test cases
without it, nor I would be able to file this BZ without it.

Comment 11 Martin Kudlej 2017-11-29 15:28:40 UTC
I verified this bug with:
tendrl-gluster-integration-1.5.4-6.el7rhgs.noarch
tendrl-notifier-1.5.4-5.el7rhgs.noarch
--> VERIFIED

Comment 13 errata-xmlrpc 2017-12-18 04:37:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3478