Description of problem ====================== Tendrl notifier doesn't send alerts for POSIX_HEALTH_CHECK_FAILED gluster native events. Version-Release =============== tendrl-notifier-1.5.4-3.el7rhgs.noarch [root@usm1-server ~]# rpm -qa | grep tendrl | sort tendrl-ansible-1.5.4-1.el7rhgs.noarch tendrl-api-1.5.4-2.el7rhgs.noarch tendrl-api-httpd-1.5.4-2.el7rhgs.noarch tendrl-commons-1.5.4-4.el7rhgs.noarch tendrl-grafana-plugins-1.5.4-5.el7rhgs.noarch tendrl-grafana-selinux-1.5.3-2.el7rhgs.noarch tendrl-monitoring-integration-1.5.4-5.el7rhgs.noarch tendrl-node-agent-1.5.4-5.el7rhgs.noarch tendrl-notifier-1.5.4-3.el7rhgs.noarch tendrl-selinux-1.5.3-2.el7rhgs.noarch tendrl-ui-1.5.4-4.el7rhgs.noarch [root@usm1-gl1 ~]# rpm -qa | grep tendrl | sort tendrl-collectd-selinux-1.5.3-2.el7rhgs.noarch tendrl-commons-1.5.4-4.el7rhgs.noarch tendrl-gluster-integration-1.5.4-4.el7rhgs.noarch tendrl-node-agent-1.5.4-5.el7rhgs.noarch tendrl-selinux-1.5.3-2.el7rhgs.noarch Steps to Reproduce ================== 1. Install RHGS WA using tendrl-ansible 2. Configure alerting to send events via both smtp and snmp 3. Import gluster trusted storage pool with a volume 4. See alerts/events send shown via UI and send via smtp and snmp when you perform the following steps 5. On one storage node, locate all brick mount points, pick one and run lazy unmount on it, eg: [root@usm1-gl1 ~]# umount -l /mnt/brick_VOLNAME_1 6. Check incoming tendrl alerts via: * RHGS WA UI * email * snmp trap message When qe playbooks for alerting test setup are used: * https://github.com/usmqe/usmqe-setup/blob/master/test_setup.smtp.yml * https://github.com/usmqe/usmqe-setup/blob/master/test_setup.snmp.yml one can check incoming snmp trap messages via: # journalctl -u snmptrapd -fe and email messages via: # mutt Actual results ============== I haven't received any alert (neither via Tendrl UI, email or snmp) for this action. Expected results ================ I receive alert for failed posix health check via all supported channels (Tendrl UI, email and snmp). Additional info =============== Checking further on the machine when I did the lazy umount of a brick, I see following error in event log: ``` [root@usm1-gl1 ~]# grep POSIX_HEALTH_CHECK_FAILED /var/log/glusterfs/events.log [2017-11-23 11:09:01,087] WARNING [utils - 198:publish_to_webhook] - Event push failed to URL: http://0.0.0.0:8697/listen, Event: {"event": "POSIX_HEALTH_CHECK_FAILED", "message": {"brick": "usm1-gl1.example.com:/mnt/brick_gama_disperse_1/1", "error": "No such file or directory", "op": "open", "path": "/mnt/brick_gama_disperse_1/1/.glusterfs/health_check"}, "nodeid": "45d388d5-3979-4ee2-bdf6-e2a0fbf4ac7d", "ts": 1511453341}, Status Code: 500 ``` Does it mean that RHGS WA failed to setup Gluster native eventing during cluster import?
> Does it mean that RHGS WA failed to setup Gluster native eventing during > cluster import? Asking for evaluation. If that is the case, the severity of this BZ would be higher as no gluster native events would be processed and forwarded by tendrl notifier, not just this single one.
Did this events are raised by gluster? Whatever events are raised by gluster will be logged at /var/log/glusterfs/events.log Can you verify this and get us the required information
(In reply to Nishanth Thomas from comment #3) > Did this events are raised by gluster? > Whatever events are raised by gluster will be logged at > /var/log/glusterfs/events.log > Can you verify this and get us the required information Yes, gluster tried to send the event, but it seems that the push failed. The evidence you are asking for is listed in the description of this BZ. There you can see that gluster indeed tried to send the posix health event, but it seems that the push failed, which may be the root cause of this BZ. In comment 2, I was asking you to recheck that. This was tested based on description from "List of Alerts and Notifications in Tendrl"[1], RHGS WA should be able to receive native gluster events, process them and resend some of them (as described in the list) as alerts. The event in question "Posix health check failed for brick" is listed there. [1] https://github.com/Tendrl/documentation/wiki/List-of-Alerts-and-Notifications-in-Tendrl
Also see this output from gluster side: ``` # gluster-eventsapi status Webhooks: http://0.0.0.0:8697/listen +-------------------------------+-------------+-----------------------+ | NODE | NODE STATUS | GLUSTEREVENTSD STATUS | +-------------------------------+-------------+-----------------------+ | mbukatov-usm1-gl1.example.com | UP | OK | | mbukatov-usm1-gl2.example.com | UP | OK | | mbukatov-usm1-gl3.example.com | UP | OK | | mbukatov-usm1-gl4.example.com | UP | OK | | mbukatov-usm1-gl6.example.com | UP | OK | | localhost | UP | OK | +-------------------------------+-------------+-----------------------+ ``` It seems that the webhook is micronfigured. Shouldn't I see some tendrl component there, listening for these events?
Nishant, I have created a separate BZ 1517468 to make my point clear. The new BZ 1517468 is not duplicate of this one, as it affects all native events. When BZ 1517468 is fixed, we would need to retest scenario in this BZ again.
@martin, are you saying you are not receiving any events mentioned in https://github.com/Tendrl/documentation/wiki/List-of-Alerts-and-Notifications-in-Tendrl ?
https://github.com/Tendrl/gluster-integration/pull/510 fixes this issue for native event POSIX_HEALTH_CHECK_FAILED
(In reply to Nishanth Thomas from comment #7) > @martin, are you saying you are not receiving any events mentioned in > https://github.com/Tendrl/documentation/wiki/List-of-Alerts-and- > Notifications-in-Tendrl ? So far, I wasn't able to receive any gluster native event, as listed in last section in the wikipage you linked. Ask Martin K. for up to date testing status. My impression is that it could be caused by errors or misconfiguration of sending messages from gluster to tendrl, as described in BZ 1517468. I have redrafted test cases based on the wikipage you linked when it become available and I really appreciate this documentation effort upstream, because I wouldn't be able to create test case which covers gluster native test cases without it, nor I would be able to file this BZ without it.
I verified this bug with: tendrl-gluster-integration-1.5.4-6.el7rhgs.noarch tendrl-notifier-1.5.4-5.el7rhgs.noarch --> VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:3478