Bug 1388862

Summary: [Eventing]: Events not seen when command is triggered from one of the peer nodes
Product: [Community] GlusterFS Reporter: Aravinda VK <avishwan>
Component: eventsapiAssignee: Aravinda VK <avishwan>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: mainlineCC: amukherj, avishwan, sanandpa, storage-qa-internal, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.10.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1384316
: 1399482 (view as bug list) Environment:
Last Closed: 2017-03-06 17:31:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1384316    
Bug Blocks: 1399482    

Description Aravinda VK 2016-10-26 10:27:37 UTC
+++ This bug was initially created as a clone of Bug #1384316 +++

Description of problem:
=======================
Have a 4 node cluster, with eventing enabled. Login to N3's console/terminal, and create a distribute volume of 2 bricks residing on N1 and N2, and another distribute replicate volume 1*2, again residing on N1 and N2. Execute bitrot related commands and monitor the events that are seen. The bitrot commands when triggered from N1, N2, N3 successfully generate an event, however any command that is executed on N4 results in no events. 


How reproducible:
================
Seeing it across 2 volumes in the present setup


Steps to Reproduce:
===================
1. Have a 4 node cluster,enable eventing.
2. Login to N3, create 'dist' with B1 of N1 and B2 of N2. Create another volume 'distrep' 1*2 , with B2 of N1 and B2 of N2.
3. Enable bitrot and play around the scrub options from the console of N1/N2/N3.
4. Login to N4, and execute the same commands in step3 on either of the volumes 'dist' or 'distrep'

Actual results:
==============
Events seen as expected in step3. NO events seen in step4

Expected results:
=================
Events should be seen irrespective of the peer from which the command is executed.


Additional info:
================

----
N2 
----

[root@dhcp35-100 ~]# gluster v bitrot distrep scrub-throttle normal
volume bitrot: success
[root@dhcp35-100 ~]# gluster v bitrot dist scrub-frequency weekly
volume bitrot: success
[root@dhcp35-100 ~]#


{u'message': {u'name': u'distrep', u'value': u'normal'}, u'event': u'BITROT_SCRUB_THROTTLE', u'ts': 1476336724, u'nodeid': u'fcfacf2e-57fb-45ba-b1e1-e4ba640a4de5'}

{u'message': {u'name': u'dist', u'value': u'weekly'}, u'event': u'BITROT_SCRUB_FREQ', u'ts': 1476336747, u'nodeid': u'fcfacf2e-57fb-45ba-b1e1-e4ba640a4de5'}

======================================================================================================================


-----
N1
-----
[root@dhcp35-115 ~]# gluster v bitrot dist scrub pause
volume bitrot: success
[root@dhcp35-115 ~]# 

{u'message': {u'name': u'dist', u'value': u'pause'}, u'event': u'BITROT_SCRUB_OPTION', u'ts': 1476336842, u'nodeid': u'6ac165c0-317f-42ad-8262-953995171dbb'}

======================================================================================================================

-----
N3
-----
[root@dhcp35-101 ~]# gluster v bitrot dist scrub resume
volume bitrot: success
[root@dhcp35-101 ~]#

{u'message': {u'name': u'dist', u'value': u'resume'}, u'event': u'BITROT_SCRUB_OPTION', u'ts': 1476336858, u'nodeid': u'a3bd23b9-f70a-47f5-9c95-7a271f5f1e18'}

======================================================================================================================

----
N4
----
[root@dhcp35-104 ~]# gluster v bitrot dist scrub pause
volume bitrot: success
[root@dhcp35-104 ~]# gluster v bitrot distrep scrub pause
volume bitrot: success
[root@dhcp35-104 ~]#
[root@dhcp35-104 ~]# gluster v bitrot dist scrub resume
volume bitrot: success
[root@dhcp35-104 ~]# 
[root@dhcp35-104 ~]# gluster v bitrot dist scrub status

Volume name : dist

State of scrub: Active (Idle)

Scrub impact: aggressive

Scrub frequency: weekly

Bitrot error log location: /var/log/glusterfs/bitd.log

Scrubber error log location: /var/log/glusterfs/scrub.log


=========================================================

Node: dhcp35-115.lab.eng.blr.redhat.com

Number of Scrubbed files: 0

Number of Skipped files: 0

Last completed scrub time: 2016-10-13 05:32:47

Duration of last scrub (D:M:H:M:S): 0:0:0:0

Error count: 0


=========================================================

Node: 10.70.35.100

Number of Scrubbed files: 0

Number of Skipped files: 0

Last completed scrub time: 2016-10-13 05:32:45

Duration of last scrub (D:M:H:M:S): 0:0:0:0

Error count: 0

=========================================================

[root@dhcp35-104 ~]# 

<<<<<<<<<          No events seen         >>>>>>>>



[root@dhcp35-104 ~]# gluster-eventsapi status
Webhooks: 
http://10.70.35.109:9000/listen

+-----------------------------------+-------------+-----------------------+
|                NODE               | NODE STATUS | GLUSTEREVENTSD STATUS |
+-----------------------------------+-------------+-----------------------+
| dhcp35-115.lab.eng.blr.redhat.com |          UP |                    UP |
| dhcp35-101.lab.eng.blr.redhat.com |          UP |                    UP |
|            10.70.35.100           |          UP |                    UP |
|             localhost             |          UP |                    UP |
+-----------------------------------+-------------+-----------------------+
[root@dhcp35-104 ~]# gluster peer tsatus
unrecognized word: tsatus (position 1)
[root@dhcp35-104 ~]# gluster peer status
Number of Peers: 3

Hostname: dhcp35-115.lab.eng.blr.redhat.com
Uuid: 6ac165c0-317f-42ad-8262-953995171dbb
State: Peer in Cluster (Connected)

Hostname: dhcp35-101.lab.eng.blr.redhat.com
Uuid: a3bd23b9-f70a-47f5-9c95-7a271f5f1e18
State: Peer in Cluster (Connected)

Hostname: 10.70.35.100
Uuid: fcfacf2e-57fb-45ba-b1e1-e4ba640a4de5
State: Peer in Cluster (Connected)
[root@dhcp35-104 ~]# 
[root@dhcp35-104 ~]# 
[root@dhcp35-104 ~]#

--- Additional comment from Sweta Anandpara on 2016-10-13 02:55:44 EDT ---

Added debuginfo package, and Atin figured out that the event IS actually being sent. 

Did a glustereventsd reload on the affected node N4, and started receiving events. node-reload is one of the program called when we do a webhook-add, which would in turn do a glustereventsd reload. For some reason when I did a webhook add in this setup, glustereventsd reload would have failed. Just a hypothesis as of now. 

Will create a new webhook and add it in this same setup. Will observe the success/failure/errors seen while doing so, and will update. 

Until then anyone seeing similar issue can do a work around of 'service glustereventsd reload' on the impacted node, and the cluster and its events should work as expected.

--- Additional comment from Sweta Anandpara on 2016-10-13 03:38:34 EDT ---

Deleted the said webhook, and tried to add the same webhook again to the cluster. That did show up an exception where it failed to run 'gluster system:: execute eventsapi.py node-reload'

It fails in the same node N4 everytime, and I am unable to figure out the reason why. It works on all the other nodes of the cluster.

[root@dhcp35-101 yum.repos.d]# gluster-eventsapi webhook-del http://10.70.35.109:9000/listen
Traceback (most recent call last):
  File "/usr/sbin/gluster-eventsapi", line 459, in <module>
    runcli()
  File "/usr/lib/python2.6/site-packages/gluster/cliutils/cliutils.py", line 212, in runcli
    cls.run(args)
  File "/usr/sbin/gluster-eventsapi", line 274, in run
    sync_to_peers()
  File "/usr/sbin/gluster-eventsapi", line 129, in sync_to_peers
    out = execute_in_peers("node-reload")
  File "/usr/lib/python2.6/site-packages/gluster/cliutils/cliutils.py", line 125, in execute_in_peers
    raise GlusterCmdException((rc, out, err, " ".join(cmd)))
gluster.cliutils.cliutils.GlusterCmdException: (1, '', 'Commit failed on 10.70.35.104. Error: Unable to end. Error : Success\n', 'gluster system:: execute eventsapi.py node-reload')
[root@dhcp35-101 yum.repos.d]# 
[root@dhcp35-101 yum.repos.d]# 
[root@dhcp35-101 yum.repos.d]# 
[root@dhcp35-101 yum.repos.d]# gluster-eventsapi status
Webhooks: None

+-----------------------------------+-------------+-----------------------+
|                NODE               | NODE STATUS | GLUSTEREVENTSD STATUS |
+-----------------------------------+-------------+-----------------------+
|            10.70.35.100           |          UP |                    UP |
|            10.70.35.104           |          UP |                    UP |
| dhcp35-115.lab.eng.blr.redhat.com |          UP |                    UP |
|             localhost             |          UP |                    UP |
+-----------------------------------+-------------+-----------------------+
[root@dhcp35-101 yum.repos.d]# 
[root@dhcp35-101 yum.repos.d]# 
[root@dhcp35-101 yum.repos.d]# 
[root@dhcp35-101 yum.repos.d]# 
[root@dhcp35-101 yum.repos.d]# gluster-eventsapi webhook-test http://10.70.35.109:9000/listen
+-----------------------------------+-------------+----------------+
|                NODE               | NODE STATUS | WEBHOOK STATUS |
+-----------------------------------+-------------+----------------+
|            10.70.35.100           |          UP |             OK |
|            10.70.35.104           |          UP |             OK |
| dhcp35-115.lab.eng.blr.redhat.com |          UP |             OK |
|             localhost             |          UP |             OK |
+-----------------------------------+-------------+----------------+
[root@dhcp35-101 yum.repos.d]# 
[root@dhcp35-101 yum.repos.d]# 
[root@dhcp35-101 yum.repos.d]# 
[root@dhcp35-101 yum.repos.d]# gluster-eventsapi webhook-add http://10.70.35.109:9000/listen
Traceback (most recent call last):
  File "/usr/sbin/gluster-eventsapi", line 459, in <module>
    runcli()
  File "/usr/lib/python2.6/site-packages/gluster/cliutils/cliutils.py", line 212, in runcli
    cls.run(args)
  File "/usr/sbin/gluster-eventsapi", line 232, in run
    sync_to_peers()
  File "/usr/sbin/gluster-eventsapi", line 129, in sync_to_peers
    out = execute_in_peers("node-reload")
  File "/usr/lib/python2.6/site-packages/gluster/cliutils/cliutils.py", line 125, in execute_in_peers
    raise GlusterCmdException((rc, out, err, " ".join(cmd)))
gluster.cliutils.cliutils.GlusterCmdException: (1, '', 'Commit failed on 10.70.35.104. Error: Unable to end. Error : Success\n', 'gluster system:: execute eventsapi.py node-reload')
[root@dhcp35-101 yum.repos.d]# 
[root@dhcp35-101 yum.repos.d]# 
[root@dhcp35-101 yum.repos.d]# 
[root@dhcp35-101 yum.repos.d]# gluster-eventsapi status
Webhooks: 
http://10.70.35.109:9000/listen

+-----------------------------------+-------------+-----------------------+
|                NODE               | NODE STATUS | GLUSTEREVENTSD STATUS |
+-----------------------------------+-------------+-----------------------+
|            10.70.35.100           |          UP |                    UP |
|            10.70.35.104           |          UP |                    UP |
| dhcp35-115.lab.eng.blr.redhat.com |          UP |                    UP |
|             localhost             |          UP |                    UP |
+-----------------------------------+-------------+-----------------------+
[root@dhcp35-101 yum.repos.d]# 
[root@dhcp35-101 yum.repos.d]#

--- Additional comment from Aravinda VK on 2016-10-17 08:36:14 EDT ---

This is similar to BZ 1379963. `glustereventsd` on one node is not reloaded and it doesn't know the information about new Webhook added.

Comment 1 Worker Ant 2016-10-26 10:29:03 UTC
REVIEW: http://review.gluster.org/15731 (eventsapi: Auto reload Webhooks data when modified) posted (#1) for review on master by Aravinda VK (avishwan)

Comment 2 Worker Ant 2016-11-17 06:28:21 UTC
REVIEW: http://review.gluster.org/15731 (eventsapi: Auto reload Webhooks data when modified) posted (#2) for review on master by Aravinda VK (avishwan)

Comment 3 Worker Ant 2016-11-17 11:10:31 UTC
COMMIT: http://review.gluster.org/15731 committed in master by Aravinda VK (avishwan) 
------
commit b7ebffbda9ba784ccfae6d1a90766d5310cdaa15
Author: Aravinda VK <avishwan>
Date:   Wed Oct 26 15:51:17 2016 +0530

    eventsapi: Auto reload Webhooks data when modified
    
    glustereventsd depends on reload signal to reload the
    Webhooks configurations. But if reload signal missed, no
    events will be sent to newly added Webhook.
    
    Added auto reload based on webhooks file mtime. Before pushing
    events to Webhooks, reloads webhooks configurations if previously
    recorded mtime is different than current mtime.
    
    BUG: 1388862
    Change-Id: I83a41d6a52d8fa1d70e88294298f4a5c396d4158
    Signed-off-by: Aravinda VK <avishwan>
    Reviewed-on: http://review.gluster.org/15731
    Reviewed-by: Prashanth Pai <ppai>
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 4 Shyamsundar 2017-03-06 17:31:48 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.0, please open a new bug report.

glusterfs-3.10.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2017-February/030119.html
[2] https://www.gluster.org/pipermail/gluster-users/