Bug 1384316

Summary:	[Eventing]: Events not seen when command is triggered from one of the peer nodes
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Sweta Anandpara <sanandpa>
Component:	glusterfs	Assignee:	Aravinda VK <avishwan>
Status:	CLOSED ERRATA	QA Contact:	Sweta Anandpara <sanandpa>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.2	CC:	amukherj, avishwan, rhinduja, vbellur
Target Milestone:	---
Target Release:	RHGS 3.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.8.4-6	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1388862 (view as bug list)		Environment:
Last Closed:	2017-03-23 06:09:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1351528, 1388862, 1399482

Description Sweta Anandpara 2016-10-13 05:55:40 UTC

Description of problem:
=======================
Have a 4 node cluster, with eventing enabled. Login to N3's console/terminal, and create a distribute volume of 2 bricks residing on N1 and N2, and another distribute replicate volume 1*2, again residing on N1 and N2. Execute bitrot related commands and monitor the events that are seen. The bitrot commands when triggered from N1, N2, N3 successfully generate an event, however any command that is executed on N4 results in no events. 


Version-Release number of selected component (if applicable):
============================================================
3.8.4-2

How reproducible:
================
Seeing it across 2 volumes in the present setup


Steps to Reproduce:
===================
1. Have a 4 node cluster,enable eventing.
2. Login to N3, create 'dist' with B1 of N1 and B2 of N2. Create another volume 'distrep' 1*2 , with B2 of N1 and B2 of N2.
3. Enable bitrot and play around the scrub options from the console of N1/N2/N3.
4. Login to N4, and execute the same commands in step3 on either of the volumes 'dist' or 'distrep'

Actual results:
==============
Events seen as expected in step3. NO events seen in step4

Expected results:
=================
Events should be seen irrespective of the peer from which the command is executed.


Additional info:
================

----
N2 
----

[root@dhcp35-100 ~]# gluster v bitrot distrep scrub-throttle normal
volume bitrot: success
[root@dhcp35-100 ~]# gluster v bitrot dist scrub-frequency weekly
volume bitrot: success
[root@dhcp35-100 ~]#


{u'message': {u'name': u'distrep', u'value': u'normal'}, u'event': u'BITROT_SCRUB_THROTTLE', u'ts': 1476336724, u'nodeid': u'fcfacf2e-57fb-45ba-b1e1-e4ba640a4de5'}

{u'message': {u'name': u'dist', u'value': u'weekly'}, u'event': u'BITROT_SCRUB_FREQ', u'ts': 1476336747, u'nodeid': u'fcfacf2e-57fb-45ba-b1e1-e4ba640a4de5'}

======================================================================================================================


-----
N1
-----
[root@dhcp35-115 ~]# gluster v bitrot dist scrub pause
volume bitrot: success
[root@dhcp35-115 ~]# 

{u'message': {u'name': u'dist', u'value': u'pause'}, u'event': u'BITROT_SCRUB_OPTION', u'ts': 1476336842, u'nodeid': u'6ac165c0-317f-42ad-8262-953995171dbb'}

======================================================================================================================

-----
N3
-----
[root@dhcp35-101 ~]# gluster v bitrot dist scrub resume
volume bitrot: success
[root@dhcp35-101 ~]#

{u'message': {u'name': u'dist', u'value': u'resume'}, u'event': u'BITROT_SCRUB_OPTION', u'ts': 1476336858, u'nodeid': u'a3bd23b9-f70a-47f5-9c95-7a271f5f1e18'}

======================================================================================================================

----
N4
----
[root@dhcp35-104 ~]# gluster v bitrot dist scrub pause
volume bitrot: success
[root@dhcp35-104 ~]# gluster v bitrot distrep scrub pause
volume bitrot: success
[root@dhcp35-104 ~]#
[root@dhcp35-104 ~]# gluster v bitrot dist scrub resume
volume bitrot: success
[root@dhcp35-104 ~]# 
[root@dhcp35-104 ~]# gluster v bitrot dist scrub status

Volume name : dist

State of scrub: Active (Idle)

Scrub impact: aggressive

Scrub frequency: weekly

Bitrot error log location: /var/log/glusterfs/bitd.log

Scrubber error log location: /var/log/glusterfs/scrub.log


=========================================================

Node: dhcp35-115.lab.eng.blr.redhat.com

Number of Scrubbed files: 0

Number of Skipped files: 0

Last completed scrub time: 2016-10-13 05:32:47

Duration of last scrub (D:M:H:M:S): 0:0:0:0

Error count: 0


=========================================================

Node: 10.70.35.100

Number of Scrubbed files: 0

Number of Skipped files: 0

Last completed scrub time: 2016-10-13 05:32:45

Duration of last scrub (D:M:H:M:S): 0:0:0:0

Error count: 0

=========================================================

[root@dhcp35-104 ~]# 

<<<<<<<<<          No events seen         >>>>>>>>



[root@dhcp35-104 ~]# gluster-eventsapi status
Webhooks: 
http://10.70.35.109:9000/listen

+-----------------------------------+-------------+-----------------------+
|                NODE               | NODE STATUS | GLUSTEREVENTSD STATUS |
+-----------------------------------+-------------+-----------------------+
| dhcp35-115.lab.eng.blr.redhat.com |          UP |                    UP |
| dhcp35-101.lab.eng.blr.redhat.com |          UP |                    UP |
|            10.70.35.100           |          UP |                    UP |
|             localhost             |          UP |                    UP |
+-----------------------------------+-------------+-----------------------+
[root@dhcp35-104 ~]# gluster peer tsatus
unrecognized word: tsatus (position 1)
[root@dhcp35-104 ~]# gluster peer status
Number of Peers: 3

Hostname: dhcp35-115.lab.eng.blr.redhat.com
Uuid: 6ac165c0-317f-42ad-8262-953995171dbb
State: Peer in Cluster (Connected)

Hostname: dhcp35-101.lab.eng.blr.redhat.com
Uuid: a3bd23b9-f70a-47f5-9c95-7a271f5f1e18
State: Peer in Cluster (Connected)

Hostname: 10.70.35.100
Uuid: fcfacf2e-57fb-45ba-b1e1-e4ba640a4de5
State: Peer in Cluster (Connected)
[root@dhcp35-104 ~]# 
[root@dhcp35-104 ~]# 
[root@dhcp35-104 ~]# rpm -qa | grep gluster
glusterfs-server-3.8.4-2.el6rhs.x86_64
python-gluster-3.8.4-2.el6rhs.noarch
glusterfs-events-3.8.4-2.el6rhs.x86_64
glusterfs-3.8.4-2.el6rhs.x86_64
glusterfs-fuse-3.8.4-2.el6rhs.x86_64
glusterfs-ganesha-3.8.4-2.el6rhs.x86_64
gluster-nagios-common-0.2.4-1.el6rhs.noarch
glusterfs-debuginfo-3.8.4-1.el6rhs.x86_64
glusterfs-client-xlators-3.8.4-2.el6rhs.x86_64
glusterfs-cli-3.8.4-2.el6rhs.x86_64
glusterfs-geo-replication-3.8.4-2.el6rhs.x86_64
gluster-nagios-addons-0.2.8-1.el6rhs.x86_64
vdsm-gluster-4.16.30-1.5.el6rhs.noarch
glusterfs-api-3.8.4-2.el6rhs.x86_64
glusterfs-api-devel-3.8.4-2.el6rhs.x86_64
nfs-ganesha-gluster-2.3.1-8.el6rhs.x86_64
glusterfs-libs-3.8.4-2.el6rhs.x86_64
glusterfs-devel-3.8.4-2.el6rhs.x86_64
glusterfs-rdma-3.8.4-2.el6rhs.x86_64
[root@dhcp35-104 ~]#

Comment 2 Sweta Anandpara 2016-10-13 06:55:44 UTC

Added debuginfo package, and Atin figured out that the event IS actually being sent. 

Did a glustereventsd reload on the affected node N4, and started receiving events. node-reload is one of the program called when we do a webhook-add, which would in turn do a glustereventsd reload. For some reason when I did a webhook add in this setup, glustereventsd reload would have failed. Just a hypothesis as of now. 

Will create a new webhook and add it in this same setup. Will observe the success/failure/errors seen while doing so, and will update. 

Until then anyone seeing similar issue can do a work around of 'service glustereventsd reload' on the impacted node, and the cluster and its events should work as expected.

Comment 3 Sweta Anandpara 2016-10-13 07:38:34 UTC

Deleted the said webhook, and tried to add the same webhook again to the cluster. That did show up an exception where it failed to run 'gluster system:: execute eventsapi.py node-reload'

It fails in the same node N4 everytime, and I am unable to figure out the reason why. It works on all the other nodes of the cluster.

[root@dhcp35-101 yum.repos.d]# gluster-eventsapi webhook-del http://10.70.35.109:9000/listen
Traceback (most recent call last):
  File "/usr/sbin/gluster-eventsapi", line 459, in <module>
    runcli()
  File "/usr/lib/python2.6/site-packages/gluster/cliutils/cliutils.py", line 212, in runcli
    cls.run(args)
  File "/usr/sbin/gluster-eventsapi", line 274, in run
    sync_to_peers()
  File "/usr/sbin/gluster-eventsapi", line 129, in sync_to_peers
    out = execute_in_peers("node-reload")
  File "/usr/lib/python2.6/site-packages/gluster/cliutils/cliutils.py", line 125, in execute_in_peers
    raise GlusterCmdException((rc, out, err, " ".join(cmd)))
gluster.cliutils.cliutils.GlusterCmdException: (1, '', 'Commit failed on 10.70.35.104. Error: Unable to end. Error : Success\n', 'gluster system:: execute eventsapi.py node-reload')
[root@dhcp35-101 yum.repos.d]# 
[root@dhcp35-101 yum.repos.d]# 
[root@dhcp35-101 yum.repos.d]# 
[root@dhcp35-101 yum.repos.d]# gluster-eventsapi status
Webhooks: None

+-----------------------------------+-------------+-----------------------+
|                NODE               | NODE STATUS | GLUSTEREVENTSD STATUS |
+-----------------------------------+-------------+-----------------------+
|            10.70.35.100           |          UP |                    UP |
|            10.70.35.104           |          UP |                    UP |
| dhcp35-115.lab.eng.blr.redhat.com |          UP |                    UP |
|             localhost             |          UP |                    UP |
+-----------------------------------+-------------+-----------------------+
[root@dhcp35-101 yum.repos.d]# 
[root@dhcp35-101 yum.repos.d]# 
[root@dhcp35-101 yum.repos.d]# 
[root@dhcp35-101 yum.repos.d]# 
[root@dhcp35-101 yum.repos.d]# gluster-eventsapi webhook-test http://10.70.35.109:9000/listen
+-----------------------------------+-------------+----------------+
|                NODE               | NODE STATUS | WEBHOOK STATUS |
+-----------------------------------+-------------+----------------+
|            10.70.35.100           |          UP |             OK |
|            10.70.35.104           |          UP |             OK |
| dhcp35-115.lab.eng.blr.redhat.com |          UP |             OK |
|             localhost             |          UP |             OK |
+-----------------------------------+-------------+----------------+
[root@dhcp35-101 yum.repos.d]# 
[root@dhcp35-101 yum.repos.d]# 
[root@dhcp35-101 yum.repos.d]# 
[root@dhcp35-101 yum.repos.d]# gluster-eventsapi webhook-add http://10.70.35.109:9000/listen
Traceback (most recent call last):
  File "/usr/sbin/gluster-eventsapi", line 459, in <module>
    runcli()
  File "/usr/lib/python2.6/site-packages/gluster/cliutils/cliutils.py", line 212, in runcli
    cls.run(args)
  File "/usr/sbin/gluster-eventsapi", line 232, in run
    sync_to_peers()
  File "/usr/sbin/gluster-eventsapi", line 129, in sync_to_peers
    out = execute_in_peers("node-reload")
  File "/usr/lib/python2.6/site-packages/gluster/cliutils/cliutils.py", line 125, in execute_in_peers
    raise GlusterCmdException((rc, out, err, " ".join(cmd)))
gluster.cliutils.cliutils.GlusterCmdException: (1, '', 'Commit failed on 10.70.35.104. Error: Unable to end. Error : Success\n', 'gluster system:: execute eventsapi.py node-reload')
[root@dhcp35-101 yum.repos.d]# 
[root@dhcp35-101 yum.repos.d]# 
[root@dhcp35-101 yum.repos.d]# 
[root@dhcp35-101 yum.repos.d]# gluster-eventsapi status
Webhooks: 
http://10.70.35.109:9000/listen

+-----------------------------------+-------------+-----------------------+
|                NODE               | NODE STATUS | GLUSTEREVENTSD STATUS |
+-----------------------------------+-------------+-----------------------+
|            10.70.35.100           |          UP |                    UP |
|            10.70.35.104           |          UP |                    UP |
| dhcp35-115.lab.eng.blr.redhat.com |          UP |                    UP |
|             localhost             |          UP |                    UP |
+-----------------------------------+-------------+-----------------------+
[root@dhcp35-101 yum.repos.d]# 
[root@dhcp35-101 yum.repos.d]#

Comment 4 Aravinda VK 2016-10-17 12:36:14 UTC

This is similar to BZ 1379963. `glustereventsd` on one node is not reloaded and it doesn't know the information about new Webhook added.

Comment 5 Aravinda VK 2016-10-26 10:34:52 UTC

Upstream patch sent to auto reload webhooks configuration if file changes
http://review.gluster.org/15731

Comment 8 Aravinda VK 2016-11-29 07:12:41 UTC

Upstream patches: (master) http://review.gluster.org/15731
                  (   3.9) http://review.gluster.org/15963

Downstream patch: https://code.engineering.redhat.com/gerrit/91515

Comment 10 Sweta Anandpara 2016-12-19 09:31:57 UTC

Will not be able to verify this until the selinux issue wrt events is fixed (BZ 1379963).

Comment 11 Sweta Anandpara 2017-02-07 10:22:40 UTC

Tested this on glusterfs-3.8.4-13 build with selinux-policy-3.7.19-292.el6_8.3

BZ 1419869 talks about a new avc seen, and the workaround mentioned does result in no traceback. Commands when executed from any of the peer nodes ends up in events being seen on the registered webhook.

Moving this BZ to verified in 3.2.

Comment 13 errata-xmlrpc 2017-03-23 06:09:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html