Description of problem: ======================= Have a 4 node cluster, with eventing enabled. Login to N3's console/terminal, and create a distribute volume of 2 bricks residing on N1 and N2, and another distribute replicate volume 1*2, again residing on N1 and N2. Execute bitrot related commands and monitor the events that are seen. The bitrot commands when triggered from N1, N2, N3 successfully generate an event, however any command that is executed on N4 results in no events. Version-Release number of selected component (if applicable): ============================================================ 3.8.4-2 How reproducible: ================ Seeing it across 2 volumes in the present setup Steps to Reproduce: =================== 1. Have a 4 node cluster,enable eventing. 2. Login to N3, create 'dist' with B1 of N1 and B2 of N2. Create another volume 'distrep' 1*2 , with B2 of N1 and B2 of N2. 3. Enable bitrot and play around the scrub options from the console of N1/N2/N3. 4. Login to N4, and execute the same commands in step3 on either of the volumes 'dist' or 'distrep' Actual results: ============== Events seen as expected in step3. NO events seen in step4 Expected results: ================= Events should be seen irrespective of the peer from which the command is executed. Additional info: ================ ---- N2 ---- [root@dhcp35-100 ~]# gluster v bitrot distrep scrub-throttle normal volume bitrot: success [root@dhcp35-100 ~]# gluster v bitrot dist scrub-frequency weekly volume bitrot: success [root@dhcp35-100 ~]# {u'message': {u'name': u'distrep', u'value': u'normal'}, u'event': u'BITROT_SCRUB_THROTTLE', u'ts': 1476336724, u'nodeid': u'fcfacf2e-57fb-45ba-b1e1-e4ba640a4de5'} {u'message': {u'name': u'dist', u'value': u'weekly'}, u'event': u'BITROT_SCRUB_FREQ', u'ts': 1476336747, u'nodeid': u'fcfacf2e-57fb-45ba-b1e1-e4ba640a4de5'} ====================================================================================================================== ----- N1 ----- [root@dhcp35-115 ~]# gluster v bitrot dist scrub pause volume bitrot: success [root@dhcp35-115 ~]# {u'message': {u'name': u'dist', u'value': u'pause'}, u'event': u'BITROT_SCRUB_OPTION', u'ts': 1476336842, u'nodeid': u'6ac165c0-317f-42ad-8262-953995171dbb'} ====================================================================================================================== ----- N3 ----- [root@dhcp35-101 ~]# gluster v bitrot dist scrub resume volume bitrot: success [root@dhcp35-101 ~]# {u'message': {u'name': u'dist', u'value': u'resume'}, u'event': u'BITROT_SCRUB_OPTION', u'ts': 1476336858, u'nodeid': u'a3bd23b9-f70a-47f5-9c95-7a271f5f1e18'} ====================================================================================================================== ---- N4 ---- [root@dhcp35-104 ~]# gluster v bitrot dist scrub pause volume bitrot: success [root@dhcp35-104 ~]# gluster v bitrot distrep scrub pause volume bitrot: success [root@dhcp35-104 ~]# [root@dhcp35-104 ~]# gluster v bitrot dist scrub resume volume bitrot: success [root@dhcp35-104 ~]# [root@dhcp35-104 ~]# gluster v bitrot dist scrub status Volume name : dist State of scrub: Active (Idle) Scrub impact: aggressive Scrub frequency: weekly Bitrot error log location: /var/log/glusterfs/bitd.log Scrubber error log location: /var/log/glusterfs/scrub.log ========================================================= Node: dhcp35-115.lab.eng.blr.redhat.com Number of Scrubbed files: 0 Number of Skipped files: 0 Last completed scrub time: 2016-10-13 05:32:47 Duration of last scrub (D:M:H:M:S): 0:0:0:0 Error count: 0 ========================================================= Node: 10.70.35.100 Number of Scrubbed files: 0 Number of Skipped files: 0 Last completed scrub time: 2016-10-13 05:32:45 Duration of last scrub (D:M:H:M:S): 0:0:0:0 Error count: 0 ========================================================= [root@dhcp35-104 ~]# <<<<<<<<< No events seen >>>>>>>> [root@dhcp35-104 ~]# gluster-eventsapi status Webhooks: http://10.70.35.109:9000/listen +-----------------------------------+-------------+-----------------------+ | NODE | NODE STATUS | GLUSTEREVENTSD STATUS | +-----------------------------------+-------------+-----------------------+ | dhcp35-115.lab.eng.blr.redhat.com | UP | UP | | dhcp35-101.lab.eng.blr.redhat.com | UP | UP | | 10.70.35.100 | UP | UP | | localhost | UP | UP | +-----------------------------------+-------------+-----------------------+ [root@dhcp35-104 ~]# gluster peer tsatus unrecognized word: tsatus (position 1) [root@dhcp35-104 ~]# gluster peer status Number of Peers: 3 Hostname: dhcp35-115.lab.eng.blr.redhat.com Uuid: 6ac165c0-317f-42ad-8262-953995171dbb State: Peer in Cluster (Connected) Hostname: dhcp35-101.lab.eng.blr.redhat.com Uuid: a3bd23b9-f70a-47f5-9c95-7a271f5f1e18 State: Peer in Cluster (Connected) Hostname: 10.70.35.100 Uuid: fcfacf2e-57fb-45ba-b1e1-e4ba640a4de5 State: Peer in Cluster (Connected) [root@dhcp35-104 ~]# [root@dhcp35-104 ~]# [root@dhcp35-104 ~]# rpm -qa | grep gluster glusterfs-server-3.8.4-2.el6rhs.x86_64 python-gluster-3.8.4-2.el6rhs.noarch glusterfs-events-3.8.4-2.el6rhs.x86_64 glusterfs-3.8.4-2.el6rhs.x86_64 glusterfs-fuse-3.8.4-2.el6rhs.x86_64 glusterfs-ganesha-3.8.4-2.el6rhs.x86_64 gluster-nagios-common-0.2.4-1.el6rhs.noarch glusterfs-debuginfo-3.8.4-1.el6rhs.x86_64 glusterfs-client-xlators-3.8.4-2.el6rhs.x86_64 glusterfs-cli-3.8.4-2.el6rhs.x86_64 glusterfs-geo-replication-3.8.4-2.el6rhs.x86_64 gluster-nagios-addons-0.2.8-1.el6rhs.x86_64 vdsm-gluster-4.16.30-1.5.el6rhs.noarch glusterfs-api-3.8.4-2.el6rhs.x86_64 glusterfs-api-devel-3.8.4-2.el6rhs.x86_64 nfs-ganesha-gluster-2.3.1-8.el6rhs.x86_64 glusterfs-libs-3.8.4-2.el6rhs.x86_64 glusterfs-devel-3.8.4-2.el6rhs.x86_64 glusterfs-rdma-3.8.4-2.el6rhs.x86_64 [root@dhcp35-104 ~]#
Added debuginfo package, and Atin figured out that the event IS actually being sent. Did a glustereventsd reload on the affected node N4, and started receiving events. node-reload is one of the program called when we do a webhook-add, which would in turn do a glustereventsd reload. For some reason when I did a webhook add in this setup, glustereventsd reload would have failed. Just a hypothesis as of now. Will create a new webhook and add it in this same setup. Will observe the success/failure/errors seen while doing so, and will update. Until then anyone seeing similar issue can do a work around of 'service glustereventsd reload' on the impacted node, and the cluster and its events should work as expected.
Deleted the said webhook, and tried to add the same webhook again to the cluster. That did show up an exception where it failed to run 'gluster system:: execute eventsapi.py node-reload' It fails in the same node N4 everytime, and I am unable to figure out the reason why. It works on all the other nodes of the cluster. [root@dhcp35-101 yum.repos.d]# gluster-eventsapi webhook-del http://10.70.35.109:9000/listen Traceback (most recent call last): File "/usr/sbin/gluster-eventsapi", line 459, in <module> runcli() File "/usr/lib/python2.6/site-packages/gluster/cliutils/cliutils.py", line 212, in runcli cls.run(args) File "/usr/sbin/gluster-eventsapi", line 274, in run sync_to_peers() File "/usr/sbin/gluster-eventsapi", line 129, in sync_to_peers out = execute_in_peers("node-reload") File "/usr/lib/python2.6/site-packages/gluster/cliutils/cliutils.py", line 125, in execute_in_peers raise GlusterCmdException((rc, out, err, " ".join(cmd))) gluster.cliutils.cliutils.GlusterCmdException: (1, '', 'Commit failed on 10.70.35.104. Error: Unable to end. Error : Success\n', 'gluster system:: execute eventsapi.py node-reload') [root@dhcp35-101 yum.repos.d]# [root@dhcp35-101 yum.repos.d]# [root@dhcp35-101 yum.repos.d]# [root@dhcp35-101 yum.repos.d]# gluster-eventsapi status Webhooks: None +-----------------------------------+-------------+-----------------------+ | NODE | NODE STATUS | GLUSTEREVENTSD STATUS | +-----------------------------------+-------------+-----------------------+ | 10.70.35.100 | UP | UP | | 10.70.35.104 | UP | UP | | dhcp35-115.lab.eng.blr.redhat.com | UP | UP | | localhost | UP | UP | +-----------------------------------+-------------+-----------------------+ [root@dhcp35-101 yum.repos.d]# [root@dhcp35-101 yum.repos.d]# [root@dhcp35-101 yum.repos.d]# [root@dhcp35-101 yum.repos.d]# [root@dhcp35-101 yum.repos.d]# gluster-eventsapi webhook-test http://10.70.35.109:9000/listen +-----------------------------------+-------------+----------------+ | NODE | NODE STATUS | WEBHOOK STATUS | +-----------------------------------+-------------+----------------+ | 10.70.35.100 | UP | OK | | 10.70.35.104 | UP | OK | | dhcp35-115.lab.eng.blr.redhat.com | UP | OK | | localhost | UP | OK | +-----------------------------------+-------------+----------------+ [root@dhcp35-101 yum.repos.d]# [root@dhcp35-101 yum.repos.d]# [root@dhcp35-101 yum.repos.d]# [root@dhcp35-101 yum.repos.d]# gluster-eventsapi webhook-add http://10.70.35.109:9000/listen Traceback (most recent call last): File "/usr/sbin/gluster-eventsapi", line 459, in <module> runcli() File "/usr/lib/python2.6/site-packages/gluster/cliutils/cliutils.py", line 212, in runcli cls.run(args) File "/usr/sbin/gluster-eventsapi", line 232, in run sync_to_peers() File "/usr/sbin/gluster-eventsapi", line 129, in sync_to_peers out = execute_in_peers("node-reload") File "/usr/lib/python2.6/site-packages/gluster/cliutils/cliutils.py", line 125, in execute_in_peers raise GlusterCmdException((rc, out, err, " ".join(cmd))) gluster.cliutils.cliutils.GlusterCmdException: (1, '', 'Commit failed on 10.70.35.104. Error: Unable to end. Error : Success\n', 'gluster system:: execute eventsapi.py node-reload') [root@dhcp35-101 yum.repos.d]# [root@dhcp35-101 yum.repos.d]# [root@dhcp35-101 yum.repos.d]# [root@dhcp35-101 yum.repos.d]# gluster-eventsapi status Webhooks: http://10.70.35.109:9000/listen +-----------------------------------+-------------+-----------------------+ | NODE | NODE STATUS | GLUSTEREVENTSD STATUS | +-----------------------------------+-------------+-----------------------+ | 10.70.35.100 | UP | UP | | 10.70.35.104 | UP | UP | | dhcp35-115.lab.eng.blr.redhat.com | UP | UP | | localhost | UP | UP | +-----------------------------------+-------------+-----------------------+ [root@dhcp35-101 yum.repos.d]# [root@dhcp35-101 yum.repos.d]#
This is similar to BZ 1379963. `glustereventsd` on one node is not reloaded and it doesn't know the information about new Webhook added.
Upstream patch sent to auto reload webhooks configuration if file changes http://review.gluster.org/15731
Upstream patches: (master) http://review.gluster.org/15731 ( 3.9) http://review.gluster.org/15963 Downstream patch: https://code.engineering.redhat.com/gerrit/91515
Will not be able to verify this until the selinux issue wrt events is fixed (BZ 1379963).
Tested this on glusterfs-3.8.4-13 build with selinux-policy-3.7.19-292.el6_8.3 BZ 1419869 talks about a new avc seen, and the workaround mentioned does result in no traceback. Commands when executed from any of the peer nodes ends up in events being seen on the registered webhook. Moving this BZ to verified in 3.2.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html