Bug 1403732

Summary: [Eventing]: CLIENT_CONNECT/DISCONNECT events repeatedly seen in an idle cluster
Product: Red Hat Gluster Storage Reporter: Sweta Anandpara <sanandpa>
Component: eventsapiAssignee: Mohammed Rafi KC <rkavunga>
Status: CLOSED WONTFIX QA Contact: Sweta Anandpara <sanandpa>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: amukherj, rcyriac, rhinduja
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-08 12:42:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Sweta Anandpara 2016-12-12 09:54:32 UTC
Description of problem:
======================
In an idle steady-state cluster, with eventing enabled and webhook configured as a listener, multiple CLIENT_CONNECT and CLIENT_DISCONNECT events are seen. And they repeat every minute (when timed). I do not have any bricks down, nor am I stopping/starting any service of any volume. An untouched cluster is generating CLIENT_CONNECT/DISCONNECT events at every heartbeat-of-a-minute. 

There must be some service that is getting restarted internally, but I have been unable to figure out more related to that service, as the pid that is shown in the event when checked on the corresponding node results in no match. After enabling debug level logs, we do see 'EOF on socket' in the brick logs. This behaviour looks to be on similar lines of BZ 1397314 and BZ 1401162, as per Rafi.

Presently, I am seeing these messages for distribute-replicate and arbiter volumes, but not on disperse. Further testing is required to confidently claim the same.


Version-Release number of selected component (if applicable):
===========================================================
3.8.4-7


How reproducible:
================
I have 3 setups, and all of them are showing this behaviour


Additional info:
================

[root@dhcp47-60 ~]# rpm -qa | grep gluster
glusterfs-server-3.8.4-7.el7rhgs.x86_64
glusterfs-3.8.4-7.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
vdsm-gluster-4.17.33-1.el7rhgs.noarch
glusterfs-libs-3.8.4-7.el7rhgs.x86_64
glusterfs-fuse-3.8.4-7.el7rhgs.x86_64
python-gluster-3.8.4-7.el7rhgs.noarch
glusterfs-events-3.8.4-7.el7rhgs.x86_64
glusterfs-api-3.8.4-7.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-7.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-7.el7rhgs.x86_64
gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64
glusterfs-cli-3.8.4-7.el7rhgs.x86_64
[root@dhcp47-60 ~]# 
[root@dhcp47-60 ~]# gluster peer status
Number of Peers: 3

Hostname: 10.70.47.61
Uuid: f4b259db-7add-4d01-bb5e-3c7f9c077bb4
State: Peer in Cluster (Connected)

Hostname: 10.70.47.26
Uuid: 95c24075-02aa-49c1-a1e4-c7e0775e7128
State: Peer in Cluster (Connected)

Hostname: 10.70.47.27
Uuid: 8d1aaf3a-059e-41c2-871b-6c7f5c0dd90b
State: Peer in Cluster (Connected)
[root@dhcp47-60 ~]# 
[root@dhcp47-60 ~]# 


Brick logs:
----------

[2016-12-12 09:51:29.896808] D [socket.c:559:__socket_rwv] 0-tcp.testvol-server: EOF on socket
[2016-12-12 09:51:29.896859] D [socket.c:2403:socket_event_handler] 0-transport: disconnecting now
[2016-12-12 09:51:29.896898] D [logging.c:1953:_gf_msg_internal] 0-logging-infra: Buffer overflow of a buffer whose size limit is 5. About to flush least recently used log message to disk
[2016-12-12 09:51:29.829919] D [MSGID: 0] [io-threads.c:353:iot_schedule] 0-testvol-io-threads: LOOKUP scheduled as fast fop
[2016-12-12 09:51:29.896896] I [MSGID: 115036] [server.c:552:server_rpc_notify] 0-testvol-server: disconnecting connection from dhcp47-27.lab.eng.blr.redhat.com-23712-2016/12/12-09:51:29:685953-testvol-client-0-0-0
[2016-12-12 09:51:29.896987] D [client_t.c:328:gf_client_ref] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_disconnect+0x10f) [0x7f1643c1c4cf] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x7bce) [0x7f162f11dbce] -->/lib64/libglusterfs.so.0(gf_client_ref+0x6b) [0x7f1643eb124b] ) 0-client_t: dhcp47-27.lab.eng.blr.redhat.com-23712-2016/12/12-09:51:29:685953-testvol-client-0-0-0: ref-count 2
[2016-12-12 09:51:29.897151] D [client_t.c:305:gf_client_put] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_disconnect+0x10f) [0x7f1643c1c4cf] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x7bdb) [0x7f162f11dbdb] -->/lib64/libglusterfs.so.0(gf_client_put+0x93) [0x7f1643eb17d3] ) 0-client_t: dhcp47-27.lab.eng.blr.redhat.com-23712-2016/12/12-09:51:29:685953-testvol-client-0-0-0: bind_ref: 0, ref: 2, unref: 1
[2016-12-12 09:51:29.897270] D [client_t.c:412:gf_client_unref] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_disconnect+0x10f) [0x7f1643c1c4cf] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x7bdb) [0x7f162f11dbdb] -->/lib64/libglusterfs.so.0(gf_client_unref+0x7c) [0x7f1643eb13cc] ) 0-client_t: dhcp47-27.lab.eng.blr.redhat.com-23712-2016/12/12-09:51:29:685953-testvol-client-0-0-0: ref-count 1
[2016-12-12 09:51:29.897314] D [MSGID: 0] [server-helpers.c:367:server_connection_cleanup] 0-testvol-server: Performing cleanup on 128 fdentries
[2016-12-12 09:51:29.897628] D [client_t.c:412:gf_client_unref] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_disconnect+0x10f) [0x7f1643c1c4cf] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x7bef) [0x7f162f11dbef] -->/lib64/libglusterfs.so.0(gf_client_unref+0x7c) [0x7f1643eb13cc] ) 0-client_t: dhcp47-27.lab.eng.blr.redhat.com-23712-2016/12/12-09:51:29:685953-testvol-client-0-0-0: ref-count 0
[2016-12-12 09:51:29.897664] I [MSGID: 101055] [client_t.c:415:gf_client_unref] 0-testvol-server: Shutting down connection dhcp47-27.lab.eng.blr.redhat.com-23712-2016/12/12-09:51:29:685953-testvol-client-0-0-0
[2016-12-12 09:51:29.897750] D [MSGID: 0] [event-epoll.c:587:event_dispatch_epoll_handler] 0-epoll: generation bumped on idx=14 from gen=649 to slot->gen=650, fd=23, slot->fd=23
[2016-12-12 09:51:59.676123] D [logging.c:1829:gf_log_flush_timeout_cbk] 0-logging-infra: Log timer timed out. About to flush outstanding messages if present
[2016-12-12 09:51:59.676272] D [logging.c:1791:__gf_log_inject_timer_event] 0-logging-infra: Starting timer now. Timeout = 120, current buf size = 5

Comment 2 Sweta Anandpara 2016-12-12 09:56:21 UTC
This was not seen on the earlier builds, < 3.8.4-5 .

Comment 5 Sweta Anandpara 2016-12-14 09:07:13 UTC
As requested (to confirm that this was not seen before), I downgraded my cluster setup to a build <3.8.4-5 (3.8.4-3 to be precise).

Enabled eventing on all the nodes, and performed some random user driven operations on the volume. Events are seen as expected. And when the cluster is left idle, no CLIENT_CONNECT/DISCONNECT events are seen at all. This confirms that the issue raised in this BZ seems to have gotten in, in one of the interim builds. Further testing would be required to ascertain the exact build.