+++ This bug was initially created as a clone of Bug #1409472 +++ Description of problem: ======================== I created my systemic setup on friday as part of regression cycle testing Over the weekend I noticed two of the bricks in a 8x2 volume crashed(both of different dht-subvols) Following is the volume configuration bash-4.3$ ssh root.35.20 root.35.20's password: Last login: Mon Jan 2 11:55:17 2017 from dhcp35-226.lab.eng.blr.redhat.com [root@dhcp35-20 ~]# gluster v status Status of volume: sysvol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.20:/rhs/brick1/sysvol 49154 0 Y 2404 Brick 10.70.37.86:/rhs/brick1/sysvol 49154 0 Y 26789 Brick 10.70.35.156:/rhs/brick1/sysvol N/A N/A N N/A Brick 10.70.37.154:/rhs/brick1/sysvol 49154 0 Y 2192 Brick 10.70.35.20:/rhs/brick2/sysvol 49155 0 Y 2424 Brick 10.70.37.86:/rhs/brick2/sysvol N/A N/A N N/A Brick 10.70.35.156:/rhs/brick2/sysvol 49155 0 Y 26793 Brick 10.70.37.154:/rhs/brick2/sysvol 49155 0 Y 2212 Snapshot Daemon on localhost 49156 0 Y 2449 Self-heal Daemon on localhost N/A N/A Y 3131 Quota Daemon on localhost N/A N/A Y 3139 Snapshot Daemon on 10.70.37.86 49156 0 Y 26832 Self-heal Daemon on 10.70.37.86 N/A N/A Y 2187 Quota Daemon on 10.70.37.86 N/A N/A Y 2195 Snapshot Daemon on 10.70.35.156 49156 0 Y 26816 Self-heal Daemon on 10.70.35.156 N/A N/A Y 1376 Quota Daemon on 10.70.35.156 N/A N/A Y 1384 Snapshot Daemon on 10.70.37.154 49156 0 Y 2235 Self-heal Daemon on 10.70.37.154 N/A N/A Y 9321 Quota Daemon on 10.70.37.154 N/A N/A Y 9329 Task Status of Volume sysvol ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp35-20 ~]# gluster v info Volume Name: sysvol Type: Distributed-Replicate Volume ID: 4efd4f77-85c7-4eb9-b958-6769d31d84c8 Status: Started Snapshot Count: 0 Number of Bricks: 4 x 2 = 8 Transport-type: tcp Bricks: Brick1: 10.70.35.20:/rhs/brick1/sysvol Brick2: 10.70.37.86:/rhs/brick1/sysvol Brick3: 10.70.35.156:/rhs/brick1/sysvol Brick4: 10.70.37.154:/rhs/brick1/sysvol Brick5: 10.70.35.20:/rhs/brick2/sysvol Brick6: 10.70.37.86:/rhs/brick2/sysvol Brick7: 10.70.35.156:/rhs/brick2/sysvol Brick8: 10.70.37.154:/rhs/brick2/sysvol Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on performance.md-cache-timeout: 600 performance.cache-invalidation: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on features.quota-deem-statfs: on features.inode-quota: on features.quota: on features.uss: enable transport.address-family: inet performance.readdir-ahead: on nfs.disable: on Also more about the testing is available at https://docs.google.com/spreadsheets/d/1iP5Mi1TewBFVh8HTmlcBm9072Bgsbgkr3CLcGmawDys/edit#gid=632186609 I will be putting more information, but below is the trace I found in brick log Brick 10.70.35.156:/rhs/brick1/sysvol N/A N/A N N/A [2017-01-01 23:48:37.960916] I [MSGID: 115036] [server.c:552:server_rpc_notify] 0-sysvol-server: disconnecting connection from rhs-client14.lab.eng.blr.redhat.com-4193-2016/12/30-10:12:13:880804-sysvol-client-2-0-133 [2017-01-01 23:48:37.961019] W [entrylk.c:752:pl_entrylk_log_cleanup] 0-sysvol-server: releasing lock on fdd77d52-2461-424f-be26-58b3213e2916 held by {client=0x7fbd97e4def0, pid=-6 lk-owner=64de9764667f0000} [2017-01-01 23:48:37.961044] W [entrylk.c:752:pl_entrylk_log_cleanup] 0-sysvol-server: releasing lock on 2b86c5a0-1c1e-4316-a73b-23599b41eb6c held by {client=0x7fbd97e4def0, pid=-6 lk-owner=e8f3a464667f0000} [2017-01-01 23:48:37.961063] W [entrylk.c:752:pl_entrylk_log_cleanup] 0-sysvol-server: releasing lock on fdd77d52-2461-424f-be26-58b3213e2916 held by {client=0x7fbd97e4def0, pid=-6 lk-owner=64de9764667f0000} [2017-01-01 23:48:37.961080] W [entrylk.c:752:pl_entrylk_log_cleanup] 0-sysvol-server: releasing lock on 7a287317-1ad3-4a73-8c6c-d0337606c287 held by {client=0x7fbd97e4def0, pid=-6 lk-owner=64b8a264667f0000} [2017-01-01 23:48:37.961095] W [entrylk.c:752:pl_entrylk_log_cleanup] 0-sysvol-server: releasing lock on 2b86c5a0-1c1e-4316-a73b-23599b41eb6c held by {client=0x7fbd97e4def0, pid=-6 lk-owner=e8f3a464667f0000} [2017-01-01 23:48:37.961107] W [entrylk.c:752:pl_entrylk_log_cleanup] 0-sysvol-server: releasing lock on 7a287317-1ad3-4a73-8c6c-d0337606c287 held by {client=0x7fbd97e4def0, pid=-6 lk-owner=64b8a264667f0000} [2017-01-01 23:48:37.961193] W [socket.c:590:__socket_rwv] 0-tcp.sysvol-server: writev on 10.70.37.72:1018 failed (Broken pipe) [2017-01-01 23:48:37.969562] I [socket.c:3513:socket_submit_reply] 0-tcp.sysvol-server: not connected (priv->connected = -1) [2017-01-01 23:48:37.969597] E [rpcsvc.c:1304:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x10a7e6, Program: GlusterFS 3.3, ProgVers: 330, Proc: 31) to rpc-transport (tcp.sysvol-server) [2017-01-01 23:48:37.980615] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18502) [0x7fbd91e05502] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x18d36) [0x7fbd919a5d36] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7fbd91996186] ) 0-: Reply submission failed pending frames: frame : type(0) op(0) frame : type(0) op(27) frame : type(0) op(27) frame : type(0) op(27) frame : type(0) op(27) frame : type(0) op(27) frame : type(0) op(27) frame : type(0) op(11) frame : type(0) op(27) frame : type(0) op(27) frame : type(0) op(1) frame : type(0) op(27) frame : type(0) op(27) frame : type(0) op(27) frame : type(0) op(27) frame : type(0) op(27) frame : type(0) op(27) frame : type(0) op(27) frame : type(0) op(31) frame : type(0) op(33) frame : type(0) op(31) frame : type(0) op(31) frame : type(0) op(31) frame : type(0) op(31) frame : type(0) op(31) frame : type(0) op(31) frame : type(0) op(11) frame : type(0) op(29) frame : type(0) op(31) frame : type(0) op(18) frame : type(0) op(31) frame : type(0) op(11) frame : type(0) op(11) frame : type(0) op(11) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) [2017-01-01 23:48:37.980670] W [socket.c:590:__socket_rwv] 0-tcp.sysvol-server: writev on 10.70.36.36:1010 failed (Broken pipe) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) [2017-01-01 23:48:37.991515] I [socket.c:3513:socket_submit_reply] 0-tcp.sysvol-server: not connected (priv->connected = -1) [2017-01-01 23:48:38.002516] E [rpcsvc.c:1304:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x43f158, Program: GlusterFS 3.3, ProgVers: 330, Proc: 1) to rpc-transport (tcp.sysvol-server) [2017-01-01 23:48:38.002566] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x11b2e) [0x7fbd91dfeb2e] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x20b9b) [0x7fbd919adb9b] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7fbd91996186] ) 0-: Reply submission failed frame : type(0) op(5) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(31) frame : type(0) op(0) frame : type(0) op(8) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(8) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) frame : type(0) op(0) patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2017-01-01 23:48:50 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.8.4 [2017-01-01 23:48:38.070045] W [socket.c:590:__socket_rwv] 0-tcp.sysvol-server: writev on 10.70.36.45:999 failed (Broken pipe) [2017-01-01 23:48:38.070631] W [socket.c:590:__socket_rwv] 0-tcp.sysvol-server: writev on 10.70.36.60:998 failed (Broken pipe) [2017-01-01 23:48:38.070766] W [socket.c:590:__socket_rwv] 0-tcp.sysvol-server: writev on 10.70.36.56:1001 failed (Broken pipe) /lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xc2)[0x7fbda6772c32] /lib64/libglusterfs.so.0(gf_print_trace+0x324)[0x7fbda677c6c4] /lib64/libc.so.6(+0x35250)[0x7fbda4e56250] /lib64/libglusterfs.so.0(_gf_event+0x137)[0x7fbda67e7ad7] /usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x7d9d)[0x7fbd91994d9d] /lib64/libgfrpc.so.0(rpcsvc_handle_disconnect+0x10f)[0x7fbda65344cf] /lib64/libgfrpc.so.0(rpcsvc_notify+0xc0)[0x7fbda6536910] /lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fbda6538893] /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so(+0x9714)[0x7fbd9b02c714] /lib64/libglusterfs.so.0(+0x83650)[0x7fbda67cc650] /lib64/libpthread.so.0(+0x7dc5)[0x7fbda55d3dc5] /lib64/libc.so.6(clone+0x6d)[0x7fbda4f1873d] --------- [2017-01-01 23:48:38.072069] I [socket.c:3513:socket_submit_reply] 0-tcp.sysvol-server: not connected (priv->connected = -1) [2017-01-01 23:48:50.395573] E [rpcsvc.c:1304:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xa847, Program: GlusterFS 3.3, ProgVers: 330, Proc: 27) to rpc-transport (tcp.sysvol-server) [2017-01-01 23:48:38.080453] E [rpcsvc.c:1304:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x93205c, Program: GlusterFS 3.3, ProgVers: 330, Proc: 8) to rpc-transport (tcp.sysvol-server) [2017-01-01 23:48:50.395766] I [MSGID: 101055] [client_t.c:415:gf_client_unref] 0-sysvol-server: Shutting down connection rhs-client23.lab.eng.blr.redhat.com-3206-2016/12/30-10:15:07:168605-sysvol-client-2-0-134 Also following is the gdb [root@dhcp35-156 ~]# file /core.26773 /core.26773: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/glusterfsd -s 10.70.35.156 --volfile-id sysvol.10.70.35.156.rhs-brick', real uid: 0, effective uid: 0, real gid: 0, effective gid: 0, execfn: '/usr/sbin/glusterfsd', platform: 'x86_64' #0 0x00007fbda67e7ad7 in _gf_event () from /lib64/libglusterfs.so.0 #1 0x00007fbd91994d9d in server_rpc_notify () from /usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so #2 0x00007fbda65344cf in rpcsvc_handle_disconnect () from /lib64/libgfrpc.so.0 #3 0x00007fbda6536910 in rpcsvc_notify () from /lib64/libgfrpc.so.0 #4 0x00007fbda6538893 in rpc_transport_notify () from /lib64/libgfrpc.so.0 #5 0x00007fbd9b02c714 in socket_event_handler () from /usr/lib64/glusterfs/3.8.4/rpc-transport/socket.so #6 0x00007fbda67cc650 in event_dispatch_epoll_worker () from /lib64/libglusterfs.so.0 #7 0x00007fbda55d3dc5 in start_thread () from /lib64/libpthread.so.0 #8 0x00007fbda4f1873d in clone () from /lib64/libc.so.6 (gdb) Version-Release number of selected component (if applicable): =================== 3.8.4-10 --- Additional comment from nchilaka on 2017-01-02 01:56:11 EST --- Also, I saw if this was the same bz as "1385606 - 4 of 8 bricks (2 dht subvols) crashed on systemic setup" which is on_qa I found the trace to be different Hence raising a new bug that also means bz#1385606 is blocked till this gets fixed --- Additional comment from Aravinda VK on 2017-01-03 02:09:22 EST --- Reproduced this issue locally with an example program. Issue is `gethostbyname` is not thread safe, multiple call to gf_event at the same time from different threads can corrupt the hostent struct. Crash from both these bugs are same. Solution: Replace `gethostbyname` by `getaddrinfo` which is thread safe.
REVIEW: http://review.gluster.org/16327 (eventsapi: Use `getaddrinfo` instead of `gethostbyname`) posted (#1) for review on master by Aravinda VK (avishwan)
REVIEW: http://review.gluster.org/16327 (eventsapi: Use `getaddrinfo` instead of `gethostbyname`) posted (#2) for review on master by Aravinda VK (avishwan)
COMMIT: http://review.gluster.org/16327 committed in master by Raghavendra G (rgowdapp) ------ commit aa053b228e01ab079f86d24f3444b2389895effd Author: Aravinda VK <avishwan> Date: Thu Jan 5 11:28:44 2017 +0530 eventsapi: Use `getaddrinfo` instead of `gethostbyname` `gethostbyname` is not thread safe. Use `getaddrinfo` to avoid any race or segfault while sending events BUG: 1410313 Change-Id: I164af1f8eb72501fb0ed47445e68d896f7c3e908 Signed-off-by: Aravinda VK <avishwan> Reviewed-on: http://review.gluster.org/16327 Reviewed-by: Atin Mukherjee <amukherj> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Smoke: Gluster Build System <jenkins.org> Reviewed-by: Raghavendra G <rgowdapp>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.0, please open a new bug report. glusterfs-3.10.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/gluster-users/2017-February/030119.html [2] https://www.gluster.org/pipermail/gluster-users/