Bug 765432 (GLUSTER-3700)

Summary: [glusterfs-3.3.0qa14] glusterd crashed when volume stop command is issued
Product: [Community] GlusterFS Reporter: M S Vishwanath Bhat <vbhat>
Component: glusterdAssignee: krishnan parthasarathi <kparthas>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: pre-releaseCC: gluster-bugs, nsathyan, vijay
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:

Description M S Vishwanath Bhat 2011-10-05 07:19:44 EDT
Created a volume with tcp,rdma volume type. mounted via knfs in another client machine. nfs mountpoint was hanging for sometime. So with the intention of restarting the volume with TRACE log level I issued volume stop command. glusterd crashed with following back trace.

Core was generated by `glusterd'.
Program terminated with signal 6, Aborted.
#0  0x0000003283a30265 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003283a30265 in raise () from /lib64/libc.so.6
#1  0x0000003283a31d10 in abort () from /lib64/libc.so.6
#2  0x0000003283a296e6 in __assert_fail () from /lib64/libc.so.6
#3  0x00002b056efe48ba in __gf_free (free_ptr=0x16ca4a90) at ../../../libglusterfs/src/mem-pool.c:273
#4  0x00002b056efcb41a in gf_timer_call_cancel (ctx=0x1515f010, event=0x16ca4a90) at ../../../libglusterfs/src/timer.c:131
#5  0x00002aaaaaae7a8e in glusterd_op_brick_disconnect (data=0x16ca4a30) at ../../../../../xlators/mgmt/glusterd/src/glusterd-op-sm.c:1872
#6  0x00002aaaaaadf33b in glusterd_brick_rpc_notify (rpc=0x16bdcab0, mydata=0x191a3020, event=RPC_CLNT_DISCONNECT, data=0x0) at ../../../../../xlators/mgmt/glusterd/src/glusterd-handler.c:2565
#7  0x00002b056f23b510 in rpc_clnt_notify (trans=0x16bdcce0, mydata=0x16bdcae0, event=RPC_TRANSPORT_DISCONNECT, data=0x16bdcce0) at ../../../../rpc/rpc-lib/src/rpc-clnt.c:879
#8  0x00002b056f2379f3 in rpc_transport_notify (this=0x16bdcce0, event=RPC_TRANSPORT_DISCONNECT, data=0x16bdcce0) at ../../../../rpc/rpc-lib/src/rpc-transport.c:498
#9  0x00002aaaaae07006 in socket_event_poll_err (this=0x16bdcce0) at ../../../../../rpc/rpc-transport/socket/src/socket.c:694
#10 0x00002aaaaae0b4b8 in socket_event_handler (fd=21, idx=10, data=0x16bdcce0, poll_in=1, poll_out=0, poll_err=16) at ../../../../../rpc/rpc-transport/socket/src/socket.c:1797
#11 0x00002b056efe3988 in event_dispatch_epoll_handler (event_pool=0x151617a0, events=0x167cc070, i=0) at ../../../libglusterfs/src/event.c:794
#12 0x00002b056efe3b8d in event_dispatch_epoll (event_pool=0x151617a0) at ../../../libglusterfs/src/event.c:856
#13 0x00002b056efe3ee7 in event_dispatch (event_pool=0x151617a0) at ../../../libglusterfs/src/event.c:956
#14 0x000000000040784d in main (argc=1, argv=0x7fff989efb38) at ../../../glusterfsd/src/glusterfsd.c:1592
(gdb) f 3
#3  0x00002b056efe48ba in __gf_free (free_ptr=0x16ca4a90) at ../../../libglusterfs/src/mem-pool.c:273
273                     GF_ASSERT (0);
(gdb) f 5
#5  0x00002aaaaaae7a8e in glusterd_op_brick_disconnect (data=0x16ca4a30) at ../../../../../xlators/mgmt/glusterd/src/glusterd-op-sm.c:1872
1872                    gf_timer_call_cancel (THIS->ctx, brickinfo->timer);

I saw following erros in glusterd logs before it crashed.

[2011-10-05 04:00:20.53453] I [glusterd-utils.c:214:glusterd_is_local_addr] 0-glusterd: 10.1.10.21 is local
[2011-10-05 04:00:20.53491] I [glusterd-utils.c:214:glusterd_is_local_addr] 0-glusterd: 10.1.10.21 is local
[2011-10-05 04:00:20.150984] E [rdma.c:4447:rdma_event_handler] 0-rpc-transport/rdma: rdma.management: pollin received on tcp socket (peer: 10.1.10.21:999) after handshake is complete
[2011-10-05 04:00:20.162281] E [rdma.c:4447:rdma_event_handler] 0-rpc-transport/rdma: rdma.management: pollin received on tcp socket (peer: 10.1.10.21:1000) after handshake is complete
[2011-10-05 04:00:20.179142] E [rdma.c:4447:rdma_event_handler] 0-rpc-transport/rdma: rdma.management: pollin received on tcp socket (peer: 10.1.10.21:1001) after handshake is complete
[2011-10-05 04:00:20.190451] E [rdma.c:4447:rdma_event_handler] 0-rpc-transport/rdma: rdma.management: pollin received on tcp socket (peer: 10.1.10.21:1002) after handshake is complete
[2011-10-05 04:00:20.196186] E [rdma.c:4447:rdma_event_handler] 0-rpc-transport/rdma: rdma.management: pollin received on tcp socket (peer: 10.1.10.21:1003) after handshake is complete
[2011-10-05 04:00:20.201945] E [rdma.c:4447:rdma_event_handler] 0-rpc-transport/rdma: rdma.management: pollin received on tcp socket (peer: 10.1.10.21:1004) after handshake is complete
[2011-10-05 04:00:20.212742] E [rdma.c:4447:rdma_event_handler] 0-rpc-transport/rdma: rdma.management: pollin received on tcp socket (peer: 10.1.10.21:1005) after handshake is complete
[2011-10-05 04:00:20.218558] E [rdma.c:4447:rdma_event_handler] 0-rpc-transport/rdma: rdma.management: pollin received on tcp socket (peer: 10.1.10.21:998) after handshake is complete
[2011-10-05 04:01:21.137325] E [rdma.c:4447:rdma_event_handler] 0-rpc-transport/rdma: rdma.management: pollin received on tcp socket (peer: 10.1.10.21:993) after handshake is complete
[2011-10-05 04:01:21.322276] E [rdma.c:4447:rdma_event_handler] 0-rpc-transport/rdma: rdma.management: pollin received on tcp socket (peer: 10.1.10.21:992) after handshake is complete
[2011-10-05 04:02:26.583109] E [rdma.c:4447:rdma_event_handler] 0-rpc-transport/rdma: rdma.management: pollin received on tcp socket (peer: 10.1.10.21:1015) after handshake is complete
[2011-10-05 04:02:26.670842] E [rdma.c:4447:rdma_event_handler] 0-rpc-transport/rdma: rdma.management: pollin received on tcp socket (peer: 10.1.10.21:1014) after handshake is complete
[2011-10-05 04:03:39.191342] I [glusterd-volume-ops.c:305:glusterd_handle_cli_stop_volume] 0-glusterd: Received stop vol reqfor volume hosdu
[2011-10-05 04:03:39.218057] I [glusterd-utils.c:258:glusterd_lock] 0-glusterd: Cluster lock held by 2040c7b1-4fba-4435-b47a-683ce18e9c42
[2011-10-05 04:03:39.218093] I [glusterd-handler.c:438:glusterd_op_txn_begin] 0-management: Acquired local lock
[2011-10-05 04:03:39.294191] I [glusterd-op-sm.c:1611:glusterd_op_ac_send_stage_op] 0-glusterd: Sent op req to 0 peers
[2011-10-05 04:03:39.404534] I [glusterd-pmap.c:282:pmap_registry_remove] 0-pmap: removing brick /tmp/brick8 on port 24023
[2011-10-05 04:03:39.543404] I [glusterd-utils.c:214:glusterd_is_local_addr] 0-glusterd: 10.1.10.21 is local
[2011-10-05 04:03:39.543453] I [glusterd-utils.c:214:glusterd_is_local_addr] 0-glusterd: 10.1.10.21 is local
[2011-10-05 04:03:39.543509] I [glusterd-utils.c:214:glusterd_is_local_addr] 0-glusterd: 10.1.10.21 is local
[2011-10-05 04:03:39.543550] I [glusterd-utils.c:214:glusterd_is_local_addr] 0-glusterd: 10.1.10.21 is local
[2011-10-05 04:03:39.543594] I [glusterd-utils.c:214:glusterd_is_local_addr] 0-glusterd: 10.1.10.21 is local
[2011-10-05 04:03:39.543635] I [glusterd-utils.c:214:glusterd_is_local_addr] 0-glusterd: 10.1.10.21 is local
[2011-10-05 04:03:39.543675] I [glusterd-utils.c:214:glusterd_is_local_addr] 0-glusterd: 10.1.10.21 is local
[2011-10-05 04:03:39.543714] I [glusterd-utils.c:214:glusterd_is_local_addr] 0-glusterd: 10.1.10.21 is local
[2011-10-05 04:03:39.723750] I [glusterd-pmap.c:282:pmap_registry_remove] 0-pmap: removing brick /tmp/brick4 on port 24015
[2011-10-05 04:03:39.723800] I [glusterd-utils.c:214:glusterd_is_local_addr] 0-glusterd: 10.1.10.21 is local
[2011-10-05 04:03:39.723843] I [glusterd-utils.c:214:glusterd_is_local_addr] 0-glusterd: 10.1.10.21 is local
[2011-10-05 04:03:39.723882] I [glusterd-utils.c:214:glusterd_is_local_addr] 0-glusterd: 10.1.10.21 is local
[2011-10-05 04:03:39.723922] I [glusterd-utils.c:214:glusterd_is_local_addr] 0-glusterd: 10.1.10.21 is local
[2011-10-05 04:03:39.743246] I [glusterd-pmap.c:282:pmap_registry_remove] 0-pmap: removing brick /tmp/brick3 on port 24013
[2011-10-05 04:03:39.743294] I [glusterd-utils.c:214:glusterd_is_local_addr] 0-glusterd: 10.1.10.21 is local
[2011-10-05 04:03:39.743334] I [glusterd-utils.c:214:glusterd_is_local_addr] 0-glusterd: 10.1.10.21 is local
[2011-10-05 04:03:39.743374] I [glusterd-utils.c:214:glusterd_is_local_addr] 0-glusterd: 10.1.10.21 is local
[2011-10-05 04:03:39.863767] W [socket.c:204:__socket_rwv] 0-socket.management: readv failed (Connection reset by peer)
[2011-10-05 04:03:39.863897] W [socket.c:1510:__socket_proto_state_machine] 0-management: reading from socket failed. Error (Transport endpoint is not connected), peer (/tmp/8789d7b89b2809e8991e5ae1fe789aaf.socket)
[2011-10-05 04:03:39.976645] E [socket.c:2110:socket_connect] 0-management: connection attempt failed (No such file or directory)

I have archived the core file and glusterd log file.
Comment 1 Anand Avati 2011-10-14 05:43:00 EDT
CHANGE: http://review.gluster.com/575 (Earlier we waited for brick disconnect or 5s whichever is smaller, before we) merged in master by Vijay Bellur (vijay@gluster.com)
Comment 2 Anand Avati 2011-10-28 04:58:21 EDT
CHANGE: http://review.gluster.com/626 (Earlier we would wait until brick disconnect or 5s whichever was earlier,) merged in release-3.2 by Vijay Bellur (vijay@gluster.com)