Description of problem: Distributed-Replicate 8x2 volume on 3.3.1 . Moving a healthy brick to another server. While the source brick was overloaded with IO causing the whole volume to be unavailable (I'll file another bug for that) the destination brick glusterfs hangs itself in __inode_find in inode.c: 20130409_114419 root@stor3-idc1-lga:~/ gdb //sbin/glusterfs (gdb) attach 24685 (gdb) bt #0 0x00007fa947885b40 in uuid_compare@plt () from //lib/libglusterfs.so.0 #1 0x00007fa9478a008d in __inode_find (table=0x23cf660, gfid=0x2687410 "\003\267\245~\246\321@\024\274\372\230\310\002\365\306,") at inode.c:765 #2 0x00007fa9478a0649 in inode_find (table=0x23cf660, gfid=0x2687410 "\003\267\245~\246\321@\024\274\372\230\310\002\365\306,") at inode.c:788 #3 0x00007fa94372ab5f in resolve_entry_simple (frame=<optimized out>) at server-resolve.c:251 #4 0x00007fa94372b32e in resolve_continue (frame=0x7fa94590679c) at server-resolve.c:215 #5 0x00007fa94372b71c in resolve_gfid_entry_cbk (frame=0x7fa94590679c, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, inode=<optimized out>, buf=0x7fffd7390820, xdata=0x0, postparent=0x7fffd7390890) at server-resolve.c:107 #6 0x00007fa943964886 in pl_lookup_cbk (frame=0x7fa945b102dc, cookie=<optimized out>, this=<optimized out>, op_ret=-1, op_errno=0, inode=0x7fa9424fd7bc, buf=0x7fffd7390820, xdata=0x0, postparent=0x7fffd7390890) at posix.c:1611 #7 0x00007fa943b7b17e in posix_lookup (frame=0x7fa945b10230, this=<optimized out>, loc=0x2687438, xdata=<optimized out>) at posix.c:188 #8 0x00007fa94395f464 in pl_lookup (frame=<optimized out>, this=0x23c0050, loc=0x2687438, xdata=0x0) at posix.c:1653 #9 0x00007fa94372b573 in resolve_gfid_cbk (frame=0x7fa94590679c, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, inode=0x7fa9424fd7bc, buf=0x7fffd7390b40, xdata=0x0, postparent=0x7fffd7390bb0) at server-resolve.c:158 #10 0x00007fa943964886 in pl_lookup_cbk (frame=0x7fa945b1002c, cookie=<optimized out>, this=<optimized out>, op_ret=0, op_errno=0, inode=0x7fa9424fd7bc, buf=0x7fffd7390b40, xdata=0x0, postparent=0x7fffd7390bb0) at posix.c:1611 #11 0x00007fa943b7b17e in posix_lookup (frame=0x7fa945b100d8, this=<optimized out>, loc=0x2687438, xdata=<optimized out>) at posix.c:188 #12 0x00007fa94395f464 in pl_lookup (frame=<optimized out>, this=0x23c0050, loc=0x2687438, xdata=0x0) at posix.c:1653 #13 0x00007fa94372aa91 in resolve_gfid (frame=0x7fa94590679c) at server-resolve.c:190 #14 0x00007fa94372b160 in server_resolve_entry (frame=0x7fa94590679c) at server-resolve.c:325 #15 0x00007fa94372b218 in server_resolve (frame=0x7fa94590679c) at server-resolve.c:502 #16 0x00007fa94372af0e in server_resolve_all (frame=<optimized out>) at server-resolve.c:559 #17 0x00007fa94372b7f4 in resolve_and_resume (frame=<optimized out>, fn=<optimized out>) at server-resolve.c:589 #18 0x00007fa943745d41 in server_lookup (req=<optimized out>) at server3_1-fops.c:5571 #19 0x00007fa9476641c8 in rpcsvc_handle_rpc_call (svc=0x23c2ec0, trans=<optimized out>, msg=<optimized out>) at rpcsvc.c:513 #20 0x00007fa9476647fb in rpcsvc_notify (trans=0x24a5970, mydata=<optimized out>, event=<optimized out>, data=0x23cebf0) at rpcsvc.c:612 #21 0x00007fa947668367 in rpc_transport_notify (this=<optimized out>, event=<optimized out>, data=<optimized out>) at rpc-transport.c:489 #22 0x00007fa942df1c24 in socket_event_poll_in (this=0x24a5970) at socket.c:1677 #23 0x00007fa942df1f77 in socket_event_handler (fd=<optimized out>, idx=1, data=0x24a5970, poll_in=1, poll_out=0, poll_err=<optimized out>) at socket.c:1792 #24 0x00007fa9478b2917 in event_dispatch_epoll_handler (i=<optimized out>, events=0x23cd620, event_pool=0x23b5d30) at event.c:785 #25 event_dispatch_epoll (event_pool=0x23b5d30) at event.c:847 #26 0x000000000040475d in main (argc=<optimized out>, argv=0x7fffd7391598) at glusterfsd.c:1689 After some stepping I found that __inode_find never finishes. The process never exits "list_for_each_entry (tmp, &table->inode_hash[hash], hash) {" of __inode_find in inode.c The process was still busy writing out 2M directories to the new brick, it got to 1% .
User has been stuck for 24h, raising priority accordingly.
Created attachment 733672 [details] The -etc-glusterfs-glusterd.vol.log This is the -etc-glusterfs-glusterd.vol.log of the destination node this bug is about. The glusterfs is still spinning at 100% CPU. Bug 950024 a.o. holds the source brick log and commands.
Any update ? It's now 7 days further and I killed the 100% CPU spinning glusterfs to free up resources.
The version that this bug has been reported against, does not get any updates from the Gluster Community anymore. Please verify if this report is still valid against a current (3.4, 3.5 or 3.6) release and update the version, or close this bug. If there has been no update before 9 December 2014, this bug will get automatocally closed.