Hide Forgot
On a dist-rep set-up I was running 'teragen' and 'randomtextwriter' simultaneously and glusterfs client process crashed with following backtrace. (gdb) bt #0 0x00000038fbe30265 in raise () from /lib64/libc.so.6 #1 0x00000038fbe31d10 in abort () from /lib64/libc.so.6 #2 0x00000038fbe296e6 in __assert_fail () from /lib64/libc.so.6 #3 0x00002b77db9bbdff in __gf_free (free_ptr=0x11c32a00) at mem-pool.c:297 #4 0x00002aaaada8c61c in dht_pathinfo_getxattr_cbk (frame=0x2b77dc8a5d14, cookie=0x2b77dc8a52d4, this=0x11c08660, op_ret=0, op_errno=0, xattr=0x11c31910) at dht-common.c:1764 #5 0x00002aaaad80d98a in afr_getxattr_pathinfo_cbk (frame=0x2b77dc8a52d4, cookie=0x0, this=0x11c07a10, op_ret=0, op_errno=0, dict=0x11c398e0) at afr-inode-read.c:742 #6 0x00002aaaad5d0490 in client3_1_getxattr_cbk (req=0x2aaaaebf6710, iov=0x2aaaaebf6750, count=1, myframe=0x2b77dc8a6568) at client3_1-fops.c:892 #7 0x00002b77dbc0a752 in rpc_clnt_handle_reply (clnt=0x11c1bb70, pollin=0x11c381e0) at rpc-clnt.c:747 #8 0x00002b77dbc0aa89 in rpc_clnt_notify (trans=0x11c1bcb0, mydata=0x11c1bba0, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x11c381e0) at rpc-clnt.c:860 #9 0x00002b77dbc07170 in rpc_transport_notify (this=0x11c1bcb0, event=RPC_TRANSPORT_MSG_RECEIVED, data=0x11c381e0) at rpc-transport.c:931 #10 0x00002aaaaad6dea7 in socket_event_poll_in (this=0x11c1bcb0) at socket.c:1676 #11 0x00002aaaaad6e3e9 in socket_event_handler (fd=10, idx=3, data=0x11c1bcb0, poll_in=1, poll_out=0, poll_err=0) at socket.c:1791 #12 0x00002b77db9bae30 in event_dispatch_epoll_handler (event_pool=0x11bf7920, events=0x11bfc650, i=0) at event.c:794 #13 0x00002b77db9bb035 in event_dispatch_epoll (event_pool=0x11bf7920) at event.c:856 #14 0x00002b77db9bb38f in event_dispatch (event_pool=0x11bf7920) at event.c:956 #15 0x0000000000407222 in main (argc=4, argv=0x7fffc15b33a8) at glusterfsd.c:1557 (gdb) f 4 #4 0x00002aaaada8c61c in dht_pathinfo_getxattr_cbk (frame=0x2b77dc8a5d14, cookie=0x2b77dc8a52d4, this=0x11c08660, op_ret=0, op_errno=0, xattr=0x11c31910) at dht-common.c:1764 1764 GF_FREE (local->pathinfo); (gdb) f 5 #5 0x00002aaaad80d98a in afr_getxattr_pathinfo_cbk (frame=0x2b77dc8a52d4, cookie=0x0, this=0x11c07a10, op_ret=0, op_errno=0, dict=0x11c398e0) at afr-inode-read.c:742 742 AFR_STACK_UNWIND (getxattr, frame, op_ret, op_errno, xattr); (gdb) f 3 #3 0x00002b77db9bbdff in __gf_free (free_ptr=0x11c32a00) at mem-pool.c:297 297 GF_ASSERT (0); (gdb)
Vishwanath is giving another run with fix. It was a memory overrun due to incorrect length used in GF_REALLOC calls in pathinfo xattr callback for dht.
CHANGE: http://review.gluster.com/236 (size of the allocated length is incorrectly calculated which could) merged in master by Anand Avati (avati)
CHANGE: http://review.gluster.com/249 (We use strcat to concat pathinfo strings. strcat appends a \0 at) merged in master by Anand Avati (avati)
I observed this crash when I ran two map-reduce jobs simultaneously. Now after the fix, When i ran 2 map-reduce jobs simultaneously I don't see the crash and both the jobs went on to completion.