Description of problem: Running kernel untar and bonnie on the nfs mount point and started rebalance, after some time bought down one of the child from distributed-replicate volume and bought up again. after sometime rebalance process crashed Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. created a distribute-replicate volume 2x2 2. On the nfs mount started kernel untarring 3. on the second mount of the same volume started bonnie 4. Bought down the first child of first pair in the volume Actual results: After sometime rebalance process was crashed Expected results: Additional info: =============== Program terminated with signal 6, Aborted. #0 0x0000003739032885 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6_2.9.x86_64 libgcc-4.4.6-3.el6.x86_64 openssl-1.0.0-20.el6_2.4.x86_64 zlib-1.2.3-27.el6.x86_64 bt ==== (gdb) bt #0 0x0000003739032885 in raise () from /lib64/libc.so.6 #1 0x0000003739034065 in abort () from /lib64/libc.so.6 #2 0x000000373906f977 in __libc_message () from /lib64/libc.so.6 #3 0x0000003739075296 in malloc_printerr () from /lib64/libc.so.6 #4 0x00007f0a92e79fbc in __gf_free (free_ptr=0x1fd6b20) at mem-pool.c:258 #5 0x00007f0a8e3b1418 in gf_defrag_start_crawl (data=0x1fbf3b0) at dht-rebalance.c:1494 #6 0x00007f0a92e89a96 in synctask_wrap (old_task=0x7f0a74200d70) at syncop.c:120 #7 0x0000003739043610 in ?? () from /lib64/libc.so.6 #8 0x0000000000000000 in ?? ()
Created attachment 585627 [details] rebalance logs
Went through the log: [2012-05-20 10:24:44.623944] E [dht-rebalance.c:1369:gf_defrag_fix_layout] 0-dis-rep-dht: Fix layout failed for /run6157 [2012-05-20 10:24:44.624088] I [dht-rebalance.c:1614:gf_defrag_status_get] 0-glusterfs: Rebalance is completed [2012-05-20 10:24:44.624102] I [dht-rebalance.c:1617:gf_defrag_status_get] 0-glusterfs: Files migrated: 165350080, size: 218959092, lookups: 159521, failures: 26899 [2012-05-20 10:24:44.668152] W [glusterfsd.c:816:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x37390e5ccd] (-->/lib64/libpthread.so.0() [0x37398077f1] (-->/usr/local/sbin/glusterfs(glusterfs_sigwaiter+0xfc) [0x407ac2]))) 0-: received signum (15), shutting down pending frames: pending frames: patchset: git://git.gluster.com/glusterfs.git patchset: git://git.gluster.com/glusterfs.git signal received: 6 signal received: 6 time of crash: 2012-05-20 10:24:44 configuration details: argp 1 So this is case of 'rebalance stop' getting triggered, and a race between other thread freeing up mem-pools and stuff. As this is in the 'cleanup_and_exit()' part, I am reducing the priority.
*** This bug has been marked as a duplicate of bug 826584 ***