Created attachment 589744 [details] glustershd log Description of problem: ------------------------- self-heal daemon process crashed during self-heal of 50K files. Core was generated by `/usr/local/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/'. Program terminated with signal 11, Segmentation fault. #0 0x0000003271f14f99 in xdrmem_getlong () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6.x86_64 libgcc-4.4.6-3.el6.x86_64 openssl-1.0.0-20.el6.x86_64 zlib-1.2.3-27.el6.x86_64 (gdb) bt #0 0x0000003271f14f99 in xdrmem_getlong () from /lib64/libc.so.6 #1 0x0000003271f141f3 in xdr_u_int_internal () from /lib64/libc.so.6 #2 0x0000003271f148e8 in xdr_string_internal () from /lib64/libc.so.6 #3 0x00007f3af77c5c67 in xdr_gd1_mgmt_brick_op_req (xdrs=0x132c5a0, objp=0x132ce40) at glusterd1-xdr.c:478 #4 0x00007f3af77bd71f in xdr_to_generic (inmsg=..., args=0x132ce40, proc=0x7f3af77c5c3e <xdr_gd1_mgmt_brick_op_req>) at xdr-generic.c:60 #5 0x000000000040a4b7 in glusterfs_handle_translator_op (data=0xd5175c) at glusterfsd-mgmt.c:655 #6 0x00007f3af7c4a882 in synctask_wrap (old_task=0xf2dbf0) at syncop.c:120 #7 0x0000003271e43610 in ?? () from /lib64/libc.so.6 #8 0x0000000000000000 in ?? () create_files.sh:-(create 1M - 10M 50K files) (270GB disk space is required) ----------------- #!/bin/bash path=$1 count_value=1 for i in {1..50000};do if [ $count_value -gt 10 ]; then count_value=1 fi echo -e "Creating File : file.$i\n" dd if=/dev/urandom of=$path/file.$i bs=1M count=$count_value echo -e "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n" let "count_value = $count_value + 1" done Version-Release number of selected component (if applicable): -------------------------------------------------------------- 3.3.0qa45 Steps to Reproduce: --------------------- 1.Create a distribute-replicate volume(3x3) 2.Bring down 2 bricks from each replicate sub-volume. 3.Create a fuse mount 4.On the mount, execute the script "create_files.sh" 5.Once the script is completely executed, bring back the bricks.(gluster v start <vol_name> force" 6.On storage node execute : ------------------------- a. gluster v heal vol full Heal operation on volume vol has been successful b. gluster v heal vol info healed Self-heal daemon is not running. Check self-heal daemon log file. Actual results: ---------------- After some time glusterfs self-heal daemon crashed on one of the storage nodes. Additional info: --------------- [06/06/12 - 02:18:58 root@AFR-Server1 ~]# gluster v info Volume Name: vol Type: Distributed-Replicate Volume ID: b2f7f458-598e-456f-af7c-aa5af0036393 Status: Started Number of Bricks: 3 x 3 = 9 Transport-type: tcp Bricks: Brick1: 10.16.159.184:/export_b1/dir1 Brick2: 10.16.159.188:/export_b1/dir1 Brick3: 10.16.159.196:/export_b1/dir1 Brick4: 10.16.159.184:/export_c1/dir1 Brick5: 10.16.159.188:/export_c1/dir1 Brick6: 10.16.159.196:/export_c1/dir1 Brick7: 10.16.159.184:/export_d1/dir1 Brick8: 10.16.159.188:/export_d1/dir1 Brick9: 10.16.159.196:/export_d1/dir1 Found the bug while verifying the Bug 798907
Amar will be fixing the XDR encoding/decoding problems. Amar there is still a bug in brick-op which blows up the request size because of the brick-op response on the originator. We will have to fix that as well once the xdr problems go away.
did some check on this. The failure is happening inside xdrmem_getlong(), which is outside glusterfs scope. Did some searching about the possible failures of xdrmem_getlong()... did find some issue way back in 1995, but hopefully that should be fixed now... (http://mail-index.netbsd.org/netbsd-bugs/1995/06/05/0003.html) lowering the priority for this bug for now.
not happening anymore with 3.4.0 qa series. Please re-open if seen again.
(In reply to Amar Tumballi from comment #4) > not happening anymore with 3.4.0 qa series. Please re-open if seen again. When we can expect 3.4.0 to be released? Thank you in advance, Kind regards, Elvir / Red Hat GSS EMEA