Hide Forgot
Created attachment 392
I've setup a distributed/replicated storage across 4 server, each with 2 bricks, during some file copy of VM images, the fourth node got a crash in both glusterfsd processes.
That was probably an hardware issue, it isn't occurred again after fixing a faulty CPU fan. Sorry for that, feel free to close this issue.
Sorry, but after some other testing, the bug is still there. I did memtest on fourth machine, no error reported, hardware should be fine. Any test I can do for you?
Hi, Could you give the exact details of steps you did. Also if it is easily reproducible could you reproduce it with tracing enabled and also give the client log files. Gaurav.
Created attachment 422
Created attachment 423
Created attachment 424
Created attachment 425
Volume was created with: gluster volume create vms replica 4 transport tcp 10.10.10.1:/srv/vol1 10.10.10.2:/srv/vol1 10.10.10.3:/srv/vol1 10.10.10.4:/srv/vol1 10.10.10.1:/srv/vol2 10.10.10.2:/srv/vol2 10.10.10.3:/srv/vol2 10.10.10.4:/srv/vol2 rsyncing from a local folder to nfs-mounted gluster on first node, got a crash on a few minutes (hardware should be ok). # gluster --version glusterfs 3.1.2 built on Jan 16 2011 18:14:56 Repository revision: v3.1.1-64-gf2a067c # cat /etc/fstab # /etc/fstab: static file system information. # # <file system> <mount point> <type> <options> <dump> <pass> proc /proc proc defaults 0 0 /dev/mapper/vg0-host--root / ext3 relatime,errors=remount-ro 0 1 /dev/md0 /boot ext3 defaults 0 2 /dev/mapper/vg0-host--swap none swap sw 0 0 /dev/scd0 /media/cdrom0 udf,iso9660 user,noauto 0 0 UUID=852b2401-a64b-4c38-bd67-d90eb9721f17 /srv/vol1 ext4 user_xattr 0 0 UUID=547e764b-d9e7-4a42-9022-cf2c66ea4ded /srv/vol2 ext4 user_xattr 0 0
Hi, Could you give the details like no. of files and size of files you were copying, also give memory info 'free -k', exact 'rsync' command you fired and stack size 'ulimit -s'. Regards, Gaurav
I was restoring an opennebula VM images backend from a backup (rsync -av), there was a bunch of very small files, each in a folder with the qemu xml configuration, and a folder with 7-8 disk images of 8 GB each. nebula01:~# ulimit -s 8192 nebula01:~# free -k total used free shared buffers cached Mem: 4054976 4021928 33048 0 291068 3414012 -/+ buffers/cache: 316848 3738128 Swap: 3903480 0 3903480 The only interesting thing in all this story, it that with replica 4, *everytime* (from few minutes from rsync launch to few hours after launching some kvm virtual machine) the fourth node (nebula04) got glusterfsd crashing. Never happened on nebula01, nor nebula02 or nebula03. Another interesting thing, is that with replica 3 (using only 3 nodes instead of 4 nodes), I never got a glusterfsd crash. Initially I was thinking about an hardware problem on nebula04, but I haven't got any problem setting up a replica 3 using nebula01-03-04 or nebula01-02-04. On the client side, I remember clearly that I have tried both nfs-mount and native-fuse client. Sorry that I cannot give you more tests, because I'm not using anymore GlusterFS due to this problem.
Most possible reason for crash in in function "server_finodelk", somehow "args.volume" is getting some invalid address. =====>args.volume = alloca (4096); if (!xdr_to_finodelk_req (req->msg[0], &args)) { //failed to decode msg; req->rpc_err = GARBAGE_ARGS; goto out; } Now one of reasons could be "alloca" is not able to allocate 4k of memory in the stack and returning some invalid address as "alloca" behaviour in this case is undefined. I was able to reproduce the same issue by explicitly assigning invalid address to "args.volume". One of the solution is to use malloc instead of allocating 4k of memmory in stack. Generally we don't test with more than 2/3 replica count. Are you willing to test it with some patch provided by us.
Yes, I will test with every patch you will provide, but in a week or two, not before. Thanks.
Hi, Please try with this patch, http://patches.gluster.com/patch/6096/ Regards, Gaurav
Hi Gaurav, unfortunately the behavior hasn't changed with your patch: rsync stalled after few minutes. This time glusterfsd didn't crashed, but I got same errors: nebula04-brick2: [2011-02-07 16:06:40.568602] I [server-handshake.c:535:server_setvolume] vms-server: accepted client from 10.10.10.4:1021 [2011-02-07 16:06:40.589622] I [server-handshake.c:535:server_setvolume] vms-server: accepted client from 10.10.10.2:1005 [2011-02-07 16:06:44.226718] I [server-handshake.c:535:server_setvolume] vms-server: accepted client from 10.10.10.1:1001 [2011-02-07 16:06:44.436381] I [server-handshake.c:535:server_setvolume] vms-server: accepted client from 10.10.10.3:1003 [2011-02-07 16:08:37.595070] E [rpcsvc.c:874:rpcsvc_request_create] rpc-service: RPC call decoding failed [2011-02-07 16:08:37.595233] I [server.c:428:server_rpc_notify] vms-server: disconnected connection from 10.10.10.1:1001 [2011-02-07 16:08:37.595275] I [server-helpers.c:670:server_connection_destroy] vms-server: destroyed connection of nebula01-26537-2011/02/07-16:06:36:194704-vms-client-7 [2011-02-07 16:08:48.259397] I [server-handshake.c:535:server_setvolume] vms-server: accepted client from 10.10.10.1:1017 [2011-02-07 16:08:59.554105] E [rpcsvc.c:874:rpcsvc_request_create] rpc-service: RPC call decoding failed [2011-02-07 16:08:59.554255] I [server.c:428:server_rpc_notify] vms-server: disconnected connection from 10.10.10.1:1017 [..] [2011-02-07 16:09:41.167949] I [server-handshake.c:535:server_setvolume] vms-server: accepted client from 10.10.10.1:1018 [2011-02-07 16:09:41.202843] E [server3_1-fops.c:4427:server_finodelk] : iov_len invalid 65572 [2011-02-07 16:09:41.202925] E [socket.c:1368:__socket_read_frag] rpc: wrong MSG-TYPE (542930224) received from 10.10.10.1:1018 [2011-02-07 16:09:41.202957] E [socket.c:1368:__socket_read_frag] rpc: wrong MSG-TYPE (542930224) received from 10.10.10.1:1018 [2011-02-07 16:09:41.202979] E [socket.c:1368:__socket_read_frag] rpc: wrong MSG-TYPE (542930224) received from 10.10.10.1:1018 And on nfs-client: [2011-02-07 16:08:37.495145] E [rpcsvc.c:1693:nfs_rpcsvc_submit_generic] nfsrpc: Failed to submit message [2011-02-07 16:08:37.495202] E [nfs3.c:522:nfs3svc_submit_reply] nfs-nfsv3: Reply submission failed [2011-02-07 16:08:37.498037] E [rpcsvc.c:1693:nfs_rpcsvc_submit_generic] nfsrpc: Failed to submit message [2011-02-07 16:08:37.498064] E [nfs3.c:522:nfs3svc_submit_reply] nfs-nfsv3: Reply submission failed [2011-02-07 16:08:37.505520] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/loca l/lib/libgfrpc.so.0(rpc_clnt_notify+0x77) [0x7f5f22a80d17] (-->/usr/local/lib/lib gfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x7f5f22a804ae] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f5f22a8040e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) op(WRITE(13)) called at 2011-02-07 16:08:37.496988 If you want, I can provide full log for nfs client and failing node glusterfsd.
Yes, please give full log files.
Created attachment 433
Created attachment 434
I don't see any significant changes between those logs and first ones I've attached. Here you are the new nfs client log and the brick that crashed. However, this bug is very easy to trigger for me, can't you setup your own 4 test bricks with replica 4 on debian lenny x86_64?
Hi Gaurav, I have replicated our setup under 4 virtual machines on another host, and I have the same issues. If you are interested, I can give you root access to inspect and do your tests. Thanks.
(In reply to comment #20) > Hi Gaurav, > > I have replicated our setup under 4 virtual machines on another host, and I > have the same issues. > > If you are interested, I can give you root access to inspect and do your tests. > > Thanks. Hi, I am able to reproduce it. Its failing for big files(8GB) on 4 replica setup with NFS. Thanks, Gaurav
In 4 replica NFS setup, at 4th gfs brick, rpc message decoding is failing, since its getting some garbage data from gfs client, which is leading to connection close. This is happening too frequently and slowing down the write operations on last brick. So its looking like rsync hangs for large data. It crashes because of unexpected data size for finodelk request, we are allocating 4k and data is actually 64k, which is ofcourse due to some corrupted packets. But still we have to handle it gracefully instead of crashing. Because of this corruption we are also receiving unexpected RPC_TRANSPORT_MAP_XID_REQUEST event in rpcsvc_notify, we are not handling it in rpcsvc_notify, so need to handle it gracefully. So the root cause of this problem is iob array limit (8) of iobref structure is getting exhausted for more than 3 replica setup. In rpc_clnt_submit we are failing to add request_iob to iobref and unreffering it at the end leading iob getting freed and rused. This causes memory corruption as same iob being used by two requests. It always fails for last subvolume as first 3 subvolumes always get space in iobref array. iobref_add() is failing and unfortunately we are not checking its return value, otherwise debugging it wouldn't have been so painful. By analizing the packets i found that rpc header is getting corrupted for write operations. Since in write we use the same iobref for all subvolumes and pass it to rpc_clnt_submit. So to make it work on 4 replica setup we can increase the array limit, i have tested it by increasing the count for 4 replica NFS. But againg it won't work for more than 4 replica. So have finalize the count here.
PATCH: http://patches.gluster.com/patch/6226 in master (rpc/rpc-clnt: don't use iobref passed from caller to hold iobufs required for building program and rpc headers.)
Thank you for the detailed explanation. I will try the new patch and report back in the next few days.
Great, it's working now! I've succesfully rsynced over NFS over 50GB of datas on a replica 4 storage, without any problem. Thank you!
as discussed verifying this bug with a largefile, over a 4-replica volume [root@centos-qa-4 ~]# gluster volume info Volume Name: dist1 Type: Distribute Status: Started Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: 10.1.12.109:/mnt/dist1 Brick2: 10.1.12.110:/mnt/dist1 Volume Name: rep1 Type: Distributed-Replicate Status: Started Number of Bricks: 2 x 4 = 8 Transport-type: tcp Bricks: Brick1: 10.1.12.109:/export/rep1 Brick2: 10.1.12.110:/rep1 Brick3: 10.1.12.109:/export/rep2 Brick4: 10.1.12.110:/rep2 Brick5: 10.1.12.109:/export/rep3 Brick6: 10.1.12.110:/rep3 Brick7: 10.1.12.109:/export/rep4 Brick8: 10.1.12.110:/rep4 [root@centos-qa-4 ~]# [root@centos-qa-3 rep]# cp /root/3.1.3/largefile . [root@centos-qa-3 rep]# ls -lia total 7820172 1 drwxr-xr-x 2 root root 8192 Mar 9 2011 . 4620289 drwxr-xr-x 7 root root 4096 Mar 8 20:57 .. 929586040937022622 -rw-r--r-- 1 root root 8000000000 Mar 9 2011 largefile
*** Bug 2505 has been marked as a duplicate of this bug. ***