Bug 763889 (GLUSTER-2157) - glusterfsd crash
Summary: glusterfsd crash
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: GLUSTER-2157
Product: GlusterFS
Classification: Community
Component: core
Version: 3.1.2
Hardware: x86_64
OS: Linux
low
low
Target Milestone: ---
Assignee: Gaurav
QA Contact:
URL:
Whiteboard:
: GLUSTER-2505 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-11-26 16:16 UTC by Giovanni Toraldo (LiberSoft)
Modified: 2015-12-01 16:45 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: ---
Mount Type: nfs
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)
log of first brick (32.14 KB, application/x-gzip)
2010-11-26 13:16 UTC, Giovanni Toraldo (LiberSoft)
no flags Details
log of second brick (439.69 KB, application/x-gzip)
2010-11-26 13:17 UTC, Giovanni Toraldo (LiberSoft)
no flags Details
nebula04: where glusterfsd crashed (45 bytes, text/plain)
2011-01-18 11:06 UTC, Giovanni Toraldo (LiberSoft)
no flags Details
nebula03 (45 bytes, text/plain)
2011-01-18 11:12 UTC, Giovanni Toraldo (LiberSoft)
no flags Details
nebula02 (45 bytes, text/plain)
2011-01-18 11:12 UTC, Giovanni Toraldo (LiberSoft)
no flags Details
nebula01: where nfs client was copying files (via rsync) (45 bytes, text/plain)
2011-01-18 11:13 UTC, Giovanni Toraldo (LiberSoft)
no flags Details
nebula01: nfs client (40 bytes, text/plain)
2011-02-09 06:23 UTC, Giovanni Toraldo (LiberSoft)
no flags Details
nebula04: 2nd brick crashed (45 bytes, text/plain)
2011-02-09 06:24 UTC, Giovanni Toraldo (LiberSoft)
no flags Details

Description Giovanni Toraldo (LiberSoft) 2010-11-26 13:17:46 UTC
Created attachment 392

Comment 1 Giovanni Toraldo (LiberSoft) 2010-11-26 16:16:43 UTC
I've setup a distributed/replicated storage across 4 server, each with 2 bricks, during some file copy of VM images, the fourth node got a crash in both glusterfsd processes.

Comment 2 Giovanni Toraldo (LiberSoft) 2010-12-10 06:52:11 UTC
That was probably an hardware issue, it isn't occurred again after fixing a faulty CPU fan.

Sorry for that, feel free to close this issue.

Comment 3 Giovanni Toraldo (LiberSoft) 2010-12-24 06:59:28 UTC
Sorry, but after some other testing, the bug is still there. I did memtest on fourth machine, no error reported, hardware should be fine.

Any test I can do for you?

Comment 4 Gaurav 2011-01-18 01:04:06 UTC
Hi,
   Could you give the exact details of steps you did. Also if it is easily reproducible could you reproduce it with tracing enabled and also give the client log files.

Gaurav.

Comment 5 Giovanni Toraldo (LiberSoft) 2011-01-18 11:06:14 UTC
Created attachment 422

Comment 6 Giovanni Toraldo (LiberSoft) 2011-01-18 11:12:01 UTC
Created attachment 423

Comment 7 Giovanni Toraldo (LiberSoft) 2011-01-18 11:12:37 UTC
Created attachment 424

Comment 8 Giovanni Toraldo (LiberSoft) 2011-01-18 11:13:28 UTC
Created attachment 425

Comment 9 Giovanni Toraldo (LiberSoft) 2011-01-18 11:19:49 UTC
Volume was created with:

gluster volume create vms replica 4 transport tcp 10.10.10.1:/srv/vol1 10.10.10.2:/srv/vol1 10.10.10.3:/srv/vol1 10.10.10.4:/srv/vol1 10.10.10.1:/srv/vol2 10.10.10.2:/srv/vol2 10.10.10.3:/srv/vol2 10.10.10.4:/srv/vol2


rsyncing from a local folder to nfs-mounted gluster on first node, got a crash on a few minutes (hardware should be ok).

# gluster --version
glusterfs 3.1.2 built on Jan 16 2011 18:14:56
Repository revision: v3.1.1-64-gf2a067c


# cat /etc/fstab 
# /etc/fstab: static file system information.
#
# <file system> <mount point>   <type>  <options>       <dump>  <pass>
proc            /proc           proc    defaults        0       0
/dev/mapper/vg0-host--root /               ext3    relatime,errors=remount-ro 0       1
/dev/md0        /boot           ext3    defaults        0       2
/dev/mapper/vg0-host--swap none            swap    sw              0       0
/dev/scd0       /media/cdrom0   udf,iso9660 user,noauto     0       0
UUID=852b2401-a64b-4c38-bd67-d90eb9721f17 /srv/vol1 ext4 user_xattr 0 0
UUID=547e764b-d9e7-4a42-9022-cf2c66ea4ded /srv/vol2 ext4 user_xattr 0 0

Comment 10 Gaurav 2011-01-27 06:23:26 UTC
Hi,
   Could you give the details like no. of files and size of files you were copying, also give memory info 'free -k', exact 'rsync' command you fired and stack size 'ulimit -s'.

Regards,
Gaurav

Comment 11 Giovanni Toraldo (LiberSoft) 2011-01-27 16:53:38 UTC
I was restoring an opennebula VM images backend from a backup (rsync -av), there was a bunch of very small files, each in a folder with the qemu xml configuration, and a folder with 7-8 disk images of 8 GB each.

nebula01:~# ulimit -s
8192

nebula01:~# free -k
             total       used       free     shared    buffers     cached
Mem:       4054976    4021928      33048          0     291068    3414012
-/+ buffers/cache:     316848    3738128
Swap:      3903480          0    3903480

The only interesting thing in all this story, it that with replica 4, *everytime* (from few minutes from rsync launch to few hours after launching some kvm virtual machine) the fourth node (nebula04) got glusterfsd crashing. Never happened on nebula01, nor nebula02 or nebula03.

Another interesting thing, is that with replica 3 (using only 3 nodes instead of 4 nodes), I never got a glusterfsd crash. Initially I was thinking about an hardware problem on nebula04, but I haven't got any problem setting up a replica 3 using nebula01-03-04 or nebula01-02-04.

On the client side, I remember clearly that I have tried both nfs-mount and native-fuse client.

Sorry that I cannot give you more tests, because I'm not using anymore GlusterFS due to this problem.

Comment 12 Gaurav 2011-01-28 08:56:51 UTC
Most possible reason for crash in in function "server_finodelk", somehow "args.volume" is getting some invalid address.

=====>args.volume = alloca (4096);
        if (!xdr_to_finodelk_req (req->msg[0], &args)) {
                //failed to decode msg;
                req->rpc_err = GARBAGE_ARGS;
                goto out;
        }


Now one of reasons could be "alloca" is not able to allocate 4k of memory in the stack and returning some invalid address as "alloca" behaviour in this case is undefined.

I was able to reproduce the same issue by explicitly assigning invalid address to "args.volume".

One of the solution is to use malloc instead of allocating 4k of memmory in stack.

Generally we don't test with more than 2/3 replica count. 

Are you willing to test it with some patch provided by us.

Comment 13 Giovanni Toraldo (LiberSoft) 2011-01-28 12:09:52 UTC
Yes, I will test with every patch you will provide, but in a week or two, not before.

Thanks.

Comment 14 Gaurav 2011-02-04 02:34:05 UTC
Hi,
Please try with this patch, 
http://patches.gluster.com/patch/6096/

Regards,
Gaurav

Comment 15 Giovanni Toraldo (LiberSoft) 2011-02-07 12:30:19 UTC
Hi Gaurav,

unfortunately the behavior hasn't changed with your patch: rsync stalled after few minutes.

This time glusterfsd didn't crashed, but I got same errors:

nebula04-brick2:

[2011-02-07 16:06:40.568602] I [server-handshake.c:535:server_setvolume] vms-server: accepted client from 10.10.10.4:1021
[2011-02-07 16:06:40.589622] I [server-handshake.c:535:server_setvolume] vms-server: accepted client from 10.10.10.2:1005
[2011-02-07 16:06:44.226718] I [server-handshake.c:535:server_setvolume] vms-server: accepted client from 10.10.10.1:1001
[2011-02-07 16:06:44.436381] I [server-handshake.c:535:server_setvolume] vms-server: accepted client from 10.10.10.3:1003
[2011-02-07 16:08:37.595070] E [rpcsvc.c:874:rpcsvc_request_create] rpc-service: RPC call decoding failed
[2011-02-07 16:08:37.595233] I [server.c:428:server_rpc_notify] vms-server: disconnected connection from 10.10.10.1:1001
[2011-02-07 16:08:37.595275] I [server-helpers.c:670:server_connection_destroy] vms-server: destroyed connection of nebula01-26537-2011/02/07-16:06:36:194704-vms-client-7
[2011-02-07 16:08:48.259397] I [server-handshake.c:535:server_setvolume] vms-server: accepted client from 10.10.10.1:1017
[2011-02-07 16:08:59.554105] E [rpcsvc.c:874:rpcsvc_request_create] rpc-service: RPC call decoding failed
[2011-02-07 16:08:59.554255] I [server.c:428:server_rpc_notify] vms-server: disconnected connection from 10.10.10.1:1017

[..]

[2011-02-07 16:09:41.167949] I [server-handshake.c:535:server_setvolume] vms-server: accepted client from 10.10.10.1:1018
[2011-02-07 16:09:41.202843] E [server3_1-fops.c:4427:server_finodelk] : iov_len invalid 65572
[2011-02-07 16:09:41.202925] E [socket.c:1368:__socket_read_frag] rpc: wrong MSG-TYPE (542930224) received from 10.10.10.1:1018
[2011-02-07 16:09:41.202957] E [socket.c:1368:__socket_read_frag] rpc: wrong MSG-TYPE (542930224) received from 10.10.10.1:1018
[2011-02-07 16:09:41.202979] E [socket.c:1368:__socket_read_frag] rpc: wrong MSG-TYPE (542930224) received from 10.10.10.1:1018


And on nfs-client:
[2011-02-07 16:08:37.495145] E [rpcsvc.c:1693:nfs_rpcsvc_submit_generic] nfsrpc: 
Failed to submit message
[2011-02-07 16:08:37.495202] E [nfs3.c:522:nfs3svc_submit_reply] nfs-nfsv3: Reply submission failed
[2011-02-07 16:08:37.498037] E [rpcsvc.c:1693:nfs_rpcsvc_submit_generic] nfsrpc: 
Failed to submit message
[2011-02-07 16:08:37.498064] E [nfs3.c:522:nfs3svc_submit_reply] nfs-nfsv3: Reply submission failed
[2011-02-07 16:08:37.505520] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/loca
l/lib/libgfrpc.so.0(rpc_clnt_notify+0x77) [0x7f5f22a80d17] (-->/usr/local/lib/lib
gfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x7f5f22a804ae] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f5f22a8040e]))) rpc-clnt: forced unwinding frame type(GlusterFS 3.1) op(WRITE(13)) called at 2011-02-07 16:08:37.496988

If you want, I can provide full log for nfs client and failing node glusterfsd.

Comment 16 Gaurav 2011-02-08 00:00:23 UTC
Yes, please give full log files.

Comment 17 Giovanni Toraldo (LiberSoft) 2011-02-09 06:23:17 UTC
Created attachment 433

Comment 18 Giovanni Toraldo (LiberSoft) 2011-02-09 06:24:12 UTC
Created attachment 434

Comment 19 Giovanni Toraldo (LiberSoft) 2011-02-09 06:26:25 UTC
I don't see any significant changes between those logs and first ones I've attached.

Here you are the new nfs client log and the brick that crashed.

However, this bug is very easy to trigger for me, can't you setup your own 4 test bricks with replica 4 on debian lenny x86_64?

Comment 20 Giovanni Toraldo (LiberSoft) 2011-02-16 07:47:03 UTC
Hi Gaurav,

I have replicated our setup under 4 virtual machines on another host, and I have the same issues.

If you are interested, I can give you root access to inspect and do your tests.

Thanks.

Comment 21 Gaurav 2011-02-16 08:26:02 UTC
(In reply to comment #20)
> Hi Gaurav,
> 
> I have replicated our setup under 4 virtual machines on another host, and I
> have the same issues.
> 
> If you are interested, I can give you root access to inspect and do your tests.
> 
> Thanks.

Hi,
   I am able to reproduce it. Its failing for big files(8GB) on 4 replica setup with NFS.

Thanks,
Gaurav

Comment 22 Gaurav 2011-02-20 13:49:52 UTC
In 4 replica NFS setup, at 4th gfs  brick, rpc message decoding is failing, since its getting some garbage data from gfs client, which is leading to connection close. This is happening too frequently and slowing down the write operations on last brick. So its looking like rsync hangs for large data.
It crashes because of unexpected data size for finodelk request, we are allocating 4k and data is actually 64k, which is ofcourse due to some corrupted packets. But still we have to handle it gracefully instead of crashing.
Because of this corruption we are also receiving unexpected RPC_TRANSPORT_MAP_XID_REQUEST event in rpcsvc_notify, we are not handling it in rpcsvc_notify, so need to handle it gracefully.


So the root cause of this problem is iob array limit (8) of iobref structure is getting exhausted for more than 3 replica setup.
In rpc_clnt_submit we are failing to add request_iob to iobref and unreffering it at the end leading iob getting freed and rused. This causes memory corruption as same iob being used by two requests.
It always fails for last subvolume as first 3 subvolumes always get space in iobref array.
iobref_add() is failing and unfortunately we are not checking its return value, otherwise debugging it wouldn't have been so painful.
By analizing the packets i found that rpc header is getting corrupted for write operations. Since in write we use the same iobref for all subvolumes and pass it to rpc_clnt_submit.
So to make it work on 4 replica setup we can increase the array limit, i have tested it by increasing the count for 4 replica NFS. But againg it won't work for more than 4 replica. So have finalize the count here.

Comment 23 Anand Avati 2011-02-22 07:11:44 UTC
PATCH: http://patches.gluster.com/patch/6226 in master (rpc/rpc-clnt: don't use iobref passed from caller to hold iobufs required for building program and rpc headers.)

Comment 24 Giovanni Toraldo (LiberSoft) 2011-02-22 07:44:41 UTC
Thank you for the detailed explanation. I will try the new patch and report back in the next few days.

Comment 25 Giovanni Toraldo (LiberSoft) 2011-02-24 12:50:09 UTC
Great, it's working now! I've succesfully rsynced over NFS over 50GB of datas on a replica 4 storage, without any problem.

Thank you!

Comment 26 Saurabh 2011-03-09 10:32:43 UTC
as discussed verifying this bug with a largefile, over a 4-replica volume


[root@centos-qa-4 ~]# gluster volume info

Volume Name: dist1
Type: Distribute
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: 10.1.12.109:/mnt/dist1
Brick2: 10.1.12.110:/mnt/dist1

Volume Name: rep1
Type: Distributed-Replicate
Status: Started
Number of Bricks: 2 x 4 = 8
Transport-type: tcp
Bricks:
Brick1: 10.1.12.109:/export/rep1
Brick2: 10.1.12.110:/rep1
Brick3: 10.1.12.109:/export/rep2
Brick4: 10.1.12.110:/rep2
Brick5: 10.1.12.109:/export/rep3
Brick6: 10.1.12.110:/rep3
Brick7: 10.1.12.109:/export/rep4
Brick8: 10.1.12.110:/rep4
[root@centos-qa-4 ~]# 

[root@centos-qa-3 rep]# cp /root/3.1.3/largefile .

[root@centos-qa-3 rep]# ls -lia
total 7820172
                 1 drwxr-xr-x 2 root root       8192 Mar  9  2011 .
           4620289 drwxr-xr-x 7 root root       4096 Mar  8 20:57 ..
929586040937022622 -rw-r--r-- 1 root root 8000000000 Mar  9  2011 largefile

Comment 27 raf 2011-03-10 13:59:21 UTC
*** Bug 2505 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.