Bug 763231 (GLUSTER-1499)

Summary: GNFS from mainline Glusterfs-3.1-qa13 crashes while initiating SFS2008
Product: [Community] GlusterFS Reporter: Prithu Tiwari <prithu>
Component: nfsAssignee: Shehjar Tikoo <shehjart>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: low    
Version: 3.1-alphaCC: gluster-bugs, prithu
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: RTP Mount Type: nfs
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:

Description Prithu Tiwari 2010-09-01 05:20:04 EDT
It crashed again under similar circumstances.

Next we tried to run with following option

export GLUSTERFS_DISABLE_MEM_ACCT=1 on all the bricks

GNFS  crashed yet again with similar back-trace.
Comment 1 Prithu Tiwari 2010-09-01 07:28:04 EDT
SFS crashed while initializing the test.
tail of sfsc003.r4k-2 file 
-------------------------------------------------------------------------------
Wed Sep  1 03:56:53 2010 Completed.
Wed Sep  1 03:56:53 2010 Sending DONE-MOUNT message to Prime Client(client3.gluster.priv).
Wed Sep  1 03:56:53 2010 Completed.
Wed Sep  1 03:56:53 2010 Waiting on DO-INIT message from Prime Client(client3.gluster.priv).
Wed Sep  1 03:56:58 2010 Received.

Wed Sep  1 03:56:58 2010 Initializing test directories.
Wed Sep  1 03:56:58 2010 Child 2 will create 1926 directories.
Wed Sep  1 03:56:58 2010 Child 3 will create 1926 directories.
Wed Sep  1 03:56:58 2010 Child 1 will create 1926 directories.
Wed Sep  1 03:56:58 2010 Child 0 will create 1926 directories.
Wed Sep  1 03:57:08 2010 Child 0 finished creating 1926 directories.
Wed Sep  1 03:57:08 2010 Child 0 will create 58925 files.
Wed Sep  1 03:57:08 2010 Child 1 finished creating 1926 directories.
Wed Sep  1 03:57:08 2010 Child 1 will create 58925 files.
lad_write() RPC call failed : RPC: Unable to receive
sfsnfs30: error in lad_write() at 168 for f0yeeeyd.d
sfsnfs30: sending Pid 0 Signal 2
sfsnfs3: caught unexpected SIGCHLD.
A child (PID: 19549) has abnormally exited with status of 65
Tell Prime to stop the benchmark.
Terminating all child processes.
sfsnfs3: sending Pid 0 Signal 2
------------------------------------------------------------------------------

Checked at the GNFS server and found GNFS had crashed with core dump.
the back-trace of core was :

------------------------------------------------------------------------------
(gdb) bt
#0  0x00000032df630265 in raise () from /lib64/libc.so.6
#1  0x00000032df631d10 in abort () from /lib64/libc.so.6
#2  0x00000032df6296e6 in __assert_fail () from /lib64/libc.so.6
#3  0x00002b51f378ac7d in __gf_free (free_ptr=<value optimized out>) at mem-pool.c:252
#4  0x00002aaaab362052 in xdr_free_write3args_nocopy (wa=0x2aaab42ef558) at ../../../../xlators/nfs/lib/src/xdr-nfs3.c:1894
#5  0x00002aaaab351270 in nfs3svc_write_vec (req=0x16661938, iob=<value optimized out>) at nfs3.c:1931
#6  0x00002aaaab36120d in nfs_rpcsvc_record_vectored_call_actor (conn=0x161da278) at ../../../../xlators/nfs/lib/src/rpcsvc.c:2206
#7  0x00002aaaab361a66 in nfs_rpcsvc_update_vectored_state (conn=0x161da278) at ../../../../xlators/nfs/lib/src/rpcsvc.c:2263
#8  0x00002aaaab361e38 in nfs_rpcsvc_record_update_state (conn=0x161da278, dataread=1328) at ../../../../xlators/nfs/lib/src/rpcsvc.c:2359
#9  0x00002aaaab362028 in nfs_rpcsvc_conn_data_handler (fd=<value optimized out>, idx=19119, data=0x161da278, poll_in=1, poll_out=128,
    poll_err=0) at ../../../../xlators/nfs/lib/src/rpcsvc.c:2557
#10 0x00002b51f3789d77 in event_dispatch_epoll (event_pool=0x161cf7b8) at event.c:812
#11 0x00002aaaab360492 in nfs_rpcsvc_stage_proc (arg=<value optimized out>) at ../../../../xlators/nfs/lib/src/rpcsvc.c:64
#12 0x00000032dfe06617 in start_thread () from /lib64/libpthread.so.0
#13 0x00000032df6d3c2d in clone () from /lib64/libc.so.6
-------------------------------------------------------------------------------
Comment 2 Prithu Tiwari 2010-09-01 20:28:26 EDT
The crash did not occur when I ran GNFS  with following command

./dsh tc4 "export GLUSTERFS_DISABLE_MEM_ACCT=1;/opt/gnfs/sbin/glusterfs -f /share/shehjart/volfiles/gnfs-1v-4d.vol -l /tmp/gnnnn3"

The SFS test ran to completion with all 5 iterations finishing.

Performance is slightly slower(less) but need to investigate more before it could be confirmed.
Comment 3 Shehjar Tikoo 2010-09-02 00:45:00 EDT
Prithu,

Please run another test with the following patch:

http://dev.gluster.com/~shehjart/0001-nfs3-Free-vectored-write-args-using-FREE-not-GF_FREE.patch


Do not export the environment variable otherwise this patch will not have any affect.
Comment 4 Shehjar Tikoo 2010-09-02 03:54:49 EDT
I can confirm this happens on a simple compilebench load also:

configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.1.0git
/lib64/libc.so.6[0x32d5c302d0]
/lib64/libc.so.6(gsignal+0x35)[0x32d5c30265]
/lib64/libc.so.6(abort+0x110)[0x32d5c31d10]
/lib64/libc.so.6(__assert_fail+0xf6)[0x32d5c296e6]
/home/shehjart/glusterfsd-master/lib/libglusterfs.so.0(__gf_free+0x11d)[0x2b9d51b5dc7d]
/home/shehjart/glusterfsd-master/lib/glusterfs/3.1.0git/xlator/nfs/server.so(xdr_free_write3args_nocopy+0x12)[0x2aaaabbda072]
/home/shehjart/glusterfsd-master/lib/glusterfs/3.1.0git/xlator/nfs/server.so(nfs3svc_write_vec+0x90)[0x2aaaabbc9290]
/home/shehjart/glusterfsd-master/lib/glusterfs/3.1.0git/xlator/nfs/server.so(nfs_rpcsvc_record_vectored_call_actor+0x6d)[0x2aaaabbd922d]
/home/shehjart/glusterfsd-master/lib/glusterfs/3.1.0git/xlator/nfs/server.so(nfs_rpcsvc_update_vectored_state+0xf6)[0x2aaaabbd9a86]
/home/shehjart/glusterfsd-master/lib/glusterfs/3.1.0git/xlator/nfs/server.so(nfs_rpcsvc_record_update_state+0x1d8)[0x2aaaabbd9e58]
/home/shehjart/glusterfsd-master/lib/glusterfs/3.1.0git/xlator/nfs/server.so(nfs_rpcsvc_conn_data_handler+0x78)[0x2aaaabbda048]
/home/shehjart/glusterfsd-master/lib/libglusterfs.so.0[0x2b9d51b5cd77]
/home/shehjart/glusterfsd-master/lib/glusterfs/3.1.0git/xlator/nfs/server.so(nfs_rpcsvc_stage_proc+0x12)[0x2aaaabbd84b2]
/lib64/libpthread.so.0[0x32d680673d]
/lib64/libc.so.6(clone+0x6d)[0x32d5cd3d1d]




Compilebench command:
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.1.0git
/lib64/libc.so.6[0x32d5c302d0]
/lib64/libc.so.6(gsignal+0x35)[0x32d5c30265]
/lib64/libc.so.6(abort+0x110)[0x32d5c31d10]
/lib64/libc.so.6(__assert_fail+0xf6)[0x32d5c296e6]
/home/shehjart/glusterfsd-master/lib/libglusterfs.so.0(__gf_free+0x11d)[0x2b9d51b5dc7d]
/home/shehjart/glusterfsd-master/lib/glusterfs/3.1.0git/xlator/nfs/server.so(xdr_free_write3args_nocopy+0x12)[0x2aaaabbda072]
/home/shehjart/glusterfsd-master/lib/glusterfs/3.1.0git/xlator/nfs/server.so(nfs3svc_write_vec+0x90)[0x2aaaabbc9290]
/home/shehjart/glusterfsd-master/lib/glusterfs/3.1.0git/xlator/nfs/server.so(nfs_rpcsvc_record_vectored_call_actor+0x6d)[0x2aaaabbd922d]
/home/shehjart/glusterfsd-master/lib/glusterfs/3.1.0git/xlator/nfs/server.so(nfs_rpcsvc_update_vectored_state+0xf6)[0x2aaaabbd9a86]
/home/shehjart/glusterfsd-master/lib/glusterfs/3.1.0git/xlator/nfs/server.so(nfs_rpcsvc_record_update_state+0x1d8)[0x2aaaabbd9e58]
/home/shehjart/glusterfsd-master/lib/glusterfs/3.1.0git/xlator/nfs/server.so(nfs_rpcsvc_conn_data_handler+0x78)[0x2aaaabbda048]
/home/shehjart/glusterfsd-master/lib/libglusterfs.so.0[0x2b9d51b5cd77]
/home/shehjart/glusterfsd-master/lib/glusterfs/3.1.0git/xlator/nfs/server.so(nfs_rpcsvc_stage_proc+0x12)[0x2aaaabbd84b2]
/lib64/libpthread.so.0[0x32d680673d]
/lib64/libc.so.6(clone+0x6d)[0x32d5cd3d1d]


Config is a 4x3 distributed-replicated config.
Comment 5 Vijay Bellur 2010-09-02 05:17:09 EDT
PATCH: http://patches.gluster.com/patch/4482 in master (nfs3: Free vectored write args using FREE not GF_FREE)
Comment 6 Shehjar Tikoo 2010-09-02 05:24:02 EDT
Prithu, do a git pull on mainline. The patch in the previous message fixes this problem.
Comment 7 Prithu Tiwari 2010-09-02 23:05:04 EDT
I did a patch of the code(qa13) using the patch sent but the GNFS crashed again with following back-trace
-------------------------------------------------------------------------------
 #0  0x00000032df630265 in raise () from /lib64/libc.so.6
#1  0x00000032df631d10 in abort () from /lib64/libc.so.6
#2  0x00000032df66a84b in __libc_message () from /lib64/libc.so.6
#3  0x00000032df6722ef in _int_free () from /lib64/libc.so.6
#4  0x00000032df67273b in free () from /lib64/libc.so.6
#5  0x00002aaaab3512b0 in nfs3svc_write_vec (req=0x2aaaaf6b5038, iob=<value optimized out>) at nfs3.c:1931
#6  0x00002aaaab36124d in nfs_rpcsvc_record_vectored_call_actor (conn=0x1e171048) at ../../../../xlators/nfs/lib/src/rpcsvc.c:2206
#7  0x00002aaaab361aa6 in nfs_rpcsvc_update_vectored_state (conn=0x1e171048) at ../../../../xlators/nfs/lib/src/rpcsvc.c:2263
#8  0x00002aaaab361e78 in nfs_rpcsvc_record_update_state (conn=0x1e171048, dataread=1328) at ../../../../xlators/nfs/lib/src/rpcsvc.c:2359
#9  0x00002aaaab362068 in nfs_rpcsvc_conn_data_handler (fd=<value optimized out>, idx=16109, data=0x1e171048, poll_in=1, poll_out=128, 
    poll_err=0) at ../../../../xlators/nfs/lib/src/rpcsvc.c:2557
#10 0x00002b1e88667d77 in event_dispatch_epoll (event_pool=0x1e15a7b8) at event.c:812
#11 0x00002aaaab3604d2 in nfs_rpcsvc_stage_proc (arg=<value optimized out>) at ../../../../xlators/nfs/lib/src/rpcsvc.c:64
#12 0x00000032dfe06617 in start_thread () from /lib64/libpthread.so.0
#13 0x00000032df6d3c2d in clone () from /lib64/libc.so.6
---------------------------------------------------------------------------------

But I would now  try with git-pull from the mainline.
Comment 8 Shehjar Tikoo 2010-09-02 23:07:48 EDT
(In reply to comment #7)
> I did a patch of the code(qa13) using the patch sent but the GNFS crashed again
> with following back-trace


Yes. Thats expected. Mainline has it fixed.
Comment 9 Prithu Tiwari 2010-09-03 07:10:36 EDT
I used the Glusterfs-mainline from the git. The SFS still crashed the GNFS with
the following backtrace
----------------------------------------------------------------------
#0  0x00000032df630265 in raise () from /lib64/libc.so.6
#1  0x00000032df631d10 in abort () from /lib64/libc.so.6
#2  0x00000032df6296e6 in __assert_fail () from /lib64/libc.so.6
#3  0x00002b5171057130 in gf_mem_set_acct_info (xl=0x2b5171281560, alloc_ptr=0x40b63fa0, size=48, type=73) at mem-pool.c:88
#4  0x00002b5171057472 in __gf_calloc (nmemb=1, size=<value optimized out>, type=73) at mem-pool.c:140
#5  0x00002aaaab33deb4 in nfs3svc_write_vecsizer (req=0x2aaaaf82c428, readsize=0x40b64010, newbuf=0x40b6401c) at nfs3.c:1902
#6  0x00002aaaab3542db in nfs_rpcsvc_handle_vectored_rpc_call (conn=0x1019ccf8) at ../../../../xlators/nfs/lib/src//rpcsvc.c:2144
#7  0x00002aaaab3548c4 in nfs_rpcsvc_update_vectored_state (conn=0x20fd) at ../../../../xlators/nfs/lib/src//rpcsvc.c:2254
#8  0x00002aaaab354bfc in nfs_rpcsvc_record_update_state (conn=0x1019ccf8, dataread=20) at ../../../../xlators/nfs/lib/src//rpcsvc.c:2357
#9  0x00002aaaab354d68 in nfs_rpcsvc_conn_data_handler (fd=<value optimized out>, idx=8447, data=0x1019ccf8, poll_in=1, poll_out=128, 
    poll_err=0) at ../../../../xlators/nfs/lib/src//rpcsvc.c:2555
#10 0x00002b51710562d7 in event_dispatch_epoll (event_pool=0x10171548) at event.c:812
#11 0x00002aaaab3535b2 in nfs_rpcsvc_stage_proc (arg=<value optimized out>) at ../../../../xlators/nfs/lib/src//rpcsvc.c:64
#12 0x00000032dfe06617 in start_thread () from /lib64/libpthread.so.0
#13 0x00000032df6d3c2d in clone () from /lib64/libc.so.6
---------------------------------------------------------------------------
Comment 10 Shehjar Tikoo 2010-09-06 04:21:25 EDT
Re-opened. Fix is on the way.
Comment 11 Shehjar Tikoo 2010-09-06 04:33:53 EDT
Prithu, please set this to resolved after you've verified that the crash is fixed.
Comment 12 Vijay Bellur 2010-09-06 09:37:47 EDT
PATCH: http://patches.gluster.com/patch/4582 in master (nfsrpc: Set THIS before vector sizing upcall)