Bug 763105 (GLUSTER-1373)

Summary: fileop mkdir fails on 4x3 dist-repl gnfs mount
Product: [Community] GlusterFS Reporter: Lakshmipathi G <lakshmipathi>
Component: replicateAssignee: Pavan Vilas Sondur <pavan>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: low    
Version: 3.1-alphaCC: gluster-bugs, shehjart, vijay
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: RTP Mount Type: nfs
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
gnfs log file none

Description Lakshmipathi G 2010-08-16 09:47:47 UTC
Running fileop with nfs_beta_rc10 (4x3 distributed replicate) fails with all performance translators enable, But this didn't crash the server.

# /opt/qa/tools-32bit/fileop -f 30 -t

Fileop:  Working in ., File size is 1,  Output is in Ops/sec. (A=Avg, B=Best, W=Worst)
Mkdir failed

Comment 1 Lakshmipathi G 2010-08-16 10:23:43 UTC
tested again with same setup. -nfs_beta_rc10 (4x3 distributed replicate)-with all
performance translators enabled,this time it passes.

Comment 2 Shehjar Tikoo 2010-09-04 05:49:29 UTC
With 3.1qa11, it still fails, with glusterfsd crashing in replicate.

fileop -d /mnt/nfs-master-4dist-3repl/fileoptest/ -f  10000
Fileop:  Working in /mnt/nfs-master-4dist-3repl/fileoptest, File size is 1,  Output is in Ops/sec. (A=Avg, B=Best, W=Worst)
 .       mkdir   chdir   rmdir  create    open    read   write   close    stat  access   chmod readdir  link    unlink  delete  Total_files


Mkdir failed

The crash trace:
Core was generated by `/home/shehjart/glusterfsd-master/sbin/glusterfs -f /home/shehjart/volfiles/nfs-'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000032d680b722 in pthread_spin_lock () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00000032d680b722 in pthread_spin_lock () from /lib64/libpthread.so.0
#1  0x00002b28015ee20b in fd_unref (fd=0x2aaaae909ad0) at fd.c:467
#2  0x00002aaaaad15686 in afr_local_cleanup (local=0x2aaae0ed16c8, this=0x1a17c548) at afr-common.c:353
#3  0x00002aaaaacee23c in afr_fstat_cbk (frame=0x2b28024c1cb8, cookie=<value optimized out>, this=<value optimized out>, op_ret=0, 
    op_errno=0, buf=0x7fffc9d8cf10) at afr-inode-read.c:351
#4  0x00002aaaaaacac1a in client3_1_fstat_cbk (req=<value optimized out>, iov=<value optimized out>, count=<value optimized out>, 
    myframe=0x2b28024c15b8) at client3_1-fops.c:1042
#5  0x00002b280182bc95 in rpc_clnt_handle_reply (clnt=<value optimized out>, pollin=<value optimized out>) at rpc-clnt.c:734
#6  0x00002b280182be78 in rpc_clnt_notify (trans=<value optimized out>, mydata=0x1a19dca8, event=<value optimized out>, data=0x1a169190)
    at rpc-clnt.c:844
#7  0x00002b28018272cc in rpc_transport_notify (this=0xaaaaaab2, event=RPC_TRANSPORT_ACCEPT, data=0x1a169190) at rpc-transport.c:1124
#8  0x00002aaaaed21c2f in socket_event_poll_in (this=0x1a19de88) at socket.c:1561
#9  0x00002aaaaed21dc0 in socket_event_handler (fd=<value optimized out>, idx=10, data=0x1a19de88, poll_in=1, poll_out=0, poll_err=0)
    at socket.c:1675
#10 0x00002b28015efd77 in event_dispatch_epoll_handler (event_pool=0x1a16ab08) at event.c:812
#11 event_dispatch_epoll (event_pool=0x1a16ab08) at event.c:876
#12 0x000000000040470d in main (argc=8, argv=0x7fffc9d8d7c8) at glusterfsd.c:1398

Comment 3 Vijay Bellur 2010-09-18 03:28:10 UTC
Can this be tested with qa26 please?

Comment 4 Shehjar Tikoo 2010-09-18 04:45:14 UTC
(In reply to comment #3)
> Can this be tested with qa26 please?

I've given a small test plan to Prithu. Am working with him to re-run the nfs tests again with gfid changes. This is part of that.

Comment 5 Shehjar Tikoo 2010-09-20 04:16:08 UTC
fileop has a really bad error reporting mechanism so it doesnt actually tell what the error was. "Mkdir failed" doesnt tell me anything.

I tried it and the reason mkdir failed for me with exactly same cmd is because of the OOM killer killing the gluster nfs process. This points to a known memory leak filed as bug 762991.

That must be resulting in fileop receiving an EIO on timeout, because we must have mounted with soft,intr as mount options.

Cmd line used.
fileop -d /mnt/nfs-4dist-master -f  10000

Retrying without those options to see if the mkdir fails without the timeout options.

Comment 6 Vijay Bellur 2010-09-29 14:57:14 UTC
Moving this to 3.1.1 as 3 replica tests can be done post 3.1.0

Comment 7 Shehjar Tikoo 2010-09-30 02:00:20 UTC
Closing. All the tests that I am running and having Prithu run are 4x3 dist-repl. We havent seen this problem at all lately. See previous comments to know why.