Description of problem: nfs-ganesha-gluster is consistently crashing on a fairly straightforward distributed iozone workload. Version-Release number of selected component (if applicable): glusterfs-3.7dev-0.994.gitf522001.el6.x86_64 glusterfs-ganesha-3.7dev-0.994.gitf522001.el6.x86_64 nfs-ganesha-gluster-2.2_dev_23-1.el6.x86_64 nfs-ganesha-vfs-2.2_dev_23-1.el6.x86_64 nfs-ganesha-2.2_dev_23-1.el6.x86_64 kernel-2.6.32-504.3.3.el6.x86_64 How reproducible: Consistently. Issues seen range from iozone hang to nfs-ganesha crash, but have not had a single successful run in more than 5 tries. Steps to Reproduce: 1. create a single-brick gluster volume and export it with nfs-ganesha. mount exported volume on clients. 2. run iozone in distributed mode on clients: iozone -+m <distributed iozone conf file> -+h <main_client> -i 0 -w -+n -c -C -e -s 10g -r 64k -t 12 Actual results: ganesha.nfsd crashes at some point in the test. Some lines from /var/log/messages: Apr 23 01:48:18 gprfs046 kernel: ganesha.nfsd[7864]: segfault at 200000108 ip 00000039ad0097a6 sp 00007fab1e1f9630 error 4 in libgfapi.so.0.0.0[39ad000000+1a000] Apr 23 01:48:29 gprfs046 abrt[8043]: Saved core dump of pid 7845 (/usr/bin/ganesha.nfsd) to /var/spool/abrt/ccpp-2015-04-23-01:48:18-7845 (2133598208 bytes) Expected results: Test should complete with 120g of data stored in 12 files in the volume Additional info:
Simplifying the steps to reproduce (eliminating multiple clients and distributed iozone). I am able to reproduce the crash in a very simple config. 1. single brick gluster volume, exported with nfs-ganesha. 2. single client mounting the ganesha-exported volume (on /mnt/glustervol in the example below). 3. run iozone on the client, like this: iozone -i 0 -w -+n -c -C -e -s 8G -r 64k -t 8 -F /mnt/glustervol/f{1,2,3,4,5,6,7,8}-8g.ioz So that's 8 iozone threads, writing 8g each.
nfs version in my script was set to 3. With nfs v4, I'm seeing much better behavior. I've had multiple runs of the tests above -- both single-client iozone and distributed iozone -- complete with nfs version set to 4.
Created attachment 1022158 [details] packet_trace
I tried to reproduce same for v3 mount , there is no crash occurred in my setup, but it just hangs.There is no notable errors in ganesha and gfapi logs. Volume is still exported via ganesha server. In the packet trace, I found that client is still sending the request , but server is replying back with TCP ZeroWindowsize and it repeats in a loop. I had attached packet. If anyone needs to look into my setup server : heka-client-13.gdev.lab.eng.bos.redhat.com exporting volume "vol" client : heka-client-15.gdev.lab.eng.bos.redhat.com mount point "/mnt/nfs/1/"
As it says in comment 0, I have seen ganesha crash many times in my testing. I saved a core dump and have made it available for analysis. This looks like a seg fault encountered in libgfapi. Any leads from the coredump analysis?
I can't analysis the core. bt gives me : #0 0x00000039ad0097a6 in ?? () #1 0x00007faaf81af380 in ?? () #2 0x00000039ad00e0b4 in ?? () #3 0xa9eeb31fb78cd404 in ?? () #4 0x014f7beed34de417 in ?? () #5 0x04d48cb71fb3eea9 in ?? () #6 0x000000000000fd03 in ?? () #7 0x0001030000000001 in ?? () #8 0x0000000000000001 in ?? () #9 0x0000000000000000 in ?? () I am using the same version of ganesha(2.2-dev-23) And one more thing , the iozone test completed today(after 5 days) in my setup iozone -i 0 -w -+n -c -C -e -s 512M -r 64k -t 4 -F /mnt/nfs/1/f{1,2,3,4}-.ioz Iozone: Performance Test of File I/O Version $Revision: 3.429 $ Compiled for 64 bit mode. Build: linux-AMD64 Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone, Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root, Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer, Vangel Bojaxhi, Ben England, Vikentsi Lapa. Run began: Thu Apr 30 02:38:27 2015 Setting no_unlink No retest option selected Include close in write timing Include fsync in write timing File size set to 524288 kB Record Size 64 kB Command line used: iozone -i 0 -w -+n -c -C -e -s 512M -r 64k -t 4 -F /mnt/nfs/1/f1-.ioz /mnt/nfs/1/f2-.ioz /mnt/nfs/1/f3-.ioz /mnt/nfs/1/f4-.ioz Output is in kBytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 kBytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. Throughput test with 4 processes Each process writes a 524288 kByte file in 64 kByte records Children see throughput for 4 initial writers = 3.74 kB/sec Parent sees throughput for 4 initial writers = 3.73 kB/sec Min throughput per process = 0.84 kB/sec Max throughput per process = 1.02 kB/sec Avg throughput per process = 0.94 kB/sec Min xfer = 433920.00 kB Child[0] xfer count = 492992.00 kB, Throughput = 0.96 kB/sec Child[1] xfer count = 433920.00 kB, Throughput = 0.84 kB/sec Child[2] xfer count = 524288.00 kB, Throughput = 1.02 kB/sec Child[3] xfer count = 481792.00 kB, Throughput = 0.93 kB/sec iozone test complete.
Upgraded to: glusterfs*-3.7.0beta1-0.3.git7aeae00.el6.x86_64 nfs-ganesha*-2.2.0-0.el6.x86_64 Debuginfo installed as well. Was able to reproduce the crash with single-client iozone runs, same as the one in comment 2. gdb on the coredump gives: (gdb) bt #0 glfs_iatt_to_stat (fs=0x0, iatt=0x7f7cab9da5e0, stat=0x7f7cab9da6b8) at glfs-fops.c:58 #1 0x00007f7ccb8bc0b4 in pub_glfs_fstat (glfd=0x7f7c84002270, stat=0x7f7cab9da6b8) at glfs-fops.c:292 #2 0x00007f7ccbacf956 in getattrs (obj_hdl=0x7f7c680022d8) at /usr/src/debug/nfs-ganesha-2.2.0-0.1.1-Source/FSAL/FSAL_GLUSTER/handle.c:661 #3 0x00000000004e79c8 in cache_inode_refresh_attrs (entry=0x7f7c680025a0) at /usr/src/debug/nfs-ganesha-2.2.0-0.1.1-Source/include/cache_inode.h:967 #4 0x00000000004e8d3b in cache_inode_rdwr_plus (entry=0x7f7c680025a0, io_direction=CACHE_INODE_WRITE, offset=104857600, io_size=1048576, bytes_moved=0x7f7cab9dbdc8, buffer=0x7f7c400c1290, eof=0x7f7cab9dbdc7, sync=0x7f7cab9dbdc6, info=0x0) at /usr/src/debug/nfs-ganesha-2.2.0-0.1.1-Source/cache_inode/cache_inode_rdwr.c:271 #5 0x00000000004e9111 in cache_inode_rdwr (entry=0x7f7c680025a0, io_direction=CACHE_INODE_WRITE, offset=104857600, io_size=1048576, bytes_moved=0x7f7cab9dbdc8, buffer=0x7f7c400c1290, eof=0x7f7cab9dbdc7, sync=0x7f7cab9dbdc6) at /usr/src/debug/nfs-ganesha-2.2.0-0.1.1-Source/cache_inode/cache_inode_rdwr.c:304 #6 0x0000000000461b22 in nfs3_write (arg=0x7f7c40000f38, worker=0x7f7ca40008c0, req=0x7f7c40000df8, res=0x7f7ca40022f0) at /usr/src/debug/nfs-ganesha-2.2.0-0.1.1-Source/Protocols/NFS/nfs3_write.c:234 #7 0x00000000004574fc in nfs_rpc_execute (req=0x7f7c40479910, worker_data=0x7f7ca40008c0) at /usr/src/debug/nfs-ganesha-2.2.0-0.1.1-Source/MainNFSD/nfs_worker_thread.c:1268 #8 0x0000000000458296 in worker_run (ctx=0x2824ed0) at /usr/src/debug/nfs-ganesha-2.2.0-0.1.1-Source/MainNFSD/nfs_worker_thread.c:1535 #9 0x000000000051f272 in fridgethr_start_routine (arg=0x2824ed0) at /usr/src/debug/nfs-ganesha-2.2.0-0.1.1-Source/support/fridgethr.c:562 #10 0x0000003cc86079d1 in start_thread () from /lib64/libpthread.so.0 #11 0x0000003cc82e89dd in clone () from /lib64/libc.so.6 (gdb)
Created attachment 1023433 [details] ganesha conf file
(In reply to Manoj Pillai from comment #9) > Created attachment 1023433 [details] > ganesha conf file This is the ganesha conf file used for the test in comment 8.
Summary ======== Debugged and have the fix. Problem technical details/Fix ============================== There is a race between getattrs() and a close() on a cache entry object which leads the the glfd getting wiped out from under the feet of getattrs() leading to the crash. This is not a bug in ganesha or gfapi but due to an optimized way in which we try to code up getattrs in FSAL_GLUSTER. If a given glfd is valid and if the loflags are not indicating a FSAL_O_CLOSED state, we try and use glfs_fstat() which unfortunately needs the objhandle->glfd to be valid. I believe the logic in doing this was to avoid the resolve/lookup under the covers for getattrs(). However it doesn't look feasible as the cache_inode_close() -> file_close() -> glfs_close() can sneak in and destroy the glfd. The fix I have is to simply fall back to using glfs_h_stat() for all getattrs in FSAL_GLUSTER. Testbed State ============== I have installed the fix in Manoj's testbed. It is available under /root/ganesha-bin. Before starting ganesha.nfsd, one needs to set LD_LIBRARY_PATH to /root/ganesha-bin/lib64:$LD_LIBRARY_PATH and then start the server. Ran iozone with the config Manoj had specified for NFSv3. I'd appreciate if Manoj could give this a spin to see if he has any comments on the fix/stability etc. I will wait for a day before pushing the fix upstream into nfs-ganesha tag V2.3-dev-2 or V2.3-dev-3. Sanity Testing for the fix =========================== "iozone -i 0 -w -+n -c -C -e -s 8G -r 64k -t 8 -F /mnt/glustervol/f{1,2,3,4,5,6,7,8}-8g.ioz " runs successfully for both V3 and V4. Output: (Sorry slightly long) ============================== V3: [root@gprfc066 ~]# /root/mpillai/bin/iozone -i 0 -w -+n -c -C -e -s 8G -r 64k -t 8 -F /mnt/glustervol/f{1,2,3,4,5,6,7,8}-8g.ioz Iozone: Performance Test of File I/O Version $Revision: 3.420 $ Compiled for 64 bit mode. Build: linux-AMD64 Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone, Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root, Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer, Vangel Bojaxhi, Ben England, Vikentsi Lapa. Run began: Wed May 13 04:25:03 2015 Setting no_unlink No retest option selected Include close in write timing Include fsync in write timing File size set to 8388608 KB Record Size 64 KB Command line used: /root/mpillai/bin/iozone -i 0 -w -+n -c -C -e -s 8G -r 64k -t 8 -F /mnt/glustervol/f1-8g.ioz /mnt/glustervol/f2-8g.ioz /mnt/glustervol/f3-8g.ioz /mnt/glustervol/f4-8g.ioz /mnt/glustervol/f5-8g.ioz /mnt/glustervol/f6-8g.ioz /mnt/glustervol/f7-8g.ioz /mnt/glustervol/f8-8g.ioz Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. Throughput test with 8 processes Each process writes a 8388608 Kbyte file in 64 Kbyte records Children see throughput for 8 initial writers = 417587.90 KB/sec Parent sees throughput for 8 initial writers = 409992.15 KB/sec Min throughput per process = 51458.87 KB/sec Max throughput per process = 53289.76 KB/sec Avg throughput per process = 52198.49 KB/sec Min xfer = 8131776.00 KB Child[0] xfer count = 8319488.00 KB, Throughput = 52967.62 KB/sec Child[1] xfer count = 8155264.00 KB, Throughput = 51922.92 KB/sec Child[2] xfer count = 8140864.00 KB, Throughput = 51458.87 KB/sec Child[3] xfer count = 8298560.00 KB, Throughput = 52468.45 KB/sec Child[4] xfer count = 8388608.00 KB, Throughput = 53289.76 KB/sec Child[5] xfer count = 8131776.00 KB, Throughput = 51777.37 KB/sec Child[6] xfer count = 8184896.00 KB, Throughput = 52100.61 KB/sec Child[7] xfer count = 8171840.00 KB, Throughput = 51602.31 KB/sec V4: [root@gprfc066 ~]# /root/mpillai/bin/iozone -i 0 -w -+n -c -C -e -s 8G -r 64k -t 8 -F /mnt/glustervol/f{1,2,3,4,5,6,7,8}-8g.ioz Iozone: Performance Test of File I/O Version $Revision: 3.420 $ Compiled for 64 bit mode. Build: linux-AMD64 Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins Al Slater, Scott Rhine, Mike Wisner, Ken Goss Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR, Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner, Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone, Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root, Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer, Vangel Bojaxhi, Ben England, Vikentsi Lapa. Run began: Wed May 13 05:27:20 2015 Setting no_unlink No retest option selected Include close in write timing Include fsync in write timing File size set to 8388608 KB Record Size 64 KB Command line used: /root/mpillai/bin/iozone -i 0 -w -+n -c -C -e -s 8G -r 64k -t 8 -F /mnt/glustervol/f1-8g.ioz /mnt/glustervol/f2-8g.ioz /mnt/glustervol/f3-8g.ioz /mnt/glustervol/f4-8g.ioz /mnt/glustervol/f5-8g.ioz /mnt/glustervol/f6-8g.ioz /mnt/glustervol/f7-8g.ioz /mnt/glustervol/f8-8g.ioz Output is in Kbytes/sec Time Resolution = 0.000001 seconds. Processor cache size set to 1024 Kbytes. Processor cache line size set to 32 bytes. File stride size set to 17 * record size. Throughput test with 8 processes Each process writes a 8388608 Kbyte file in 64 Kbyte records Children see throughput for 8 initial writers = 411434.73 KB/sec Parent sees throughput for 8 initial writers = 405304.36 KB/sec Min throughput per process = 50801.98 KB/sec Max throughput per process = 52086.17 KB/sec Avg throughput per process = 51429.34 KB/sec Min xfer = 8163392.00 KB Child[0] xfer count = 8163392.00 KB, Throughput = 50801.98 KB/sec Child[1] xfer count = 8373632.00 KB, Throughput = 51973.79 KB/sec Child[2] xfer count = 8182400.00 KB, Throughput = 50802.51 KB/sec Child[3] xfer count = 8242432.00 KB, Throughput = 51199.33 KB/sec Child[4] xfer count = 8388608.00 KB, Throughput = 52086.17 KB/sec Child[5] xfer count = 8382144.00 KB, Throughput = 52039.33 KB/sec Child[6] xfer count = 8175744.00 KB, Throughput = 50879.82 KB/sec Child[7] xfer count = 8303680.00 KB, Throughput = 51651.80 KB/sec iozone test complete.
ALso note that this issue is not seen with other FSALs (Malahal from IBM was kind enough to test it in his testbed and no crash was seen with FSAL_GPFS). Thanks, Anand
(In reply to Anand Subramanian from comment #11) I'm continuing to see crashes, so as a first step can you (Anand) check whether the ganesha server that is currently running on my setup has your fix? I set LD_LIBRARY_PATH as in comment 11, but better to make sure. [I'm wondering though: why would cache_inode_close() get called in the middle a write test on NFSv3? The protocol itself does not have a close. Maybe this is is not _the_ problem we are hitting, although it sounds like a good fix to have.]
Answering your second question first while I continue to look at your setup: cache_inode_close() has nothing with the protocol specific RPC's like OPEN or CLOSE or anything. This is a piece of infrastructure specific only to nfs-ganesha. Whenever a backend fs object needs to be accessed, it is accessed via an intermediate entry in the cache inode layer. This cache inode entry needs to be maintained in conjunction with any fops on the file in the backend filesystem. If you are interested, here is the stack: Thread 1 (Thread 0x7f7cab9dd700 (LWP 5797)): #0 glfs_iatt_to_stat (fs=0x0, iatt=0x7f7cab9da5e0, stat=0x7f7cab9da6b8) at glfs-fops.c:58 #1 0x00007f7ccb8bc0b4 in pub_glfs_fstat (glfd=0x7f7c84002270, stat=0x7f7cab9da6b8) at glfs-fops.c:292 #2 0x00007f7ccbacf956 in getattrs (obj_hdl=0x7f7c680022d8) at /usr/src/debug/nfs-ganesha-2.2.0-0.1.1-Source/FSAL/FSAL_GLUSTER/handle.c:661 #3 0x00000000004e79c8 in cache_inode_refresh_attrs (entry=0x7f7c680025a0) at /usr/src/debug/nfs-ganesha-2.2.0-0.1.1-Source/include/cache_inode.h:967 #4 0x00000000004e8d3b in cache_inode_rdwr_plus (entry=0x7f7c680025a0, io_direction=CACHE_INODE_WRITE, offset=104857600, io_size=1048576, bytes_moved=0x7f7cab9dbdc8, buffer=0x7f7c400c1290, eof=0x7f7cab9dbdc7, sync=0x7f7cab9dbdc6, info=0x0) at /usr/src/debug/nfs-ganesha-2.2.0-0.1.1-Source/cache_inode/cache_inode_rdwr.c:271 #5 0x00000000004e9111 in cache_inode_rdwr (entry=0x7f7c680025a0, io_direction=CACHE_INODE_WRITE, offset=104857600, io_size=1048576, bytes_moved=0x7f7cab9dbdc8, buffer=0x7f7c400c1290, eof=0x7f7cab9dbdc7, sync=0x7f7cab9dbdc6) at /usr/src/debug/nfs-ganesha-2.2.0-0.1.1-Source/cache_inode/cache_inode_rdwr.c:304 #6 0x0000000000461b22 in nfs3_write (arg=0x7f7c40000f38, worker=0x7f7ca40008c0, req=0x7f7c40000df8, res=0x7f7ca40022f0) at /usr/src/debug/nfs-ganesha-2.2.0-0.1.1-Source/Protocols/NFS/nfs3_write.c:234 #7 0x00000000004574fc in nfs_rpc_execute (req=0x7f7c40479910, worker_data=0x7f7ca40008c0) at /usr/src/debug/nfs-ganesha-2.2.0-0.1.1-Source/MainNFSD/nfs_worker_thread.c:1268 #8 0x0000000000458296 in worker_run (ctx=0x2824ed0) at /usr/src/debug/nfs-ganesha-2.2.0-0.1.1-Source/MainNFSD/nfs_worker_thread.c:1535 #9 0x000000000051f272 in fridgethr_start_routine (arg=0x2824ed0) at /usr/src/debug/nfs-ganesha-2.2.0-0.1.1-Source/support/fridgethr.c:562 #10 0x0000003cc86079d1 in start_thread () from /lib64/libpthread.so.0 #11 0x0000003cc82e89dd in clone () from /lib64/libc.so.6 In cache_inode_rdwr_plus, the code path does a cache_inode_open. This function in turn does a reopen if the filesystem supports it else does a close() and then an open() at the cost of losing associated locks. But that is the way it works or has to work (the cache inode layer). This is where your close() comes from. HTH.
Just to bring the thread to a conclusion, Manoj was using the older version of ganesha.nfsd hence still seeing the crashes. Updated the process to try out the newer version, he will now be running fresh tests to try out the fix. Thanks, Anand.
With the new version, I've not seen any crash/hang in my runs so far. I tried both the simplified single-client test in comment 2, as well as the original multiple-client, distributed iozone test from comment 0. Both ran fine. Great!
Based on https://bugzilla.redhat.com/show_bug.cgi?id=1214616#c16 moving this BZ to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1495.html