Created attachment 1022108 [details] core file Description of problem: If nfs-client tries to write into file which exceeds brick size , then ganesha daemon crashes with core. Version-Release number of selected component (if applicable): mainline source installated from latest ganesha(2.3dev-1) and gluster sources(mainline) How reproducible: always Steps to Reproduce: 1.mount the exported volume using nfs-ganesha v3 . 2.Try perform write operation which exceeds brick size dd if=/dev/zero of=/mnt/nfs/2/file1 bs=11000K count=1000 conv=sync ; which writes 11GB data to brick of size 10GB Actual results: ganesha daemon crashes after reaching the limit of brick size Expected results: write should complete till the brick size ,stops after that, throwing input/output error. Additional info: The crash is didn't occur for gluster-nfs , v4 ganesha and native mount volume configuration : Volume Name: test Type: Distribute Volume ID: f84a1ad8-2b19-42ae-89b6-d0b8243410be Status: Started Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: 10.70.42.219:/brick/test Options Reconfigured: features.cache-invalidation: off nfs.disable: on size of brick = size of volume mount 10.70.42.219:/test 9.8G 3.2G 6.7G 33% /mnt/nfs/2 nfs client - fedora20(3.17.6-200.fc20.x86_64) bt of the core : #0 0x00000030b540e273 in __glfs_entry_fd (glfd=<value optimized out>) at glfs-internal.h:222 #1 pub_glfs_close (glfd=<value optimized out>) at glfs-fops.c:161 #2 0x00007fdd64579b5d in file_close (obj_hdl=0x7fdd24002358) at /root/lat_nfs_ganesha/src/FSAL/FSAL_GLUSTER/handle.c:1307 #3 0x00000000004f06b2 in cache_inode_close (entry=0x7fdd24002620, flags=168) at /root/lat_nfs_ganesha/src/cache_inode/cache_inode_open_close.c:305 #4 0x00000000004e5b44 in cache_inode_rdwr_plus (entry=0x7fdd24002620, io_direction=CACHE_INODE_WRITE, offset=10112888832, io_size=1048576, bytes_moved=0x7fdd41b75e28, buffer=0x7fdd044c2430, eof=0x7fdd41b75e27, sync=0x7fdd41b75e26, info=0x0) at /root/lat_nfs_ganesha/src/cache_inode/cache_inode_rdwr.c:227 #5 0x00000000004e64cd in cache_inode_rdwr (entry=0x7fdd24002620, io_direction=CACHE_INODE_WRITE, offset=10112888832, io_size=1048576, bytes_moved=0x7fdd41b75e28, buffer=0x7fdd044c2430, eof=0x7fdd41b75e27, sync=0x7fdd41b75e26) at /root/lat_nfs_ganesha/src/cache_inode/cache_inode_rdwr.c:304 #6 0x000000000045eed2 in nfs3_write (arg=0x7fdd040c1858, worker=0x7fdd1c0008c0, req=0x7fdd040c1718, res=0x7fdd1c009fa0) at /root/lat_nfs_ganesha/src/Protocols/NFS/nfs3_write.c:234 #7 0x00000000004548ac in nfs_rpc_execute (req=0x7fdd04000f10, worker_data=0x7fdd1c0008c0) at /root/lat_nfs_ganesha/src/MainNFSD/nfs_worker_thread.c:1268 #8 0x0000000000455646 in worker_run (ctx=0x101ac60) at /root/lat_nfs_ganesha/src/MainNFSD/nfs_worker_thread.c:1535 #9 0x000000000051bf5e in fridgethr_start_routine (arg=0x101ac60) at /root/lat_nfs_ganesha/src/support/fridgethr.c:562 #10 0x00000039090079d1 in start_thread () from /lib64/libpthread.so.0 #11 0x0000003908ce88fd in clone () from /lib64/libc.so.6
Trying to reproduce this, and got a similar segfault (though it was on a getattrs/glfs_fstat call): (gdb) f 0 #0 0x00007f732dbd827f in __glfs_entry_fd (fd=0x7f730c7a30a0) at glfs-internal.h:193 193 THIS = fd->fd->inode->table->xl->ctx->master; (gdb) l 188 189 190 static inline void 191 __glfs_entry_fd (struct glfs_fd *fd) 192 { 193 THIS = fd->fd->inode->table->xl->ctx->master; 194 } 195 196 197 /* (gdb) p *fd $1 = { openfds = { <--- empty list next = 0x7f730c7a30a0, prev = 0x7f730c7a30a0 }, fs = 0x7f732e6b4660, offset = 796860416, fd = 0x7f732ab3703c, entries = { next = 0x0, prev = 0x0 }, next = 0x0, readdirbuf = 0x0 } (gdb) p *fd->fd $2 = { pid = 1039, flags = 2, refcount = 0, <--- refcount is 0! inode_list = { <--- empty list next = 0x7f732ab3704c, prev = 0x7f732ab3704c }, inode = 0xaaaaaaaa, <--- invalid pointer lock = 1, _ctx = 0x7f730c559060, xl_count = 11, lk_ctx = 0x7f730c5e9e20, anonymous = _gf_false } (gdb) p *fd->fd->inode Cannot access memory at address 0xaaaaaaaa Upstream prevents this issue after added checks with http://review.gluster.org/10759 (also returns EBADFD instead of EINVAL). http://review.gluster.org/9797 replaces __glfs_entry_fs() with __GLFS_ENTRY_VALIDATE_FS(). In the /tmp/gfapi.log (yes, changed location in newer versions) the following message is repeated a lot: [2015-08-24 09:19:15.610646] W [client-rpc-fops.c:851:client3_3_writev_cbk] 0-bz1218535-client-0: remote operation failed: No space left on device My guess is that there is a race where the error is returned by Gluster, but not correctly handled by libgfapi. This would cause an incorrect error to be passed to NFS-Ganesha, which then re-uses the fd/handle that should have been marked invalid. Including the above two patches should prevent the problem from occuring.
*** This bug has been marked as a duplicate of bug 1240920 ***