Hide Forgot
NFS Server crashed while running arequal from the clients. The setup was Distribute-Replicate over 4 GlusterFS servers with 48 clients. Log file in /share/tickets [2010-05-05 02:31:39] T [rpcsvc.c:1572:rpcsvc_submit_generic] rpc-service: Tx message: 136 [2010-05-05 02:31:39] T [rpcsvc.c:1365:rpcsvc_record_build_header] rpc-service: Reply fraglen 160, payload: 136, rpc hdr: 24 [2010-05-05 02:31:39] T [rpcsvc.c:1402:rpcsvc_conn_submit] rpc-service: Tx Header: 28, payload: 136 [2010-05-05 02:31:39] T [nfs3.c:198:nfs3_call_state_wipe] nfs-nfsv3: fd ref: 6 [2010-05-05 02:31:39] T [io-cache.c:132:ioc_inode_flush] iocache: locked inode(0x2aaab4aba040) [2010-05-05 02:31:39] T [io-cache.c:136:ioc_inode_flush] iocache: unlocked inode(0x2aaab4aba040) [2010-05-05 02:31:39] T [nfs.c:407:nfs_user_create] nfs: uid: 0, gid 0, gids: 0 [2010-05-05 02:31:39] T [io-cache.c:132:ioc_inode_flush] iocache: locked inode(0x2aaab4245060) [2010-05-05 02:31:39] T [io-cache.c:136:ioc_inode_flush] iocache: unlocked inode(0x2aaab4245060) [2010-05-05 02:31:39] T [nfs.c:407:nfs_user_create] nfs: uid: 0, gid 0, gids: 0 [2010-05-05 02:31:39] T [rbthash.c:309:rbthash_key_bucket] rbthash: HASH: 1832395053 [2010-05-05 02:31:39] T [rbthash.c:311:rbthash_key_bucket] rbthash: BUCKET: 0 [2010-05-05 02:31:39] T [io-cache.c:132:ioc_inode_flush] iocache: locked inode(0x2aaab8552de0) [2010-05-05 02:31:39] T [io-cache.c:136:ioc_inode_flush] iocache: unlocked inode(0x2aaab8552de0) [2010-05-05 02:31:39] T [nfs.c:407:nfs_user_create] nfs: uid: 0, gid 0, gids: 0 [2010-05-05 02:31:39] T [io-cache.c:132:ioc_inode_flush] iocache: locked inode(0x2aaab882aa00) [2010-05-05 02:31:39] T [io-cache.c:136:ioc_inode_flush] iocache: unlocked inode(0x2aaab882aa00) [2010-05-05 02:31:39] T [nfs.c:407:nfs_user_create] nfs: uid: 0, gid 0, gids: 0 [2010-05-05 02:31:39] T [io-cache.c:132:ioc_inode_flush] iocache: locked inode(0x4976c20) [2010-05-05 02:31:39] T [io-cache.c:136:ioc_inode_flush] iocache: unlocked inode(0x4976c20) [2010-05-05 02:31:39] T [nfs.c:407:nfs_user_create] nfs: uid: 0, gid 0, gids: 0 [2010-05-05 02:31:39] T [io-cache.c:132:ioc_inode_flush] iocache: locked inode(0x2aaaf58c0940) [2010-05-05 02:31:39] T [io-cache.c:136:ioc_inode_flush] iocache: unlocked inode(0x2aaaf58c0940) [2010-05-05 02:31:39] T [nfs.c:407:nfs_user_create] nfs: uid: 0, gid 0, gids: 0 [2010-05-05 02:31:39] T [io-cache.c:132:ioc_inode_flush] iocache: locked inode(0x2aab136f1300) [2010-05-05 02:31:39] T [io-cache.c:136:ioc_inode_flush] iocache: unlocked inode(0x2aab136f1300) [2010-05-05 02:31:39] D [nfs3-helpers.c:2224:nfs3_log_write_res] nfs-nfsv3: XID: 6430105e, WRITE: NFS: 0(Call completed successfully.), POSIX: 0(Success), count: 65536, UNSTABLE,wverf: 1272964805 [2010-05-05 02:31:39] T [rpcsvc.c:1572:rpcsvc_submit_generic] rpc-service: Tx message: 136 [2010-05-05 02:31:39] T [rpcsvc.c:1365:rpcsvc_record_build_header] rpc-service: Reply fraglen 160, payload: 136, rpc hdr: 24 [2010-05-05 02:31:39] T [rpcsvc.c:1402:rpcsvc_conn_submit] rpc-service: Tx Header: 28, payload: 136 [2010-05-05 02:31:39] T [nfs3.c:198:nfs3_call_state_wipe] nfs-nfsv3: fd ref: 5 [2010-05-05 02:31:39] T [io-cache.c:132:ioc_inode_flush] iocache: locked inode(0x2aab136f1300) [2010-05-05 02:31:39] T [io-cache.c:136:ioc_inode_flush] iocache: unlocked inode(0x2aab136f1300) [2010-05-05 02:31:39] D [nfs3-helpers.c:2224:nfs3_log_write_res] nfs-nfsv3: XID: 6530105e, WRITE: NFS: 0(Call completed successfully.), POSIX: 0(Success), count: 65536, UNSTABLE,wverf: 1272964805 [2010-05-05 02:31:39] T [rpcsvc.c:1572:rpcsvc_submit_generic] rpc-service: Tx message: 136 [2010-05-05 02:31:39] T [rpcsvc.c:1365:rpcsvc_record_build_header] rpc-service: Reply fraglen 160, payload: 136, rpc hdr: 24 [2010-05-05 02:31:39] T [rpcsvc.c:1402:rpcsvc_conn_submit] rpc-service: Tx Header: 28, payload: 136 [2010-05-05 02:31:39] T [nfs3.c:198:nfs3_call_state_wipe] nfs-nfsv3: fd ref: 4 [2010-05-05 02:31:39] T [io-cache.c:132:ioc_inode_flush] iocache: locked inode(0x2aab10830630) [2010-05-05 02:31:39] T [io-cache.c:136:ioc_inode_flush] iocache: unlocked inode(0x2aab10830630) [2010-05-05 02:31:39] T [nfs.c:407:nfs_user_create] nfs: uid: 0, gid 0, gids: 0 [2010-05-05 02:31:39] D [nfs3-helpers.c:2224:nfs3_log_write_res] nfs-nfsv3: XID: 162bf8e7, WRITE: NFS: 0(Call completed successfully.), POSIX: 117(Structure needs cleaning), count: 8192, STABLE,wverf: 1272964805 [2010-05-05 02:31:39] T [rpcsvc.c:1572:rpcsvc_submit_generic] rpc-service: Tx message: 136 [2010-05-05 02:31:39] T [rpcsvc.c:1365:rpcsvc_record_build_header] rpc-service: Reply fraglen 160, payload: 136, rpc hdr: 24 [2010-05-05 02:31:39] T [rpcsvc.c:1402:rpcsvc_conn_submit] rpc-service: Tx Header: 28, payload: 136 [2010-05-05 02:31:39] T [nfs3.c:198:nfs3_call_state_wipe] nfs-nfsv3: fd ref: 3 [2010-05-05 02:31:39] D [nfs3-helpers.c:2224:nfs3_log_write_res] nfs-nfsv3: XID: bbb81229, WRITE: NFS: 0(Call completed successfully.), POSIX: 117(Structure needs cleaning), count: 8192, STABLE,wverf: 1272964805 [2010-05-05 02:31:39] T [rpcsvc.c:1572:rpcsvc_submit_generic] rpc-service: Tx message: 136 [2010-05-05 02:31:39] T [rpcsvc.c:1365:rpcsvc_record_build_header] rpc-service: Reply fraglen 160, payload: 136, rpc hdr: 24 [2010-05-05 02:31:39] T [rpcsvc.c:1402:rpcsvc_conn_submit] rpc-service: Tx Header: 28, payload: 136 [2010-05-05 02:31:39] T [nfs3.c:198:nfs3_call_state_wipe] nfs-nfsv3: fd ref: 3 [2010-05-05 02:31:39] D [nfs3-helpers.c:2224:nfs3_log_write_res] nfs-nfsv3: XID: a292ad11, WRITE: NFS: 0(Call completed successfully.), POSIX: 117(Structure needs cleaning), count: 4096, STABLE,wverf: 1272964805 [2010-05-05 02:31:39] T [rpcsvc.c:1572:rpcsvc_submit_generic] rpc-service: Tx message: 136 [2010-05-05 02:31:39] T [rpcsvc.c:1365:rpcsvc_record_build_header] rpc-service: Reply fraglen 160, payload: 136, rpc hdr: 24 [2010-05-05 02:31:39] T [rpcsvc.c:1402:rpcsvc_conn_submit] rpc-service: Tx Header: 28, payload: 136 [2010-05-05 02:31:39] T [nfs3.c:198:nfs3_call_state_wipe] nfs-nfsv3: fd ref: 3 [2010-05-05 02:31:39] D [nfs3-helpers.c:2224:nfs3_log_write_res] nfs-nfsv3: XID: 7df5cc22, WRITE: NFS: 0(Call completed successfully.), POSIX: 117(Structure needs cleaning), count: 8192, STABLE,wverf: 1272964805 [2010-05-05 02:31:39] T [rpcsvc.c:1572:rpcsvc_submit_generic] rpc-service: Tx message: 136 [2010-05-05 02:31:39] T [rpcsvc.c:1365:rpcsvc_record_build_header] rpc-service: Reply fraglen 160, payload: 136, rpc hdr: 24 [2010-05-05 02:31:39] T [rpcsvc.c:1402:rpcsvc_conn_submit] rpc-service: Tx Header: 28, payload: 136 [2010-05-05 02:31:39] T [nfs3.c:198:nfs3_call_state_wipe] nfs-nfsv3: fd ref: 3 [2010-05-05 02:31:39] D [nfs3-helpers.c:2224:nfs3_log_write_res] nfs-nfsv3: XID: 5fad5b16, WRITE: NFS: 0(Call completed successfully.), POSIX: 117(Structure needs cleaning), count: 8192, STABLE,wverf: 1272964805 [2010-05-05 02:31:39] T [rpcsvc.c:1572:rpcsvc_submit_generic] rpc-service: Tx message: 136 [2010-05-05 02:31:39] T [rpcsvc.c:1365:rpcsvc_record_build_header] rpc-service: Reply fraglen 160, payload: 136, rpc hdr: 24 [2010-05-05 02:31:39] T [rpcsvc.c:1402:rpcsvc_conn_submit] rpc-service: Tx Header: 28, payload: 136 [2010-05-05 02:31:39] T [nfs3.c:198:nfs3_call_state_wipe] nfs-nfsv3: fd ref: 3 [2010-05-05 02:31:39] D [nfs3-helpers.c:2224:nfs3_log_write_res] nfs-nfsv3: XID: 8fb577f8, WRITE: NFS: 0(Call completed successfully.), POSIX: 117(Structure needs cleaning), count: 8192, STABLE,wverf: 1272964805 [2010-05-05 02:31:39] T [rpcsvc.c:1572:rpcsvc_submit_generic] rpc-service: Tx message: 136 [2010-05-05 02:31:39] T [rpcsvc.c:1365:rpcsvc_record_build_header] rpc-service: Reply fraglen 160, payload: 136, rpc hdr: 24 [2010-05-05 02:31:39] T [rpcsvc.c:1402:rpcsvc_conn_submit] rpc-service: Tx Header: 28, payload: 136 [2010-05-05 02:31:39] T [nfs3.c:198:nfs3_call_state_wipe] nfs-nfsv3: fd ref: 3 [2010-05-05 02:31:39] T [afr-self-heal-algorithm.c:522:sh_diff_loop_return] mirror-0: loop for offset 3488874496 returned [2010-05-05 02:31:39] T [afr-self-heal-algorithm.c:997:sh_diff_loop_driver] mirror-0: spawning a loop for offset 3490971648 [2010-05-05 02:31:39] T [afr-self-heal-algorithm.c:522:sh_diff_loop_return] mirror-0: loop for offset 3489005568 returned [2010-05-05 02:31:39] T [afr-self-heal-algorithm.c:997:sh_diff_loop_driver] mirror-0: spawning a loop for offset 3491102720 /lib64/libc.so.6[0x2aaaab39c280] /opt/glusterfs/nfs-git/lib/glusterfs/3.0.0git/xlator/nfs/server.so[0x2aaaad06a650] /opt/glusterfs/nfs-git/lib/glusterfs/3.0.0git/xlator/nfs/server.so(__nfs3_fdcache_update_entry+0x90)[0x2aaaad06a628] /opt/glusterfs/nfs-git/lib/glusterfs/3.0.0git/xlator/nfs/server.so(nfs3_fdcache_update+0x41)[0x2aaaad06a71e] /opt/glusterfs/nfs-git/lib/glusterfs/3.0.0git/xlator/nfs/server.so(nfs3_file_open_and_resume+0x126)[0x2aaaad06afd9] /opt/glusterfs/nfs-git/lib/glusterfs/3.0.0git/xlator/nfs/server.so(nfs3_write_open_resume+0x6f)[0x2aaaad05bd84] /opt/glusterfs/nfs-git/lib/glusterfs/3.0.0git/xlator/nfs/server.so(nfs3_fh_resolve_inode_done+0xae)[0x2aaaad06bfa6] /opt/glusterfs/nfs-git/lib/glusterfs/3.0.0git/xlator/nfs/server.so(nfs3_fh_resolve_inode+0xbc)[0x2aaaad06d88b] /opt/glusterfs/nfs-git/lib/glusterfs/3.0.0git/xlator/nfs/server.so(nfs3_fh_resolve_and_resume+0xc1)[0x2aaaad06d9e8] /opt/glusterfs/nfs-git/lib/glusterfs/3.0.0git/xlator/nfs/server.so(nfs3_write+0x35d)[0x2aaaad05c167] /opt/glusterfs/nfs-git/lib/glusterfs/3.0.0git/xlator/nfs/server.so(nfs3svc_write_vec+0xba)[0x2aaaad05c49b] /opt/glusterfs/nfs-git/lib/libglrpcsvc.so.0(rpcsvc_record_vectored_call_actor+0xbc)[0x2aaaad28f10a] /opt/glusterfs/nfs-git/lib/libglrpcsvc.so.0(rpcsvc_update_vectored_state+0x1bd)[0x2aaaad28f33b] /opt/glusterfs/nfs-git/lib/libglrpcsvc.so.0(rpcsvc_handle_vectored_frag+0x85)[0x2aaaad28f425] /opt/glusterfs/nfs-git/lib/libglrpcsvc.so.0(rpcsvc_record_update_state+0xcd)[0x2aaaad28f504] /opt/glusterfs/nfs-git/lib/libglrpcsvc.so.0(rpcsvc_conn_data_poll_in+0xe7)[0x2aaaad28f7b9] /opt/glusterfs/nfs-git/lib/libglrpcsvc.so.0(rpcsvc_conn_data_handler+0x87)[0x2aaaad28fc15] /opt/glusterfs/nfs-git/lib/libglusterfs.so.0[0x2aaaaad072b9] /opt/glusterfs/nfs-git/lib/libglusterfs.so.0[0x2aaaaad0748e] /opt/glusterfs/nfs-git/lib/libglusterfs.so.0(event_dispatch+0x73)[0x2aaaaad077a4] /opt/glusterfs/nfs-git/lib/libglrpcsvc.so.0(rpcsvc_stage_proc+0x32)[0x2aaaad28ac32] /lib64/libpthread.so.0[0x2aaaab157367] /lib64/libc.so.6(clone+0x6d)[0x2aaaab43ef7d]
Do you remember what you were areequal'ing? kernel source?
arequaling /usr
There is a replicate self-heal function in the trace log. Anush. was this seen when running the availability tests with replicate and NFS?
Reproduceable by running a compilebench load over a distributed-replicated setup and all perf translators. The crash occurs a few mins after the downed node comes back up. Stack trace looks like: #0 __nfs3_fdcache_remove_entry (nfs3=0x640f90, fde=0x0) at ../../../../libglusterfs/src/list.h:60 60 old->prev->next = old->next; (gdb) bt #0 __nfs3_fdcache_remove_entry (nfs3=0x640f90, fde=0x0) at ../../../../libglusterfs/src/list.h:60 #1 0x00002aaaad023fa6 in nfs3_fdcache_remove (nfs3=0x640f90, fd=0xeac860) at nfs3-helpers.c:1933 #2 0x00002aaaad014b7b in nfs3svc_remove_cbk (frame=<value optimized out>, cookie=<value optimized out>, this=<value optimized out>, op_ret=<value optimized out>, op_errno=117, preparent=0x2aaab2481d58, postparent=0x2aaab2481dc0) at nfs3.c:2951 #3 0x00002aaaad00c812 in nfs_inode_unlink_cbk (frame=0x2aaab2534088, cookie=0x61b170, this=0x6147b0, op_ret=0, op_errno=117, preparent=0x2aaab2481d58, postparent=0x2aaab2481dc0) at nfs-inodes.c:348 #4 0x00002aaaad0092d3 in nfs_fop_unlink_cbk (frame=0x2aaab2534088, cookie=0x61b170, this=0x6147b0, op_ret=0, op_errno=117, preparent=0x2aaab2481d58, postparent=0x2aaab2481dc0) at nfs-fops.c:915 #5 0x00002aaaac378278 in dht_unlink_cbk (frame=0x2aaab257e8f0, cookie=0x2aaab23e3340, this=0x6147b0, op_ret=<value optimized out>, op_errno=<value optimized out>, preparent=<value optimized out>, postparent=0x2aaab0c0a188) at dht-common.c:1248 #6 0x00002aaaac1323ab in afr_unlink_unwind (frame=<value optimized out>, this=<value optimized out>) at afr-dir-write.c:1605 #7 0x00002aaaac13435c in afr_unlink_wind_cbk (frame=0x2aaab260c618, cookie=0x1, this=0x612ba0, op_ret=0, op_errno=0, preparent=0x7fffd6ba83d0, postparent=0x7fffd6ba8360) at afr-dir-write.c:1669 #8 0x00002aaaabf14831 in client_unlink_cbk (frame=0x27d7cb0, hdr=<value optimized out>, hdrlen=<value optimized out>, iobuf=<value optimized out>) at client-protocol.c:4357 #9 0x00002aaaabefd22a in protocol_client_pollin (this=0x6104a0, trans=0x6205d0) at client-protocol.c:6435 #10 0x00002aaaabf0bba2 in notify (this=0x640f90, event=2, data=0x6205d0) at client-protocol.c:6554 #11 0x00002aaaaacdad13 in xlator_notify (xl=0x6104a0, event=2, data=0x6205d0) at xlator.c:919 #12 0x00002aaaad44f0e3 in socket_event_handler (fd=<value optimized out>, idx=3, data=0x6205d0, poll_in=1, poll_out=0, poll_err=0) at socket.c:831 #13 0x00002aaaaacf5dc5 in event_dispatch_epoll (event_pool=0x609360) at event.c:804 #14 0x0000000000404397 in main (argc=8, argv=0x7fffd6ba9318) at glusterfsd.c:1494 (gdb) fr 0 #0 __nfs3_fdcache_remove_entry (nfs3=0x640f90, fde=0x0) at ../../../../libglusterfs/src/list.h:60 60 old->prev->next = old->next; (gdb) list 55 56 57 static inline void 58 list_del (struct list_head *old) 59 { 60 old->prev->next = old->next; 61 old->next->prev = old->prev; 62 63 old->next = (void *)0xbabebabe; 64 old->prev = (void *)0xcafecafe; (gdb) fr 1 #1 0x00002aaaad023fa6 in nfs3_fdcache_remove (nfs3=0x640f90, fd=0xeac860) at nfs3-helpers.c:1933 1933 __nfs3_fdcache_remove_entry (nfs3, fde); (gdb) listy Undefined command: "listy". Try "help". (gdb) list 1928 1929 LOCK (&nfs3->fdlrulock); 1930 { 1931 fd_ctx_get (fd, nfs3->nfsx, &ctxaddr); 1932 fde = (struct nfs3_fd_entry *)(long)ctxaddr; 1933 __nfs3_fdcache_remove_entry (nfs3, fde); 1934 } 1935 UNLOCK (&nfs3->fdlrulock); 1936 1937 return 0; (gdb) p *fd $1 = {pid = 0, flags = 0, refcount = 1, flush_unique = 0, inode_list = {next = 0xd89058, prev = 0xd89058}, inode = 0xd89020, lock = 1, _ctx = 0xeacf40} The @fde entry is NULL.
Latest crash happened 5 mins after node came back up. Log in dev:/share/tickets/885/fd-cache-crash.log.bz2 Program terminated with signal 11, Segmentation fault. #0 0x00002aaaad023c08 in __nfs3_fdcache_remove_entry (nfs3=0x641230, fde=0x0) at nfs3-helpers.c:1908 1908 gf_log (GF_NFS3, GF_LOG_TRACE, "Removing fd: 0x%lx: %d", (gdb) bt #0 0x00002aaaad023c08 in __nfs3_fdcache_remove_entry (nfs3=0x641230, fde=0x0) at nfs3-helpers.c:1908 #1 0x00002aaaad023fa6 in nfs3_fdcache_remove (nfs3=0x641230, fd=0x2aaab0839ae0) at nfs3-helpers.c:1933 #2 0x00002aaaad014b7b in nfs3svc_remove_cbk (frame=<value optimized out>, cookie=<value optimized out>, this=<value optimized out>, op_ret=<value optimized out>, op_errno=117, preparent=0x2aaab15ab648, postparent=0x2aaab15ab6b0) at nfs3.c:2951 #3 0x00002aaaad00c812 in nfs_inode_unlink_cbk (frame=0x2aaab15be1a8, cookie=0x61b5c0, this=0x6149d0, op_ret=0, op_errno=117, preparent=0x2aaab15ab648, postparent=0x2aaab15ab6b0) at nfs-inodes.c:348 #4 0x00002aaaad0092d3 in nfs_fop_unlink_cbk (frame=0x2aaab15be1a8, cookie=0x61b5c0, this=0x6149d0, op_ret=0, op_errno=117, preparent=0x2aaab15ab648, postparent=0x2aaab15ab6b0) at nfs-fops.c:915 #5 0x00002aaaac378278 in dht_unlink_cbk (frame=0x2aaab01bf4a0, cookie=0x2aaab00e9530, this=0x6149d0, op_ret=<value optimized out>, op_errno=<value optimized out>, preparent=<value optimized out>, postparent=0x2aaab180dac8) at dht-common.c:1248 #6 0x00002aaaac1323ab in afr_unlink_unwind (frame=<value optimized out>, this=<value optimized out>) at afr-dir-write.c:1605 #7 0x00002aaaac13435c in afr_unlink_wind_cbk (frame=0x2aaab14c3d48, cookie=0x1, this=0x612e10, op_ret=0, op_errno=0, preparent=0x7fff7dc7d4a0, postparent=0x7fff7dc7d430) at afr-dir-write.c:1669 #8 0x00002aaaabf14831 in client_unlink_cbk (frame=0x2aaab497c510, hdr=<value optimized out>, hdrlen=<value optimized out>, iobuf=<value optimized out>) at client-protocol.c:4357 #9 0x00002aaaabefd22a in protocol_client_pollin (this=0x610680, trans=0x620d30) at client-protocol.c:6435 #10 0x00002aaaabf0bba2 in notify (this=0x641230, event=2, data=0x620d30) at client-protocol.c:6554 #11 0x00002aaaaacdad13 in xlator_notify (xl=0x610680, event=2, data=0x620d30) at xlator.c:919 #12 0x00002aaaad44f0e3 in socket_event_handler (fd=<value optimized out>, idx=3, data=0x620d30, poll_in=1, poll_out=0, poll_err=0) at socket.c:831 #13 0x00002aaaaacf5dc5 in event_dispatch_epoll (event_pool=0x609360) at event.c:804 #14 0x0000000000404397 in main (argc=8, argv=0x7fff7dc7e3e8) at glusterfsd.c:1494 ########################################################################### ########################################################################### Looking deeper in the segment below: #0 0x00002aaaad023c08 in __nfs3_fdcache_remove_entry (nfs3=0x641230, fde=0x0) at nfs3-helpers.c:1908 1908 gf_log (GF_NFS3, GF_LOG_TRACE, "Removing fd: 0x%lx: %d", (gdb) fr 1 #1 0x00002aaaad023fa6 in nfs3_fdcache_remove (nfs3=0x641230, fd=0x2aaab0839ae0) at nfs3-helpers.c:1933 1933 __nfs3_fdcache_remove_entry (nfs3, fde); (gdb) ld Undefined command: "ld". Try "help". (gdb) list 1928 1929 LOCK (&nfs3->fdlrulock); 1930 { 1931 fd_ctx_get (fd, nfs3->nfsx, &ctxaddr); 1932 fde = (struct nfs3_fd_entry *)(long)ctxaddr; 1933 __nfs3_fdcache_remove_entry (nfs3, fde); 1934 } 1935 UNLOCK (&nfs3->fdlrulock); 1936 1937 return 0; (gdb) p *fd $1 = {pid = 0, flags = 0, refcount = 1, flush_unique = 0, inode_list = {next = 0x2aaab0839848, prev = 0x2aaab0839848}, inode = 0x2aaab0839810, lock = 1, _ctx = 0x2aaab0839b50} (gdb) p *fd->inode->fd_list Structure has no component named operator*. (gdb) p *fd->inode $2 = {table = 0x61d280, lock = 1, nlookup = 0, generation = 5498616990052389063, in_attic = 0, ref = 8, ino = 32446706, ia_type = IA_IFREG, fd_list = {next = 0x2aaab0839af8, prev = 0x2aaab0839af8}, dentry_list = {next = 0x2aaab0839858, prev = 0x2aaab0839858}, hash = {next = 0x2aaaaab514f0, prev = 0x2aaab12f4908}, list = {next = 0x2aaab07c2b28, prev = 0x2aaab0839ca8}, _ctx = 0x2aaab08398a0} (gdb) p *fd $3 = {pid = 0, flags = 0, refcount = 1, flush_unique = 0, inode_list = {next = 0x2aaab0839848, prev = 0x2aaab0839848}, inode = 0x2aaab0839810, lock = 1, _ctx = 0x2aaab0839b50} ########################################################################### ########################################################################### The fd_list in the inode and the inode_list in the fd are both empty, even though I got this fd here by doing an fd_lookup on the inode. It looks like a race condition between the point where we do fd_lookup in nfs3_remove_cbk and the point where nfs3_fdcache_remove is called.
Heres the problem.Suppose a file name 1 is created and some data is written to it. After this another 512 files are newly created and written to. When the the 513th file is created and an fd_t opened for it, it results in 1's fd_t being replaced in the fd-lru with 513th file's fd_t. This is the correct behaviour resulting in all refs getting unref from the fd_t of 1 and the fd and all related state being freed. But, in some workloads, some refs are still pending even after the fd_t is removed from LRU, resulting in the fd still being bound to the inode. In nfs3svc_remove_cbk, while removing the inode state, we also ensure that any fd_ts in the cache for this inode are also removed. While removing the fd_t, this situation where the fd_t has replaced with another, even while a ref remains on the fd_t, results in a crash in the fdcache_remove path in nfs3svc_remove_cbk. This happens because the fd_ctx_get results in a NULL value because the ctx was already deleted when this fd_t was removed from fd-lru earlier. This patch fixes the crash by introducing a NULL check.
Regression test at: http://test.gluster.com/show_bug.cgi?id=90
PATCH: http://patches.gluster.com/patch/3941 in master (nfs3: NULL fdentry check before removing from fdcache)