Bug 762617 (GLUSTER-885)

Summary: [NFS Xlator] Crash in nfs3_fdcache_update
Product: [Community] GlusterFS Reporter: Anush Shetty <anush>
Component: nfsAssignee: Shehjar Tikoo <shehjart>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: low    
Version: nfs-alphaCC: gluster-bugs
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: RTP Mount Type: nfs
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:

Description Anush Shetty 2010-05-05 05:02:52 EDT
NFS Server crashed while running arequal from the clients. The setup was Distribute-Replicate over 4 GlusterFS servers with 48 clients. Log file in /share/tickets

[2010-05-05 02:31:39] T [rpcsvc.c:1572:rpcsvc_submit_generic] rpc-service: Tx message: 136
[2010-05-05 02:31:39] T [rpcsvc.c:1365:rpcsvc_record_build_header] rpc-service: Reply fraglen 160, payload: 136, rpc hdr: 24
[2010-05-05 02:31:39] T [rpcsvc.c:1402:rpcsvc_conn_submit] rpc-service: Tx Header: 28, payload: 136
[2010-05-05 02:31:39] T [nfs3.c:198:nfs3_call_state_wipe] nfs-nfsv3: fd ref: 6
[2010-05-05 02:31:39] T [io-cache.c:132:ioc_inode_flush] iocache: locked inode(0x2aaab4aba040)
[2010-05-05 02:31:39] T [io-cache.c:136:ioc_inode_flush] iocache: unlocked inode(0x2aaab4aba040)
[2010-05-05 02:31:39] T [nfs.c:407:nfs_user_create] nfs: uid: 0, gid 0, gids: 0
[2010-05-05 02:31:39] T [io-cache.c:132:ioc_inode_flush] iocache: locked inode(0x2aaab4245060)
[2010-05-05 02:31:39] T [io-cache.c:136:ioc_inode_flush] iocache: unlocked inode(0x2aaab4245060)
[2010-05-05 02:31:39] T [nfs.c:407:nfs_user_create] nfs: uid: 0, gid 0, gids: 0
[2010-05-05 02:31:39] T [rbthash.c:309:rbthash_key_bucket] rbthash: HASH: 1832395053
[2010-05-05 02:31:39] T [rbthash.c:311:rbthash_key_bucket] rbthash: BUCKET: 0
[2010-05-05 02:31:39] T [io-cache.c:132:ioc_inode_flush] iocache: locked inode(0x2aaab8552de0)
[2010-05-05 02:31:39] T [io-cache.c:136:ioc_inode_flush] iocache: unlocked inode(0x2aaab8552de0)
[2010-05-05 02:31:39] T [nfs.c:407:nfs_user_create] nfs: uid: 0, gid 0, gids: 0
[2010-05-05 02:31:39] T [io-cache.c:132:ioc_inode_flush] iocache: locked inode(0x2aaab882aa00)
[2010-05-05 02:31:39] T [io-cache.c:136:ioc_inode_flush] iocache: unlocked inode(0x2aaab882aa00)
[2010-05-05 02:31:39] T [nfs.c:407:nfs_user_create] nfs: uid: 0, gid 0, gids: 0
[2010-05-05 02:31:39] T [io-cache.c:132:ioc_inode_flush] iocache: locked inode(0x4976c20)
[2010-05-05 02:31:39] T [io-cache.c:136:ioc_inode_flush] iocache: unlocked inode(0x4976c20)
[2010-05-05 02:31:39] T [nfs.c:407:nfs_user_create] nfs: uid: 0, gid 0, gids: 0
[2010-05-05 02:31:39] T [io-cache.c:132:ioc_inode_flush] iocache: locked inode(0x2aaaf58c0940)
[2010-05-05 02:31:39] T [io-cache.c:136:ioc_inode_flush] iocache: unlocked inode(0x2aaaf58c0940)
[2010-05-05 02:31:39] T [nfs.c:407:nfs_user_create] nfs: uid: 0, gid 0, gids: 0
[2010-05-05 02:31:39] T [io-cache.c:132:ioc_inode_flush] iocache: locked inode(0x2aab136f1300)
[2010-05-05 02:31:39] T [io-cache.c:136:ioc_inode_flush] iocache: unlocked inode(0x2aab136f1300)
[2010-05-05 02:31:39] D [nfs3-helpers.c:2224:nfs3_log_write_res] nfs-nfsv3: XID: 6430105e, WRITE: NFS: 0(Call completed successfully.), POSIX: 0(Success), count: 65536, UNSTABLE,wverf: 1272964805
[2010-05-05 02:31:39] T [rpcsvc.c:1572:rpcsvc_submit_generic] rpc-service: Tx message: 136
[2010-05-05 02:31:39] T [rpcsvc.c:1365:rpcsvc_record_build_header] rpc-service: Reply fraglen 160, payload: 136, rpc hdr: 24
[2010-05-05 02:31:39] T [rpcsvc.c:1402:rpcsvc_conn_submit] rpc-service: Tx Header: 28, payload: 136
[2010-05-05 02:31:39] T [nfs3.c:198:nfs3_call_state_wipe] nfs-nfsv3: fd ref: 5
[2010-05-05 02:31:39] T [io-cache.c:132:ioc_inode_flush] iocache: locked inode(0x2aab136f1300)
[2010-05-05 02:31:39] T [io-cache.c:136:ioc_inode_flush] iocache: unlocked inode(0x2aab136f1300)
[2010-05-05 02:31:39] D [nfs3-helpers.c:2224:nfs3_log_write_res] nfs-nfsv3: XID: 6530105e, WRITE: NFS: 0(Call completed successfully.), POSIX: 0(Success), count: 65536, UNSTABLE,wverf: 1272964805
[2010-05-05 02:31:39] T [rpcsvc.c:1572:rpcsvc_submit_generic] rpc-service: Tx message: 136
[2010-05-05 02:31:39] T [rpcsvc.c:1365:rpcsvc_record_build_header] rpc-service: Reply fraglen 160, payload: 136, rpc hdr: 24
[2010-05-05 02:31:39] T [rpcsvc.c:1402:rpcsvc_conn_submit] rpc-service: Tx Header: 28, payload: 136
[2010-05-05 02:31:39] T [nfs3.c:198:nfs3_call_state_wipe] nfs-nfsv3: fd ref: 4
[2010-05-05 02:31:39] T [io-cache.c:132:ioc_inode_flush] iocache: locked inode(0x2aab10830630)
[2010-05-05 02:31:39] T [io-cache.c:136:ioc_inode_flush] iocache: unlocked inode(0x2aab10830630)
[2010-05-05 02:31:39] T [nfs.c:407:nfs_user_create] nfs: uid: 0, gid 0, gids: 0
[2010-05-05 02:31:39] D [nfs3-helpers.c:2224:nfs3_log_write_res] nfs-nfsv3: XID: 162bf8e7, WRITE: NFS: 0(Call completed successfully.), POSIX: 117(Structure needs cleaning), count: 8192, STABLE,wverf: 1272964805
[2010-05-05 02:31:39] T [rpcsvc.c:1572:rpcsvc_submit_generic] rpc-service: Tx message: 136
[2010-05-05 02:31:39] T [rpcsvc.c:1365:rpcsvc_record_build_header] rpc-service: Reply fraglen 160, payload: 136, rpc hdr: 24
[2010-05-05 02:31:39] T [rpcsvc.c:1402:rpcsvc_conn_submit] rpc-service: Tx Header: 28, payload: 136
[2010-05-05 02:31:39] T [nfs3.c:198:nfs3_call_state_wipe] nfs-nfsv3: fd ref: 3
[2010-05-05 02:31:39] D [nfs3-helpers.c:2224:nfs3_log_write_res] nfs-nfsv3: XID: bbb81229, WRITE: NFS: 0(Call completed successfully.), POSIX: 117(Structure needs cleaning), count: 8192, STABLE,wverf: 1272964805
[2010-05-05 02:31:39] T [rpcsvc.c:1572:rpcsvc_submit_generic] rpc-service: Tx message: 136
[2010-05-05 02:31:39] T [rpcsvc.c:1365:rpcsvc_record_build_header] rpc-service: Reply fraglen 160, payload: 136, rpc hdr: 24
[2010-05-05 02:31:39] T [rpcsvc.c:1402:rpcsvc_conn_submit] rpc-service: Tx Header: 28, payload: 136
[2010-05-05 02:31:39] T [nfs3.c:198:nfs3_call_state_wipe] nfs-nfsv3: fd ref: 3
[2010-05-05 02:31:39] D [nfs3-helpers.c:2224:nfs3_log_write_res] nfs-nfsv3: XID: a292ad11, WRITE: NFS: 0(Call completed successfully.), POSIX: 117(Structure needs cleaning), count: 4096, STABLE,wverf: 1272964805
[2010-05-05 02:31:39] T [rpcsvc.c:1572:rpcsvc_submit_generic] rpc-service: Tx message: 136
[2010-05-05 02:31:39] T [rpcsvc.c:1365:rpcsvc_record_build_header] rpc-service: Reply fraglen 160, payload: 136, rpc hdr: 24
[2010-05-05 02:31:39] T [rpcsvc.c:1402:rpcsvc_conn_submit] rpc-service: Tx Header: 28, payload: 136
[2010-05-05 02:31:39] T [nfs3.c:198:nfs3_call_state_wipe] nfs-nfsv3: fd ref: 3
[2010-05-05 02:31:39] D [nfs3-helpers.c:2224:nfs3_log_write_res] nfs-nfsv3: XID: 7df5cc22, WRITE: NFS: 0(Call completed successfully.), POSIX: 117(Structure needs cleaning), count: 8192, STABLE,wverf: 1272964805
[2010-05-05 02:31:39] T [rpcsvc.c:1572:rpcsvc_submit_generic] rpc-service: Tx message: 136
[2010-05-05 02:31:39] T [rpcsvc.c:1365:rpcsvc_record_build_header] rpc-service: Reply fraglen 160, payload: 136, rpc hdr: 24
[2010-05-05 02:31:39] T [rpcsvc.c:1402:rpcsvc_conn_submit] rpc-service: Tx Header: 28, payload: 136
[2010-05-05 02:31:39] T [nfs3.c:198:nfs3_call_state_wipe] nfs-nfsv3: fd ref: 3
[2010-05-05 02:31:39] D [nfs3-helpers.c:2224:nfs3_log_write_res] nfs-nfsv3: XID: 5fad5b16, WRITE: NFS: 0(Call completed successfully.), POSIX: 117(Structure needs cleaning), count: 8192, STABLE,wverf: 1272964805
[2010-05-05 02:31:39] T [rpcsvc.c:1572:rpcsvc_submit_generic] rpc-service: Tx message: 136
[2010-05-05 02:31:39] T [rpcsvc.c:1365:rpcsvc_record_build_header] rpc-service: Reply fraglen 160, payload: 136, rpc hdr: 24
[2010-05-05 02:31:39] T [rpcsvc.c:1402:rpcsvc_conn_submit] rpc-service: Tx Header: 28, payload: 136
[2010-05-05 02:31:39] T [nfs3.c:198:nfs3_call_state_wipe] nfs-nfsv3: fd ref: 3
[2010-05-05 02:31:39] D [nfs3-helpers.c:2224:nfs3_log_write_res] nfs-nfsv3: XID: 8fb577f8, WRITE: NFS: 0(Call completed successfully.), POSIX: 117(Structure needs cleaning), count: 8192, STABLE,wverf: 1272964805
[2010-05-05 02:31:39] T [rpcsvc.c:1572:rpcsvc_submit_generic] rpc-service: Tx message: 136
[2010-05-05 02:31:39] T [rpcsvc.c:1365:rpcsvc_record_build_header] rpc-service: Reply fraglen 160, payload: 136, rpc hdr: 24
[2010-05-05 02:31:39] T [rpcsvc.c:1402:rpcsvc_conn_submit] rpc-service: Tx Header: 28, payload: 136
[2010-05-05 02:31:39] T [nfs3.c:198:nfs3_call_state_wipe] nfs-nfsv3: fd ref: 3
[2010-05-05 02:31:39] T [afr-self-heal-algorithm.c:522:sh_diff_loop_return] mirror-0: loop for offset 3488874496 returned
[2010-05-05 02:31:39] T [afr-self-heal-algorithm.c:997:sh_diff_loop_driver] mirror-0: spawning a loop for offset 3490971648
[2010-05-05 02:31:39] T [afr-self-heal-algorithm.c:522:sh_diff_loop_return] mirror-0: loop for offset 3489005568 returned
[2010-05-05 02:31:39] T [afr-self-heal-algorithm.c:997:sh_diff_loop_driver] mirror-0: spawning a loop for offset 3491102720
/lib64/libc.so.6[0x2aaaab39c280]
/opt/glusterfs/nfs-git/lib/glusterfs/3.0.0git/xlator/nfs/server.so[0x2aaaad06a650]
/opt/glusterfs/nfs-git/lib/glusterfs/3.0.0git/xlator/nfs/server.so(__nfs3_fdcache_update_entry+0x90)[0x2aaaad06a628]
/opt/glusterfs/nfs-git/lib/glusterfs/3.0.0git/xlator/nfs/server.so(nfs3_fdcache_update+0x41)[0x2aaaad06a71e]
/opt/glusterfs/nfs-git/lib/glusterfs/3.0.0git/xlator/nfs/server.so(nfs3_file_open_and_resume+0x126)[0x2aaaad06afd9]
/opt/glusterfs/nfs-git/lib/glusterfs/3.0.0git/xlator/nfs/server.so(nfs3_write_open_resume+0x6f)[0x2aaaad05bd84]
/opt/glusterfs/nfs-git/lib/glusterfs/3.0.0git/xlator/nfs/server.so(nfs3_fh_resolve_inode_done+0xae)[0x2aaaad06bfa6]
/opt/glusterfs/nfs-git/lib/glusterfs/3.0.0git/xlator/nfs/server.so(nfs3_fh_resolve_inode+0xbc)[0x2aaaad06d88b]
/opt/glusterfs/nfs-git/lib/glusterfs/3.0.0git/xlator/nfs/server.so(nfs3_fh_resolve_and_resume+0xc1)[0x2aaaad06d9e8]
/opt/glusterfs/nfs-git/lib/glusterfs/3.0.0git/xlator/nfs/server.so(nfs3_write+0x35d)[0x2aaaad05c167]
/opt/glusterfs/nfs-git/lib/glusterfs/3.0.0git/xlator/nfs/server.so(nfs3svc_write_vec+0xba)[0x2aaaad05c49b]
/opt/glusterfs/nfs-git/lib/libglrpcsvc.so.0(rpcsvc_record_vectored_call_actor+0xbc)[0x2aaaad28f10a]
/opt/glusterfs/nfs-git/lib/libglrpcsvc.so.0(rpcsvc_update_vectored_state+0x1bd)[0x2aaaad28f33b]
/opt/glusterfs/nfs-git/lib/libglrpcsvc.so.0(rpcsvc_handle_vectored_frag+0x85)[0x2aaaad28f425]
/opt/glusterfs/nfs-git/lib/libglrpcsvc.so.0(rpcsvc_record_update_state+0xcd)[0x2aaaad28f504]
/opt/glusterfs/nfs-git/lib/libglrpcsvc.so.0(rpcsvc_conn_data_poll_in+0xe7)[0x2aaaad28f7b9]
/opt/glusterfs/nfs-git/lib/libglrpcsvc.so.0(rpcsvc_conn_data_handler+0x87)[0x2aaaad28fc15]
/opt/glusterfs/nfs-git/lib/libglusterfs.so.0[0x2aaaaad072b9]
/opt/glusterfs/nfs-git/lib/libglusterfs.so.0[0x2aaaaad0748e]
/opt/glusterfs/nfs-git/lib/libglusterfs.so.0(event_dispatch+0x73)[0x2aaaaad077a4]
/opt/glusterfs/nfs-git/lib/libglrpcsvc.so.0(rpcsvc_stage_proc+0x32)[0x2aaaad28ac32]
/lib64/libpthread.so.0[0x2aaaab157367]
/lib64/libc.so.6(clone+0x6d)[0x2aaaab43ef7d]
Comment 1 Shehjar Tikoo 2010-05-14 00:17:57 EDT
Do you remember what you were areequal'ing? kernel source?
Comment 2 Anush Shetty 2010-05-14 00:19:49 EDT
arequaling /usr
Comment 3 Shehjar Tikoo 2010-05-16 23:17:37 EDT
There is a replicate self-heal function in the trace log.

Anush. was this seen when running the availability tests with replicate and NFS?
Comment 4 Shehjar Tikoo 2010-07-27 06:20:07 EDT
Reproduceable by running a compilebench load over a distributed-replicated setup and all perf translators. The crash occurs a few mins after the downed node comes back up.

Stack trace looks like:
#0  __nfs3_fdcache_remove_entry (nfs3=0x640f90, fde=0x0) at ../../../../libglusterfs/src/list.h:60
60              old->prev->next = old->next;
(gdb) bt
#0  __nfs3_fdcache_remove_entry (nfs3=0x640f90, fde=0x0) at ../../../../libglusterfs/src/list.h:60
#1  0x00002aaaad023fa6 in nfs3_fdcache_remove (nfs3=0x640f90, fd=0xeac860) at nfs3-helpers.c:1933
#2  0x00002aaaad014b7b in nfs3svc_remove_cbk (frame=<value optimized out>, cookie=<value optimized out>, this=<value optimized out>, op_ret=<value optimized out>, op_errno=117, 
    preparent=0x2aaab2481d58, postparent=0x2aaab2481dc0) at nfs3.c:2951
#3  0x00002aaaad00c812 in nfs_inode_unlink_cbk (frame=0x2aaab2534088, cookie=0x61b170, this=0x6147b0, op_ret=0, op_errno=117, preparent=0x2aaab2481d58, postparent=0x2aaab2481dc0)
    at nfs-inodes.c:348
#4  0x00002aaaad0092d3 in nfs_fop_unlink_cbk (frame=0x2aaab2534088, cookie=0x61b170, this=0x6147b0, op_ret=0, op_errno=117, preparent=0x2aaab2481d58, postparent=0x2aaab2481dc0)
    at nfs-fops.c:915
#5  0x00002aaaac378278 in dht_unlink_cbk (frame=0x2aaab257e8f0, cookie=0x2aaab23e3340, this=0x6147b0, op_ret=<value optimized out>, op_errno=<value optimized out>, 
    preparent=<value optimized out>, postparent=0x2aaab0c0a188) at dht-common.c:1248
#6  0x00002aaaac1323ab in afr_unlink_unwind (frame=<value optimized out>, this=<value optimized out>) at afr-dir-write.c:1605
#7  0x00002aaaac13435c in afr_unlink_wind_cbk (frame=0x2aaab260c618, cookie=0x1, this=0x612ba0, op_ret=0, op_errno=0, preparent=0x7fffd6ba83d0, postparent=0x7fffd6ba8360)
    at afr-dir-write.c:1669
#8  0x00002aaaabf14831 in client_unlink_cbk (frame=0x27d7cb0, hdr=<value optimized out>, hdrlen=<value optimized out>, iobuf=<value optimized out>) at client-protocol.c:4357
#9  0x00002aaaabefd22a in protocol_client_pollin (this=0x6104a0, trans=0x6205d0) at client-protocol.c:6435
#10 0x00002aaaabf0bba2 in notify (this=0x640f90, event=2, data=0x6205d0) at client-protocol.c:6554
#11 0x00002aaaaacdad13 in xlator_notify (xl=0x6104a0, event=2, data=0x6205d0) at xlator.c:919
#12 0x00002aaaad44f0e3 in socket_event_handler (fd=<value optimized out>, idx=3, data=0x6205d0, poll_in=1, poll_out=0, poll_err=0) at socket.c:831
#13 0x00002aaaaacf5dc5 in event_dispatch_epoll (event_pool=0x609360) at event.c:804
#14 0x0000000000404397 in main (argc=8, argv=0x7fffd6ba9318) at glusterfsd.c:1494
(gdb) fr 0
#0  __nfs3_fdcache_remove_entry (nfs3=0x640f90, fde=0x0) at ../../../../libglusterfs/src/list.h:60
60              old->prev->next = old->next;
(gdb) list
55
56
57      static inline void
58      list_del (struct list_head *old)
59      {
60              old->prev->next = old->next;
61              old->next->prev = old->prev;
62
63              old->next = (void *)0xbabebabe;
64              old->prev = (void *)0xcafecafe;
(gdb) fr 1
#1  0x00002aaaad023fa6 in nfs3_fdcache_remove (nfs3=0x640f90, fd=0xeac860) at nfs3-helpers.c:1933
1933                    __nfs3_fdcache_remove_entry (nfs3, fde);
(gdb) listy
Undefined command: "listy".  Try "help".
(gdb) list
1928
1929            LOCK (&nfs3->fdlrulock);
1930            {
1931                    fd_ctx_get (fd, nfs3->nfsx, &ctxaddr);
1932                    fde = (struct nfs3_fd_entry *)(long)ctxaddr;
1933                    __nfs3_fdcache_remove_entry (nfs3, fde);
1934            }
1935            UNLOCK (&nfs3->fdlrulock);
1936
1937            return 0;
(gdb) p *fd
$1 = {pid = 0, flags = 0, refcount = 1, flush_unique = 0, inode_list = {next = 0xd89058, prev = 0xd89058}, inode = 0xd89020, lock = 1, _ctx = 0xeacf40}


The @fde entry is NULL.
Comment 5 Shehjar Tikoo 2010-07-27 09:50:47 EDT
Latest crash happened 5 mins after node came back up. Log in dev:/share/tickets/885/fd-cache-crash.log.bz2

Program terminated with signal 11, Segmentation fault.
#0  0x00002aaaad023c08 in __nfs3_fdcache_remove_entry (nfs3=0x641230, fde=0x0) at nfs3-helpers.c:1908
1908            gf_log (GF_NFS3, GF_LOG_TRACE, "Removing fd: 0x%lx: %d",
(gdb) bt
#0  0x00002aaaad023c08 in __nfs3_fdcache_remove_entry (nfs3=0x641230, fde=0x0) at nfs3-helpers.c:1908
#1  0x00002aaaad023fa6 in nfs3_fdcache_remove (nfs3=0x641230, fd=0x2aaab0839ae0) at nfs3-helpers.c:1933
#2  0x00002aaaad014b7b in nfs3svc_remove_cbk (frame=<value optimized out>, cookie=<value optimized out>, this=<value optimized out>, op_ret=<value optimized out>, op_errno=117, 
    preparent=0x2aaab15ab648, postparent=0x2aaab15ab6b0) at nfs3.c:2951
#3  0x00002aaaad00c812 in nfs_inode_unlink_cbk (frame=0x2aaab15be1a8, cookie=0x61b5c0, this=0x6149d0, op_ret=0, op_errno=117, preparent=0x2aaab15ab648, postparent=0x2aaab15ab6b0)
    at nfs-inodes.c:348
#4  0x00002aaaad0092d3 in nfs_fop_unlink_cbk (frame=0x2aaab15be1a8, cookie=0x61b5c0, this=0x6149d0, op_ret=0, op_errno=117, preparent=0x2aaab15ab648, postparent=0x2aaab15ab6b0)
    at nfs-fops.c:915
#5  0x00002aaaac378278 in dht_unlink_cbk (frame=0x2aaab01bf4a0, cookie=0x2aaab00e9530, this=0x6149d0, op_ret=<value optimized out>, op_errno=<value optimized out>, 
    preparent=<value optimized out>, postparent=0x2aaab180dac8) at dht-common.c:1248
#6  0x00002aaaac1323ab in afr_unlink_unwind (frame=<value optimized out>, this=<value optimized out>) at afr-dir-write.c:1605
#7  0x00002aaaac13435c in afr_unlink_wind_cbk (frame=0x2aaab14c3d48, cookie=0x1, this=0x612e10, op_ret=0, op_errno=0, preparent=0x7fff7dc7d4a0, postparent=0x7fff7dc7d430)
    at afr-dir-write.c:1669
#8  0x00002aaaabf14831 in client_unlink_cbk (frame=0x2aaab497c510, hdr=<value optimized out>, hdrlen=<value optimized out>, iobuf=<value optimized out>) at client-protocol.c:4357
#9  0x00002aaaabefd22a in protocol_client_pollin (this=0x610680, trans=0x620d30) at client-protocol.c:6435
#10 0x00002aaaabf0bba2 in notify (this=0x641230, event=2, data=0x620d30) at client-protocol.c:6554
#11 0x00002aaaaacdad13 in xlator_notify (xl=0x610680, event=2, data=0x620d30) at xlator.c:919
#12 0x00002aaaad44f0e3 in socket_event_handler (fd=<value optimized out>, idx=3, data=0x620d30, poll_in=1, poll_out=0, poll_err=0) at socket.c:831
#13 0x00002aaaaacf5dc5 in event_dispatch_epoll (event_pool=0x609360) at event.c:804
#14 0x0000000000404397 in main (argc=8, argv=0x7fff7dc7e3e8) at glusterfsd.c:1494



###########################################################################
###########################################################################

Looking deeper in the segment below:


#0  0x00002aaaad023c08 in __nfs3_fdcache_remove_entry (nfs3=0x641230, fde=0x0) at nfs3-helpers.c:1908
1908            gf_log (GF_NFS3, GF_LOG_TRACE, "Removing fd: 0x%lx: %d",
(gdb) fr 1
#1  0x00002aaaad023fa6 in nfs3_fdcache_remove (nfs3=0x641230, fd=0x2aaab0839ae0) at nfs3-helpers.c:1933
1933                    __nfs3_fdcache_remove_entry (nfs3, fde);
(gdb) ld
Undefined command: "ld".  Try "help".
(gdb) list
1928
1929            LOCK (&nfs3->fdlrulock);
1930            {
1931                    fd_ctx_get (fd, nfs3->nfsx, &ctxaddr);
1932                    fde = (struct nfs3_fd_entry *)(long)ctxaddr;
1933                    __nfs3_fdcache_remove_entry (nfs3, fde);
1934            }
1935            UNLOCK (&nfs3->fdlrulock);
1936
1937            return 0;
(gdb) p *fd
$1 = {pid = 0, flags = 0, refcount = 1, flush_unique = 0, inode_list = {next = 0x2aaab0839848, prev = 0x2aaab0839848}, inode = 0x2aaab0839810, lock = 1, _ctx = 0x2aaab0839b50}
(gdb) p *fd->inode->fd_list
Structure has no component named operator*.
(gdb) p *fd->inode
$2 = {table = 0x61d280, lock = 1, nlookup = 0, generation = 5498616990052389063, in_attic = 0, ref = 8, ino = 32446706, ia_type = IA_IFREG, fd_list = {next = 0x2aaab0839af8, 
    prev = 0x2aaab0839af8}, dentry_list = {next = 0x2aaab0839858, prev = 0x2aaab0839858}, hash = {next = 0x2aaaaab514f0, prev = 0x2aaab12f4908}, list = {next = 0x2aaab07c2b28, 
    prev = 0x2aaab0839ca8}, _ctx = 0x2aaab08398a0}
(gdb) p *fd
$3 = {pid = 0, flags = 0, refcount = 1, flush_unique = 0, inode_list = {next = 0x2aaab0839848, prev = 0x2aaab0839848}, inode = 0x2aaab0839810, lock = 1, _ctx = 0x2aaab0839b50}

###########################################################################
###########################################################################


The fd_list in the inode and the inode_list in the fd are both empty, even though I got this fd here by doing an fd_lookup on the inode.

It looks like a race condition between the point where we do fd_lookup in nfs3_remove_cbk and the point where nfs3_fdcache_remove is called.
Comment 6 Shehjar Tikoo 2010-07-28 02:19:32 EDT
Heres the problem.Suppose a file name 1 is created and some data is written to it. After this another 512 files are newly created and written to. When the the 513th file is created and an fd_t opened for it, it results in 1's fd_t being replaced in the fd-lru with 513th file's fd_t. This is the correct behaviour resulting in all refs getting unref from the fd_t of 1 and the fd and all related state being freed.

But, in some workloads, some refs are still pending even after the fd_t is removed from LRU, resulting in the fd still being bound to the inode. In nfs3svc_remove_cbk, while removing the inode state, we also ensure that any fd_ts in the cache for this inode are also removed. While removing the fd_t, this situation where the fd_t has replaced with another, even while a ref remains on the fd_t, results in a crash in the fdcache_remove path in nfs3svc_remove_cbk. This happens because the fd_ctx_get results in a NULL value because the ctx was already deleted when this fd_t was removed from fd-lru earlier. This patch fixes the crash by introducing a NULL check.
Comment 7 Shehjar Tikoo 2010-07-28 02:34:47 EDT
Regression test at:
http://test.gluster.com/show_bug.cgi?id=90
Comment 8 Anand Avati 2010-07-28 03:10:05 EDT
PATCH: http://patches.gluster.com/patch/3941 in master (nfs3: NULL fdentry check before removing from fdcache)