This service will be undergoing maintenance at 00:00 UTC, 2016-08-01. It is expected to last about 1 hours

Bug 763181 (GLUSTER-1449)

Summary: NFS crash in nfs_fop_fsync_cbk
Product: [Community] GlusterFS Reporter: Krishna Srinivas <krishna>
Component: nfsAssignee: Shehjar Tikoo <shehjart>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: urgent    
Version: nfs-alphaCC: gluster-bugs, vijay
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: RTP Mount Type: nfs
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:

Description Krishna Srinivas 2010-08-26 18:44:02 EDT
Customer crash.
backtrace:
Core was generated by `/opt/glusterfs/sbin/glusterfs -f /etc/glusterfs/nfs.vol'.
Program terminated with signal 11, Segmentation fault.
#0  0x00002b3ea92e6cc4 in nfs_fop_fsync_cbk (frame=0x1d69bd18, cookie=0x1d204850, this=0x1d201060, op_ret=0, op_errno=0, prebuf=0x7fff966c1d90, postbuf=0x7fff966c1d20)
   at nfs-fops.c:1170
1170            nfs_fop_restore_root_ino (nfl, prebuf, postbuf, NULL, NULL);



(gdb) bt
#0  0x00002b3ea92e6cc4 in nfs_fop_fsync_cbk (frame=0x1d69bd18, cookie=0x1d204850, this=0x1d201060, op_ret=0, op_errno=0, prebuf=0x7fff966c1d90, postbuf=0x7fff966c1d20)
   at nfs-fops.c:1170
#1  0x00002b3ea90c6820 in iot_fsync_cbk (frame=0x1d6bb2e0, cookie=0x2aaac00de840, this=0x1d201060, op_ret=0, op_errno=0, prebuf=0x7fff966c1d90, postbuf=0x7fff966c1d20)
   at io-threads.c:893
#2  0x00002b3ea8eb0019 in client_fsync_cbk (frame=0x2aaac00de840, hdr=0x2aaac8006e70, hdrlen=268, iobuf=0x0) at client-protocol.c:4324
#3  0x00002b3ea8eb5af8 in protocol_client_interpret (this=0x1d1f9c00, trans=0x2aaaac0048e0, hdr_p=0x2aaac8006e70 "", hdrlen=268, iobuf=0x0) at client-protocol.c:6137
#4  0x00002b3ea8eb67be in protocol_client_pollin (this=0x1d1f9c00, trans=0x2aaaac0048e0) at client-protocol.c:6435
#5  0x00002b3ea8eb6e35 in notify (this=0x1d1f9c00, event=2, data=0x2aaaac0048e0) at client-protocol.c:6554
#6  0x00002b3ea83c5b7c in xlator_notify (xl=0x1d1f9c00, event=2, data=0x2aaaac0048e0) at xlator.c:919
#7  0x00002aaaaaf09e96 in socket_event_poll_in (this=0x2aaaac0048e0) at socket.c:731
#8  0x00002aaaaaf0a18b in socket_event_handler (fd=16, idx=8, data=0x2aaaac0048e0, poll_in=1, poll_out=0, poll_err=0) at socket.c:831
#9  0x00002b3ea83ec2b9 in event_dispatch_epoll_handler (event_pool=0x1d1f18b0, events=0x2aaaac009b20, i=0) at event.c:804
#10 0x00002b3ea83ec48e in event_dispatch_epoll (event_pool=0x1d1f18b0) at event.c:867
#11 0x00002b3ea83ec7a4 in event_dispatch (event_pool=0x1d1f18b0) at event.c:975
#12 0x0000000000406344 in main (argc=3, argv=0x7fff966c29f8) at glusterfsd.c:1494


(gdb) p *nfl
Cannot access memory at address 0x1d73bc40

looking at the code:
nfs_fop_fsync_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
                   int32_t op_ret, int32_t op_errno, struct iatt *prebuf,
 	           struct iatt *postbuf)
{
        struct nfs_fop_local    *nfl = NULL;
	fop_fsync_cbk_t         progcbk = NULL;

        nfl_to_prog_data (nfl, progcbk, frame);
        nfs_fop_restore_root_ino (nfl, prebuf, postbuf, NULL, NULL);

nfl_to_prog_data() does mem_put() of nfl after which nfl is accessed in nfs_fop_restore_root_ino() which might cause segfault
Comment 1 Shehjar Tikoo 2010-08-26 23:14:45 EDT
How to reproduce? How often does it happen?

Is this mainline or nfs-beta branch?

Try exporting from gnfs using the trusted-write option as a work-around.

Yes, you're right, it may just be the nfl access after mem-put. Checking it out..
Comment 2 Krishna Srinivas 2010-08-26 23:36:34 EDT
(In reply to comment #1)
> How to reproduce? How often does it happen?
> 

This is a customer crash. Highly critical. It happened when there was a lot of I/O. No other known trigger.

> Is this mainline or nfs-beta branch?

This is rc8.

> Try exporting from gnfs using the trusted-write option as a work-around.

Are you sure this will fix the problem? because we will still be accessing invalid memory after mem_put()

> 
> Yes, you're right, it may just be the nfl access after mem-put. Checking it
> out..

It is definitely due to that, see this:

(gdb) p *nfl
Cannot access memory at address 0x1d73bc40
Comment 3 Shehjar Tikoo 2010-08-26 23:46:03 EDT
(In reply to comment #2)
> (In reply to comment #1)
...
...
> > Try exporting from gnfs using the trusted-write option as a work-around.
> 
> Are you sure this will fix the problem? because we will still be accessing
> invalid memory after mem_put()
> 

Yes, at least for fsync because the client will not send COMMIT requests which translate to fsync fop.
Comment 4 Vijay Bellur 2010-08-31 07:44:47 EDT
PATCH: http://patches.gluster.com/patch/4422 in master (nfs: Free fop local only after inode checks)
Comment 5 Shehjar Tikoo 2010-08-31 22:43:20 EDT
Mem-pool starts CALLOCing when the pool over-flows, for this data structure, the pool will overflow in a very high load situation and only then dereference of a free area will happen.

Keeping unresolved till I figure out a way to reproduce without the need for very high load.