Bug 763151 (GLUSTER-1419) - GNFS crashes while running SFS2008
Summary: GNFS crashes while running SFS2008
Keywords:
Status: CLOSED WONTFIX
Alias: GLUSTER-1419
Product: GlusterFS
Classification: Community
Component: protocol
Version: nfs-alpha
Hardware: All
OS: Linux
low
medium
Target Milestone: ---
Assignee: Amar Tumballi
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-08-23 09:49 UTC by Prithu Tiwari
Modified: 2015-12-01 16:45 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: RTP
Mount Type: nfs
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)

Description Shehjar Tikoo 2010-08-23 06:59:22 UTC
Backtrace:
Program terminated with signal 11, Segmentation fault.
#0  client_readv_cbk (frame=0x2aaab0d965d0, hdr=0x2aaaac399f10, hdrlen=<value optimized out>, iobuf=0x0) at client-protocol.c:4188
4188	                        vector.iov_base = iobuf->ptr;
(gdb) bt
#0  client_readv_cbk (frame=0x2aaab0d965d0, hdr=0x2aaaac399f10, hdrlen=<value optimized out>, iobuf=0x0) at client-protocol.c:4188
#1  0x00002b9101b3e1ba in protocol_client_pollin (this=0x143c9010, trans=0x143d1500) at client-protocol.c:6435
#2  0x00002b9101b4cb52 in notify (this=0xd9, event=2, data=0x143d1500) at client-protocol.c:6554
#3  0x00002b910107b433 in xlator_notify (xl=0x143c9010, event=2, data=0x143d1500) at xlator.c:919
#4  0x00002aaaaab9b073 in socket_event_handler (fd=<value optimized out>, idx=4, data=0x143d1500, poll_in=1, poll_out=0, poll_err=0)
    at socket.c:831
#5  0x00002b91010964e5 in event_dispatch_epoll (event_pool=0x143c1350) at event.c:804
#6  0x0000000000404367 in main (argc=5, argv=0x7fff0dcf97f8) at glusterfsd.c:1494
(gdb) p *vectore
No symbol "vectore" in current context.
(gdb) p *vector
Structure has no component named operator*.
(gdb) p vector
$1 = {iov_base = 0x0, iov_len = 7893}
(gdb) fr 1
#1  0x00002b9101b3e1ba in protocol_client_pollin (this=0x143c9010, trans=0x143d1500) at client-protocol.c:6435
6435	                ret = protocol_client_interpret (this, trans, hdr, hdrlen,
(gdb) list
6430	
6431	        ret = transport_receive (trans, &hdr, &hdrlen, &iobuf);
6432	
6433	        if (ret == 0)
6434	        {
6435	                ret = protocol_client_interpret (this, trans, hdr, hdrlen,
6436	                                                 iobuf);
6437	        }
6438	
6439	        /* TODO: use mem-pool */
(gdb) list transport_receive
319	
320	
321	int32_t
322	transport_receive (transport_t *this, char **hdr_p, size_t *hdrlen_p,
323			   struct iobuf **iobuf_p)
324	{
325		int32_t ret = -1;
326	
327		GF_VALIDATE_OR_GOTO("transport", this, fail);
328	
(gdb) 
329	        if (this->peer_trans) {
330	                *hdr_p = this->handover.msg->hdr;
331	                *hdrlen_p = this->handover.msg->hdrlen;
332	                *iobuf_p = this->handover.msg->iobuf;
333	
334	                return 0;
335	        }
336	
337		ret = this->ops->receive (this, hdr_p, hdrlen_p, iobuf_p);
338	fail:
(gdb)

Comment 1 Shehjar Tikoo 2010-08-23 07:00:52 UTC
The symptom is that the transport code is receiving a NULL iobuf even though op_ret says that the readv fop has returned with over 7000 bytes of data.

Comment 2 Prithu Tiwari 2010-08-23 09:49:07 UTC
GNFS crashed after running for  some time while running SFS2008.

The volfile is 
-------------------------------------------------
volume brick5
        type protocol/client
        option transport-type socket
        option transport.socket.remote-port 7001
        option remote-host 10.3.10.15
        option remote-subvolume posix1-locked-iot
end-volume

volume brick6
        type protocol/client
        option transport-type socket
        option transport.socket.remote-port 7001
        option remote-host 10.3.10.16
        option remote-subvolume posix1-locked-iot
end-volume

volume brick7
        type protocol/client
        option transport-type socket
        option transport.socket.remote-port 7001
        option remote-host 10.3.10.17
        option remote-subvolume posix1-locked-iot
end-volume

volume brick8
        type protocol/client
        option transport-type socket
        option transport.socket.remote-port 7001
        option remote-host 10.3.10.18
        option remote-subvolume posix1-locked-iot
end-volume

volume dist
        type cluster/distribute
        subvolumes brick5 brick6 brick7 brick8
end-volume

volume for-wb
        type performance/io-threads
        subvolumes dist
end-volume

volume distribute
        type performance/write-behind
        option window-size 1Gb
        subvolumes for-wb
end-volume

#volume distribute
#       type performance/read-ahead
#       subvolumes for-ra
#end-volume

volume nfsserver 
        type nfs/server
        subvolumes distribute
        option rpc-auth.addr.allow *
end-volume
--------------------------------------------------------------------------

The log file is :
--------------------------------------------------------------------------
[2010-08-22 22:11:21] W [xlator.c:651:validate_xlator_volume_options] distribute: option 'window-size' is deprecated, preferred is 'cache-si
ze', continuing with correction
[2010-08-22 22:11:21] W [xlator.c:651:validate_xlator_volume_options] brick8: option 'transport.socket.remote-port' is deprecated, preferred
 is 'remote-port', continuing with correction
[2010-08-22 22:11:21] W [xlator.c:651:validate_xlator_volume_options] brick7: option 'transport.socket.remote-port' is deprecated, preferred
 is 'remote-port', continuing with correction
[2010-08-22 22:11:21] W [xlator.c:651:validate_xlator_volume_options] brick6: option 'transport.socket.remote-port' is deprecated, preferred
 is 'remote-port', continuing with correction
[2010-08-22 22:11:21] W [xlator.c:651:validate_xlator_volume_options] brick5: option 'transport.socket.remote-port' is deprecated, preferred
 is 'remote-port', continuing with correction
[2010-08-22 22:11:21] N [glusterfsd.c:1477:main] glusterfs: Successfully started
[2010-08-22 22:11:21] N [client-protocol.c:5857:client_setvolume_cbk] brick7: Connected to 10.3.10.17:7001, attached to remote volume 'posix
1-locked-iot'.   
[2010-08-22 22:11:21] N [client-protocol.c:5857:client_setvolume_cbk] brick8: Connected to 10.3.10.18:7001, attached to remote volume 'posix
1-locked-iot'.   
[2010-08-22 22:11:21] N [client-protocol.c:5857:client_setvolume_cbk] brick5: Connected to 10.3.10.15:7001, attached to remote volume 'posix
1-locked-iot'.   
[2010-08-22 22:11:21] N [client-protocol.c:5857:client_setvolume_cbk] brick6: Connected to 10.3.10.16:7001, attached to remote volume 'posix
1-locked-iot'.   
[2010-08-22 22:11:21] N [client-protocol.c:5857:client_setvolume_cbk] brick5: Connected to 10.3.10.15:7001, attached to remote volume 'posix
1-locked-iot'.   
[2010-08-22 22:11:21] N [client-protocol.c:5857:client_setvolume_cbk] brick7: Connected to 10.3.10.17:7001, attached to remote volume 'posix
1-locked-iot'.   
[2010-08-22 22:11:21] N [client-protocol.c:5857:client_setvolume_cbk] brick7: Connected to 10.3.10.17:7001, attached to remote volume 'posix
1-locked-iot'.   
[2010-08-22 22:11:21] N [client-protocol.c:5857:client_setvolume_cbk] brick8: Connected to 10.3.10.18:7001, attached to remote volume 'posix1-locked-iot'.   
[2010-08-22 22:11:21] N [client-protocol.c:5857:client_setvolume_cbk] brick6: Connected to 10.3.10.16:7001, attached to remote volume 'posix1-locked-iot'.   
pending frames:  

patchset: v3.0.0-245-g849f5ec
signal received: 11
time of crash: 2010-08-22 22:29:02
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs nfs_beta_rc10
/lib64/libc.so.6[0x32df6302d0]
/opt/gnfs/lib/glusterfs/nfs_beta_rc10/xlator/protocol/client.so(client_readv_cbk+0x2a2)[0x2b9101b56a02]
/opt/gnfs/lib/glusterfs/nfs_beta_rc10/xlator/protocol/client.so(protocol_client_pollin+0xca)[0x2b9101b3e1ba]
/opt/gnfs/lib/glusterfs/nfs_beta_rc10/xlator/protocol/client.so(notify+0x212)[0x2b9101b4cb52]
/opt/gnfs/lib/libglusterfs.so.0(xlator_notify+0x43)[0x2b910107b433]
/opt/gnfs/lib/glusterfs/nfs_beta_rc10/transport/socket.so(socket_event_handler+0xd3)[0x2aaaaab9b073]
/opt/gnfs/lib/libglusterfs.so.0[0x2b91010964e5]
/opt/gnfs/sbin/glusterfs(main+0xb17)[0x404367]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x32df61d994]
/opt/gnfs/sbin/glusterfs[0x4027a9]
---------
-------------------------------------------------------------------------------


and the core file is at gluster.163.210:/gluster/pbt/core.23900.tbz

Comment 3 Amar Tumballi 2010-08-29 08:40:05 UTC
Shehjar, In 3.1 we will not be having this protocol code, hence this bug may not be valid anymore. What do you want to do with this?

Comment 4 Shehjar Tikoo 2010-08-31 03:52:51 UTC
We ran into this again while running SFS:

Program terminated with signal 11, Segmentation fault.
#0  client_readv_cbk (frame=0x2aaab8efea00, hdr=0x2aaab611e390, hdrlen=<value optimized out>, iobuf=0x0) at client-protocol.c:4188
4188	                        vector.iov_base = iobuf->ptr;
(gdb) bt
#0  client_readv_cbk (frame=0x2aaab8efea00, hdr=0x2aaab611e390, hdrlen=<value optimized out>, iobuf=0x0) at client-protocol.c:4188
#1  0x00002ad5510d61ba in protocol_client_pollin (this=0x800e70, trans=0x80cc00) at client-protocol.c:6435
#2  0x00002ad5510e4b52 in notify (this=0x94, event=2, data=0x80cc00) at client-protocol.c:6554
#3  0x00002ad550613433 in xlator_notify (xl=0x800e70, event=2, data=0x80cc00) at xlator.c:919
#4  0x00002aaaaab9b073 in socket_event_handler (fd=<value optimized out>, idx=0, data=0x80cc00, poll_in=1, poll_out=0, poll_err=0)
    at socket.c:831
#5  0x00002ad55062e4e5 in event_dispatch_epoll (event_pool=0x7fb350) at event.c:804
#6  0x0000000000404367 in main (argc=5, argv=0x7fffc4bd2e28) at glusterfsd.c:1494

Comment 5 Shehjar Tikoo 2010-08-31 03:58:51 UTC
(In reply to comment #3)
> Shehjar, In 3.1 we will not be having this protocol code, hence this bug may
> not be valid anymore. What do you want to do with this?

Hi Amar,

This bug is turning out to be a blocker for SFS tests on nfs-beta branch. If a quick-fix is possible, lets check that out otherwise, i think we can safely keep this as low prio, unless of course AB says that SFS over nfs-beta branch is a prio. This crash does not happen on any other tests with nfs-beta-rcX, only with SFS.

Comment 6 Prithu Tiwari 2010-09-02 00:24:35 UTC
The crash did not occur when I ran GNFS  with following command

./dsh tc4 "export GLUSTERFS_DISABLE_MEM_ACCT=1;/opt/gnfs/sbin/glusterfs -f /share/shehjart/volfiles/gnfs-1v-4d.vol -l /tmp/gnnnn3"

The SFS test ran to completion with all 5 iterations finishing.

Performance is slightly slower(less) but need to investigate more before it could be confirmed.

Comment 7 Prithu Tiwari 2010-09-02 00:30:28 UTC
Sorry, last comment was meant for BUG-1499 :P

Comment 8 Amar Tumballi 2010-09-07 04:53:41 UTC
reducing the priority as we have removed legacy protocol from build, and SFS running work started over NFS in mainline..

Comment 9 Amar Tumballi 2010-09-17 03:45:25 UTC
Shehjar/Prithu ji,

I will close this bug as now NFS has started working on mainline, and legacy xlator is removed from the build. This particular backtrace snapshot is no more valid now.

Please file new bug wrt to mainline failures.

-Amar


Note You need to log in before you can comment on or make changes to this bug.