| Summary: | GNFS crashes while running SFS2008 | ||
|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Prithu Tiwari <prithu> |
| Component: | protocol | Assignee: | Amar Tumballi <amarts> |
| Status: | CLOSED WONTFIX | QA Contact: | |
| Severity: | medium | Docs Contact: | |
| Priority: | low | ||
| Version: | nfs-alpha | CC: | ab, gluster-bugs, shehjart, vraman |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | --- | |
| Regression: | RTP | Mount Type: | nfs |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
The symptom is that the transport code is receiving a NULL iobuf even though op_ret says that the readv fop has returned with over 7000 bytes of data. GNFS crashed after running for some time while running SFS2008.
The volfile is
-------------------------------------------------
volume brick5
type protocol/client
option transport-type socket
option transport.socket.remote-port 7001
option remote-host 10.3.10.15
option remote-subvolume posix1-locked-iot
end-volume
volume brick6
type protocol/client
option transport-type socket
option transport.socket.remote-port 7001
option remote-host 10.3.10.16
option remote-subvolume posix1-locked-iot
end-volume
volume brick7
type protocol/client
option transport-type socket
option transport.socket.remote-port 7001
option remote-host 10.3.10.17
option remote-subvolume posix1-locked-iot
end-volume
volume brick8
type protocol/client
option transport-type socket
option transport.socket.remote-port 7001
option remote-host 10.3.10.18
option remote-subvolume posix1-locked-iot
end-volume
volume dist
type cluster/distribute
subvolumes brick5 brick6 brick7 brick8
end-volume
volume for-wb
type performance/io-threads
subvolumes dist
end-volume
volume distribute
type performance/write-behind
option window-size 1Gb
subvolumes for-wb
end-volume
#volume distribute
# type performance/read-ahead
# subvolumes for-ra
#end-volume
volume nfsserver
type nfs/server
subvolumes distribute
option rpc-auth.addr.allow *
end-volume
--------------------------------------------------------------------------
The log file is :
--------------------------------------------------------------------------
[2010-08-22 22:11:21] W [xlator.c:651:validate_xlator_volume_options] distribute: option 'window-size' is deprecated, preferred is 'cache-si
ze', continuing with correction
[2010-08-22 22:11:21] W [xlator.c:651:validate_xlator_volume_options] brick8: option 'transport.socket.remote-port' is deprecated, preferred
is 'remote-port', continuing with correction
[2010-08-22 22:11:21] W [xlator.c:651:validate_xlator_volume_options] brick7: option 'transport.socket.remote-port' is deprecated, preferred
is 'remote-port', continuing with correction
[2010-08-22 22:11:21] W [xlator.c:651:validate_xlator_volume_options] brick6: option 'transport.socket.remote-port' is deprecated, preferred
is 'remote-port', continuing with correction
[2010-08-22 22:11:21] W [xlator.c:651:validate_xlator_volume_options] brick5: option 'transport.socket.remote-port' is deprecated, preferred
is 'remote-port', continuing with correction
[2010-08-22 22:11:21] N [glusterfsd.c:1477:main] glusterfs: Successfully started
[2010-08-22 22:11:21] N [client-protocol.c:5857:client_setvolume_cbk] brick7: Connected to 10.3.10.17:7001, attached to remote volume 'posix
1-locked-iot'.
[2010-08-22 22:11:21] N [client-protocol.c:5857:client_setvolume_cbk] brick8: Connected to 10.3.10.18:7001, attached to remote volume 'posix
1-locked-iot'.
[2010-08-22 22:11:21] N [client-protocol.c:5857:client_setvolume_cbk] brick5: Connected to 10.3.10.15:7001, attached to remote volume 'posix
1-locked-iot'.
[2010-08-22 22:11:21] N [client-protocol.c:5857:client_setvolume_cbk] brick6: Connected to 10.3.10.16:7001, attached to remote volume 'posix
1-locked-iot'.
[2010-08-22 22:11:21] N [client-protocol.c:5857:client_setvolume_cbk] brick5: Connected to 10.3.10.15:7001, attached to remote volume 'posix
1-locked-iot'.
[2010-08-22 22:11:21] N [client-protocol.c:5857:client_setvolume_cbk] brick7: Connected to 10.3.10.17:7001, attached to remote volume 'posix
1-locked-iot'.
[2010-08-22 22:11:21] N [client-protocol.c:5857:client_setvolume_cbk] brick7: Connected to 10.3.10.17:7001, attached to remote volume 'posix
1-locked-iot'.
[2010-08-22 22:11:21] N [client-protocol.c:5857:client_setvolume_cbk] brick8: Connected to 10.3.10.18:7001, attached to remote volume 'posix1-locked-iot'.
[2010-08-22 22:11:21] N [client-protocol.c:5857:client_setvolume_cbk] brick6: Connected to 10.3.10.16:7001, attached to remote volume 'posix1-locked-iot'.
pending frames:
patchset: v3.0.0-245-g849f5ec
signal received: 11
time of crash: 2010-08-22 22:29:02
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs nfs_beta_rc10
/lib64/libc.so.6[0x32df6302d0]
/opt/gnfs/lib/glusterfs/nfs_beta_rc10/xlator/protocol/client.so(client_readv_cbk+0x2a2)[0x2b9101b56a02]
/opt/gnfs/lib/glusterfs/nfs_beta_rc10/xlator/protocol/client.so(protocol_client_pollin+0xca)[0x2b9101b3e1ba]
/opt/gnfs/lib/glusterfs/nfs_beta_rc10/xlator/protocol/client.so(notify+0x212)[0x2b9101b4cb52]
/opt/gnfs/lib/libglusterfs.so.0(xlator_notify+0x43)[0x2b910107b433]
/opt/gnfs/lib/glusterfs/nfs_beta_rc10/transport/socket.so(socket_event_handler+0xd3)[0x2aaaaab9b073]
/opt/gnfs/lib/libglusterfs.so.0[0x2b91010964e5]
/opt/gnfs/sbin/glusterfs(main+0xb17)[0x404367]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x32df61d994]
/opt/gnfs/sbin/glusterfs[0x4027a9]
---------
-------------------------------------------------------------------------------
and the core file is at gluster.163.210:/gluster/pbt/core.23900.tbz
Shehjar, In 3.1 we will not be having this protocol code, hence this bug may not be valid anymore. What do you want to do with this? We ran into this again while running SFS:
Program terminated with signal 11, Segmentation fault.
#0 client_readv_cbk (frame=0x2aaab8efea00, hdr=0x2aaab611e390, hdrlen=<value optimized out>, iobuf=0x0) at client-protocol.c:4188
4188 vector.iov_base = iobuf->ptr;
(gdb) bt
#0 client_readv_cbk (frame=0x2aaab8efea00, hdr=0x2aaab611e390, hdrlen=<value optimized out>, iobuf=0x0) at client-protocol.c:4188
#1 0x00002ad5510d61ba in protocol_client_pollin (this=0x800e70, trans=0x80cc00) at client-protocol.c:6435
#2 0x00002ad5510e4b52 in notify (this=0x94, event=2, data=0x80cc00) at client-protocol.c:6554
#3 0x00002ad550613433 in xlator_notify (xl=0x800e70, event=2, data=0x80cc00) at xlator.c:919
#4 0x00002aaaaab9b073 in socket_event_handler (fd=<value optimized out>, idx=0, data=0x80cc00, poll_in=1, poll_out=0, poll_err=0)
at socket.c:831
#5 0x00002ad55062e4e5 in event_dispatch_epoll (event_pool=0x7fb350) at event.c:804
#6 0x0000000000404367 in main (argc=5, argv=0x7fffc4bd2e28) at glusterfsd.c:1494
(In reply to comment #3) > Shehjar, In 3.1 we will not be having this protocol code, hence this bug may > not be valid anymore. What do you want to do with this? Hi Amar, This bug is turning out to be a blocker for SFS tests on nfs-beta branch. If a quick-fix is possible, lets check that out otherwise, i think we can safely keep this as low prio, unless of course AB says that SFS over nfs-beta branch is a prio. This crash does not happen on any other tests with nfs-beta-rcX, only with SFS. The crash did not occur when I ran GNFS with following command ./dsh tc4 "export GLUSTERFS_DISABLE_MEM_ACCT=1;/opt/gnfs/sbin/glusterfs -f /share/shehjart/volfiles/gnfs-1v-4d.vol -l /tmp/gnnnn3" The SFS test ran to completion with all 5 iterations finishing. Performance is slightly slower(less) but need to investigate more before it could be confirmed. Sorry, last comment was meant for BUG-1499 :P reducing the priority as we have removed legacy protocol from build, and SFS running work started over NFS in mainline.. Shehjar/Prithu ji, I will close this bug as now NFS has started working on mainline, and legacy xlator is removed from the build. This particular backtrace snapshot is no more valid now. Please file new bug wrt to mainline failures. -Amar |
Backtrace: Program terminated with signal 11, Segmentation fault. #0 client_readv_cbk (frame=0x2aaab0d965d0, hdr=0x2aaaac399f10, hdrlen=<value optimized out>, iobuf=0x0) at client-protocol.c:4188 4188 vector.iov_base = iobuf->ptr; (gdb) bt #0 client_readv_cbk (frame=0x2aaab0d965d0, hdr=0x2aaaac399f10, hdrlen=<value optimized out>, iobuf=0x0) at client-protocol.c:4188 #1 0x00002b9101b3e1ba in protocol_client_pollin (this=0x143c9010, trans=0x143d1500) at client-protocol.c:6435 #2 0x00002b9101b4cb52 in notify (this=0xd9, event=2, data=0x143d1500) at client-protocol.c:6554 #3 0x00002b910107b433 in xlator_notify (xl=0x143c9010, event=2, data=0x143d1500) at xlator.c:919 #4 0x00002aaaaab9b073 in socket_event_handler (fd=<value optimized out>, idx=4, data=0x143d1500, poll_in=1, poll_out=0, poll_err=0) at socket.c:831 #5 0x00002b91010964e5 in event_dispatch_epoll (event_pool=0x143c1350) at event.c:804 #6 0x0000000000404367 in main (argc=5, argv=0x7fff0dcf97f8) at glusterfsd.c:1494 (gdb) p *vectore No symbol "vectore" in current context. (gdb) p *vector Structure has no component named operator*. (gdb) p vector $1 = {iov_base = 0x0, iov_len = 7893} (gdb) fr 1 #1 0x00002b9101b3e1ba in protocol_client_pollin (this=0x143c9010, trans=0x143d1500) at client-protocol.c:6435 6435 ret = protocol_client_interpret (this, trans, hdr, hdrlen, (gdb) list 6430 6431 ret = transport_receive (trans, &hdr, &hdrlen, &iobuf); 6432 6433 if (ret == 0) 6434 { 6435 ret = protocol_client_interpret (this, trans, hdr, hdrlen, 6436 iobuf); 6437 } 6438 6439 /* TODO: use mem-pool */ (gdb) list transport_receive 319 320 321 int32_t 322 transport_receive (transport_t *this, char **hdr_p, size_t *hdrlen_p, 323 struct iobuf **iobuf_p) 324 { 325 int32_t ret = -1; 326 327 GF_VALIDATE_OR_GOTO("transport", this, fail); 328 (gdb) 329 if (this->peer_trans) { 330 *hdr_p = this->handover.msg->hdr; 331 *hdrlen_p = this->handover.msg->hdrlen; 332 *iobuf_p = this->handover.msg->iobuf; 333 334 return 0; 335 } 336 337 ret = this->ops->receive (this, hdr_p, hdrlen_p, iobuf_p); 338 fail: (gdb)