Bug 761905 (GLUSTER-173)

Summary: [ glusterfs 2.0.6rc1 ] - "server SEG fault"
Product: [Community] GlusterFS Reporter: Vinayak Hegde <vinayak>
Component: protocolAssignee: Basavanagowda Kanur <gowda>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: low    
Version: 2.0.5CC: aavati, gluster-bugs, gowda, guru, vijay
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Vinayak Hegde 2009-07-29 07:50:17 UTC
3 server stripe
write behind and io-cache on client side

Error may not be reproducible.

Running tools were
exnihilate.sh,annihilate.sh,dbench,iozone



[root@brick6 /]# du -hs core.32202-brick6
9.0M    core.32202-brick6
[root@brick6 /]# ls -lh core.32202-brick6
-rw------- 1 root root 31M Jul 28 23:30 core.32202-brick6

The BT:

# /share/guru/analyse_core.sh core.32202-brick6
Core was generated by `/opt/glusterfs/2.0.6rc1/sbin/glusterfsd -f server_stripe.vol -l server_log'.
Program terminated with signal 11, Segmentation fault.
#0  0x00002ad610e44f3c in gf_print_trace (signum=11) at ../../../libglusterfs/src/common-utils.c:385
385                             if ((tmp->root->type == GF_OP_TYPE_FOP_REQUEST) ||
(gdb) bt
#0  0x00002ad610e44f3c in gf_print_trace (signum=11) at ../../../libglusterfs/src/common-utils.c:385
#1  <signal handler called>
#2  0x00002ad611905906 in posix_readdir (frame=0x2aaab004acb0, this=0xb6c3de0, fd=0x2aaab8053860, size=4096, off=4096) at ../../../../../xlators/storage/posix/src/posix.c:3524
#3  0x00002ad610e43a2d in default_readdir (frame=0x2aaab001f590, this=0xb6c47a0, fd=0x2aaab8053860, size=4096, off=4096) at ../../../libglusterfs/src/defaults.c:1400
#4  0x00002ad611d2147b in iot_readdir_wrapper (frame=0x2aaab8029180, this=0xb6c4fe0, fd=0x2aaab8053860, size=4096, offset=4096) at ../../../../../xlators/performance/io-threads/src/io-threads.c:1689
#5  0x00002ad610e52e3b in call_resume_wind (stub=0x2aaab8016660) at ../../../libglusterfs/src/call-stub.c:2661
#6  0x00002ad610e56fc2 in call_resume (stub=0x2aaab8016660) at ../../../libglusterfs/src/call-stub.c:4169
#7  0x00002ad611d21f93 in iot_worker_ordered (arg=0xb6c6850) at ../../../../../xlators/performance/io-threads/src/io-threads.c:1943
#8  0x0000003129c06307 in start_thread () from /lib64/libpthread.so.0
#9  0x00000031294d1ded in clone () from /lib64/libc.so.6
(gdb) thr
[Current thread is 1 (process 32206)]
(gdb) info thr
  10 process 32202  0x00000031294c441b in write () from /lib64/libc.so.6
  9 process 32203  0x00000031294983b1 in nanosleep () from /lib64/libc.so.6
  8 process 32204  0x0000003129c0a6a7 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  7 process 32205  0x0000003129c0a6a7 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  6 process 32207  0x0000003129c0a6a7 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  5 process 32208  0x0000003129c0a6a7 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  4 process 32209  0x0000003129c0a6a7 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  3 process 32210  0x0000003129c0a6a7 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  2 process 32211  0x0000003129c0a6a7 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
* 1 process 32206  0x00002ad610e44f3c in gf_print_trace (signum=11) at ../../../libglusterfs/src/common-utils.c:385
(gdb) thr 10
[Switching to thread 10 (process 32202)]#0  0x00000031294c441b in write () from /lib64/libc.so.6
(gdb) bt
#0  0x00000031294c441b in write () from /lib64/libc.so.6
#1  0x000000312946b643 in _IO_new_file_write () from /lib64/libc.so.6
#2  0x000000312946b556 in _IO_new_do_write () from /lib64/libc.so.6
#3  0x000000312946bb87 in _IO_new_file_sync () from /lib64/libc.so.6
#4  0x0000003129460d30 in fflush () from /lib64/libc.so.6
#5  0x00002ad610e3e2f5 in _gf_log (domain=0xb6c4fc0 "brick", file=0x2ad611f48aa0 "../../../../../xlators/protocol/server/src/server-protocol.c", function=0x2ad611f49e4b "server_flush", line=4110, 
    level=GF_LOG_ERROR, fmt=0x2ad611f48aeb "invalid argument: state->fd") at ../../../libglusterfs/src/logging.c:201
#6  0x00002ad611f388d8 in server_flush (frame=0x2aaab401e8b0, bound_xl=0xb6c4fe0, hdr=0x2aaab4018f70, hdrlen=48, iobuf=0x0) at ../../../../../xlators/protocol/server/src/server-protocol.c:4110
#7  0x00002ad611f43489 in protocol_server_interpret (this=0xb6c58d0, trans=0x2aaab400d800, hdr_p=0x2aaab4018f70 "", hdrlen=48, iobuf=0x0) at ../../../../../xlators/protocol/server/src/server-protocol.c:7474
#8  0x00002ad611f43fd5 in protocol_server_pollin (this=0xb6c58d0, trans=0x2aaab400d800) at ../../../../../xlators/protocol/server/src/server-protocol.c:7755
#9  0x00002ad611f441e4 in notify (this=0xb6c58d0, event=2, data=0x2aaab400d800) at ../../../../../xlators/protocol/server/src/server-protocol.c:7811
#10 0x00002aaaaaaaeba4 in socket_event_poll_in (this=0x2aaab400d800) at ../../../../transport/socket/src/socket.c:714
#11 0x00002aaaaaaaeea2 in socket_event_handler (fd=7, idx=2, data=0x2aaab400d800, poll_in=1, poll_out=0, poll_err=0) at ../../../../transport/socket/src/socket.c:814
#12 0x00002ad610e5b0dd in event_dispatch_epoll_handler (event_pool=0xb6be320, events=0xb6c8570, i=0) at ../../../libglusterfs/src/event.c:804
#13 0x00002ad610e5b2b2 in event_dispatch_epoll (event_pool=0xb6be320) at ../../../libglusterfs/src/event.c:867
#14 0x00002ad610e5b5c8 in event_dispatch (event_pool=0xb6be320) at ../../../libglusterfs/src/event.c:975
#15 0x00000000004053a9 in main (argc=5, argv=0x7fff99c80ee8) at ../../../glusterfsd/src/glusterfsd.c:1226

Comment 1 Basavanagowda Kanur 2009-07-29 10:54:40 UTC
few observations after analyzing the core:
1. readdir() was sent on a file-descriptor which refers to a file, not a directory.
2. process id (pid) of the process which sent readdir() does not match the process id (pid) of the process which owns the file-descriptor.
3. connection disconnect happened on client side, for the exact server which segfaulted.

suspecting client-side bug might have triggered server segfault.

in ideal situation, we should have encountered problem (segafault or other errors) on client side for client's fault. i'm making changes to protocol/server to ensure that the file-descriptor requestor owns the file-descriptor.

Comment 2 Anand Avati 2009-08-05 03:52:00 UTC
> i'm making changes to
> protocol/server to ensure that the file-descriptor requestor owns the
> file-descriptor.

I'm not sure how you plan to do this. As a filesystem we can only know the opener of the fd. It is very much legal for the non-openers of the fd to access it, since fds can be passed on to other processes via fork()s and unix fd passing.

Avati

Comment 3 Anand Avati 2009-08-07 05:24:16 UTC
PATCH: http://patches.gluster.com/patch/943 in master (protocol/client: fixed registration of saved_fds)

Comment 4 Anand Avati 2009-08-07 05:24:33 UTC
PATCH: http://patches.gluster.com/patch/943 in release-2.0 (protocol/client: fixed registration of saved_fds)