Description of problem: glusterfs server crashed due to memory corruption with the below backtrace. Core was generated by `/usr/local/sbin/glusterfsd -s localhost --volfile-id vol.hyperspace.mnt-sda7-ex'. Program terminated with signal 11, Segmentation fault. #0 0x00007f7455b40a56 in rpcsvc_get_listener (svc=0x1a83610, port=65535, trans=0x1a90440) at ../../../../rpc/rpc-lib/src/rpcsvc.c:1223 1223 if (listener->trans == trans) { (gdb) bt #0 0x00007f7455b40a56 in rpcsvc_get_listener (svc=0x1a83610, port=65535, trans=0x1a90440) at ../../../../rpc/rpc-lib/src/rpcsvc.c:1223 #1 0x00007f7455b3f573 in rpcsvc_notify (trans=0x1a986f0, mydata=0x1a83610, event=RPC_TRANSPORT_CLEANUP, data=0x0) at ../../../../rpc/rpc-lib/src/rpcsvc.c:636 #2 0x00007f7455b44ef8 in rpc_transport_unref (this=0x1a986f0) at ../../../../rpc/rpc-lib/src/rpc-transport.c:479 #3 0x00007f745208d9b1 in socket_event_handler (fd=8, idx=3, data=0x1a986f0, poll_in=1, poll_out=0, poll_err=16) at ../../../../../rpc/rpc-transport/socket/src/socket.c:1809 #4 0x00007f7455d9fac1 in event_dispatch_epoll_handler (event_pool=0x1a7e3a0, events=0x1a97830, i=0) at ../../../libglusterfs/src/event.c:794 #5 0x00007f7455d9fcdb in event_dispatch_epoll (event_pool=0x1a7e3a0) at ../../../libglusterfs/src/event.c:856 #6 0x00007f7455da004d in event_dispatch (event_pool=0x1a7e3a0) at ../../../libglusterfs/src/event.c:956 #7 0x000000000040857f in main (argc=19, argv=0x7fff83788c18) at ../../../glusterfsd/src/glusterfsd.c:1652 (gdb) info thr 7 Thread 29231 0x00007f74550b3cf7 in unlink () at ../sysdeps/unix/syscall-template.S:82 6 Thread 29235 0x00007f745570d4bd in nanosleep () at ../sysdeps/unix/syscall-template.S:82 5 Thread 29262 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:216 4 Thread 29232 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 3 Thread 29263 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:216 2 Thread 29233 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 * 1 Thread 29230 0x00007f7455b40a56 in rpcsvc_get_listener (svc=0x1a83610, port=65535, trans=0x1a90440) at ../../../../rpc/rpc-lib/src/rpcsvc.c:1223 (gdb) f 0 #0 0x00007f7455b40a56 in rpcsvc_get_listener (svc=0x1a83610, port=65535, trans=0x1a90440) at ../../../../rpc/rpc-lib/src/rpcsvc.c:1223 1223 if (listener->trans == trans) { (gdb) p listener->trans Cannot access memory at address 0x676e696461657190 (gdb) p *listener Cannot access memory at address 0x676e696461657190 (gdb) l 1218 1219 pthread_mutex_lock (&svc->rpclock); 1220 { 1221 list_for_each_entry (listener, &svc->listeners, list) { 1222 if (trans != NULL) { 1223 if (listener->trans == trans) { 1224 found = 1; 1225 break; 1226 } 1227 (gdb) p svc->listeners $1 = {next = 0x676e696461657220, prev = 0x6f73206d6f726620} (gdb) p *svc $2 = {rpclock = {__data = {__lock = 825242203, __count = 875572530, __owner = 540488237, __nusers = 859452721, __kind = 808663607, __spins = 875967022, __list = {__prev = 0x5b2057205d313037, __next = 0x632e74656b636f73}}, __size = "[2012-04-27 15:37:30.264701] W [socket.c", __align = 3760555382451221083}, memfactor = 842346810, authschemes = { next = 0x705f74656b636f73, prev = 0x6174735f6f746f72}, options = 0x696863616d5f6574, allow_insecure = 542991726, register_portmap = 1818701104, ctx = 0x3a73667265747375, listeners = {next = 0x676e696461657220, prev = 0x6f73206d6f726620}, programs = { next = 0x6961662074656b63, prev = 0x727245202e64656c}, notify = {next = 0x6e6e6f432820726f, prev = 0x72206e6f69746365}, notify_count = 1952805733, mydata = 0x70202c2972656570, notifyfn = 0x3732312820726565, rxpool = 0x323a312e302e302e} (gdb) Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: +------------------------------------------------------------------------------+ [2012-04-27 14:40:33.792647] W [socket.c:419:__socket_keepalive] 0-socket: failed to set keep idle on socket 8 [2012-04-27 14:40:33.792815] W [socket.c:1885:socket_server_event_handler] 0-socket.glusterfsd: Failed to set keep-alive: Operation not supported [2012-04-27 14:40:33.884884] I [server-handshake.c:571:server_setvolume] 0-vol-server: accepted client from hyperspace-29240-2012/04/27-14:40:29:862111-vol-client-0-0 (version: 3git) [2012-04-27 14:47:01.563460] I [server.c:685:server_rpc_notify] 0-vol-server: disconnecting connectionfrom hyperspace-29240-2012/04/27-14:40:29:862111-vol-client-0-0 [2012-04-27 14:47:01.563526] I [server-helpers.c:741:server_connection_put] 0-vol-server: Shutting down connection hyperspace-29240-2012/04/27-14:40:29:862111-vol-client-0-0 [2012-04-27 14:47:01.563615] I [server-helpers.c:629:server_connection_destroy] 0-vol-server: destroyed connection of hyperspace-29240-2012/04/27-14:40:29:862111-vol-client-0-0 [2012-04-27 14:47:07.584147] I [server-handshake.c:571:server_setvolume] 0-vol-server: accepted client from hyperspace-11685-2012/04/27-14:47:02:558264-vol-client-0-0 (version: 3git) [2012-04-27 15:37:30.224134] W [glusterfsd.c:794:cleanup_and_exit] (-->/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f74550c104d] (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x6d8c) [0x7f7455704d8c] (-->/usr/local/sbin/glusterfsd(glusterfs_sigwaiter+0xea) [0x407c46]))) 0-: received signum (15), shutting down [2012-04-27 15:37:30.224502] E [rpcsvc.c:1161:rpcsvc_program_unregister_portmap] 0-rpc-service: Could not unregister with portmap [2012-04-27 15:37:30.264661] W [socket.c:204:__socket_rwv] 0-glusterfs: readv failed (Connection reset by peer) [2012-04-27 15:37:30.264701] W [socket.c:1521:__socket_proto_state_machine] 0-glusterfs: reading from socket failed. Error (Connection reset by peer), peer (127.0.0.1:24007) [2012-04-27 15:37:30.264810] E [rpc-clnt.c:382:saved_frames_unwind] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0x124) [0x7f7455b48e7c] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x155) [0x7f7455b483ee] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0x1f) [0x7f7455b47e73]))) 0-glusterfs: forced unwinding frame type(Gluster Portmap) op(SIGNOUT(5)) called at 2012-04-27 15:37:30.224215 (xid=0x3x) [2012-04-27 15:37:30.275653] A [mem-pool.c:153:__gf_calloc] (-->/usr/local/lib/libgfrpc.so.0(rpc_transport_notify+0x115) [0x7f7455b45025] (-->/usr/local/lib/libgfrpc.so.0(rpcsvc_notify+0x131) [0x7f7455b3f4a5] (-->/usr/local/lib/libgfrpc.so.0(rpcsvc_handle_disconnect+0x7b) [0x7f7455b3f25b]))) : no memory available for size (62489783496) pending frames: patchset: git://git.gluster.com/glusterfs.git signal received: 11 time of crash: 2012-04-27 15:37:30 configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 :
Noticed now that, this happened when the process goto the SIGTERM (cleanup_and_exit() with signal 15). At this stage, if there are pending frames, this could have happened. Need to bring in a check to wait for all the saved_frames_unwind() to finish before doing cleanup_and_exit(). Considering anyways process was about to get stopped, lowering the priority of the bug. Feel Free to grab from my name if someone want to fix it :-)
I got the crash again with the below backtrace. Core was generated by `/usr/local/sbin/glusterfsd -s localhost --volfile-id vol.hyperspace.mnt-sda10-e'. Program terminated with signal 6, Aborted. #0 0x00007f3ff16b8d05 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 64 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory. in ../nptl/sysdeps/unix/sysv/linux/raise.c (gdb) bt #0 0x00007f3ff16b8d05 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00007f3ff16bcab6 in abort () at abort.c:92 #2 0x00007f3ff16b17c5 in __assert_fail (assertion=0x7f3ff1db9f05 "(-(e)) != 3 || !robust", file=<value optimized out>, line=312, function=<value optimized out>) at assert.c:81 #3 0x00007f3ff1db12aa in __pthread_mutex_lock_full (mutex=0xcca620) at pthread_mutex_lock.c:312 #4 0x00007f3ff21e923d in rpcsvc_handle_disconnect (svc=0xcca620, trans=0xcdf5c0) at ../../../../rpc/rpc-lib/src/rpcsvc.c:554 #5 0x00007f3ff21e94ba in rpcsvc_notify (trans=0xcdf5c0, mydata=0xcca620, event=RPC_TRANSPORT_DISCONNECT, data=0xcdf5c0) at ../../../../rpc/rpc-lib/src/rpcsvc.c:616 #6 0x00007f3ff21ef039 in rpc_transport_notify (this=0xcdf5c0, event=RPC_TRANSPORT_DISCONNECT, data=0xcdf5c0) at ../../../../rpc/rpc-lib/src/rpc-transport.c:498 #7 0x00007f3fee73327a in socket_event_poll_err (this=0xcdf5c0) at ../../../../../rpc/rpc-transport/socket/src/socket.c:694 #8 0x00007f3fee7379a5 in socket_event_handler (fd=8, idx=3, data=0xcdf5c0, poll_in=1, poll_out=0, poll_err=16) at ../../../../../rpc/rpc-transport/socket/src/socket.c:1808 #9 0x00007f3ff2449ac1 in event_dispatch_epoll_handler (event_pool=0xcc53a0, events=0xcde840, i=0) at ../../../libglusterfs/src/event.c:794 #10 0x00007f3ff2449cdb in event_dispatch_epoll (event_pool=0xcc53a0) at ../../../libglusterfs/src/event.c:856 #11 0x00007f3ff244a04d in event_dispatch (event_pool=0xcc53a0) at ../../../libglusterfs/src/event.c:956 #12 0x000000000040857f in main (argc=19, argv=0x7fff36e21028) at ../../../glusterfsd/src/glusterfsd.c:1652 (gdb) f 3 #3 0x00007f3ff1db12aa in __pthread_mutex_lock_full (mutex=0xcca620) at pthread_mutex_lock.c:312 312 pthread_mutex_lock.c: No such file or directory. in pthread_mutex_lock.c (gdb) f 4 #4 0x00007f3ff21e923d in rpcsvc_handle_disconnect (svc=0xcca620, trans=0xcdf5c0) at ../../../../rpc/rpc-lib/src/rpcsvc.c:554 554 pthread_mutex_lock (&svc->rpclock); (gdb) p svc $1 = (rpcsvc_t *) 0xcca620 (gdb) p *svc $2 = {rpclock = {__data = {__lock = -1322241445, __count = 892349746, __owner = 540225581, __nusers = 825898033, __kind = 859060786, __spins = 875902254, __list = {__prev = 0x5b2057205d383036, __next = 0x632e74656b636f73}}, __size = "[20\261\062-05-03 14:12:43.554608] W [socket.c", __align = 3832612978636632667}, memfactor = 842346810, authschemes = { next = 0x705f74656b636f73, prev = 0x6174735f6f746f72}, options = 0x696863616d5f6574, allow_insecure = 542991726, register_portmap = 1818701104, ctx = 0x3a73667265747375, listeners = {next = 0x676e696461657220, prev = 0x6f73206d6f726620}, programs = { next = 0x6961662074656b63, prev = 0x727245202e64656c}, notify = {next = 0x6e6e6f432820726f, prev = 0x72206e6f69746365}, notify_count = 1952805733, mydata = 0x70202c2972656570, notifyfn = 0x3732312820726565, rxpool = 0x323a312e302e302e} (gdb) info thr 9 Thread 15323 0x00007f3ff175dcf7 in unlink () at ../sysdeps/unix/syscall-template.S:82 8 Thread 15406 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 7 Thread 15405 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 6 Thread 15408 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:216 5 Thread 15325 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 4 Thread 15407 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:216 3 Thread 15324 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 2 Thread 15327 0x00007f3ff1db74bd in nanosleep () at ../sysdeps/unix/syscall-template.S:82 * 1 Thread 15322 0x00007f3ff16b8d05 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 (gdb) t 9 [Switching to thread 9 (Thread 15323)]#0 0x00007f3ff175dcf7 in unlink () at ../sysdeps/unix/syscall-template.S:82 82 ../sysdeps/unix/syscall-template.S: No such file or directory. in ../sysdeps/unix/syscall-template.S (gdb) bt #0 0x00007f3ff175dcf7 in unlink () at ../sysdeps/unix/syscall-template.S:82 #1 0x0000000000407847 in glusterfs_pidfile_cleanup (ctx=0xcad010) at ../../../glusterfsd/src/glusterfsd.c:1296 #2 0x0000000000406225 in cleanup_and_exit (signum=15) at ../../../glusterfsd/src/glusterfsd.c:814 #3 0x0000000000407c46 in glusterfs_sigwaiter (arg=0x7fff36e20e30) at ../../../glusterfsd/src/glusterfsd.c:1384 #4 0x00007f3ff1daed8c in start_thread (arg=0x7f3ff0145700) at pthread_create.c:304 #5 0x00007f3ff176b04d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #6 0x0000000000000000 in ?? () (gdb)
http://review.gluster.org/4230 : root caused the issue as a race between freeing the RPC object and handling socket's POLL_ERR May be couple of iterations of review before getting the patch accepted, but marking it as POST as its in review queue.
The typical usecase for this is to: kill all glusterfs processes and restart the glusterd couple of time. This would generate the crash. With the new release(qa5), this behavior is no longer reproducible.