Description of problem: I was running geo-replication with 24 node cluster (6*2 dist-rep master and 6*2 dist-rep slave). And somehow one of the glusterd has stopped. From the logs it looks like it has received SIGTERM but I haven't issue SIGTERM by myslef. But the only ERROR message I could find in the brick logs were [2013-11-12 19:38:57.885686] E [socket.c:1875:__socket_read_frag] 0-rpc: wrong MSG-TYPE (1414541105) received from 10.11.15.101:52450 [2013-11-13 09:00:23.770089] W [glusterfsd.c:1097:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x3f68ee894d] (-->/lib64/libpthread.so.0() [0x3f69607851] (-->/usr/sbin/glusterfsd(glusterfs_sigwaiter+0xcd) [0x4053cd]))) 0-: received signum (15), shutting down But apart from this there are no error messages and no core files. Version-Release number of selected component (if applicable): glusterfs-3.4.0.43rhs-1.el6rhs.x86_64 How reproducible: Hit only once. And no idea how it happened. So I don't have a consistently reproducible steps for this. Steps to Reproduce: 1. 2. 3. Actual results: [root@Morgan glusterfs]# gluster v status master Status of volume: master Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick michal:/rhs/bricks/brick0 49152 Y 24713 Brick tim:/rhs/bricks/brick1 49152 Y 15331 Brick garret:/rhs/bricks/brick2 49152 Y 13191 Brick harris:/rhs/bricks/brick3 49152 Y 18629 Brick javier:/rhs/bricks/brick4 49152 Y 14901 Brick cruz:/rhs/bricks/brick5 49152 Y 16159 Brick barret:/rhs/bricks/brick6 49152 Y 24373 Brick danny:/rhs/bricks/brick7 49152 Y 3719 Brick normand:/rhs/bricks/brick8 49152 Y 3667 Brick victor:/rhs/bricks/brick9 49152 Y 19638 Brick morgan:/rhs/bricks/brick10 N/A N N/A Brick willard:/rhs/bricks/brick11 49152 Y 14039 NFS Server on localhost 2049 Y 16369 Self-heal Daemon on localhost N/A Y 16377 NFS Server on victor 2049 Y 20377 Self-heal Daemon on victor N/A Y 20385 NFS Server on cruz 2049 Y 16890 Self-heal Daemon on cruz N/A Y 16902 NFS Server on harris 2049 Y 19366 Self-heal Daemon on harris N/A Y 19373 NFS Server on normand 2049 Y 4398 Self-heal Daemon on normand N/A Y 4406 NFS Server on danny 2049 Y 4451 Self-heal Daemon on danny N/A Y 4459 NFS Server on tim 2049 Y 16059 Self-heal Daemon on tim N/A Y 16066 NFS Server on javier 2049 Y 15645 Self-heal Daemon on javier N/A Y 15653 NFS Server on michal 2049 Y 25651 Self-heal Daemon on michal N/A Y 25659 NFS Server on garret 2049 Y 13921 Self-heal Daemon on garret N/A Y 13929 NFS Server on willard 2049 Y 14772 Self-heal Daemon on willard N/A Y 14780 NFS Server on barret 2049 Y 25107 Self-heal Daemon on barret N/A Y 25115 There are no active volume tasks Brick10 at Morgan is down. From glusterfsd logs before it went down [2013-11-12 17:05:38.697320] I [server-helpers.c:757:server_connection_put] 0-master-server: Shutting down connection Normand.blr.redhat.com-4500-2013/11/12-15:51:18:373601-master-client-10-0 [2013-11-12 17:05:38.697711] I [server-helpers.c:590:server_log_conn_destroy] 0-master-server: destroyed connection of Normand.blr.redhat.com-4500-2013/11/12-15:51:18:373601-master-client-10-0 [2013-11-12 17:05:51.463430] I [server-handshake.c:569:server_setvolume] 0-master-server: accepted client from Normand.blr.redhat.com-11594-2013/11/12-17:05:51:118681-master-client-10-0 (version: 3.4.0.43rhs) [2013-11-12 19:38:57.885686] E [socket.c:1875:__socket_read_frag] 0-rpc: wrong MSG-TYPE (1414541105) received from 10.11.15.101:52450 [2013-11-13 09:00:23.770089] W [glusterfsd.c:1097:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x3f68ee894d] (-->/lib64/libpthread.so.0() [0x3f69607851] (-->/usr/sbin/glusterfsd(glusterfs_sigwaiter+0xcd) [0x4053cd]))) 0-: received signum (15), shutting down glusterd logs during the time it went down [2013-11-15 19:42:58.158932] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) [2013-11-15 19:42:58.158991] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2013-11-15 19:46:16.530372] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) [2013-11-15 19:46:16.530452] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2013-11-15 20:30:15.386778] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) [2013-11-15 20:30:15.386890] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2013-11-15 20:47:41.698191] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) [2013-11-15 20:47:41.698309] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2013-11-15 21:00:47.341366] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) [2013-11-15 21:00:47.341569] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2013-11-15 23:58:44.796189] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) [2013-11-15 23:58:44.796323] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2013-11-16 02:22:07.057975] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) [2013-11-16 02:22:07.058047] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2013-11-16 03:00:55.486415] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) [2013-11-16 03:00:55.486482] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2013-11-16 03:04:15.468155] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) [2013-11-16 03:04:15.468374] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2013-11-16 04:47:31.100274] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) [2013-11-16 04:47:31.100421] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2013-11-16 05:12:06.067980] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) [2013-11-16 05:12:06.068048] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2013-11-16 07:00:42.202433] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) [2013-11-16 07:00:42.202499] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2013-11-16 07:56:30.682856] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) [2013-11-16 07:56:30.682933] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2013-11-16 09:38:29.475531] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) [2013-11-16 09:38:29.475717] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2013-11-16 10:20:42.673781] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) [2013-11-16 10:20:42.673933] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2013-11-16 18:22:02.188426] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) [2013-11-16 18:22:02.188966] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2013-11-16 19:21:46.932373] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) [2013-11-16 19:21:46.932437] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2013-11-16 19:58:35.393362] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) [2013-11-16 19:58:35.393686] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2013-11-16 21:05:09.467166] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) [2013-11-16 21:05:09.467300] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully [2013-11-16 21:14:03.751273] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) [2013-11-16 21:14:03.751451] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully I Initially thought it might have been OOM killed. But the dmesg shows that it is no OOM kill. And I haven't issued SIGTERM myself. Expected results: glusterfsd should not crash Additional info: There are no core file(s) generated and nothing very much in log files either. But I will be having same setup for day or two (hopefully).
I haven't seen this crash after that. But then I haven't really tested again with 24 node cluster. It's a very old bug. So I can't say definitively.
Cloning this to 3.1. To be fixed in future release.