Bug 1031672 - glusterfsd crashed with "[socket.c:1875:__socket_read_frag] 0-rpc: wrong MSG-TYPE (1414541105) received from " error
Summary: glusterfsd crashed with "[socket.c:1875:__socket_read_frag] 0-rpc: wrong MSG-...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: glusterd
Version: 2.1
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Bug Updates Notification Mailing List
QA Contact: storage-qa-internal@redhat.com
URL:
Whiteboard:
Depends On:
Blocks: 1286178
TreeView+ depends on / blocked
 
Reported: 2013-11-18 14:12 UTC by M S Vishwanath Bhat
Modified: 2016-06-01 01:57 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1286178 (view as bug list)
Environment:
Last Closed: 2015-11-27 12:19:38 UTC
Embargoed:


Attachments (Terms of Use)

Description M S Vishwanath Bhat 2013-11-18 14:12:16 UTC
Description of problem:
I was running geo-replication with 24 node cluster (6*2 dist-rep master and 6*2 dist-rep slave). And somehow one of the glusterd has stopped. From the logs it looks like it has received SIGTERM but I haven't issue SIGTERM by myslef. But the only ERROR message I could find in the brick logs were

[2013-11-12 19:38:57.885686] E [socket.c:1875:__socket_read_frag] 0-rpc: wrong MSG-TYPE (1414541105) received from 10.11.15.101:52450
[2013-11-13 09:00:23.770089] W [glusterfsd.c:1097:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x3f68ee894d] (-->/lib64/libpthread.so.0() [0x3f69607851] (-->/usr/sbin/glusterfsd(glusterfs_sigwaiter+0xcd) [0x4053cd]))) 0-: received signum (15), shutting down


But apart from this there are no error messages and no core files.


Version-Release number of selected component (if applicable):
glusterfs-3.4.0.43rhs-1.el6rhs.x86_64

How reproducible:
Hit only once. And no idea how it happened. So I don't have a consistently reproducible steps for this.

Steps to Reproduce:
1. 
2.
3.

Actual results:

[root@Morgan glusterfs]# gluster v status master
Status of volume: master
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick michal:/rhs/bricks/brick0                         49152   Y       24713
Brick tim:/rhs/bricks/brick1                            49152   Y       15331
Brick garret:/rhs/bricks/brick2                         49152   Y       13191
Brick harris:/rhs/bricks/brick3                         49152   Y       18629
Brick javier:/rhs/bricks/brick4                         49152   Y       14901
Brick cruz:/rhs/bricks/brick5                           49152   Y       16159
Brick barret:/rhs/bricks/brick6                         49152   Y       24373
Brick danny:/rhs/bricks/brick7                          49152   Y       3719
Brick normand:/rhs/bricks/brick8                        49152   Y       3667
Brick victor:/rhs/bricks/brick9                         49152   Y       19638
Brick morgan:/rhs/bricks/brick10                        N/A     N       N/A
Brick willard:/rhs/bricks/brick11                       49152   Y       14039
NFS Server on localhost                                 2049    Y       16369
Self-heal Daemon on localhost                           N/A     Y       16377
NFS Server on victor                                    2049    Y       20377
Self-heal Daemon on victor                              N/A     Y       20385
NFS Server on cruz                                      2049    Y       16890
Self-heal Daemon on cruz                                N/A     Y       16902
NFS Server on harris                                    2049    Y       19366
Self-heal Daemon on harris                              N/A     Y       19373
NFS Server on normand                                   2049    Y       4398
Self-heal Daemon on normand                             N/A     Y       4406
NFS Server on danny                                     2049    Y       4451
Self-heal Daemon on danny                               N/A     Y       4459
NFS Server on tim                                       2049    Y       16059
Self-heal Daemon on tim                                 N/A     Y       16066
NFS Server on javier                                    2049    Y       15645
Self-heal Daemon on javier                              N/A     Y       15653
NFS Server on michal                                    2049    Y       25651
Self-heal Daemon on michal                              N/A     Y       25659
NFS Server on garret                                    2049    Y       13921
Self-heal Daemon on garret                              N/A     Y       13929
NFS Server on willard                                   2049    Y       14772
Self-heal Daemon on willard                             N/A     Y       14780
NFS Server on barret                                    2049    Y       25107
Self-heal Daemon on barret                              N/A     Y       25115
 
There are no active volume tasks



Brick10 at Morgan is down.

From glusterfsd logs before it went down

[2013-11-12 17:05:38.697320] I [server-helpers.c:757:server_connection_put] 0-master-server: Shutting down connection Normand.blr.redhat.com-4500-2013/11/12-15:51:18:373601-master-client-10-0
[2013-11-12 17:05:38.697711] I [server-helpers.c:590:server_log_conn_destroy] 0-master-server: destroyed connection of Normand.blr.redhat.com-4500-2013/11/12-15:51:18:373601-master-client-10-0  
[2013-11-12 17:05:51.463430] I [server-handshake.c:569:server_setvolume] 0-master-server: accepted client from Normand.blr.redhat.com-11594-2013/11/12-17:05:51:118681-master-client-10-0 (version: 3.4.0.43rhs)
[2013-11-12 19:38:57.885686] E [socket.c:1875:__socket_read_frag] 0-rpc: wrong MSG-TYPE (1414541105) received from 10.11.15.101:52450
[2013-11-13 09:00:23.770089] W [glusterfsd.c:1097:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x3f68ee894d] (-->/lib64/libpthread.so.0() [0x3f69607851] (-->/usr/sbin/glusterfsd(glusterfs_sigwaiter+0xcd) [0x4053cd]))) 0-: received signum (15), shutting down



glusterd logs during the time it went down

[2013-11-15 19:42:58.158932] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330)
[2013-11-15 19:42:58.158991] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2013-11-15 19:46:16.530372] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330)
[2013-11-15 19:46:16.530452] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2013-11-15 20:30:15.386778] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330)
[2013-11-15 20:30:15.386890] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2013-11-15 20:47:41.698191] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330)
[2013-11-15 20:47:41.698309] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2013-11-15 21:00:47.341366] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330)
[2013-11-15 21:00:47.341569] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2013-11-15 23:58:44.796189] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330)
[2013-11-15 23:58:44.796323] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2013-11-16 02:22:07.057975] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330)
[2013-11-16 02:22:07.058047] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2013-11-16 03:00:55.486415] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330)
[2013-11-16 03:00:55.486482] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2013-11-16 03:04:15.468155] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330)
[2013-11-16 03:04:15.468374] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2013-11-16 04:47:31.100274] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330)
[2013-11-16 04:47:31.100421] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2013-11-16 05:12:06.067980] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330)
[2013-11-16 05:12:06.068048] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2013-11-16 07:00:42.202433] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330)
[2013-11-16 07:00:42.202499] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2013-11-16 07:56:30.682856] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330)
[2013-11-16 07:56:30.682933] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2013-11-16 09:38:29.475531] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330)
[2013-11-16 09:38:29.475717] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2013-11-16 10:20:42.673781] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330)
[2013-11-16 10:20:42.673933] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2013-11-16 18:22:02.188426] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330)
[2013-11-16 18:22:02.188966] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2013-11-16 19:21:46.932373] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330)
[2013-11-16 19:21:46.932437] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2013-11-16 19:58:35.393362] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330)
[2013-11-16 19:58:35.393686] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2013-11-16 21:05:09.467166] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330)
[2013-11-16 21:05:09.467300] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2013-11-16 21:14:03.751273] W [rpcsvc.c:173:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330)
[2013-11-16 21:14:03.751451] E [rpcsvc.c:448:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully


I Initially thought it might have been OOM killed. But the dmesg shows that it is no OOM kill. And I haven't issued SIGTERM myself.


Expected results:
glusterfsd should not crash

Additional info:


There are no core file(s) generated and nothing very much in log files either. But I will be having same setup for day or two (hopefully).

Comment 3 M S Vishwanath Bhat 2015-06-18 10:22:13 UTC
I haven't seen this crash after that. But then I haven't really tested again with 24 node cluster. It's a very old bug. So I can't say definitively.

Comment 4 Susant Kumar Palai 2015-11-27 12:19:38 UTC
Cloning this to 3.1. To be fixed in future release.


Note You need to log in before you can comment on or make changes to this bug.