Bug 1030410 - glusterd: core dumped "E [glusterd-utils.c:380:glusterd_submit_reply] 0-: Reply submission failed"
glusterd: core dumped "E [glusterd-utils.c:380:glusterd_submit_reply] 0-: Rep...
Status: CLOSED EOL
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterd (Show other bugs)
2.1
x86_64 Linux
unspecified Severity urgent
: ---
: ---
Assigned To: Nagaprasad Sathyanarayana
storage-qa-internal@redhat.com
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-11-14 06:38 EST by Saurabh
Modified: 2016-02-17 19:21 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-12-03 12:19:10 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Saurabh 2013-11-14 06:38:04 EST
Description of problem:
after quota enable and deem-statfs option to on, for a volume 
I tried to peer probe to a new node. 
peer probe command was hung for some time, hence killed it with ctrl+c.
did "service iptables stop" on all nodes
then took the peer status, the new node was showing it to be in an undesirable state i.e. "Probe Sent to Peer (Connected)"
hence trid out "gluster peer detach" in order to detach the new node.
which actually didn't detach the node.

a subsequent detach and probe led me to see a crash in the glusterd

Version-Release number of selected component (if applicable):
glusterfs-3.4.0.44rhs-1

How reproducible:
seen on this build.

Steps to Reproduce:
steps remain the same as mentioned in the description section, rather putting the logs of exection,


volume info with quota enable and deem-statfs to on
=====================================================

[root@nfs1 ~]# gluster volume info dist-rep12
 
Volume Name: dist-rep12
Type: Distributed-Replicate
Volume ID: 9d072702-1230-421c-ad9c-41c8ed1a1c97
Status: Started
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: 10.70.37.58:/rhs/bricks/d1r1-n12
Brick2: 10.70.37.196:/rhs/bricks/d1r2-n12
Brick3: 10.70.37.138:/rhs/bricks/d2r1-n12
Brick4: 10.70.37.186:/rhs/bricks/d2r2-n12
Brick5: 10.70.37.58:/rhs/bricks/d3r1-n12
Brick6: 10.70.37.196:/rhs/bricks/d3r2-n12
Brick7: 10.70.37.138:/rhs/bricks/d4r1-n12
Brick8: 10.70.37.186:/rhs/bricks/d4r2-n12
Brick9: 10.70.37.58:/rhs/bricks/d5r1-n12
Brick10: 10.70.37.196:/rhs/bricks/d5r2-n12
Brick11: 10.70.37.138:/rhs/bricks/d6r1-n12
Brick12: 10.70.37.186:/rhs/bricks/d6r2-n12
Options Reconfigured:
features.quota: on
features.quota-deem-statfs: on
[root@nfs1 ~]# 

peer probe followed by ctrl+c, need to kill as peer probe was hung for sometime
===============================================================================
[root@nfs1 ~]# gluster peer probe rhsauto004.lab.eng.blr.redhat.com
[root@nfs1 ~]# ^C
[root@nfs1 ~]# 

putting iptables to off,
========================
[root@nfs1 ~]# service iptables status
Table: filter
Chain INPUT (policy ACCEPT)
num  target     prot opt source               destination         

Chain FORWARD (policy ACCEPT)
num  target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
num  target     prot opt source               destination         

[root@nfs1 ~]# service iptables stop
iptables: Flushing firewall rules:                         [  OK  ]
iptables: Setting chains to policy ACCEPT: filter          [  OK  ]
iptables: Unloading modules:                               [  OK  ]
[root@nfs1 ~]# 
[root@nfs1 ~]# 

peer probe again and new node( rhsauto004.lab.eng.blr.redhat.com) found to be undesired state
=============================================================================
[root@nfs1 ~]# gluster peer probe rhsauto004.lab.eng.blr.redhat.com
peer probe: success. Host rhsauto004.lab.eng.blr.redhat.com port 24007 already in peer list
[root@nfs1 ~]# gluster peer status
Number of Peers: 4

Hostname: 10.70.37.186
Uuid: 32160590-2cf2-4a14-aaeb-7c8a333bf94e
State: Peer in Cluster (Connected)

Hostname: 10.70.37.138
Uuid: 4ae9cabb-ed4d-40b5-97e3-0ad895f0046d
State: Peer in Cluster (Connected)

Hostname: 10.70.37.196
Uuid: 0d320f13-f59c-469e-be9c-b13c46ce2bb7
State: Peer in Cluster (Connected)

Hostname: rhsauto004.lab.eng.blr.redhat.com
Uuid: 9240bd40-314e-49a2-a6e8-841349cd6036
State: Probe Sent to Peer (Connected)

tried to detach new node, doesnt work but peer status shows peer connected
===========================================================================
[root@nfs1 ~]# gluster peer detach rhsauto004.lab.eng.blr.redhat.com
[root@nfs1 ~]# gluster peer status
Number of Peers: 4

Hostname: 10.70.37.186
Uuid: 32160590-2cf2-4a14-aaeb-7c8a333bf94e
State: Peer in Cluster (Connected)

Hostname: 10.70.37.138
Uuid: 4ae9cabb-ed4d-40b5-97e3-0ad895f0046d
State: Peer in Cluster (Connected)

Hostname: 10.70.37.196
Uuid: 0d320f13-f59c-469e-be9c-b13c46ce2bb7
State: Peer in Cluster (Connected)

Hostname: rhsauto004.lab.eng.blr.redhat.com
Uuid: 9240bd40-314e-49a2-a6e8-841349cd6036
State: Peer in Cluster (Connected)
[root@nfs1 ~]# 

a subsequent effort to detach node
===================================
[root@nfs1 ~]# gluster peer detach rhsauto004.lab.eng.blr.redhat.com
peer detach: success
[root@nfs1 ~]# 
[root@nfs1 ~]# gluster peer status
Connection failed. Please check if gluster daemon is operational.
[root@nfs1 ~]# 
[root@nfs1 ~]# gluster peer status
Connection failed. Please check if gluster daemon is operational.
[root@nfs1 ~]# 


[root@nfs1 ~]# service glusterd status
glusterd dead but pid file exists
[root@nfs1 ~]# 

core dump file found in "/"
============================
[root@nfs1 ~]# ls -l /core.11650 
-rw------- 1 root root 83873792 Nov 14 16:46 /core.11650




Actual results:


core dump from glusterd.log file,
================================
[2013-11-14 11:16:13.147571] E [rpcsvc.c:1111:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x1x, Program: GlusterD svc cli, ProgVers: 2, Proc: 2) to rpc-transport (socket.management)
[2013-11-14 11:16:13.147597] E [glusterd-utils.c:380:glusterd_submit_reply] 0-: Reply submission failed
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)

patchset: git://git.gluster.com/glusterfs.git
signal received: 6
time of crash: 2013-11-14 11:16:13configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.4.0.44rhs
/lib64/libc.so.6[0x3533432960]
/lib64/libc.so.6(gsignal+0x35)[0x35334328e5]
/lib64/libc.so.6(abort+0x175)[0x35334340c5]
/lib64/libc.so.6[0x35334707f7]
/lib64/libc.so.6[0x3533476126]
/usr/lib64/glusterfs/3.4.0.44rhs/xlator/mgmt/glusterd.so(glusterd_peer_destroy+0x3f)[0x7f71883ad81f]
/usr/lib64/glusterfs/3.4.0.44rhs/xlator/mgmt/glusterd.so(glusterd_friend_cleanup+0xb8)[0x7f71883b60c8]
/usr/lib64/glusterfs/3.4.0.44rhs/xlator/mgmt/glusterd.so(+0x3648f)[0x7f718839a48f]
/usr/lib64/glusterfs/3.4.0.44rhs/xlator/mgmt/glusterd.so(glusterd_friend_sm+0x19e)[0x7f718839ad8e]
/usr/lib64/glusterfs/3.4.0.44rhs/xlator/mgmt/glusterd.so(__glusterd_friend_remove_cbk+0x310)[0x7f71883c32e0]
/usr/lib64/glusterfs/3.4.0.44rhs/xlator/mgmt/glusterd.so(glusterd_big_locked_cbk+0x60)[0x7f71883c0320]
/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)[0x7f718bbf8b35]
/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x147)[0x7f718bbfa0d7]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x7f718bbf5918]
/usr/lib64/glusterfs/3.4.0.44rhs/rpc-transport/socket.so(+0x8d36)[0x7f7187e29d36]
/usr/lib64/glusterfs/3.4.0.44rhs/rpc-transport/socket.so(+0xa64d)[0x7f7187e2b64d]
/usr/lib64/libglusterfs.so.0(+0x60697)[0x7f718be60697]
/usr/sbin/glusterd(main+0x6c7)[0x406aa7]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x353341ecdd]
/usr/sbin/glusterd[0x4046e9]
---------


(gdb) bt
#0  0x00000035334328e5 in raise () from /lib64/libc.so.6
#1  0x00000035334340c5 in abort () from /lib64/libc.so.6
#2  0x00000035334707f7 in __libc_message () from /lib64/libc.so.6
#3  0x0000003533476126 in malloc_printerr () from /lib64/libc.so.6
#4  0x00007f71883ad81f in glusterd_peer_destroy (peerinfo=0x7f7170017020) at glusterd-utils.c:5805
#5  0x00007f71883b60c8 in glusterd_friend_cleanup (peerinfo=0x7f7170017020) at glusterd-utils.c:959
#6  0x00007f718839a48f in glusterd_ac_friend_remove (event=0x233db50, ctx=<value optimized out>) at glusterd-sm.c:614
#7  0x00007f718839ad8e in glusterd_friend_sm () at glusterd-sm.c:1026
#8  0x00007f71883c32e0 in __glusterd_friend_remove_cbk (req=<value optimized out>, iov=<value optimized out>, count=<value optimized out>, myframe=0x7f718aa55058)
    at glusterd-rpc-ops.c:505
#9  0x00007f71883c0320 in glusterd_big_locked_cbk (req=0x7f717002f4ac, iov=0x7f717002f4ec, count=1, myframe=0x7f718aa55058, 
    fn=0x7f71883c2fd0 <__glusterd_friend_remove_cbk>) at glusterd-rpc-ops.c:205
#10 0x00007f718bbf8b35 in rpc_clnt_handle_reply (clnt=0x7f717002d630, pollin=0x23457a0) at rpc-clnt.c:771
#11 0x00007f718bbfa0d7 in rpc_clnt_notify (trans=<value optimized out>, mydata=0x7f717002d660, event=<value optimized out>, data=<value optimized out>)
    at rpc-clnt.c:903
#12 0x00007f718bbf5918 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:499
#13 0x00007f7187e29d36 in socket_event_poll_in (this=0x7f717002e5e0) at socket.c:2119
#14 0x00007f7187e2b64d in socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x7f717002e5e0, poll_in=1, poll_out=0, poll_err=0)
    at socket.c:2229
#15 0x00007f718be60697 in event_dispatch_epoll_handler (event_pool=0x20c5ee0) at event-epoll.c:384
#16 event_dispatch_epoll (event_pool=0x20c5ee0) at event-epoll.c:445
#17 0x0000000000406aa7 in main (argc=2, argv=0x7fff2d392568) at glusterfsd.c:2048

Expected results:
no crash expected

if is to be dumped then it should go to /var/log/core, this time it is found in 
"/"

Additional info:
Comment 3 Kaushal 2013-11-15 05:03:53 EST
I had a look at the core and the logs. From, these I believe the crash happened due to a double free, caused by two detach commands racing each other.

The peer being probed, rhsauto004, was not responding quickly enough. The reason for this is not known to me as of now. Interstingly, the gluster logs for the time period during which the crash happened is also missing on rhsauto004, even though it appears glusterd was running on it all that time.

Due to the slow response by rhsauto004, the initial probe command ended with the cli timing out. Another probe command was issued, which returned with a 'peer already in peer list' error. After this, a detach was attempted. Even this detach command was delayed because of the slow response by rhsauto004. (By this time the original probe command had finished and the peer had become part of the cluster.) A second detach command was issued for which cli returned successfully (need to check why there was no delaying in this try). So at this time, there were two detach commands on the same peer which were in progress. The two then raced each other, and one of them cleaned up the peer (most probably the 2nd command). When the other command attempted cleanup, it tried to free already freed data, which lead to the crash.


I am still going through the logs and attempting to confirm if this is what happened. Since, the crash occurred because of some out of the ordinary conditions, I'd say this is not a blocker issue.
Comment 4 Harshavardhana 2014-03-27 23:40:23 EDT
[2014-03-28 03:30:26.763286] E [glusterd-utils.c:380:glusterd_submit_reply] 0-: Reply submission failed
pending frames:
frame : type(0) op(0)

patchset: git://git.gluster.com/glusterfs.git
signal received: 6
time of crash: 2014-03-28 03:30:26configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.4.0.44rhs
/lib64/libc.so.6(+0x32960)[0x7fd7a7f6e960]
/lib64/libc.so.6(gsignal+0x35)[0x7fd7a7f6e8e5]
/lib64/libc.so.6(abort+0x175)[0x7fd7a7f700c5]
/lib64/libc.so.6(+0x707f7)[0x7fd7a7fac7f7]
/lib64/libc.so.6(+0x76126)[0x7fd7a7fb2126]
/usr/lib64/glusterfs/3.4.0.44rhs/xlator/mgmt/glusterd.so(glusterd_peer_destroy+0x3f)[0x7fd7a46be81f]
/usr/lib64/glusterfs/3.4.0.44rhs/xlator/mgmt/glusterd.so(glusterd_friend_cleanup+0xb8)[0x7fd7a46c70c8]
/usr/lib64/glusterfs/3.4.0.44rhs/xlator/mgmt/glusterd.so(+0x3648f)[0x7fd7a46ab48f]
/usr/lib64/glusterfs/3.4.0.44rhs/xlator/mgmt/glusterd.so(glusterd_friend_sm+0x19e)[0x7fd7a46abd8e]
/usr/lib64/glusterfs/3.4.0.44rhs/xlator/mgmt/glusterd.so(__glusterd_handle_incoming_unfriend_req+0x29f)[0x7fd7a46aa11f]
/usr/lib64/glusterfs/3.4.0.44rhs/xlator/mgmt/glusterd.so(glusterd_big_locked_handler+0x3f)[0x7fd7a4699fff]
/usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x245)[0x7fd7a8aa4535]
/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x103)[0x7fd7a8aa4683]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x7fd7a8aa5918]
/usr/lib64/glusterfs/3.4.0.44rhs/rpc-transport/socket.so(+0x8d36)[0x7fd7a30c5d36]
/usr/lib64/glusterfs/3.4.0.44rhs/rpc-transport/socket.so(+0xa64d)[0x7fd7a30c764d]
/usr/lib64/libglusterfs.so.0(+0x60697)[0x7fd7a8d10697]
/usr/sbin/glusterd(main+0x6c7)[0x406aa7]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7fd7a7f5acdd]
/usr/sbin/glusterd[0x4046e9]

Was able to reproduce this issue, while trying to reproduce some other issue

(gdb) bt
#0  0x00007fd7a7f6e8e5 in raise () from /lib64/libc.so.6
#1  0x00007fd7a7f700c5 in abort () from /lib64/libc.so.6
#2  0x00007fd7a7fac7f7 in __libc_message () from /lib64/libc.so.6
#3  0x00007fd7a7fb2126 in malloc_printerr () from /lib64/libc.so.6
#4  0x00007fd7a46be81f in glusterd_peer_destroy (peerinfo=0x7fd79c000d00) at glusterd-utils.c:5805
#5  0x00007fd7a46c70c8 in glusterd_friend_cleanup (peerinfo=0x7fd79c000d00) at glusterd-utils.c:959
#6  0x00007fd7a46ab48f in glusterd_ac_friend_remove (event=0x1fc57c0, ctx=<value optimized out>) at glusterd-sm.c:614
#7  0x00007fd7a46abd8e in glusterd_friend_sm () at glusterd-sm.c:1026
#8  0x00007fd7a46aa11f in __glusterd_handle_incoming_unfriend_req (req=0x7fd7a32d53fc) at glusterd-handler.c:2089
#9  0x00007fd7a4699fff in glusterd_big_locked_handler (req=0x7fd7a32d53fc, actor_fn=0x7fd7a46a9e80 <__glusterd_handle_incoming_unfriend_req>)
    at glusterd-handler.c:77
#10 0x00007fd7a8aa4535 in rpcsvc_handle_rpc_call (svc=<value optimized out>, trans=<value optimized out>, msg=0x1fc2570) at rpcsvc.c:549
#11 0x00007fd7a8aa4683 in rpcsvc_notify (trans=0x1fc6aa0, mydata=<value optimized out>, event=<value optimized out>, data=0x1fc2570) at rpcsvc.c:643
#12 0x00007fd7a8aa5918 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>)
    at rpc-transport.c:499
#13 0x00007fd7a30c5d36 in socket_event_poll_in (this=0x1fc6aa0) at socket.c:2119
#14 0x00007fd7a30c764d in socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x1fc6aa0, poll_in=1, poll_out=0, 
    poll_err=0) at socket.c:2229
#15 0x00007fd7a8d10697 in event_dispatch_epoll_handler (event_pool=0x1fac710) at event-epoll.c:384
#16 event_dispatch_epoll (event_pool=0x1fac710) at event-epoll.c:445
#17 0x0000000000406aa7 in main (argc=2, argv=0x7fff08016aa8) at glusterfsd.c:2048
(gdb) 
(gdb) p *peerinfo
$1 = {uuid = "\300\310\000\234\327\177\000\000x\000\000\234\327\177\000",
  uuid_str = '\000' <repeats 16 times>, "88-a100-9e6ebd4e4ac8", '\000' <repeats 13 times>, state = {state = GD_FRIEND_STATE_DEFAULT,
    transition_time = {tv_sec = 0, tv_usec = 0}}, hostname = 0x7fd79c000ad0 "P\f", port = 24007, uuid_list = {next = 0x7fd79c000d70,
    prev = 0x7fd79c000d70}, op_peers_list = {next = 0x7fd79c000d80, prev = 0x7fd79c000d80}, rpc = 0x0, mgmt = 0x7fd7a493e680, peer = 0x7fd7a493e700,
  connected = 1, shandle = 0x0, sm_log = {transitions = 0x7fd79c000e00, current = 34, size = 50, count = 35,
    state_name_get = 0x7fd7a46aab20 <glusterd_friend_sm_state_name_get>, event_name_get = 0x7fd7a46aab40 <glusterd_friend_sm_event_name_get>},
  quorum_action = _gf_true, quorum_contrib = QUORUM_DOWN}
(gdb) p/x peerinfo->hostname
$7 = 0x7fd79c000ad0
(gdb) p (char *)0x7fd79c000ad0
$10 = 0x7fd79c000ad0 "P\f"
(gdb)
Comment 6 Vivek Agarwal 2015-12-03 12:19:10 EST
Thank you for submitting this issue for consideration in Red Hat Gluster Storage. The release for which you requested us to review, is now End of Life. Please See https://access.redhat.com/support/policy/updates/rhs/

If you can reproduce this bug against a currently maintained version of Red Hat Gluster Storage, please feel free to file a new report against the current release.

Note You need to log in before you can comment on or make changes to this bug.