Bug 1461537 - [Stress] : Brick logs spammed with Reply submission failure messages. [NEEDINFO]
[Stress] : Brick logs spammed with Reply submission failure messages.
Status: NEW
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: rpc (Show other bugs)
3.3
x86_64 Linux
unspecified Severity high
: ---
: ---
Assigned To: Milind Changire
Rahul Hinduja
rpc-3.4.0?
: ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-14 13:41 EDT by Ambarish
Modified: 2018-05-11 10:33 EDT (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
rkavunga: needinfo? (asoman)
rgowdapp: needinfo? (mchangir)


Attachments (Terms of Use)

  None (edit)
Description Ambarish 2017-06-14 13:41:29 EDT
Description of problem:
-----------------------


2 Node cluster.

3 clients mounted a 2*2 volume via v4 and were running Bonnie++ in a separate working directory.

I seea steady stream of reply submission failures :

<snip>
bricks/bricks-testvol_brick2.log:[2017-06-14 16:20:07.545679] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x1949b) [0x7fbde758e49b] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1b0f9) [0x7fbde712f0f9] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9276) [0x7fbde711d276] ) 0-: Reply submission failed
bricks/bricks-testvol_brick2.log:[2017-06-14 16:20:07.545785] E [rpcsvc.c:1333:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xa81d8, Program: GlusterFS 3.3, ProgVers: 330, Proc: 34) to rpc-transport (tcp.testvol-server)
bricks/bricks-testvol_brick2.log:[2017-06-14 16:20:07.545817] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x1949b) [0x7fbde758e49b] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1b0f9) [0x7fbde712f0f9] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9276) [0x7fbde711d276] ) 0-: Reply submission failed
bricks/bricks-testvol_brick2.log:[2017-06-14 16:20:07.545920] E [rpcsvc.c:1333:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xa81e7, Program: GlusterFS 3.3, ProgVers: 330, Proc: 34) to rpc-transport (tcp.testvol-server)
bricks/bricks-testvol_brick2.log:[2017-06-14 16:20:07.546062] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x1949b) [0x7fbde758e49b] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1b0
</snip>

Version-Release number of selected component (if applicable):
-------------------------------------------------------------

3.8.4-25


How reproducible:
-----------------

1/1


Actual results:
---------------

Logs spammed with eroors.

Expected results:
-----------------

No log flooding.

Additional info:
----------------

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 3b04b36a-1837-48e8-b437-fbc091b2f992
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas007.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas009.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas007.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas009.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: on
server.allow-insecure: on
performance.stat-prefetch: off
transport.address-family: inet
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
[root@gqas009 bricks]#
Comment 5 Ambarish 2017-06-18 05:23:53 EDT
This is a bit more serious on my Geo Rep Stress setup,on one of my master nodes.

The message has been logged > 20000 times in 2 days :

[root@gqas005 glusterfs]# grep -Ri "reply submission failed"|wc -l
20377
[root@gqas005 glusterfs]#

<snip>

bricks/bricks3-A1.log-20170618:[2017-06-16 20:27:17.346203] E [server.c:203:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x30e25) [0x7f43ed36de25] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x30dc8) [0x7f43ed36ddc8] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x93a6) [0x7f43ed3463a6] ) 0-: Reply submission failed
bricks/bricks3-A1.log-20170618:[2017-06-16 20:27:17.346232] E [server.c:210:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x30e25) [0x7f43ed36de25] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x30dc8) [0x7f43ed36ddc8] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x93fe) [0x7f43ed3463fe] ) 0-: Reply submission failed
bricks/bricks3-A1.log-20170618:[2017-06-16 20:27:17.346334] E [server.c:203:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x1bbeb) [0x7f43ed7b9beb] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1b609) [0x7f43ed358609] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x93a6) [0x7f43ed3463a6] ) 0-: Reply submission failed
bricks/bricks3-A1.log-20170618:[2017-06-16 20:27:17.346372] E [server.c:203:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x1bbeb) [0x7f43ed7b9beb] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1b609) [0x7f43ed358609] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x93a6) [0x7f43ed3463a6] ) 0-: Reply submission failed
bricks/bricks3-A1.log-20170618:[2017-06-16 20:27:17.346380] E [server.c:203:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x30e25) [0x7f43ed36de25] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x30dc8) [0x7f43ed36ddc8] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x93a6) [0x7f43ed3463a6] ) 0-: Reply submission failed

</snip>

sosreports will be uploaded soon.
Comment 8 Mohammed Rafi KC 2017-06-21 04:27:56 EDT
Just went through one of the bricks (brick1-A1) logs on node 15, and it seems that the disconnect were happened frpm one of the servers, so most likely those disconnects are from internal clients.

Did you run any heal inof commands during this time ?

Note You need to log in before you can comment on or make changes to this bug.