Bug 1461537 - [Stress] : Brick logs spammed with Reply submission failure messages.
Summary: [Stress] : Brick logs spammed with Reply submission failure messages.
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: rpc
Version: rhgs-3.3
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Raghavendra G
QA Contact: Rahul Hinduja
URL:
Whiteboard: rpc-3.4.0?
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-14 17:41 UTC by Ambarish
Modified: 2023-09-14 03:59 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-12-12 22:22:21 UTC
Embargoed:


Attachments (Terms of Use)

Description Ambarish 2017-06-14 17:41:29 UTC
Description of problem:
-----------------------


2 Node cluster.

3 clients mounted a 2*2 volume via v4 and were running Bonnie++ in a separate working directory.

I seea steady stream of reply submission failures :

<snip>
bricks/bricks-testvol_brick2.log:[2017-06-14 16:20:07.545679] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x1949b) [0x7fbde758e49b] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1b0f9) [0x7fbde712f0f9] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9276) [0x7fbde711d276] ) 0-: Reply submission failed
bricks/bricks-testvol_brick2.log:[2017-06-14 16:20:07.545785] E [rpcsvc.c:1333:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xa81d8, Program: GlusterFS 3.3, ProgVers: 330, Proc: 34) to rpc-transport (tcp.testvol-server)
bricks/bricks-testvol_brick2.log:[2017-06-14 16:20:07.545817] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x1949b) [0x7fbde758e49b] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1b0f9) [0x7fbde712f0f9] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9276) [0x7fbde711d276] ) 0-: Reply submission failed
bricks/bricks-testvol_brick2.log:[2017-06-14 16:20:07.545920] E [rpcsvc.c:1333:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xa81e7, Program: GlusterFS 3.3, ProgVers: 330, Proc: 34) to rpc-transport (tcp.testvol-server)
bricks/bricks-testvol_brick2.log:[2017-06-14 16:20:07.546062] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x1949b) [0x7fbde758e49b] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1b0
</snip>

Version-Release number of selected component (if applicable):
-------------------------------------------------------------

3.8.4-25


How reproducible:
-----------------

1/1


Actual results:
---------------

Logs spammed with eroors.

Expected results:
-----------------

No log flooding.

Additional info:
----------------

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: 3b04b36a-1837-48e8-b437-fbc091b2f992
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas007.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas009.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas007.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas009.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: on
server.allow-insecure: on
performance.stat-prefetch: off
transport.address-family: inet
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
[root@gqas009 bricks]#

Comment 5 Ambarish 2017-06-18 09:23:53 UTC
This is a bit more serious on my Geo Rep Stress setup,on one of my master nodes.

The message has been logged > 20000 times in 2 days :

[root@gqas005 glusterfs]# grep -Ri "reply submission failed"|wc -l
20377
[root@gqas005 glusterfs]#

<snip>

bricks/bricks3-A1.log-20170618:[2017-06-16 20:27:17.346203] E [server.c:203:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x30e25) [0x7f43ed36de25] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x30dc8) [0x7f43ed36ddc8] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x93a6) [0x7f43ed3463a6] ) 0-: Reply submission failed
bricks/bricks3-A1.log-20170618:[2017-06-16 20:27:17.346232] E [server.c:210:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x30e25) [0x7f43ed36de25] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x30dc8) [0x7f43ed36ddc8] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x93fe) [0x7f43ed3463fe] ) 0-: Reply submission failed
bricks/bricks3-A1.log-20170618:[2017-06-16 20:27:17.346334] E [server.c:203:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x1bbeb) [0x7f43ed7b9beb] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1b609) [0x7f43ed358609] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x93a6) [0x7f43ed3463a6] ) 0-: Reply submission failed
bricks/bricks3-A1.log-20170618:[2017-06-16 20:27:17.346372] E [server.c:203:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x1bbeb) [0x7f43ed7b9beb] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1b609) [0x7f43ed358609] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x93a6) [0x7f43ed3463a6] ) 0-: Reply submission failed
bricks/bricks3-A1.log-20170618:[2017-06-16 20:27:17.346380] E [server.c:203:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x30e25) [0x7f43ed36de25] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x30dc8) [0x7f43ed36ddc8] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x93a6) [0x7f43ed3463a6] ) 0-: Reply submission failed

</snip>

sosreports will be uploaded soon.

Comment 8 Mohammed Rafi KC 2017-06-21 08:27:56 UTC
Just went through one of the bricks (brick1-A1) logs on node 15, and it seems that the disconnect were happened frpm one of the servers, so most likely those disconnects are from internal clients.

Did you run any heal inof commands during this time ?

Comment 13 Atin Mukherjee 2018-11-10 07:30:14 UTC
What's the latest on this bug? It hasn't received any updates for more than a year now. Can we please have a decision on this bug? Is this seen in the latest releases?

Comment 14 Milind Changire 2018-11-19 10:59:24 UTC
requesting re-validation of BZ to Nag

Comment 15 Milind Changire 2018-11-19 11:01:47 UTC
see comment #14

Comment 16 Milind Changire 2018-11-20 07:03:57 UTC
Please hold off re-validation for this BZ until further notice.

This looks like a ping-timer expiry BZ.
However, there are no logs from the client side to ascertain why the connection was lost that resulted in Reply submission failures.

Comment 18 Sahina Bose 2019-11-25 07:37:49 UTC
Is this related to ping-timer expiry?
What's next step here?

Comment 19 Yaniv Kaul 2019-12-12 22:22:21 UTC
Closing - no one worked on this for a long time.

Comment 20 Red Hat Bugzilla 2023-09-14 03:59:12 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.