Description of problem: ----------------------- 2 Node cluster. 3 clients mounted a 2*2 volume via v4 and were running Bonnie++ in a separate working directory. I seea steady stream of reply submission failures : <snip> bricks/bricks-testvol_brick2.log:[2017-06-14 16:20:07.545679] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x1949b) [0x7fbde758e49b] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1b0f9) [0x7fbde712f0f9] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9276) [0x7fbde711d276] ) 0-: Reply submission failed bricks/bricks-testvol_brick2.log:[2017-06-14 16:20:07.545785] E [rpcsvc.c:1333:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xa81d8, Program: GlusterFS 3.3, ProgVers: 330, Proc: 34) to rpc-transport (tcp.testvol-server) bricks/bricks-testvol_brick2.log:[2017-06-14 16:20:07.545817] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x1949b) [0x7fbde758e49b] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1b0f9) [0x7fbde712f0f9] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9276) [0x7fbde711d276] ) 0-: Reply submission failed bricks/bricks-testvol_brick2.log:[2017-06-14 16:20:07.545920] E [rpcsvc.c:1333:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0xa81e7, Program: GlusterFS 3.3, ProgVers: 330, Proc: 34) to rpc-transport (tcp.testvol-server) bricks/bricks-testvol_brick2.log:[2017-06-14 16:20:07.546062] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x1949b) [0x7fbde758e49b] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1b0 </snip> Version-Release number of selected component (if applicable): ------------------------------------------------------------- 3.8.4-25 How reproducible: ----------------- 1/1 Actual results: --------------- Logs spammed with eroors. Expected results: ----------------- No log flooding. Additional info: ---------------- Volume Name: testvol Type: Distributed-Replicate Volume ID: 3b04b36a-1837-48e8-b437-fbc091b2f992 Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: gqas007.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0 Brick2: gqas009.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1 Brick3: gqas007.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2 Brick4: gqas009.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3 Options Reconfigured: ganesha.enable: on features.cache-invalidation: on server.allow-insecure: on performance.stat-prefetch: off transport.address-family: inet nfs.disable: on nfs-ganesha: enable cluster.enable-shared-storage: enable [root@gqas009 bricks]#
This is a bit more serious on my Geo Rep Stress setup,on one of my master nodes. The message has been logged > 20000 times in 2 days : [root@gqas005 glusterfs]# grep -Ri "reply submission failed"|wc -l 20377 [root@gqas005 glusterfs]# <snip> bricks/bricks3-A1.log-20170618:[2017-06-16 20:27:17.346203] E [server.c:203:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x30e25) [0x7f43ed36de25] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x30dc8) [0x7f43ed36ddc8] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x93a6) [0x7f43ed3463a6] ) 0-: Reply submission failed bricks/bricks3-A1.log-20170618:[2017-06-16 20:27:17.346232] E [server.c:210:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x30e25) [0x7f43ed36de25] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x30dc8) [0x7f43ed36ddc8] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x93fe) [0x7f43ed3463fe] ) 0-: Reply submission failed bricks/bricks3-A1.log-20170618:[2017-06-16 20:27:17.346334] E [server.c:203:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x1bbeb) [0x7f43ed7b9beb] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1b609) [0x7f43ed358609] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x93a6) [0x7f43ed3463a6] ) 0-: Reply submission failed bricks/bricks3-A1.log-20170618:[2017-06-16 20:27:17.346372] E [server.c:203:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x1bbeb) [0x7f43ed7b9beb] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1b609) [0x7f43ed358609] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x93a6) [0x7f43ed3463a6] ) 0-: Reply submission failed bricks/bricks3-A1.log-20170618:[2017-06-16 20:27:17.346380] E [server.c:203:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x30e25) [0x7f43ed36de25] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x30dc8) [0x7f43ed36ddc8] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x93a6) [0x7f43ed3463a6] ) 0-: Reply submission failed </snip> sosreports will be uploaded soon.
Just went through one of the bricks (brick1-A1) logs on node 15, and it seems that the disconnect were happened frpm one of the servers, so most likely those disconnects are from internal clients. Did you run any heal inof commands during this time ?
What's the latest on this bug? It hasn't received any updates for more than a year now. Can we please have a decision on this bug? Is this seen in the latest releases?
requesting re-validation of BZ to Nag
see comment #14
Please hold off re-validation for this BZ until further notice. This looks like a ping-timer expiry BZ. However, there are no logs from the client side to ascertain why the connection was lost that resulted in Reply submission failures.
Is this related to ping-timer expiry? What's next step here?
Closing - no one worked on this for a long time.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days