Bug 1420996 - Brick logs spammed with Reply submission failure messages during I/O. [NEEDINFO]
Summary: Brick logs spammed with Reply submission failure messages during I/O.
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: rpc
Version: rhgs-3.2
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Milind Changire
QA Contact: Rahul Hinduja
URL:
Whiteboard: ssl
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-10 05:17 UTC by Ambarish
Modified: 2018-11-10 07:17 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-11-10 07:17:02 UTC
Target Upstream Version:
rgowdapp: needinfo? (asoman)


Attachments (Terms of Use)

Description Ambarish 2017-02-10 05:17:36 UTC
Description of problem:
------------------------

4 Node Ganesha cluster,1 2*2 volume mounted via v4 on 4 clients.All servers and clients are TLS authenticated.

All clients were pumping I/O via Bonnie first and then iozone.

Almost 8 hours into the test,I could see that on one of the nodes,brick logs were flooded with a steady stream of Reply submission failures :

[root@gqas006 glusterfs]# grep -R -i "reply submission failed" | wc -l
104598

<snip>
bricks/bricks-testvol_brick2.log:[2017-02-09 21:45:48.103750] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18a9e) [0x7fbbe5bf4a9e] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1ac89) [0x7fbbe5796c89] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7fbbe5785186] ) 0-: Reply submission failed
bricks/bricks-testvol_brick2.log:[2017-02-09 21:45:48.103831] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18a9e) [0x7fbbe5bf4a9e] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1ac89) [0x7fbbe5796c89] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7fbbe5785186] ) 0-: Reply submission failed
bricks/bricks-testvol_brick2.log:[2017-02-09 21:45:48.103866] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18a9e) [0x7fbbe5bf4a9e] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1ac89) [0x7fbbe5796c89] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7fbbe5785186] ) 0-: Reply submission failed
bricks/bricks-testvol_brick2.log:[2017-02-09 21:45:48.103926] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18a9e) [0x7fbbe5bf4a9e] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1ac89) [0x7fbbe5796c89] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7fbbe5785186] ) 0-: Reply submission failed
bricks/bricks-testvol_brick2.log:[2017-02-09 21:45:48.103927] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18a9e) [0x7fbbe5bf4a9e] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1ac89) [0x7fbbe5796c89] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7fbbe5785186] ) 0-: Reply submission failed
bricks/bricks-testvol_brick2.log:[2017-02-09 21:45:48.104062] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18a9e) [0x7fbbe5bf4a9e] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1ac89) [0x7fbbe5796c89] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7fbbe5785186] ) 0-: Reply submission failed
bricks/bricks-testvol_brick2.log:[2017-02-09 21:45:48.104086] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18a9e) [0x7fbbe5bf4a9e] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1ac89) [0x7fbbe5796c89] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7fbbe5785186] ) 0-: Reply submission failed
bricks/bricks-testvol_brick2.log:[2017-02-09 21:45:48.104129] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18a9e) [0x7fbbe5bf4a9e] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1ac89) [0x7fbbe5796c89] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7fbbe5785186] ) 0-: Reply submission failed
bricks/bricks-testvol_brick2.log:[2017-02-09 21:45:48.104143] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18a9e) [0x7fbbe5bf4a9e] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1ac89) [0x7fbbe5796c89] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7fbbe5785186] ) 0-: Reply submission failed
bricks/bricks-testvol_brick2.log:[2017-02-09 21:45:48.104257] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18a9e) [0x7fbbe5bf4a9e] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1ac89) [0x7fbbe5796c89] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7fbbe5785186] ) 0-: Reply submission failed
bricks/bricks-testvol_brick2.log:[2017-02-09 21:45:48.104271] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18a9e) [0x7fbbe5bf4a9e] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1ac89) [0x7fbbe5796c89] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7fbbe5785186] ) 0-: Reply submission failed
bricks/bricks-testvol_brick2.log:[2017-02-09 21:45:48.104302] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18a9e) [0x7fbbe5bf4a9e] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1ac89) [0x7fbbe5796c89] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7fbbe5785186] ) 0-: Reply submission failed
bricks/bricks-testvol_brick2.log:[2017-02-09 21:45:48.104346] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18a9e) [0x7fbbe5bf4a9e] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1ac89) [0x7fbbe5796c89] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7fbbe5785186] ) 0-: Reply submission failed
bricks/bricks-testvol_brick2.log:[2017-02-09 21:45:48.104431] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18a9e) [0x7fbbe5bf4a9e] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1ac89) [0x7fbbe5796c89] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7fbbe5785186] ) 0-: Reply submission failed
bricks/bricks-testvol_brick2.log:[2017-02-09 21:45:48.104429] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18a9e) [0x7fbbe5bf4a9e] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1ac89) [0x7fbbe5796c89] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7fbbe5785186] ) 0-: Reply submission failed
bricks/bricks-testvol_brick2.log:[2017-02-09 21:45:48.104504] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18a9e) [0x7fbbe5bf4a9e] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1ac89) [0x7fbbe5796c89] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7fbbe5785186] ) 0-: Reply submission failed
bricks/bricks-testvol_brick2.log:[2017-02-09 21:45:48.104574] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18a9e) [0x7fbbe5bf4a9e] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1ac89) [0x7fbbe5796c89] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7fbbe5785186] ) 0-: Reply submission failed
bricks/bricks-testvol_brick2.log:[2017-02-09 21:45:48.104597] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18a9e) [0x7fbbe5bf4a9e] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1ac89) [0x7fbbe5796c89] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7fbbe5785186] ) 0-: Reply submission failed
bricks/bricks-testvol_brick2.log:[2017-02-09 21:45:48.104652] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18a9e) [0x7fbbe5bf4a9e] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1ac89) [0x7fbbe5796c89] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7fbbe5785186] ) 0-: Reply submission failed
bricks/bricks-testvol_brick2.log:[2017-02-09 21:45:48.104668] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18a9e) [0x7fbbe5bf4a9e] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1ac89) [0x7fbbe5796c89] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7fbbe5785186] ) 0-: Reply submission failed
bricks/bricks-testvol_brick2.log:[2017-02-09 21:45:48.104823] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18a9e) [0x7fbbe5bf4a9e] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1ac89) [0x7fbbe5796c89] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7fbbe5785186] ) 0-: Reply submission failed
bricks/bricks-testvol_brick2.log:[2017-02-09 21:45:48.104838] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18a9e) [0x7fbbe5bf4a9e] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x1ac89) [0x7fbbe5796c89] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7fbbe5785186] ) 0-: Reply submission failed

</snip>



Version-Release number of selected component (if applicable):
-------------------------------------------------------------

glusterfs-3.8.4-14.el7rhgs.x86_64

openssl-1.0.1e-60.el7.x86_64

nfs-ganesha-2.4.1-7.el7rhgs.x86_64


How reproducible:
-----------------

1/1


Actual results:
---------------


Logs flooded with errors,log size bloated.

Expected results:
-----------------

No log flooding..

Additional info:
----------------

Client and Server OS : RHEL 7.3

Volume Name: testvol
Type: Distributed-Replicate
Volume ID: a8851b7e-11b8-4a39-ae0e-d58e64bec8a8
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas013.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas005.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas006.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas008.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: off
server.allow-insecure: on
performance.stat-prefetch: off
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
client.ssl: on
server.ssl: on
cluster.enable-shared-storage: enable
nfs-ganesha: enable
[root@gqas013 ~]# 


Status of volume: testvol
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gqas013.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brick0                        49153     0          Y       15440
Brick gqas005.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brick1                        49153     0          Y       16253
Brick gqas006.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brick2                        49153     0          Y       17677
Brick gqas008.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brick3                        49152     0          Y       7226 
Self-heal Daemon on localhost               N/A       N/A        Y       23898
Self-heal Daemon on gqas005.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       2074 
Self-heal Daemon on gqas008.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       8652 
Self-heal Daemon on gqas006.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       4762 
 
Task Status of Volume testvol
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@gqas013 ~]#

Comment 2 Ambarish 2017-02-10 05:26:44 UTC
Proposing this as a blocker since it is a log flooding issue,which impacts the usability of the product as a whole.

Setup preserved,in case Dev wants to take a look.

Comment 4 Ambarish 2017-02-10 05:32:19 UTC
Adding Atin's changes that collided with mine.

Comment 5 Atin Mukherjee 2017-02-10 05:36:47 UTC
Is this flooding seen through out from its first occurrence till your test ended?

Comment 6 Ambarish 2017-02-10 05:40:58 UTC
Atin,

Yes.

Sometimes it is more frequent.

iozone is still running,and I am still seeing this message getting popped up in one of my servers,though not as frequently as before.

The snippet I added in the description shows the same message being logged a lot of times in one minute.

Comment 7 Mohit Agrawal 2017-02-10 06:33:55 UTC
Hi,

After checked the logs it seems "server_submit_reply" is failing due to error retruned by rpcsvc_submit_generic at the time of submit the message for transmission. As per logs it seems rpcsvc_submit_generic is failing because server is not connected but i am not able to see any message why server is disconnected.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 

[2017-02-09 09:35:41.821167] I [socket.c:348:ssl_setup_connection] 0-tcp.testvol-server: peer CN = gqas005.sbu.lab.eng.bos.redhat.com
[2017-02-09 09:35:41.821261] I [socket.c:351:ssl_setup_connection] 0-tcp.testvol-server: SSL verification succeeded (client: 192.168.97.141:1005)
[2017-02-09 09:35:41.821877] I [login.c:34:gf_auth] 0-auth/login: connecting user name: gqas005.sbu.lab.eng.bos.redhat.com
[2017-02-09 09:35:41.821900] I [MSGID: 115029] [server-handshake.c:693:server_setvolume] 0-testvol-server: accepted client from gqas005.sbu.lab.eng.bos.redhat.com-18777-2017/02/09-09:35:37:710014-testvol-client-2-0-0 (version: 3.8.4)
[2017-02-09 21:45:47.376556] I [socket.c:3520:socket_submit_reply] 0-tcp.testvol-server: not connected (priv->connected = -1)
[2017-02-09 21:45:47.376679] E [rpcsvc.c:1304:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x958187, Program: GlusterFS 3.3, ProgVers: 330, Proc: 34) to rpc-transport (tcp.testvol-server)
[2017-02-09 21:45:47.376686] E [rpcsvc.c:1304:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x95818b, Program: GlusterFS 3.3, ProgVers: 330, Proc: 34) to rpc-transport (tcp.testvol-server)
[2017-02-09 21:45:47.377747] I [MSGID: 115036] [server.c:552:server_rpc_notify] 0-testvol-server: disconnecting connection from gqas006.sbu.lab.eng.bos.redhat.com-4306-2017/02/09-09:15:12:450689-testvol-client-2-0-0



[2017-02-09 21:45:47.386913] E [rpcsvc.c:1304:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x950977, Program: GlusterFS 3.3, ProgVers: 330, Proc: 30) to rpc-transport (tcp.testvol-server)
[2017-02-09 21:45:47.460091] W [socket.c:595:__socket_rwv] 0-tcp.testvol-server: writev on 192.168.97.141:1005 failed (No data available)
[2017-02-09 21:45:47.460275] I [socket.c:3520:socket_submit_reply] 0-tcp.testvol-server: not connected (priv->connected = -1)
[2017-02-09 21:45:47.460311] E [rpcsvc.c:1304:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x5a5a50, Program: GlusterFS 3.3, ProgVers: 330, Proc: 34) to rpc-transport (tcp.testvol-server)
[2017-02-09 21:45:47.460336] E [rpcsvc.c:1304:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x5a59b4, Program: GlusterFS 3.3, ProgVers: 330, Proc: 34) to rpc-transport (tcp.testvol-server)
[2017-02-09 21:45:47.460353] E [rpcsvc.c:1304:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x5b183c, Program: GlusterFS 3.3, ProgVers: 330, Proc: 16) to rpc-transport (tcp.testvol-server)
[2017-02-09 21:45:47.460484] I [MSGID: 115036] [server.c:552:server_rpc_notify] 0-testvol-server: disconnecting connection from gqas005.sbu.lab.eng.bos.redhat.com-18777-2017/02/09-09:35:37:710014-testvol-client-2-0-0

>>>>>>>>>>>>>>>>>>>>>>>

I will try to reproduce the issue to find out RCA.

Regards
Mohit Agrawal

Comment 8 Ambarish 2017-02-10 10:44:21 UTC
I've reproduced the issue on the same setup for the second time..

I see logs getting spammed on another server this time.

Shared setup with Dev.

Comment 9 Ambarish 2017-02-11 09:21:19 UTC
This time the log was populated 77k times in nearly 8 hours on another server :

[root@gqas013 glusterfs]# grep -R -i "reply submission fail"|wc -l
77549
[root@gqas013 glusterfs]#

Comment 10 Ambarish 2017-02-12 05:34:53 UTC
And on another server it was populated > 200000 times :

[root@gqas006 glusterfs]# grep -Ri "reply submission fail"|wc -l
272906
[root@gqas006 glusterfs]#

Comment 11 Ambarish 2017-02-13 03:31:02 UTC
Mohit/Atin,

I don't think this problem is really specific to SSL.

I could reproduce it on a Non SSL setup (Ganesha+MTSH) as well,though it was not that severe :

bricks/bricks12-brick1.log:[2017-02-12 17:59:13.519221] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18d62) [0x7f912de05d62] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x18969) [0x7f912d9a5969] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7f912d996186] ) 0-: Reply submission failed
bricks/bricks12-brick1.log:[2017-02-12 17:59:13.519290] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18d62) [0x7f912de05d62] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x18969) [0x7f912d9a5969] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7f912d996186] ) 0-: Reply submission failed



[root@gqas003 glusterfs]# grep -Ri "reply subm"|wc -l
155
[root@gqas003 glusterfs]# 

I am adjusting the summary accordingly.


********
Vol INFO(in case you guys want to take a look) :
********

[root@gqas003 ~]# gluster v info
 
Volume Name: butcher
Type: Distributed-Disperse
Volume ID: b3b2211e-9e30-4227-9785-ad0f325f8d21
Status: Started
Snapshot Count: 0
Number of Bricks: 12 x (4 + 2) = 72
Transport-type: tcp
Bricks:
Brick1: gqas003.sbu.lab.eng.bos.redhat.com:/bricks1/brick1
Brick2: gqas009.sbu.lab.eng.bos.redhat.com:/bricks1/brick1
Brick3: gqas010.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick4: gqas012.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick5: gqas014.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick6: gqas015.sbu.lab.eng.bos.redhat.com:/bricks1/brick
Brick7: gqas003.sbu.lab.eng.bos.redhat.com:/bricks2/brick1
Brick8: gqas009.sbu.lab.eng.bos.redhat.com:/bricks2/brick1
Brick9: gqas010.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick10: gqas012.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick11: gqas014.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick12: gqas015.sbu.lab.eng.bos.redhat.com:/bricks2/brick
Brick13: gqas003.sbu.lab.eng.bos.redhat.com:/bricks3/brick1
Brick14: gqas009.sbu.lab.eng.bos.redhat.com:/bricks3/brick1
Brick15: gqas010.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick16: gqas012.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick17: gqas014.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick18: gqas015.sbu.lab.eng.bos.redhat.com:/bricks3/brick
Brick19: gqas003.sbu.lab.eng.bos.redhat.com:/bricks4/brick1
Brick20: gqas009.sbu.lab.eng.bos.redhat.com:/bricks4/brick1
Brick21: gqas010.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick22: gqas012.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick23: gqas014.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick24: gqas015.sbu.lab.eng.bos.redhat.com:/bricks4/brick
Brick25: gqas003.sbu.lab.eng.bos.redhat.com:/bricks5/brick1
Brick26: gqas009.sbu.lab.eng.bos.redhat.com:/bricks5/brick1
Brick27: gqas010.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick28: gqas012.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick29: gqas014.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick30: gqas015.sbu.lab.eng.bos.redhat.com:/bricks5/brick
Brick31: gqas003.sbu.lab.eng.bos.redhat.com:/bricks6/brick1
Brick32: gqas009.sbu.lab.eng.bos.redhat.com:/bricks6/brick1
Brick33: gqas010.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick34: gqas012.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick35: gqas014.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick36: gqas015.sbu.lab.eng.bos.redhat.com:/bricks6/brick
Brick37: gqas003.sbu.lab.eng.bos.redhat.com:/bricks7/brick1
Brick38: gqas009.sbu.lab.eng.bos.redhat.com:/bricks7/brick1
Brick39: gqas010.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick40: gqas012.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick41: gqas014.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick42: gqas015.sbu.lab.eng.bos.redhat.com:/bricks7/brick
Brick43: gqas003.sbu.lab.eng.bos.redhat.com:/bricks8/brick1
Brick44: gqas009.sbu.lab.eng.bos.redhat.com:/bricks8/brick1
Brick45: gqas010.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick46: gqas012.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick47: gqas014.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick48: gqas015.sbu.lab.eng.bos.redhat.com:/bricks8/brick
Brick49: gqas003.sbu.lab.eng.bos.redhat.com:/bricks9/brick1
Brick50: gqas009.sbu.lab.eng.bos.redhat.com:/bricks9/brick1
Brick51: gqas010.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick52: gqas012.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick53: gqas014.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick54: gqas015.sbu.lab.eng.bos.redhat.com:/bricks9/brick
Brick55: gqas003.sbu.lab.eng.bos.redhat.com:/bricks10/brick1
Brick56: gqas009.sbu.lab.eng.bos.redhat.com:/bricks10/brick1
Brick57: gqas010.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick58: gqas012.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick59: gqas014.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick60: gqas015.sbu.lab.eng.bos.redhat.com:/bricks10/brick
Brick61: gqas003.sbu.lab.eng.bos.redhat.com:/bricks11/brick1
Brick62: gqas009.sbu.lab.eng.bos.redhat.com:/bricks11/brick1
Brick63: gqas010.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick64: gqas012.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick65: gqas014.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick66: gqas015.sbu.lab.eng.bos.redhat.com:/bricks11/brick
Brick67: gqas003.sbu.lab.eng.bos.redhat.com:/bricks12/brick1
Brick68: gqas009.sbu.lab.eng.bos.redhat.com:/bricks12/brick1
Brick69: gqas010.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick70: gqas012.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick71: gqas014.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Brick72: gqas015.sbu.lab.eng.bos.redhat.com:/bricks12/brick
Options Reconfigured:
ganesha.enable: on
features.cache-invalidation: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
disperse.shd-max-threads: 64
disperse.shd-wait-qlength: 655536
cluster.enable-shared-storage: enable
nfs-ganesha: enable

Comment 12 Mohit Agrawal 2017-02-13 04:12:49 UTC
Hi Ambarish,
  
   Thanks for trying to reproduce the issue in NON-SSL environment.I also think same the issue is not specific to SSL.
   As i mentioned in my previous comment  server_submit_reply is failing because rpcsvc_submit_generic is failed and it(rpcsvc_submit_generic) is failing because other end point of socket is not connected and server is not able to send the 
   message on socket. After checked the ganesha-gfapi logs i saw there are multiple frame unwind message so i believe 
   other end point of socket is closed by ganesha gfapi.


  [2017-02-12 17:30:07.522548] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fee03e5b672] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fee03c2176e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fee03c2187e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7fee03c22fe4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7fee03c23874] ))))) 0-testvol-client-3: forced unwinding frame type(GF-DUMP) op(DUMP(1)) called at 2017-02-12 17:30:07.522226 (xid=0x113415f)

   
  I have checked one another bugzilla(https://bugzilla.redhat.com/show_bug.cgi?id=1387767) also opened on nfs that also having same kind of messages, i think issue needs to be check at nfs component level.


Regards
Mohit Agrawal

Comment 13 Atin Mukherjee 2017-02-13 04:20:27 UTC
Soumya - can you please check the above behaviour (refer comment 11 & 12)?

Comment 14 Soumya Koduri 2017-02-13 05:38:47 UTC
Could you please point me to gfapi.log location. Bug1387767 is filed in upstream on gluster-nfs component as the access protocol is gluster-NFS. But the setup here is NFS_Ganesha + GLuster. Also if the gluster client is closing the socket, either it is coming from protocol-client/rpc-socket layer itself or sent by parent xlator which in case is gfapi. NFS-Ganesha/application layer never closes any connection unless volume is being unexported or the process is going down.

Comment 15 Soumya Koduri 2017-02-13 09:01:31 UTC
Looking at brick log reported in comment#1  -

/var/log/glusterfs/bricks/bricks-testvol_brick2.log-20170212 (on gqas006)
[2017-02-09 09:35:41.821900] I [MSGID: 115029] [server-handshake.c:693:server_setvolume] 0-testvol-server: accepted client from gqas005.sbu.lab.eng.bos.redhat.com-18777-2017/02/09-09:35:37:710014-testvol-client-2-0-0 (version: 3.8.4)
[2017-02-09 21:45:47.376556] I [socket.c:3520:socket_submit_reply] 0-tcp.testvol-server: not connected (priv->connected = -1)
[2017-02-09 21:45:47.376679] E [rpcsvc.c:1304:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x958187, Program: GlusterFS 3.3, ProgVers: 330, Proc: 34) to rpc-transport (tcp.testvol-server)
[2017-02-09 21:45:47.376686] E [rpcsvc.c:1304:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x95818b, Program: GlusterFS 3.3, ProgVers: 330, Proc: 34) to rpc-transport (tcp.testvol-server)


First disconnect message was seen on 21:45:47....


Status of volume: testvol
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gqas013.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brick0                        49153     0          Y       15440
Brick gqas005.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brick1                        49153     0          Y       16253
Brick gqas006.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brick2                        49153     0          Y       17677
Brick gqas008.sbu.lab.eng.bos.redhat.com:/b
ricks/testvol_brick3                        49152     0          Y       7226 
Self-heal Daemon on localhost               N/A       N/A        Y       4762 
Self-heal Daemon on gqas008.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       8652 
Self-heal Daemon on gqas005.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       2074 
Self-heal Daemon on gqas013.sbu.lab.eng.bos
.redhat.com                                 N/A       N/A        Y       23898
 
Task Status of Volume testvol
------------------------------------------------------------------------------
There are no active volume tasks


This brick process maps to testvol-client-2 on the vol graph. Looking at client (gfapi.log) around the same time on the same machine --

[2017-02-09 14:54:51.019183] I [MSGID: 109066] [dht-rename.c:1569:dht_rename] 0-testvol-dht: renaming /run4855/Bonnie.4877 (hash=testvol-replicate-1/cache=testvol-replicate-1) => /run4855/.nfs93111cd39bceb47200000001 (hash=testvol-replicate-1/cache=<nul>)
[2017-02-09 21:12:21.576416] E [rpc-clnt.c:200:call_bail] 0-testvol-client-2: bailing out frame type(GlusterFS 3.3) op(FXATTROP(34)) xid = 0x959b3c sent = 2017-02-09 20:42:19.036664. timeout = 1800 for 192.168.97.142:49153
[2017-02-09 21:12:21.576630] W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 0-testvol-client-2: remote operation failed
[2017-02-09 21:12:21.588295] E [rpc-clnt.c:200:call_bail] 0-testvol-client-2: bailing out frame type(GlusterFS 3.3) op(FXATTROP(34)) xid = 0x959b3b sent = 2017-02-09 20:42:19.036577. timeout = 1800 for 192.168.97.142:49153


Starting from 21:12:21... , there are many calls which got bailed out. Does that mean server was very slow in repsoonding to those requests?

Post that --> 
[2017-02-09 21:45:47.374884] C [rpc-clnt.c:465:rpc_clnt_fill_request_info] 0-testvol-client-2: cannot lookup the saved frame corresponding to xid (9798026)
[2017-02-09 21:45:47.374991] W [socket.c:1923:__socket_read_reply] 0-testvol-client-2: notify for event MAP_XID failed for 192.168.97.142:49153
[2017-02-09 21:45:47.375036] E [socket.c:2568:socket_poller] 0-testvol-client-2: socket_poller 192.168.97.142:49153 failed (Success)

There is failure in looking up a saved frame with xid (9798026) followed by error in socket_polling()..Maybe this lead to transport disconnect leading to unwinding saved_frames as followed immediately in the log file.


[2017-02-09 21:45:47.407880] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f1af2e3f672] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f1af2c0576e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f1af2c0587e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f1af2c06fe4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7f1af2c07874] ))))) 0-testvol-client-2: forced unwinding frame type(GlusterFS 3.3) op(FINODELK(30)) called at 2017-02-09 20:41:28.651021 (xid=0x950977)


I am not an expert in RPC. But atleast looks like there is no disconnect generated in upper layer (gfapi). The questions which arise are why server is slow in responding leading to many calls bailing out on the client. And why there has been error in socket_polling. CCin Du as well and request to take a look.

Comment 16 Atin Mukherjee 2017-02-13 09:20:42 UTC
This looks like a problem more from a rpc side. Adjusting the component.

Comment 23 Milind Changire 2017-02-27 14:37:08 UTC
Atin,
There has been attempt to reproduce the bug after the needinfo was raised on me.
The issue is no longer reproducible.
However, Ambarish is keeping an eye open for this issue.

Comment 24 Atin Mukherjee 2017-02-27 14:50:40 UTC
(In reply to Milind Changire from comment #23)
> Atin,
> There has been attempt to reproduce the bug after the needinfo was raised on
> me.
> The issue is no longer reproducible.
> However, Ambarish is keeping an eye open for this issue.

If this is not reproducible, why can't we close this bug and Ambarish can reopen if the issue persists?

Comment 29 Milind Changire 2017-09-05 04:27:58 UTC
Moving to 3.4.0-beyond due to comment #26

Comment 30 Raghavendra G 2017-09-06 06:13:31 UTC
There are two things that've caused disconnections as per logs:

* call-bail - client didn't receive response to a fop within 30 min.
* A saved frame corresponding to xid in reply couldn't be found.

Both of these things can happen if xid in response is corrupted. bz 1409135 speaks about corruption in rpc messages when upcalls are used. I am wondering whether this bz is a duplicate of bz 1409135 (which is fixed in a version of glusterfs higher than the version this bug is filed on). Also, note that Milind says in comment #23 the issue is not reproducible.

Note that these logs go away once disconnections are fixed. Since, we suspect this bug is fixed can we retest this bug? @Ambarish, what are your thoughts on this?

Comment 33 Atin Mukherjee 2018-11-10 07:17:02 UTC
I see the bugs referenced in comment 30 are already addressed and fixed in the current releases. Since we haven't heard about this problem in latest releases I'm closing this bug. Please feel free to reopen if the issue persists.


Note You need to log in before you can comment on or make changes to this bug.