Bug 1420996
Summary: | Brick logs spammed with Reply submission failure messages during I/O. | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Ambarish <asoman> |
Component: | rpc | Assignee: | Milind Changire <mchangir> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Rahul Hinduja <rhinduja> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | rhgs-3.2 | CC: | amukherj, bturner, mchangir, rcyriac, rgowdapp, rhinduja, rhs-bugs, skoduri, storage-qa-internal, vbellur |
Target Milestone: | --- | Keywords: | ZStream |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | ssl | ||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-11-10 07:17:02 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Ambarish
2017-02-10 05:17:36 UTC
Proposing this as a blocker since it is a log flooding issue,which impacts the usability of the product as a whole. Setup preserved,in case Dev wants to take a look. Adding Atin's changes that collided with mine. Is this flooding seen through out from its first occurrence till your test ended? Atin, Yes. Sometimes it is more frequent. iozone is still running,and I am still seeing this message getting popped up in one of my servers,though not as frequently as before. The snippet I added in the description shows the same message being logged a lot of times in one minute. Hi, After checked the logs it seems "server_submit_reply" is failing due to error retruned by rpcsvc_submit_generic at the time of submit the message for transmission. As per logs it seems rpcsvc_submit_generic is failing because server is not connected but i am not able to see any message why server is disconnected. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [2017-02-09 09:35:41.821167] I [socket.c:348:ssl_setup_connection] 0-tcp.testvol-server: peer CN = gqas005.sbu.lab.eng.bos.redhat.com [2017-02-09 09:35:41.821261] I [socket.c:351:ssl_setup_connection] 0-tcp.testvol-server: SSL verification succeeded (client: 192.168.97.141:1005) [2017-02-09 09:35:41.821877] I [login.c:34:gf_auth] 0-auth/login: connecting user name: gqas005.sbu.lab.eng.bos.redhat.com [2017-02-09 09:35:41.821900] I [MSGID: 115029] [server-handshake.c:693:server_setvolume] 0-testvol-server: accepted client from gqas005.sbu.lab.eng.bos.redhat.com-18777-2017/02/09-09:35:37:710014-testvol-client-2-0-0 (version: 3.8.4) [2017-02-09 21:45:47.376556] I [socket.c:3520:socket_submit_reply] 0-tcp.testvol-server: not connected (priv->connected = -1) [2017-02-09 21:45:47.376679] E [rpcsvc.c:1304:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x958187, Program: GlusterFS 3.3, ProgVers: 330, Proc: 34) to rpc-transport (tcp.testvol-server) [2017-02-09 21:45:47.376686] E [rpcsvc.c:1304:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x95818b, Program: GlusterFS 3.3, ProgVers: 330, Proc: 34) to rpc-transport (tcp.testvol-server) [2017-02-09 21:45:47.377747] I [MSGID: 115036] [server.c:552:server_rpc_notify] 0-testvol-server: disconnecting connection from gqas006.sbu.lab.eng.bos.redhat.com-4306-2017/02/09-09:15:12:450689-testvol-client-2-0-0 [2017-02-09 21:45:47.386913] E [rpcsvc.c:1304:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x950977, Program: GlusterFS 3.3, ProgVers: 330, Proc: 30) to rpc-transport (tcp.testvol-server) [2017-02-09 21:45:47.460091] W [socket.c:595:__socket_rwv] 0-tcp.testvol-server: writev on 192.168.97.141:1005 failed (No data available) [2017-02-09 21:45:47.460275] I [socket.c:3520:socket_submit_reply] 0-tcp.testvol-server: not connected (priv->connected = -1) [2017-02-09 21:45:47.460311] E [rpcsvc.c:1304:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x5a5a50, Program: GlusterFS 3.3, ProgVers: 330, Proc: 34) to rpc-transport (tcp.testvol-server) [2017-02-09 21:45:47.460336] E [rpcsvc.c:1304:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x5a59b4, Program: GlusterFS 3.3, ProgVers: 330, Proc: 34) to rpc-transport (tcp.testvol-server) [2017-02-09 21:45:47.460353] E [rpcsvc.c:1304:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x5b183c, Program: GlusterFS 3.3, ProgVers: 330, Proc: 16) to rpc-transport (tcp.testvol-server) [2017-02-09 21:45:47.460484] I [MSGID: 115036] [server.c:552:server_rpc_notify] 0-testvol-server: disconnecting connection from gqas005.sbu.lab.eng.bos.redhat.com-18777-2017/02/09-09:35:37:710014-testvol-client-2-0-0 >>>>>>>>>>>>>>>>>>>>>>> I will try to reproduce the issue to find out RCA. Regards Mohit Agrawal I've reproduced the issue on the same setup for the second time.. I see logs getting spammed on another server this time. Shared setup with Dev. This time the log was populated 77k times in nearly 8 hours on another server : [root@gqas013 glusterfs]# grep -R -i "reply submission fail"|wc -l 77549 [root@gqas013 glusterfs]# And on another server it was populated > 200000 times : [root@gqas006 glusterfs]# grep -Ri "reply submission fail"|wc -l 272906 [root@gqas006 glusterfs]# Mohit/Atin, I don't think this problem is really specific to SSL. I could reproduce it on a Non SSL setup (Ganesha+MTSH) as well,though it was not that severe : bricks/bricks12-brick1.log:[2017-02-12 17:59:13.519221] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18d62) [0x7f912de05d62] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x18969) [0x7f912d9a5969] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7f912d996186] ) 0-: Reply submission failed bricks/bricks12-brick1.log:[2017-02-12 17:59:13.519290] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x18d62) [0x7f912de05d62] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x18969) [0x7f912d9a5969] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7f912d996186] ) 0-: Reply submission failed [root@gqas003 glusterfs]# grep -Ri "reply subm"|wc -l 155 [root@gqas003 glusterfs]# I am adjusting the summary accordingly. ******** Vol INFO(in case you guys want to take a look) : ******** [root@gqas003 ~]# gluster v info Volume Name: butcher Type: Distributed-Disperse Volume ID: b3b2211e-9e30-4227-9785-ad0f325f8d21 Status: Started Snapshot Count: 0 Number of Bricks: 12 x (4 + 2) = 72 Transport-type: tcp Bricks: Brick1: gqas003.sbu.lab.eng.bos.redhat.com:/bricks1/brick1 Brick2: gqas009.sbu.lab.eng.bos.redhat.com:/bricks1/brick1 Brick3: gqas010.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick4: gqas012.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick5: gqas014.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick6: gqas015.sbu.lab.eng.bos.redhat.com:/bricks1/brick Brick7: gqas003.sbu.lab.eng.bos.redhat.com:/bricks2/brick1 Brick8: gqas009.sbu.lab.eng.bos.redhat.com:/bricks2/brick1 Brick9: gqas010.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick10: gqas012.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick11: gqas014.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick12: gqas015.sbu.lab.eng.bos.redhat.com:/bricks2/brick Brick13: gqas003.sbu.lab.eng.bos.redhat.com:/bricks3/brick1 Brick14: gqas009.sbu.lab.eng.bos.redhat.com:/bricks3/brick1 Brick15: gqas010.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick16: gqas012.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick17: gqas014.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick18: gqas015.sbu.lab.eng.bos.redhat.com:/bricks3/brick Brick19: gqas003.sbu.lab.eng.bos.redhat.com:/bricks4/brick1 Brick20: gqas009.sbu.lab.eng.bos.redhat.com:/bricks4/brick1 Brick21: gqas010.sbu.lab.eng.bos.redhat.com:/bricks4/brick Brick22: gqas012.sbu.lab.eng.bos.redhat.com:/bricks4/brick Brick23: gqas014.sbu.lab.eng.bos.redhat.com:/bricks4/brick Brick24: gqas015.sbu.lab.eng.bos.redhat.com:/bricks4/brick Brick25: gqas003.sbu.lab.eng.bos.redhat.com:/bricks5/brick1 Brick26: gqas009.sbu.lab.eng.bos.redhat.com:/bricks5/brick1 Brick27: gqas010.sbu.lab.eng.bos.redhat.com:/bricks5/brick Brick28: gqas012.sbu.lab.eng.bos.redhat.com:/bricks5/brick Brick29: gqas014.sbu.lab.eng.bos.redhat.com:/bricks5/brick Brick30: gqas015.sbu.lab.eng.bos.redhat.com:/bricks5/brick Brick31: gqas003.sbu.lab.eng.bos.redhat.com:/bricks6/brick1 Brick32: gqas009.sbu.lab.eng.bos.redhat.com:/bricks6/brick1 Brick33: gqas010.sbu.lab.eng.bos.redhat.com:/bricks6/brick Brick34: gqas012.sbu.lab.eng.bos.redhat.com:/bricks6/brick Brick35: gqas014.sbu.lab.eng.bos.redhat.com:/bricks6/brick Brick36: gqas015.sbu.lab.eng.bos.redhat.com:/bricks6/brick Brick37: gqas003.sbu.lab.eng.bos.redhat.com:/bricks7/brick1 Brick38: gqas009.sbu.lab.eng.bos.redhat.com:/bricks7/brick1 Brick39: gqas010.sbu.lab.eng.bos.redhat.com:/bricks7/brick Brick40: gqas012.sbu.lab.eng.bos.redhat.com:/bricks7/brick Brick41: gqas014.sbu.lab.eng.bos.redhat.com:/bricks7/brick Brick42: gqas015.sbu.lab.eng.bos.redhat.com:/bricks7/brick Brick43: gqas003.sbu.lab.eng.bos.redhat.com:/bricks8/brick1 Brick44: gqas009.sbu.lab.eng.bos.redhat.com:/bricks8/brick1 Brick45: gqas010.sbu.lab.eng.bos.redhat.com:/bricks8/brick Brick46: gqas012.sbu.lab.eng.bos.redhat.com:/bricks8/brick Brick47: gqas014.sbu.lab.eng.bos.redhat.com:/bricks8/brick Brick48: gqas015.sbu.lab.eng.bos.redhat.com:/bricks8/brick Brick49: gqas003.sbu.lab.eng.bos.redhat.com:/bricks9/brick1 Brick50: gqas009.sbu.lab.eng.bos.redhat.com:/bricks9/brick1 Brick51: gqas010.sbu.lab.eng.bos.redhat.com:/bricks9/brick Brick52: gqas012.sbu.lab.eng.bos.redhat.com:/bricks9/brick Brick53: gqas014.sbu.lab.eng.bos.redhat.com:/bricks9/brick Brick54: gqas015.sbu.lab.eng.bos.redhat.com:/bricks9/brick Brick55: gqas003.sbu.lab.eng.bos.redhat.com:/bricks10/brick1 Brick56: gqas009.sbu.lab.eng.bos.redhat.com:/bricks10/brick1 Brick57: gqas010.sbu.lab.eng.bos.redhat.com:/bricks10/brick Brick58: gqas012.sbu.lab.eng.bos.redhat.com:/bricks10/brick Brick59: gqas014.sbu.lab.eng.bos.redhat.com:/bricks10/brick Brick60: gqas015.sbu.lab.eng.bos.redhat.com:/bricks10/brick Brick61: gqas003.sbu.lab.eng.bos.redhat.com:/bricks11/brick1 Brick62: gqas009.sbu.lab.eng.bos.redhat.com:/bricks11/brick1 Brick63: gqas010.sbu.lab.eng.bos.redhat.com:/bricks11/brick Brick64: gqas012.sbu.lab.eng.bos.redhat.com:/bricks11/brick Brick65: gqas014.sbu.lab.eng.bos.redhat.com:/bricks11/brick Brick66: gqas015.sbu.lab.eng.bos.redhat.com:/bricks11/brick Brick67: gqas003.sbu.lab.eng.bos.redhat.com:/bricks12/brick1 Brick68: gqas009.sbu.lab.eng.bos.redhat.com:/bricks12/brick1 Brick69: gqas010.sbu.lab.eng.bos.redhat.com:/bricks12/brick Brick70: gqas012.sbu.lab.eng.bos.redhat.com:/bricks12/brick Brick71: gqas014.sbu.lab.eng.bos.redhat.com:/bricks12/brick Brick72: gqas015.sbu.lab.eng.bos.redhat.com:/bricks12/brick Options Reconfigured: ganesha.enable: on features.cache-invalidation: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on disperse.shd-max-threads: 64 disperse.shd-wait-qlength: 655536 cluster.enable-shared-storage: enable nfs-ganesha: enable Hi Ambarish, Thanks for trying to reproduce the issue in NON-SSL environment.I also think same the issue is not specific to SSL. As i mentioned in my previous comment server_submit_reply is failing because rpcsvc_submit_generic is failed and it(rpcsvc_submit_generic) is failing because other end point of socket is not connected and server is not able to send the message on socket. After checked the ganesha-gfapi logs i saw there are multiple frame unwind message so i believe other end point of socket is closed by ganesha gfapi. [2017-02-12 17:30:07.522548] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fee03e5b672] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fee03c2176e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fee03c2187e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7fee03c22fe4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7fee03c23874] ))))) 0-testvol-client-3: forced unwinding frame type(GF-DUMP) op(DUMP(1)) called at 2017-02-12 17:30:07.522226 (xid=0x113415f) I have checked one another bugzilla(https://bugzilla.redhat.com/show_bug.cgi?id=1387767) also opened on nfs that also having same kind of messages, i think issue needs to be check at nfs component level. Regards Mohit Agrawal Soumya - can you please check the above behaviour (refer comment 11 & 12)? Could you please point me to gfapi.log location. Bug1387767 is filed in upstream on gluster-nfs component as the access protocol is gluster-NFS. But the setup here is NFS_Ganesha + GLuster. Also if the gluster client is closing the socket, either it is coming from protocol-client/rpc-socket layer itself or sent by parent xlator which in case is gfapi. NFS-Ganesha/application layer never closes any connection unless volume is being unexported or the process is going down. Looking at brick log reported in comment#1 - /var/log/glusterfs/bricks/bricks-testvol_brick2.log-20170212 (on gqas006) [2017-02-09 09:35:41.821900] I [MSGID: 115029] [server-handshake.c:693:server_setvolume] 0-testvol-server: accepted client from gqas005.sbu.lab.eng.bos.redhat.com-18777-2017/02/09-09:35:37:710014-testvol-client-2-0-0 (version: 3.8.4) [2017-02-09 21:45:47.376556] I [socket.c:3520:socket_submit_reply] 0-tcp.testvol-server: not connected (priv->connected = -1) [2017-02-09 21:45:47.376679] E [rpcsvc.c:1304:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x958187, Program: GlusterFS 3.3, ProgVers: 330, Proc: 34) to rpc-transport (tcp.testvol-server) [2017-02-09 21:45:47.376686] E [rpcsvc.c:1304:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x95818b, Program: GlusterFS 3.3, ProgVers: 330, Proc: 34) to rpc-transport (tcp.testvol-server) First disconnect message was seen on 21:45:47.... Status of volume: testvol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick gqas013.sbu.lab.eng.bos.redhat.com:/b ricks/testvol_brick0 49153 0 Y 15440 Brick gqas005.sbu.lab.eng.bos.redhat.com:/b ricks/testvol_brick1 49153 0 Y 16253 Brick gqas006.sbu.lab.eng.bos.redhat.com:/b ricks/testvol_brick2 49153 0 Y 17677 Brick gqas008.sbu.lab.eng.bos.redhat.com:/b ricks/testvol_brick3 49152 0 Y 7226 Self-heal Daemon on localhost N/A N/A Y 4762 Self-heal Daemon on gqas008.sbu.lab.eng.bos .redhat.com N/A N/A Y 8652 Self-heal Daemon on gqas005.sbu.lab.eng.bos .redhat.com N/A N/A Y 2074 Self-heal Daemon on gqas013.sbu.lab.eng.bos .redhat.com N/A N/A Y 23898 Task Status of Volume testvol ------------------------------------------------------------------------------ There are no active volume tasks This brick process maps to testvol-client-2 on the vol graph. Looking at client (gfapi.log) around the same time on the same machine -- [2017-02-09 14:54:51.019183] I [MSGID: 109066] [dht-rename.c:1569:dht_rename] 0-testvol-dht: renaming /run4855/Bonnie.4877 (hash=testvol-replicate-1/cache=testvol-replicate-1) => /run4855/.nfs93111cd39bceb47200000001 (hash=testvol-replicate-1/cache=<nul>) [2017-02-09 21:12:21.576416] E [rpc-clnt.c:200:call_bail] 0-testvol-client-2: bailing out frame type(GlusterFS 3.3) op(FXATTROP(34)) xid = 0x959b3c sent = 2017-02-09 20:42:19.036664. timeout = 1800 for 192.168.97.142:49153 [2017-02-09 21:12:21.576630] W [MSGID: 114031] [client-rpc-fops.c:1830:client3_3_fxattrop_cbk] 0-testvol-client-2: remote operation failed [2017-02-09 21:12:21.588295] E [rpc-clnt.c:200:call_bail] 0-testvol-client-2: bailing out frame type(GlusterFS 3.3) op(FXATTROP(34)) xid = 0x959b3b sent = 2017-02-09 20:42:19.036577. timeout = 1800 for 192.168.97.142:49153 Starting from 21:12:21... , there are many calls which got bailed out. Does that mean server was very slow in repsoonding to those requests? Post that --> [2017-02-09 21:45:47.374884] C [rpc-clnt.c:465:rpc_clnt_fill_request_info] 0-testvol-client-2: cannot lookup the saved frame corresponding to xid (9798026) [2017-02-09 21:45:47.374991] W [socket.c:1923:__socket_read_reply] 0-testvol-client-2: notify for event MAP_XID failed for 192.168.97.142:49153 [2017-02-09 21:45:47.375036] E [socket.c:2568:socket_poller] 0-testvol-client-2: socket_poller 192.168.97.142:49153 failed (Success) There is failure in looking up a saved frame with xid (9798026) followed by error in socket_polling()..Maybe this lead to transport disconnect leading to unwinding saved_frames as followed immediately in the log file. [2017-02-09 21:45:47.407880] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f1af2e3f672] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f1af2c0576e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f1af2c0587e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f1af2c06fe4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7f1af2c07874] ))))) 0-testvol-client-2: forced unwinding frame type(GlusterFS 3.3) op(FINODELK(30)) called at 2017-02-09 20:41:28.651021 (xid=0x950977) I am not an expert in RPC. But atleast looks like there is no disconnect generated in upper layer (gfapi). The questions which arise are why server is slow in responding leading to many calls bailing out on the client. And why there has been error in socket_polling. CCin Du as well and request to take a look. This looks like a problem more from a rpc side. Adjusting the component. Atin, There has been attempt to reproduce the bug after the needinfo was raised on me. The issue is no longer reproducible. However, Ambarish is keeping an eye open for this issue. (In reply to Milind Changire from comment #23) > Atin, > There has been attempt to reproduce the bug after the needinfo was raised on > me. > The issue is no longer reproducible. > However, Ambarish is keeping an eye open for this issue. If this is not reproducible, why can't we close this bug and Ambarish can reopen if the issue persists? Moving to 3.4.0-beyond due to comment #26 There are two things that've caused disconnections as per logs: * call-bail - client didn't receive response to a fop within 30 min. * A saved frame corresponding to xid in reply couldn't be found. Both of these things can happen if xid in response is corrupted. bz 1409135 speaks about corruption in rpc messages when upcalls are used. I am wondering whether this bz is a duplicate of bz 1409135 (which is fixed in a version of glusterfs higher than the version this bug is filed on). Also, note that Milind says in comment #23 the issue is not reproducible. Note that these logs go away once disconnections are fixed. Since, we suspect this bug is fixed can we retest this bug? @Ambarish, what are your thoughts on this? I see the bugs referenced in comment 30 are already addressed and fixed in the current releases. Since we haven't heard about this problem in latest releases I'm closing this bug. Please feel free to reopen if the issue persists. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |