Description of problem: RPC failed to decode the msg on nfs mount leading to Mounted volume on NFS mount and started sequential write with o-direct flag Version-Release number of selected component (if applicable): 3.8.4-8 Logs are placed at rhsqe-repo.lab.eng.blr.redhat.com:/var/www/html/sosreports/<bug> How reproducible: Tried Once Steps to Reproduce: 1. 4 Servers And 4 Clients , Mount 1:1 with gnfs 2. Daemonize FIO on 4 Client and start sequential write with 0 direct flag 3. After the inception of IO's the tool got hanged. 4. Multiple errors and warnings in nfs.log <SNIP> 2016-12-29 10:27:29.424871] W [xdr-rpc.c:55:xdr_to_rpc_call] 0-rpc: failed to decode call msg [2016-12-29 10:27:29.425032] W [rpc-clnt.c:717:rpc_clnt_handle_cbk] 0-testvol-client-0: RPC call decoding failed [2016-12-29 10:27:29.443275] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f9ccdb40682] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f9ccd90675e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f9ccd90686e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f9ccd907fd4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7f9ccd908864] ))))) 0-testvol-client-0: forced unwinding frame type(GlusterFS 3.3) op(FINODELK(30)) called at 2016-12-29 10:26:55.289465 (xid=0xa8ddd) [2016-12-29 10:27:29.443308] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 0-testvol-client-0: remote operation failed [Transport endpoint is not connected] [2016-12-29 10:27:29.443571] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f9ccdb40682] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f9ccd90675e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f9ccd90686e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f9ccd907fd4] (--> </SNIP> Actual results: IO tool got hung and multiple error and warning in the logs Expected results: NO IO hang should be observed Additional info: [root@gqas004 ~]# gluster v info Volume Name: testvol Type: Distributed-Replicate Volume ID: e2212c28-f04a-4f08-9f17-b0fb74434bbf Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: gqas003.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0 Brick2: gqas004.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1 Brick3: gqas007.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2 Brick4: gqas012.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3 Options Reconfigured: cluster.use-compound-fops: off network.remote-dio: off performance.strict-o-direct: on network.inode-lru-limit: 90000 performance.md-cache-timeout: 600 performance.cache-invalidation: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on server.allow-insecure: on performance.stat-prefetch: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: off
I am seeing lot of these errors in my systemic setup for regression testing of 3.2 On my systemic setup, I am doing same path directory creation simultaneously from 3 different clients. Each client used different server IP to mount the volume using fuse protocol Also, Each client were dumping sosreports every 5min into the volume mount in a screen session, along with top output being appended to a file every minute The dir-creations were happening from different users Eg: client1(el 7.2) was running the dir-creation using pavan@rhs-client23 client2(el 6.7) as root@rhs-client24 client3(el 7.3) as cli21@rhs-client21 Note: these logs are wrt client1 ie rhs-client21 Also, however note that I am able to access the mount sosreports available at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/3.2_logs/systemic_testing_logs/regression_cycle/same_dir_create_clients/rhs-client21.lab.eng.blr.redhat.com/ test execution details available at https://docs.google.com/spreadsheets/d/1iP5Mi1TewBFVh8HTmlcBm9072Bgsbgkr3CLcGmawDys/edit#gid=632186609 Version-Release number of selected component (if applicable): ============ 3.8.4-10 other BZs for reference(raised wrt issues on same setup) 1409472 - brick crashed on systemic setup 1397907 - seeing frequent kernel hangs when doing operations both on fuse client and gluster nodes on replica volumes (edit) [NEEDINFO] 1409568 - seeing socket disconnects and transport endpoint not connected frequently on systemic setup 1409572 - In fuse mount logs:seeing input/output error with split-brain observed logs and failing GETXATTR and STAT 1409580 - seeing stale file handle errors in fuse mount logs in systemic testing 1409583 - seeing RPC status error messages and timeouts due to RPC (rpc-clnt.c:200:call_bail) 1409584 - seeing remote operation failed [Invalid argument] wrt inode locking in my systemic setup
folowing were the errors seen [2016-12-31 22:30:27.780829] E [rpc-clnt.c:365:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1f2)[0x7f69dfd68882] (--> /usr/lib64/libgfrpc.s o.0(saved_frames_unwind+0x1e7)[0x7f69dfb30717] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f69dfb3082e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_ cleanup+0x99)[0x7f69dfb308e9] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1fa)[0x7f69dfb30b3a] ))))) 0-sysvol-client-6: forced unwinding frame type(GlusterFS 3.3) op (OPEN(11)) called at 2016-12-31 22:29:01.224843 (xid=0x42ad609) [2016-12-31 22:30:27.780958] E [rpc-clnt.c:365:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1f2)[0x7f69dfd68882] (--> /usr/lib64/libgfrpc.s o.0(saved_frames_unwind+0x1e7)[0x7f69dfb30717] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f69dfb3082e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_ cleanup+0x99)[0x7f69dfb308e9] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1fa)[0x7f69dfb30b3a] ))))) 0-sysvol-client-6: forced unwinding frame type(GlusterFS 3.3) op (OPEN(11)) called at 2016-12-31 22:29:01.224850 (xid=0x42ad60a) [2016-12-31 22:30:27.781091] E [rpc-clnt.c:365:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1f2)[0x7f69dfd68882] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7f69dfb30717] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f69dfb3082e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x99)[0x7f69dfb308e9] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1fa)[0x7f69dfb30b3a] ))))) 0-sysvol-client-6: forced unwinding frame type(GlusterFS 3.3) op(OPEN(11)) called at 2016-12-31 22:29:01.224856 (xid=0x42ad60b) [2016-12-31 22:30:27.781222] E [rpc-clnt.c:365:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1f2)[0x7f69dfd68882] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7f69dfb30717] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f69dfb3082e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x99)[0x7f69dfb308e9] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1fa)[0x7f69dfb30b3a] ))))) 0-sysvol-client-6: forced unwinding frame type(GlusterFS 3.3) op(OPEN(11)) called at 2016-12-31 22:29:01.224862 (xid=0x42ad60c)
(In reply to nchilaka from comment #3) > I am seeing lot of these errors in my systemic setup for regression testing > of 3.2 > > On my systemic setup, I am doing same path directory creation simultaneously > from 3 different clients. > Each client used different server IP to mount the volume using fuse protocol > Also, Each client were dumping sosreports every 5min into the volume mount > in a screen session, along with top output being appended to a file every > minute > The dir-creations were happening from different users > Eg: > client1(el 7.2) was running the dir-creation using pavan@rhs-client23 > client2(el 6.7) as root@rhs-client24 > client3(el 7.3) as cli21@rhs-client21 > > > Note: these logs are wrt client1 ie rhs-client21 > Also, however note that I am able to access the mount > > sosreports available at > http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/3.2_logs/ > systemic_testing_logs/regression_cycle/same_dir_create_clients/rhs-client21. > lab.eng.blr.redhat.com/ > One small correction, I am seeing the issue on rhs-client24 so refer to http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/3.2_logs/systemic_testing_logs/regression_cycle/same_dir_create_clients/rhs-client24.lab.eng.blr.redhat.com/ > test execution details available at > https://docs.google.com/spreadsheets/d/ > 1iP5Mi1TewBFVh8HTmlcBm9072Bgsbgkr3CLcGmawDys/edit#gid=632186609 > > Version-Release number of selected component (if applicable): > ============ > 3.8.4-10 > > other BZs for reference(raised wrt issues on same setup) > 1409472 - brick crashed on systemic setup > 1397907 - seeing frequent kernel hangs when doing operations both on fuse > client and gluster nodes on replica volumes (edit) [NEEDINFO] > 1409568 - seeing socket disconnects and transport endpoint not connected > frequently on systemic setup > 1409572 - In fuse mount logs:seeing input/output error with split-brain > observed logs and failing GETXATTR and STAT > 1409580 - seeing stale file handle errors in fuse mount logs in systemic > testing > 1409583 - seeing RPC status error messages and timeouts due to RPC > (rpc-clnt.c:200:call_bail) > 1409584 - seeing remote operation failed [Invalid argument] wrt inode > locking in my systemic setup
The error is not generated by the upcall request but from a fop cbk, we(Rtalur, Rajesh) are yet to identify which is the fop and what field of the XDR recieved was corrupt
I am unable to reproduce the original "failed to decode call msg" error, the only difference is its latest downstream with debug mode. I shall once try with the latest Gluster RPMs once. But i do see disconnect errors consistently: [2017-01-25 11:46:46.592556] W [socket.c:590:__socket_rwv] 0-socket.nfs-server: writev on 192.168.97.120:1011 failed (Connection reset by peer) [2017-01-25 11:46:48.588515] I [socket.c:2403:socket_event_handler] 0-transport: disconnecting now These disconnects are between the nfs-clients and gluster NFS server as mentioned in the IP address.
With the latest downstream RPMs, i am not able to reproduce the "failed to decode call msg" error. But i still get the following errors: [2017-01-27 05:37:17.284133] I [socket.c:2403:socket_event_handler] 0-transport: disconnecting now [2017-01-27 05:37:17.289722] I [socket.c:3439:socket_submit_request] 0-testvol-client-0: not connected (priv->connected = 0) [2017-01-25 11:46:46.592556] W [socket.c:590:__socket_rwv] 0-socket.nfs-server: writev on 192.168.97.120:1011 failed (Connection reset by peer) [2017-01-25 11:46:48.588515] I [socket.c:2403:socket_event_handler] 0-transport: disconnecting now And as per my testing this doesn't seem to be upcall issue, as the disconnects appear even when upcall is not in progress. It could be RPC problem or in case of heavy load, throttling in NFS(client-server) could send ECONRESET, not sure. Need more time for RCA.
Similar issue/logs as mentioned in this bug description are observed in bug1417062. The setup is in the same state now. Request to take a look if needed.
Fix posted: https://review.gluster.org/#/c/16613/1
Downstream patch: https://code.engineering.redhat.com/gerrit/#/c/97927/
Karan, As discussed can you please provide the pstack output of the mount process and as well of the brick processes?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html