Bug 1409135
Summary: | [Replicate] "RPC call decoding failed" leading to IO hang & mount inaccessible | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Karan Sandha <ksandha> | |
Component: | rpc | Assignee: | Poornima G <pgurusid> | |
Status: | CLOSED ERRATA | QA Contact: | Karan Sandha <ksandha> | |
Severity: | high | Docs Contact: | ||
Priority: | unspecified | |||
Version: | rhgs-3.2 | CC: | amukherj, ksandha, nchilaka, rcyriac, rhs-bugs, rjoseph, skoduri | |
Target Milestone: | --- | |||
Target Release: | RHGS 3.2.0 | |||
Hardware: | All | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | glusterfs-3.8.4-15 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1421937 (view as bug list) | Environment: | ||
Last Closed: | 2017-03-23 06:00:59 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1421937, 1422363, 1422787, 1422788, 1425740, 1427012, 1428670 | |||
Bug Blocks: | 1351528 |
Description
Karan Sandha
2016-12-29 18:42:46 UTC
I am seeing lot of these errors in my systemic setup for regression testing of 3.2 On my systemic setup, I am doing same path directory creation simultaneously from 3 different clients. Each client used different server IP to mount the volume using fuse protocol Also, Each client were dumping sosreports every 5min into the volume mount in a screen session, along with top output being appended to a file every minute The dir-creations were happening from different users Eg: client1(el 7.2) was running the dir-creation using pavan@rhs-client23 client2(el 6.7) as root@rhs-client24 client3(el 7.3) as cli21@rhs-client21 Note: these logs are wrt client1 ie rhs-client21 Also, however note that I am able to access the mount sosreports available at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/3.2_logs/systemic_testing_logs/regression_cycle/same_dir_create_clients/rhs-client21.lab.eng.blr.redhat.com/ test execution details available at https://docs.google.com/spreadsheets/d/1iP5Mi1TewBFVh8HTmlcBm9072Bgsbgkr3CLcGmawDys/edit#gid=632186609 Version-Release number of selected component (if applicable): ============ 3.8.4-10 other BZs for reference(raised wrt issues on same setup) 1409472 - brick crashed on systemic setup 1397907 - seeing frequent kernel hangs when doing operations both on fuse client and gluster nodes on replica volumes (edit) [NEEDINFO] 1409568 - seeing socket disconnects and transport endpoint not connected frequently on systemic setup 1409572 - In fuse mount logs:seeing input/output error with split-brain observed logs and failing GETXATTR and STAT 1409580 - seeing stale file handle errors in fuse mount logs in systemic testing 1409583 - seeing RPC status error messages and timeouts due to RPC (rpc-clnt.c:200:call_bail) 1409584 - seeing remote operation failed [Invalid argument] wrt inode locking in my systemic setup folowing were the errors seen [2016-12-31 22:30:27.780829] E [rpc-clnt.c:365:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1f2)[0x7f69dfd68882] (--> /usr/lib64/libgfrpc.s o.0(saved_frames_unwind+0x1e7)[0x7f69dfb30717] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f69dfb3082e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_ cleanup+0x99)[0x7f69dfb308e9] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1fa)[0x7f69dfb30b3a] ))))) 0-sysvol-client-6: forced unwinding frame type(GlusterFS 3.3) op (OPEN(11)) called at 2016-12-31 22:29:01.224843 (xid=0x42ad609) [2016-12-31 22:30:27.780958] E [rpc-clnt.c:365:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1f2)[0x7f69dfd68882] (--> /usr/lib64/libgfrpc.s o.0(saved_frames_unwind+0x1e7)[0x7f69dfb30717] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f69dfb3082e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_ cleanup+0x99)[0x7f69dfb308e9] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1fa)[0x7f69dfb30b3a] ))))) 0-sysvol-client-6: forced unwinding frame type(GlusterFS 3.3) op (OPEN(11)) called at 2016-12-31 22:29:01.224850 (xid=0x42ad60a) [2016-12-31 22:30:27.781091] E [rpc-clnt.c:365:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1f2)[0x7f69dfd68882] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7f69dfb30717] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f69dfb3082e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x99)[0x7f69dfb308e9] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1fa)[0x7f69dfb30b3a] ))))) 0-sysvol-client-6: forced unwinding frame type(GlusterFS 3.3) op(OPEN(11)) called at 2016-12-31 22:29:01.224856 (xid=0x42ad60b) [2016-12-31 22:30:27.781222] E [rpc-clnt.c:365:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1f2)[0x7f69dfd68882] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7f69dfb30717] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f69dfb3082e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x99)[0x7f69dfb308e9] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1fa)[0x7f69dfb30b3a] ))))) 0-sysvol-client-6: forced unwinding frame type(GlusterFS 3.3) op(OPEN(11)) called at 2016-12-31 22:29:01.224862 (xid=0x42ad60c) (In reply to nchilaka from comment #3) > I am seeing lot of these errors in my systemic setup for regression testing > of 3.2 > > On my systemic setup, I am doing same path directory creation simultaneously > from 3 different clients. > Each client used different server IP to mount the volume using fuse protocol > Also, Each client were dumping sosreports every 5min into the volume mount > in a screen session, along with top output being appended to a file every > minute > The dir-creations were happening from different users > Eg: > client1(el 7.2) was running the dir-creation using pavan@rhs-client23 > client2(el 6.7) as root@rhs-client24 > client3(el 7.3) as cli21@rhs-client21 > > > Note: these logs are wrt client1 ie rhs-client21 > Also, however note that I am able to access the mount > > sosreports available at > http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/3.2_logs/ > systemic_testing_logs/regression_cycle/same_dir_create_clients/rhs-client21. > lab.eng.blr.redhat.com/ > One small correction, I am seeing the issue on rhs-client24 so refer to http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/3.2_logs/systemic_testing_logs/regression_cycle/same_dir_create_clients/rhs-client24.lab.eng.blr.redhat.com/ > test execution details available at > https://docs.google.com/spreadsheets/d/ > 1iP5Mi1TewBFVh8HTmlcBm9072Bgsbgkr3CLcGmawDys/edit#gid=632186609 > > Version-Release number of selected component (if applicable): > ============ > 3.8.4-10 > > other BZs for reference(raised wrt issues on same setup) > 1409472 - brick crashed on systemic setup > 1397907 - seeing frequent kernel hangs when doing operations both on fuse > client and gluster nodes on replica volumes (edit) [NEEDINFO] > 1409568 - seeing socket disconnects and transport endpoint not connected > frequently on systemic setup > 1409572 - In fuse mount logs:seeing input/output error with split-brain > observed logs and failing GETXATTR and STAT > 1409580 - seeing stale file handle errors in fuse mount logs in systemic > testing > 1409583 - seeing RPC status error messages and timeouts due to RPC > (rpc-clnt.c:200:call_bail) > 1409584 - seeing remote operation failed [Invalid argument] wrt inode > locking in my systemic setup The error is not generated by the upcall request but from a fop cbk, we(Rtalur, Rajesh) are yet to identify which is the fop and what field of the XDR recieved was corrupt I am unable to reproduce the original "failed to decode call msg" error, the only difference is its latest downstream with debug mode. I shall once try with the latest Gluster RPMs once. But i do see disconnect errors consistently: [2017-01-25 11:46:46.592556] W [socket.c:590:__socket_rwv] 0-socket.nfs-server: writev on 192.168.97.120:1011 failed (Connection reset by peer) [2017-01-25 11:46:48.588515] I [socket.c:2403:socket_event_handler] 0-transport: disconnecting now These disconnects are between the nfs-clients and gluster NFS server as mentioned in the IP address. With the latest downstream RPMs, i am not able to reproduce the "failed to decode call msg" error. But i still get the following errors: [2017-01-27 05:37:17.284133] I [socket.c:2403:socket_event_handler] 0-transport: disconnecting now [2017-01-27 05:37:17.289722] I [socket.c:3439:socket_submit_request] 0-testvol-client-0: not connected (priv->connected = 0) [2017-01-25 11:46:46.592556] W [socket.c:590:__socket_rwv] 0-socket.nfs-server: writev on 192.168.97.120:1011 failed (Connection reset by peer) [2017-01-25 11:46:48.588515] I [socket.c:2403:socket_event_handler] 0-transport: disconnecting now And as per my testing this doesn't seem to be upcall issue, as the disconnects appear even when upcall is not in progress. It could be RPC problem or in case of heavy load, throttling in NFS(client-server) could send ECONRESET, not sure. Need more time for RCA. Similar issue/logs as mentioned in this bug description are observed in bug1417062. The setup is in the same state now. Request to take a look if needed. Fix posted: https://review.gluster.org/#/c/16613/1 Downstream patch: https://code.engineering.redhat.com/gerrit/#/c/97927/ Karan, As discussed can you please provide the pstack output of the mount process and as well of the brick processes? Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html |