1409135 – [Replicate] "RPC call decoding failed" leading to IO hang & mount inaccessible

Bug 1409135 - [Replicate] "RPC call decoding failed" leading to IO hang & mount inaccessible

Summary: [Replicate] "RPC call decoding failed" leading to IO hang & mount inaccessible

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	rpc
Sub Component:
Version:	rhgs-3.2
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Poornima G
QA Contact:	Karan Sandha
Docs Contact:
URL:
Whiteboard:
Depends On:	1421937 1422363 1422787 1422788 1425740 1427012 1428670
Blocks:	1351528
TreeView+	depends on / blocked

Reported:	2016-12-29 18:42 UTC by Karan Sandha
Modified:	2017-03-23 06:00 UTC (History)
CC List:	7 users (show)
Fixed In Version:	glusterfs-3.8.4-15
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1421937 (view as bug list)
Environment:
Last Closed:	2017-03-23 06:00:59 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1417062	0	unspecified	CLOSED	[Stress] : Input/Output Error on EC over gNFS while creating LOTS of smallfiles	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHSA-2017:0486	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update	2017-03-23 09:18:45 UTC

Internal Links: 1417062

Description Karan Sandha 2016-12-29 18:42:46 UTC

Description of problem:
RPC failed to decode the msg on nfs mount leading to  Mounted volume on NFS mount and started sequential write with o-direct flag

Version-Release number of selected component (if applicable):
3.8.4-8
Logs are placed at 
rhsqe-repo.lab.eng.blr.redhat.com:/var/www/html/sosreports/<bug>

How reproducible:
Tried Once

Steps to Reproduce:
1. 4 Servers And 4 Clients , Mount 1:1 with gnfs
2. Daemonize FIO on 4 Client and start sequential write with 0 direct flag
3. After the inception of IO's the tool got hanged.
4. Multiple errors and warnings in nfs.log

<SNIP>
2016-12-29 10:27:29.424871] W [xdr-rpc.c:55:xdr_to_rpc_call] 0-rpc: failed to decode call msg
[2016-12-29 10:27:29.425032] W [rpc-clnt.c:717:rpc_clnt_handle_cbk] 0-testvol-client-0: RPC call decoding failed
[2016-12-29 10:27:29.443275] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f9ccdb40682] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f9ccd90675e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f9ccd90686e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f9ccd907fd4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7f9ccd908864] ))))) 0-testvol-client-0: forced unwinding frame type(GlusterFS 3.3) op(FINODELK(30)) called at 2016-12-29 10:26:55.289465 (xid=0xa8ddd)
[2016-12-29 10:27:29.443308] E [MSGID: 114031] [client-rpc-fops.c:1601:client3_3_finodelk_cbk] 0-testvol-client-0: remote operation failed [Transport endpoint is not connected]
[2016-12-29 10:27:29.443571] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f9ccdb40682] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f9ccd90675e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f9ccd90686e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f9ccd907fd4] (-->

</SNIP>

Actual results:
IO tool got hung and multiple error and warning in the logs

Expected results:
NO IO hang should be observed

Additional info:

[root@gqas004 ~]# gluster v info
 
Volume Name: testvol
Type: Distributed-Replicate
Volume ID: e2212c28-f04a-4f08-9f17-b0fb74434bbf
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gqas003.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick0
Brick2: gqas004.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick1
Brick3: gqas007.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick2
Brick4: gqas012.sbu.lab.eng.bos.redhat.com:/bricks/testvol_brick3
Options Reconfigured:
cluster.use-compound-fops: off
network.remote-dio: off
performance.strict-o-direct: on
network.inode-lru-limit: 90000
performance.md-cache-timeout: 600
performance.cache-invalidation: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
server.allow-insecure: on
performance.stat-prefetch: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: off

Comment 3 Nag Pavan Chilakam 2017-01-03 06:55:40 UTC

I am seeing lot of these errors in my systemic setup for regression testing of 3.2

On my systemic setup, I am doing same path directory creation simultaneously from 3 different clients.
Each client used different server IP to mount the volume using fuse protocol
Also, Each client were dumping sosreports every 5min into the volume mount in a screen session, along with top output being appended to a file every minute
The dir-creations were happening from different users
Eg:
client1(el 7.2) was running the dir-creation using pavan@rhs-client23
client2(el 6.7) as root@rhs-client24
client3(el 7.3) as cli21@rhs-client21


Note: these logs are wrt client1 ie rhs-client21
Also, however note that I am able to access the mount

sosreports available at 
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/3.2_logs/systemic_testing_logs/regression_cycle/same_dir_create_clients/rhs-client21.lab.eng.blr.redhat.com/

test execution details available at https://docs.google.com/spreadsheets/d/1iP5Mi1TewBFVh8HTmlcBm9072Bgsbgkr3CLcGmawDys/edit#gid=632186609

Version-Release number of selected component (if applicable):
============
3.8.4-10

other BZs for reference(raised wrt issues on same setup)
1409472 - brick crashed on systemic setup
1397907 - seeing frequent kernel hangs when doing operations both on fuse client and gluster nodes on replica volumes (edit) [NEEDINFO]
1409568 - seeing socket disconnects and transport endpoint not connected frequently on systemic setup 
1409572 - In fuse mount logs:seeing input/output error with split-brain observed logs and failing GETXATTR and STAT
1409580 - seeing stale file handle errors in fuse mount logs in systemic testing
1409583 - seeing RPC status error messages and timeouts due to RPC (rpc-clnt.c:200:call_bail)
1409584 - seeing remote operation failed [Invalid argument] wrt inode locking in my systemic setup

Comment 4 Nag Pavan Chilakam 2017-01-03 06:56:57 UTC

folowing were the errors seen
[2016-12-31 22:30:27.780829] E [rpc-clnt.c:365:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1f2)[0x7f69dfd68882] (--> /usr/lib64/libgfrpc.s
o.0(saved_frames_unwind+0x1e7)[0x7f69dfb30717] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f69dfb3082e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_
cleanup+0x99)[0x7f69dfb308e9] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1fa)[0x7f69dfb30b3a] ))))) 0-sysvol-client-6: forced unwinding frame type(GlusterFS 3.3) op
(OPEN(11)) called at 2016-12-31 22:29:01.224843 (xid=0x42ad609)
[2016-12-31 22:30:27.780958] E [rpc-clnt.c:365:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1f2)[0x7f69dfd68882] (--> /usr/lib64/libgfrpc.s
o.0(saved_frames_unwind+0x1e7)[0x7f69dfb30717] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f69dfb3082e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_
cleanup+0x99)[0x7f69dfb308e9] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1fa)[0x7f69dfb30b3a] ))))) 0-sysvol-client-6: forced unwinding frame type(GlusterFS 3.3) op
(OPEN(11)) called at 2016-12-31 22:29:01.224850 (xid=0x42ad60a)
[2016-12-31 22:30:27.781091] E [rpc-clnt.c:365:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1f2)[0x7f69dfd68882] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7f69dfb30717] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f69dfb3082e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x99)[0x7f69dfb308e9] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1fa)[0x7f69dfb30b3a] ))))) 0-sysvol-client-6: forced unwinding frame type(GlusterFS 3.3) op(OPEN(11)) called at 2016-12-31 22:29:01.224856 (xid=0x42ad60b)
[2016-12-31 22:30:27.781222] E [rpc-clnt.c:365:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1f2)[0x7f69dfd68882] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7f69dfb30717] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f69dfb3082e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x99)[0x7f69dfb308e9] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x1fa)[0x7f69dfb30b3a] ))))) 0-sysvol-client-6: forced unwinding frame type(GlusterFS 3.3) op(OPEN(11)) called at 2016-12-31 22:29:01.224862 (xid=0x42ad60c)

Comment 5 Nag Pavan Chilakam 2017-01-03 06:59:08 UTC

(In reply to nchilaka from comment #3)
> I am seeing lot of these errors in my systemic setup for regression testing
> of 3.2
> 
> On my systemic setup, I am doing same path directory creation simultaneously
> from 3 different clients.
> Each client used different server IP to mount the volume using fuse protocol
> Also, Each client were dumping sosreports every 5min into the volume mount
> in a screen session, along with top output being appended to a file every
> minute
> The dir-creations were happening from different users
> Eg:
> client1(el 7.2) was running the dir-creation using pavan@rhs-client23
> client2(el 6.7) as root@rhs-client24
> client3(el 7.3) as cli21@rhs-client21
> 
> 
> Note: these logs are wrt client1 ie rhs-client21
> Also, however note that I am able to access the mount
> 
> sosreports available at 
> http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/3.2_logs/
> systemic_testing_logs/regression_cycle/same_dir_create_clients/rhs-client21.
> lab.eng.blr.redhat.com/
> 
One small correction, I am seeing the issue on rhs-client24 so refer to 
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/3.2_logs/systemic_testing_logs/regression_cycle/same_dir_create_clients/rhs-client24.lab.eng.blr.redhat.com/

> test execution details available at
> https://docs.google.com/spreadsheets/d/
> 1iP5Mi1TewBFVh8HTmlcBm9072Bgsbgkr3CLcGmawDys/edit#gid=632186609
> 
> Version-Release number of selected component (if applicable):
> ============
> 3.8.4-10
> 
> other BZs for reference(raised wrt issues on same setup)
> 1409472 - brick crashed on systemic setup
> 1397907 - seeing frequent kernel hangs when doing operations both on fuse
> client and gluster nodes on replica volumes (edit) [NEEDINFO]
> 1409568 - seeing socket disconnects and transport endpoint not connected
> frequently on systemic setup 
> 1409572 - In fuse mount logs:seeing input/output error with split-brain
> observed logs and failing GETXATTR and STAT
> 1409580 - seeing stale file handle errors in fuse mount logs in systemic
> testing
> 1409583 - seeing RPC status error messages and timeouts due to RPC
> (rpc-clnt.c:200:call_bail)
> 1409584 - seeing remote operation failed [Invalid argument] wrt inode
> locking in my systemic setup

Comment 11 Poornima G 2017-01-12 05:18:53 UTC

The error is not generated by the upcall request but from a fop cbk, we(Rtalur, Rajesh) are yet to identify which is the fop and what field of the XDR recieved was corrupt

Comment 12 Poornima G 2017-01-25 12:28:28 UTC

I am unable to reproduce the original "failed to decode call msg" error, the only difference is its latest downstream with debug mode. I shall once try with the latest Gluster RPMs once.

But i do see disconnect errors consistently:
[2017-01-25 11:46:46.592556] W [socket.c:590:__socket_rwv] 0-socket.nfs-server: writev on 192.168.97.120:1011 failed (Connection reset by peer)
[2017-01-25 11:46:48.588515] I [socket.c:2403:socket_event_handler] 0-transport: disconnecting now

These disconnects are between the nfs-clients and gluster NFS server as mentioned in the IP address.

Comment 13 Poornima G 2017-01-27 06:43:45 UTC

With the latest downstream RPMs, i am not able to reproduce the "failed to decode call msg" error.

But i still get the following errors:
[2017-01-27 05:37:17.284133] I [socket.c:2403:socket_event_handler] 0-transport: disconnecting now
[2017-01-27 05:37:17.289722] I [socket.c:3439:socket_submit_request] 0-testvol-client-0: not connected (priv->connected = 0)

[2017-01-25 11:46:46.592556] W [socket.c:590:__socket_rwv] 0-socket.nfs-server: writev on 192.168.97.120:1011 failed (Connection reset by peer)
[2017-01-25 11:46:48.588515] I [socket.c:2403:socket_event_handler] 0-transport: disconnecting now

And as per my testing this doesn't seem to be upcall issue, as the disconnects appear even when upcall is not in progress. It could be RPC problem or in case of heavy load, throttling in NFS(client-server) could send ECONRESET, not sure. Need more time for RCA.

Comment 14 Soumya Koduri 2017-01-27 10:37:05 UTC

Similar issue/logs as mentioned in this bug description are observed in bug1417062. The setup is in the same state now. Request to take a look if needed.

Comment 15 Poornima G 2017-02-14 07:22:27 UTC

Fix posted:  https://review.gluster.org/#/c/16613/1

Comment 16 Poornima G 2017-02-16 09:14:59 UTC

Downstream patch: https://code.engineering.redhat.com/gerrit/#/c/97927/

Comment 19 rjoseph 2017-02-21 15:29:35 UTC

Karan, As discussed can you please provide the pstack output of the mount process and as well of the brick processes?

Comment 24 errata-xmlrpc 2017-03-23 06:00:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.