1404219 – seeing fuse mount errors like [socket.c:1989:__socket_read_frag] 0-rpc: wrong MSG-TYPE (22) after XDR decode of cache_invalidation failed

Bug 1404219 - seeing fuse mount errors like [socket.c:1989:__socket_read_frag] 0-rpc: wrong MSG-TYPE (22) after XDR decode of cache_invalidation failed

Summary: seeing fuse mount errors like [socket.c:1989:__socket_read_frag] 0-rpc: wrong...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	core
Sub Component:
Version:	rhgs-3.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Vijay Bellur
QA Contact:	Rahul Hinduja
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-12-13 11:57 UTC by Nag Pavan Chilakam
Modified:	2018-02-07 04:26 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-02-07 04:26:38 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Nag Pavan Chilakam 2016-12-13 11:57:19 UTC

Description of problem:
=======================
on a 2x(4+2) ec volume with all the md-cache settings enabled,
I brought down some bricks using forceful umount of the bricks,
I did bring up all the bricks up, however
After some iterations, while doing ls -l on the fuse mount, I hit the below error messages consistently during that period(i hit it about 5 times)

One thing to note was i saw root of the volume "/" requiring heal(as shown in heal info) when there were no actual changes or writes to the data

create about 50000 file of size 1KB on root and about 40K file of 1KB size in a directory test under the root


Was running "time ls -lt test/file.*"


[2016-12-13 06:52:44.402952] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-disperse-disperse-0: Heal failed [Transport endpoint is not connected]
[2016-12-13 06:52:44.403859] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-disperse-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0
, good=1F, bad=20)
The message "W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-disperse-client-5: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001
) [No such file or directory]" repeated 4 times between [2016-12-13 06:52:44.371681] and [2016-12-13 06:52:44.403964]
[2016-12-13 06:52:44.406021] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-disperse-disperse-0: Heal failed [Transport endpoint is not connected]
[2016-12-13 06:52:44.421467] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-disperse-client-5: remote operation failed. Path: / (00000000-0000-0000-0
000-000000000001) [No such file or directory]
[2016-12-13 06:52:44.442641] W [MSGID: 114063] [client-callback.c:110:client_cbk_cache_invalidation] 0-disperse-client-3: XDR decode of cache_invalidation failed. [Operation not permitted]
[2016-12-13 06:52:44.442720] E [socket.c:1989:__socket_read_frag] 0-rpc: wrong MSG-TYPE (22) received from 10.70.35.135:49154
[2016-12-13 06:52:44.442948] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f4812dcc682] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f4812b9275e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f4812b9286e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f4812b93fd4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7f4812b94864] ))))) 0-disperse-client-3: forced unwinding frame type(GlusterFS 3.3) op(INODELK(29)) called at 2016-12-13 06:52:44.422872 (xid=0x1b92db)
[2016-12-13 06:52:44.442974] E [MSGID: 114031] [client-rpc-fops.c:1550:client3_3_inodelk_cbk] 0-disperse-client-3: remote operation failed [Transport endpoint is not connected]




However I am not able to reproduce after about 3 hours


Brick error logs on one of the nodes was as below, when the fuse errors were hit


[2016-12-13 07:34:48.825529] E [rpcsvc.c:1304:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x20807a, Program: GlusterFS 3.3, ProgVers: 330, Proc: 40) to rpc-transport (tcp.disperse-server)
<nag> [2016-12-13 07:34:48.826196] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x12881) [0x7f6ed95af881] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x21907) [0x7f6ed915e907] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7f6ed9146186] ) 0-: Reply submission failed



Version-Release number of selected component (if applicable):
=========
3.8.4-8

Comment 2 Nag Pavan Chilakam 2016-12-13 11:58:11 UTC

I didn't see any functional impact visible

Comment 3 Nag Pavan Chilakam 2016-12-13 11:58:31 UTC

[root@dhcp35-37 ~]# gluster v status
Status of volume: disperse
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.37:/rhs/brick2/disperse      49152     0          Y       2803 
Brick 10.70.35.116:/rhs/brick2/disperse     49152     0          Y       2705 
Brick 10.70.35.239:/rhs/brick2/disperse     49154     0          Y       3388 
Brick 10.70.35.135:/rhs/brick2/disperse     49154     0          Y       3395 
Brick 10.70.35.8:/rhs/brick2/disperse       49154     0          Y       3389 
Brick 10.70.35.196:/rhs/brick2/disperse     49152     0          Y       1734 
Brick 10.70.35.37:/rhs/brick3/disperse      49153     0          Y       2808 
Brick 10.70.35.116:/rhs/brick3/disperse     49153     0          Y       2712 
Brick 10.70.35.239:/rhs/brick3/disperse     49155     0          Y       3408 
Brick 10.70.35.135:/rhs/brick3/disperse     49155     0          Y       3415 
Brick 10.70.35.8:/rhs/brick3/disperse       49155     0          Y       3409 
Brick 10.70.35.196:/rhs/brick3/disperse     49153     0          Y       1752 
Self-heal Daemon on localhost               N/A       N/A        Y       3012 
Self-heal Daemon on 10.70.35.116            N/A       N/A        Y       5403 
Self-heal Daemon on 10.70.35.196            N/A       N/A        Y       5404 
Self-heal Daemon on 10.70.35.135            N/A       N/A        Y       7562 
Self-heal Daemon on 10.70.35.8              N/A       N/A        Y       7525 
Self-heal Daemon on 10.70.35.239            N/A       N/A        Y       7557 
 
Task Status of Volume disperse
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : a3cdc6f5-2208-4bed-bd6c-f0e947d678cc
Status               : completed           
 
[root@dhcp35-37 ~]# gluster v info
 
Volume Name: disperse
Type: Distributed-Disperse
Volume ID: 66222b81-0ccc-42d6-a655-b0a2359f8513
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (4 + 2) = 12
Transport-type: tcp
Bricks:
Brick1: 10.70.35.37:/rhs/brick2/disperse
Brick2: 10.70.35.116:/rhs/brick2/disperse
Brick3: 10.70.35.239:/rhs/brick2/disperse
Brick4: 10.70.35.135:/rhs/brick2/disperse
Brick5: 10.70.35.8:/rhs/brick2/disperse
Brick6: 10.70.35.196:/rhs/brick2/disperse
Brick7: 10.70.35.37:/rhs/brick3/disperse
Brick8: 10.70.35.116:/rhs/brick3/disperse
Brick9: 10.70.35.239:/rhs/brick3/disperse
Brick10: 10.70.35.135:/rhs/brick3/disperse
Brick11: 10.70.35.8:/rhs/brick3/disperse
Brick12: 10.70.35.196:/rhs/brick3/disperse
Options Reconfigured:
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
performance.stat-prefetch: on
features.cache-invalidation: on
features.cache-invalidation-timeout: 60
performance.cache-invalidation: on
performance.md-cache-timeout: 60

Comment 4 Nag Pavan Chilakam 2016-12-13 12:02:45 UTC

sosreports http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1404219/

Comment 8 Amar Tumballi 2018-02-07 04:26:38 UTC

We have noticed that the bug is not reproduced in the latest version of the product (RHGS-3.3.1+).

If the bug is still relevant and is being reproduced, feel free to reopen the bug.

Note You need to log in before you can comment on or make changes to this bug.