1397473 – Message logged many times : XDR decode of cache_invalidation failed. [Operation not permitted]

Bug 1397473 - Message logged many times : XDR decode of cache_invalidation failed. [Operation not permitted]

Summary: Message logged many times : XDR decode of cache_invalidation failed. [Operati...

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	md-cache
Sub Component:
Version:	rhgs-3.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Poornima G
QA Contact:	Vivek Das
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-11-22 15:44 UTC by Raghavendra Talur
Modified:	2018-11-19 05:24 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-11-19 05:24:06 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Raghavendra Talur 2016-11-22 15:44:50 UTC

Description of problem:
When a volume has new cache invalidation features enabled, the following log message is logged many times.

[MSGID: 114063] [client-callback.c:110:client_cbk_cache_invalidation] 88-ecvol-client-5: XDR decode of cache_invalidation failed. [Operation not permitted]


Version-Release number of selected component (if applicable):
glusterfs-3.8.4-5 


How reproducible:
Not seen always. Out of 12 protocol clients on 8 of them had this log.



Additional info:
gluster vol info

Volume Name: ecvol
Type: Distributed-Disperse
Volume ID: f90202e8-a36e-4d3d-a0e2-8fa93152c028
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (4 + 2) = 12
Transport-type: tcp
Bricks:
Brick1: 10.70.37.190:/bricks/brick0/b0
Brick2: 10.70.37.215:/bricks/brick0/b0
Brick3: 10.70.37.44:/bricks/brick0/b0
Brick4: 10.70.37.190:/bricks/brick1/b1
Brick5: 10.70.37.215:/bricks/brick1/b1
Brick6: 10.70.37.44:/bricks/brick1/b1
Brick7: 10.70.37.190:/bricks/brick2/b2
Brick8: 10.70.37.215:/bricks/brick2/b2
Brick9: 10.70.37.44:/bricks/brick2/b2
Brick10: 10.70.37.190:/bricks/brick3/b3
Brick11: 10.70.37.215:/bricks/brick3/b3
Brick12: 10.70.37.44:/bricks/brick3/b3
Options Reconfigured:
disperse.shd-wait-qlength: 1024
disperse.shd-max-threads: 8
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.stat-prefetch: on
performance.cache-invalidation: on
performance.md-cache-timeout: 600
features.uss: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on

Comment 4 Mohammed Rafi KC 2016-11-29 15:01:51 UTC

A similar issue has been observed when debugging the bug 1398930.

[2016-11-27 07:23:11.110028] W [xdr-rpc.c:55:xdr_to_rpc_call] 0-rpc: failed to decode call msg
[2016-11-27 07:23:11.110197] W [rpc-clnt.c:717:rpc_clnt_handle_cbk] 0-testvol-client-2: RPC call decoding failed
[2016-11-27 07:23:11.122054] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fbfa1ada642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fbfa18a075e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fbfa18a086e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7fbfa18a1fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7fbfa18a28a0] ))))) 0-testvol-client-2: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) called at 2016-11-27 07:23:11.017856 (xid=0x183c17)

[2016-11-27 07:23:12.253789] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-testvol-client-2: Server lk version = 1
[2016-11-27 07:23:18.625835] W [MSGID: 114063] [client-callback.c:110:client_cbk_cache_invalidation] 0-testvol-client-0: XDR decode of cache_invalidation failed. [Operation not permitted]
[2016-11-27 07:24:01.786006] W [xdr-rpc.c:55:xdr_to_rpc_call] 0-rpc: failed to decode call msg
[2016-11-27 07:24:01.786082] W [rpc-clnt.c:717:rpc_clnt_handle_cbk] 0-testvol-client-0: RPC call decoding failed
[2016-11-27 07:24:01.786881] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fbfa1ada642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fbfa18a075e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fbfa18a086e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7fbfa18a1fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7fbfa18a28a0] ))))) 0-testvol-client-0: forced unwinding frame type(GlusterFS 3.3) op(ENTRYLK(31)) called at 2016-11-27 07:24:01.785965 (xid=0x1fbe70)
The message "W [MSGID: 114063] [client-callback.c:110:client_cbk_cache_invalidation] 0-testvol-client-0: XDR decode of cache_invalidation failed. [Operation not permitted]" repeated 6 times between [2016-11-27 07:23:18.625835] and [2016-11-27 07:23:55.956580]
[2016-11-27 07:24:01.786910] E [MSGID: 114031] [client-rpc-fops.c:1654:client3_3_entrylk_cbk] 0-testvol-client-0: remote operation failed [Transport endpoint is not connected]
[2016-11-27 07:24:01.786958] E [MSGID: 108007] [afr-lk-common.c:825:afr_unlock_entrylk_cbk] 0-testvol-replicate-0: /linux-4.8.9/Documentation/devicetree/bindings/c6x/clocks.txt: unlock failed on testvol-client-0 [Transport endpoint is not connected]
[2016-11-27 07:24:01.787219] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fbfa1ada642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fbfa18a075e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fbfa18a086e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7fbfa18a1fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7fbfa18a28a0] ))))) 0-testvol-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) called at 2016-11-27 07:24:01.652084 (xid=0x1fbba5)



Since we got this messages from server to clients, and it happened on i/o transport, it is most likely caused by upcall rpc messages.

Comment 6 Nag Pavan Chilakam 2017-01-03 07:35:18 UTC

I am seeing these errors on my systemic setup of 3.2 in regression cycle too


On my systemic setup, I am doing same path directory creation simultaneously from 3 different clients.
Each client used different server IP to mount the volume using fuse protocol
Also, Each client were dumping sosreports every 5min into the volume mount in a screen session, along with top output being appended to a file every minute
The dir-creations were happening from different users
Eg:
client1(el 7.2) was running the dir-creation using pavan@rhs-client23
client2(el 6.7) as root@rhs-client24
client3(el 7.3) as cli21@rhs-client21


Note: these logs are wrt client1 ie rhs-client24
Also, however note that I am able to access the mount

sosreports available at 
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/3.2_logs/systemic_testing_logs/regression_cycle/same_dir_create_clients/rhs-client23.lab.eng.blr.redhat.com/

test execution details available at https://docs.google.com/spreadsheets/d/1iP5Mi1TewBFVh8HTmlcBm9072Bgsbgkr3CLcGmawDys/edit#gid=632186609

Version-Release number of selected component (if applicable):
============
3.8.4-10


other BZs for reference(raised wrt issues on same setup)
1409472 - brick crashed on systemic setup
1397907 - seeing frequent kernel hangs when doing operations both on fuse client and gluster nodes on replica volumes (edit) [NEEDINFO]
1409568 - seeing socket disconnects and transport endpoint not connected frequently on systemic setup 
1409572 - In fuse mount logs:seeing input/output error with split-brain observed logs and failing GETXATTR and STAT
1409580 - seeing stale file handle errors in fuse mount logs in systemic testing
1409583 - seeing RPC status error messages and timeouts due to RPC (rpc-clnt.c:200:call_bail)
1409135 - [Replicate] "RPC call decoding failed" leading to IO hang & mount inaccessible

1409729 - On systemic setup seeing AFR errors w.r.t fileops like "afr-open.c:187:afr_openfd_fix_open_cbk" and "afr-lk-common.c:825:afr_unlock_entrylk_cbk" and "afr-lk-common.c:825:afr_unlock_entrylk_cbk"

Comment 10 Poornima G 2018-11-19 05:24:06 UTC

This has been fixed. Hence closing the bug.

Note You need to log in before you can comment on or make changes to this bug.