Description of problem: When a volume has new cache invalidation features enabled, the following log message is logged many times. [MSGID: 114063] [client-callback.c:110:client_cbk_cache_invalidation] 88-ecvol-client-5: XDR decode of cache_invalidation failed. [Operation not permitted] Version-Release number of selected component (if applicable): glusterfs-3.8.4-5 How reproducible: Not seen always. Out of 12 protocol clients on 8 of them had this log. Additional info: gluster vol info Volume Name: ecvol Type: Distributed-Disperse Volume ID: f90202e8-a36e-4d3d-a0e2-8fa93152c028 Status: Started Snapshot Count: 0 Number of Bricks: 2 x (4 + 2) = 12 Transport-type: tcp Bricks: Brick1: 10.70.37.190:/bricks/brick0/b0 Brick2: 10.70.37.215:/bricks/brick0/b0 Brick3: 10.70.37.44:/bricks/brick0/b0 Brick4: 10.70.37.190:/bricks/brick1/b1 Brick5: 10.70.37.215:/bricks/brick1/b1 Brick6: 10.70.37.44:/bricks/brick1/b1 Brick7: 10.70.37.190:/bricks/brick2/b2 Brick8: 10.70.37.215:/bricks/brick2/b2 Brick9: 10.70.37.44:/bricks/brick2/b2 Brick10: 10.70.37.190:/bricks/brick3/b3 Brick11: 10.70.37.215:/bricks/brick3/b3 Brick12: 10.70.37.44:/bricks/brick3/b3 Options Reconfigured: disperse.shd-wait-qlength: 1024 disperse.shd-max-threads: 8 features.quota-deem-statfs: on features.inode-quota: on features.quota: on features.cache-invalidation: on features.cache-invalidation-timeout: 600 performance.stat-prefetch: on performance.cache-invalidation: on performance.md-cache-timeout: 600 features.uss: on transport.address-family: inet performance.readdir-ahead: on nfs.disable: on
A similar issue has been observed when debugging the bug 1398930. [2016-11-27 07:23:11.110028] W [xdr-rpc.c:55:xdr_to_rpc_call] 0-rpc: failed to decode call msg [2016-11-27 07:23:11.110197] W [rpc-clnt.c:717:rpc_clnt_handle_cbk] 0-testvol-client-2: RPC call decoding failed [2016-11-27 07:23:11.122054] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fbfa1ada642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fbfa18a075e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fbfa18a086e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7fbfa18a1fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7fbfa18a28a0] ))))) 0-testvol-client-2: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) called at 2016-11-27 07:23:11.017856 (xid=0x183c17) [2016-11-27 07:23:12.253789] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-testvol-client-2: Server lk version = 1 [2016-11-27 07:23:18.625835] W [MSGID: 114063] [client-callback.c:110:client_cbk_cache_invalidation] 0-testvol-client-0: XDR decode of cache_invalidation failed. [Operation not permitted] [2016-11-27 07:24:01.786006] W [xdr-rpc.c:55:xdr_to_rpc_call] 0-rpc: failed to decode call msg [2016-11-27 07:24:01.786082] W [rpc-clnt.c:717:rpc_clnt_handle_cbk] 0-testvol-client-0: RPC call decoding failed [2016-11-27 07:24:01.786881] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fbfa1ada642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fbfa18a075e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fbfa18a086e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7fbfa18a1fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7fbfa18a28a0] ))))) 0-testvol-client-0: forced unwinding frame type(GlusterFS 3.3) op(ENTRYLK(31)) called at 2016-11-27 07:24:01.785965 (xid=0x1fbe70) The message "W [MSGID: 114063] [client-callback.c:110:client_cbk_cache_invalidation] 0-testvol-client-0: XDR decode of cache_invalidation failed. [Operation not permitted]" repeated 6 times between [2016-11-27 07:23:18.625835] and [2016-11-27 07:23:55.956580] [2016-11-27 07:24:01.786910] E [MSGID: 114031] [client-rpc-fops.c:1654:client3_3_entrylk_cbk] 0-testvol-client-0: remote operation failed [Transport endpoint is not connected] [2016-11-27 07:24:01.786958] E [MSGID: 108007] [afr-lk-common.c:825:afr_unlock_entrylk_cbk] 0-testvol-replicate-0: /linux-4.8.9/Documentation/devicetree/bindings/c6x/clocks.txt: unlock failed on testvol-client-0 [Transport endpoint is not connected] [2016-11-27 07:24:01.787219] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fbfa1ada642] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7fbfa18a075e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fbfa18a086e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7fbfa18a1fc4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x120)[0x7fbfa18a28a0] ))))) 0-testvol-client-0: forced unwinding frame type(GlusterFS 3.3) op(READ(12)) called at 2016-11-27 07:24:01.652084 (xid=0x1fbba5) Since we got this messages from server to clients, and it happened on i/o transport, it is most likely caused by upcall rpc messages.
I am seeing these errors on my systemic setup of 3.2 in regression cycle too On my systemic setup, I am doing same path directory creation simultaneously from 3 different clients. Each client used different server IP to mount the volume using fuse protocol Also, Each client were dumping sosreports every 5min into the volume mount in a screen session, along with top output being appended to a file every minute The dir-creations were happening from different users Eg: client1(el 7.2) was running the dir-creation using pavan@rhs-client23 client2(el 6.7) as root@rhs-client24 client3(el 7.3) as cli21@rhs-client21 Note: these logs are wrt client1 ie rhs-client24 Also, however note that I am able to access the mount sosreports available at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/3.2_logs/systemic_testing_logs/regression_cycle/same_dir_create_clients/rhs-client23.lab.eng.blr.redhat.com/ test execution details available at https://docs.google.com/spreadsheets/d/1iP5Mi1TewBFVh8HTmlcBm9072Bgsbgkr3CLcGmawDys/edit#gid=632186609 Version-Release number of selected component (if applicable): ============ 3.8.4-10 other BZs for reference(raised wrt issues on same setup) 1409472 - brick crashed on systemic setup 1397907 - seeing frequent kernel hangs when doing operations both on fuse client and gluster nodes on replica volumes (edit) [NEEDINFO] 1409568 - seeing socket disconnects and transport endpoint not connected frequently on systemic setup 1409572 - In fuse mount logs:seeing input/output error with split-brain observed logs and failing GETXATTR and STAT 1409580 - seeing stale file handle errors in fuse mount logs in systemic testing 1409583 - seeing RPC status error messages and timeouts due to RPC (rpc-clnt.c:200:call_bail) 1409135 - [Replicate] "RPC call decoding failed" leading to IO hang & mount inaccessible 1409729 - On systemic setup seeing AFR errors w.r.t fileops like "afr-open.c:187:afr_openfd_fix_open_cbk" and "afr-lk-common.c:825:afr_unlock_entrylk_cbk" and "afr-lk-common.c:825:afr_unlock_entrylk_cbk"
This has been fixed. Hence closing the bug.