Hide Forgot
Description of problem: ======================= on a 2x(4+2) ec volume with all the md-cache settings enabled, I brought down some bricks using forceful umount of the bricks, I did bring up all the bricks up, however After some iterations, while doing ls -l on the fuse mount, I hit the below error messages consistently during that period(i hit it about 5 times) One thing to note was i saw root of the volume "/" requiring heal(as shown in heal info) when there were no actual changes or writes to the data create about 50000 file of size 1KB on root and about 40K file of 1KB size in a directory test under the root Was running "time ls -lt test/file.*" [2016-12-13 06:52:44.402952] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-disperse-disperse-0: Heal failed [Transport endpoint is not connected] [2016-12-13 06:52:44.403859] W [MSGID: 122053] [ec-common.c:116:ec_check_status] 0-disperse-disperse-0: Operation failed on some subvolumes (up=3F, mask=3F, remaining=0 , good=1F, bad=20) The message "W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-disperse-client-5: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001 ) [No such file or directory]" repeated 4 times between [2016-12-13 06:52:44.371681] and [2016-12-13 06:52:44.403964] [2016-12-13 06:52:44.406021] W [MSGID: 122002] [ec-common.c:71:ec_heal_report] 0-disperse-disperse-0: Heal failed [Transport endpoint is not connected] [2016-12-13 06:52:44.421467] W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-disperse-client-5: remote operation failed. Path: / (00000000-0000-0000-0 000-000000000001) [No such file or directory] [2016-12-13 06:52:44.442641] W [MSGID: 114063] [client-callback.c:110:client_cbk_cache_invalidation] 0-disperse-client-3: XDR decode of cache_invalidation failed. [Operation not permitted] [2016-12-13 06:52:44.442720] E [socket.c:1989:__socket_read_frag] 0-rpc: wrong MSG-TYPE (22) received from 10.70.35.135:49154 [2016-12-13 06:52:44.442948] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7f4812dcc682] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7f4812b9275e] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f4812b9286e] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x84)[0x7f4812b93fd4] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x94)[0x7f4812b94864] ))))) 0-disperse-client-3: forced unwinding frame type(GlusterFS 3.3) op(INODELK(29)) called at 2016-12-13 06:52:44.422872 (xid=0x1b92db) [2016-12-13 06:52:44.442974] E [MSGID: 114031] [client-rpc-fops.c:1550:client3_3_inodelk_cbk] 0-disperse-client-3: remote operation failed [Transport endpoint is not connected] However I am not able to reproduce after about 3 hours Brick error logs on one of the nodes was as below, when the fuse errors were hit [2016-12-13 07:34:48.825529] E [rpcsvc.c:1304:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x20807a, Program: GlusterFS 3.3, ProgVers: 330, Proc: 40) to rpc-transport (tcp.disperse-server) <nag> [2016-12-13 07:34:48.826196] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x12881) [0x7f6ed95af881] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x21907) [0x7f6ed915e907] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7f6ed9146186] ) 0-: Reply submission failed Version-Release number of selected component (if applicable): ========= 3.8.4-8
I didn't see any functional impact visible
[root@dhcp35-37 ~]# gluster v status Status of volume: disperse Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.37:/rhs/brick2/disperse 49152 0 Y 2803 Brick 10.70.35.116:/rhs/brick2/disperse 49152 0 Y 2705 Brick 10.70.35.239:/rhs/brick2/disperse 49154 0 Y 3388 Brick 10.70.35.135:/rhs/brick2/disperse 49154 0 Y 3395 Brick 10.70.35.8:/rhs/brick2/disperse 49154 0 Y 3389 Brick 10.70.35.196:/rhs/brick2/disperse 49152 0 Y 1734 Brick 10.70.35.37:/rhs/brick3/disperse 49153 0 Y 2808 Brick 10.70.35.116:/rhs/brick3/disperse 49153 0 Y 2712 Brick 10.70.35.239:/rhs/brick3/disperse 49155 0 Y 3408 Brick 10.70.35.135:/rhs/brick3/disperse 49155 0 Y 3415 Brick 10.70.35.8:/rhs/brick3/disperse 49155 0 Y 3409 Brick 10.70.35.196:/rhs/brick3/disperse 49153 0 Y 1752 Self-heal Daemon on localhost N/A N/A Y 3012 Self-heal Daemon on 10.70.35.116 N/A N/A Y 5403 Self-heal Daemon on 10.70.35.196 N/A N/A Y 5404 Self-heal Daemon on 10.70.35.135 N/A N/A Y 7562 Self-heal Daemon on 10.70.35.8 N/A N/A Y 7525 Self-heal Daemon on 10.70.35.239 N/A N/A Y 7557 Task Status of Volume disperse ------------------------------------------------------------------------------ Task : Rebalance ID : a3cdc6f5-2208-4bed-bd6c-f0e947d678cc Status : completed [root@dhcp35-37 ~]# gluster v info Volume Name: disperse Type: Distributed-Disperse Volume ID: 66222b81-0ccc-42d6-a655-b0a2359f8513 Status: Started Snapshot Count: 0 Number of Bricks: 2 x (4 + 2) = 12 Transport-type: tcp Bricks: Brick1: 10.70.35.37:/rhs/brick2/disperse Brick2: 10.70.35.116:/rhs/brick2/disperse Brick3: 10.70.35.239:/rhs/brick2/disperse Brick4: 10.70.35.135:/rhs/brick2/disperse Brick5: 10.70.35.8:/rhs/brick2/disperse Brick6: 10.70.35.196:/rhs/brick2/disperse Brick7: 10.70.35.37:/rhs/brick3/disperse Brick8: 10.70.35.116:/rhs/brick3/disperse Brick9: 10.70.35.239:/rhs/brick3/disperse Brick10: 10.70.35.135:/rhs/brick3/disperse Brick11: 10.70.35.8:/rhs/brick3/disperse Brick12: 10.70.35.196:/rhs/brick3/disperse Options Reconfigured: nfs.disable: on performance.readdir-ahead: on transport.address-family: inet performance.stat-prefetch: on features.cache-invalidation: on features.cache-invalidation-timeout: 60 performance.cache-invalidation: on performance.md-cache-timeout: 60
sosreports http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1404219/
We have noticed that the bug is not reproduced in the latest version of the product (RHGS-3.3.1+). If the bug is still relevant and is being reproduced, feel free to reopen the bug.