Red Hat Bugzilla – Bug 1300301
IOs failed with transport end point error while attach tier(shows authentication problem) and mount gets unmounted
Last modified: 2017-06-28 05:07:03 EDT
I created regular volume as below:
[root@rhs-client21 ~]# gluster v info newvol
Volume Name: newvol
Volume ID: d38264e9-6ce8-4c46-b052-ffd5e55554e1
Number of Bricks: 16
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 2 x 2 = 4
Cold Tier Type : Distribute
Number of Bricks: 12
2) I then started IOs from two clients as below:
a) linux untar on rhsauto070
b) rhsauto026: dd command for file creates in loop for 50 files of 300MB
while above was going on with some dd files created already.
3)I then attached the tier (mentoned in vol info), I saw that on rhsauto026 the dd failed immediatly for all the current and all pending files
Also, the client logs shows as below:
[2016-01-19 19:03:40.677980] W [MSGID: 114043] [client-handshake.c:1114:client_setvolume_cbk] 2-newvol-client-0: failed to set the volume [Permission denied]
[2016-01-19 19:03:40.678114] W [MSGID: 114007] [client-handshake.c:1143:client_setvolume_cbk] 2-newvol-client-0: failed to get 'process-uuid' from reply dict [Invalid argument]
[2016-01-19 19:03:40.678133] E [MSGID: 114044] [client-handshake.c:1149:client_setvolume_cbk] 2-newvol-client-0: SETVOLUME on remote-host failed [Permission denied]
[2016-01-19 19:03:40.678145] I [MSGID: 114049] [client-handshake.c:1240:client_setvolume_cbk] 2-newvol-client-0: sending AUTH_FAILED event
[2016-01-19 19:03:40.678159] E [fuse-bridge.c:5200:notify] 0-fuse: Server authenication failed. Shutting down.
[2016-01-19 19:03:40.678171] I [fuse-bridge.c:5669:fini] 0-fuse: Unmounting '/mnt/newvol'.
[2016-01-19 19:03:40.678271] I [fuse-bridge.c:4965:fuse_thread_proc] 0-fuse: unmounting /mnt/newvol
[2016-01-19 19:03:40.678872] W [glusterfsd.c:1236:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dc5) [0x7f49315b7dc5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x7f4932c22905] -->/usr/sbin/glusterfs(cleanup_and_exit+0x69) [0x7f4932c22789] ) 0-: received signum (15), shutting down
Also, I am now to cd to the the mount location, says transport end pint eror
Note: the kernel untarr on rhsauot070 was still going on without any interfeernce
Created attachment 1116638 [details]
It is very inconsistently reproducible.
It is race between graph change in client graph and an option change in server graph.
During server_reconfigure we authenticate each connected clients against the current options. To do this authentication we store previous values in a dictionary during the connection establishment phase (server_setvolume). If the authentication fails during reconfigure then we will disconnect the transport. Here it introduce a race between server_setvolume and reconfugure. If a reconfigure called before doing a setvolume, the transport will be disconnected.
After three seconds time-out transport will be reconnected.
Changing the component since this can be reproduced in any volume also this bug falls into protocol layer.
NOTE: With RCA given in comment3, the failure should not umount
upstream master patch merged. http://review.gluster.org/#/c/13271/
release 3.7 : http://review.gluster.org/#/c/13280/
Created attachment 1117135 [details]
patches mentioned in comment5 are merged in upstream, so the fix would be available for 3.2 as part of the rebase.
Moving to MODIFIED.
Patch available downstream as commit 30e4d0d.