Bug 1300301

Summary: IOs failed with transport end point error while attach tier(shows authentication problem) and mount gets unmounted
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Nag Pavan Chilakam <nchilaka>
Component: tierAssignee: Mohammed Rafi KC <rkavunga>
Status: CLOSED WONTFIX QA Contact: Nag Pavan Chilakam <nchilaka>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: hgowtham, mchangir, nbalacha, rhs-bugs, rkavunga, smohan
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: tier-attach-detach
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-08 18:36:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1300564, 1300978    
Bug Blocks:    
Attachments:
Description Flags
client error
none
mount log none

Description Nag Pavan Chilakam 2016-01-20 12:46:32 UTC
I created regular volume as below:
[root@rhs-client21 ~]# gluster v info newvol
 
Volume Name: newvol
Type: Tier
Volume ID: d38264e9-6ce8-4c46-b052-ffd5e55554e1
Status: Started
Number of Bricks: 16
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 2 x 2 = 4
Brick1: rhs-client20:/rhs/brick5/newvol_hot
Brick2: rhs-client21:/rhs/brick5/newvol_hot
Brick3: rhs-client20:/rhs/brick4/newvol_hot
Brick4: rhs-client21:/rhs/brick4/newvol_hot
Cold Tier:
Cold Tier Type : Distribute
Number of Bricks: 12
Brick5: rhs-client4:/rhs/brick1/newvol
Brick6: rhs-client20:/rhs/brick1/newvol
Brick7: rhs-client21:/rhs/brick1/newvol
Brick8: rhs-client30:/rhs/brick1/newvol
Brick9: 10.70.37.59:/rhs/brick1/newvol
Brick10: 10.70.37.150:/rhs/brick1/newvol
Brick11: rhs-client4:/rhs/brick2/newvol
Brick12: rhs-client20:/rhs/brick2/newvol
Brick13: rhs-client21:/rhs/brick2/newvol
Brick14: rhs-client30:/rhs/brick2/newvol
Brick15: 10.70.37.59:/rhs/brick2/newvol
Brick16: 10.70.37.150:/rhs/brick2/newvol
Options Reconfigured:
performance.readdir-ahead: on
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on
features.ctr-enabled: on
cluster.tier-mode: cache




2) I then started IOs from two clients as below:
 a) linux untar on rhsauto070
 b)  rhsauto026: dd command for file creates in loop for 50 files of 300MB
while above was going on  with some dd files created already.
3)I then attached the tier (mentoned in vol info), I saw that on rhsauto026 the dd failed immediatly for all the current and all pending files
Also, the client logs shows as below:



[2016-01-19 19:03:40.677980] W [MSGID: 114043] [client-handshake.c:1114:client_setvolume_cbk] 2-newvol-client-0: failed to set the volume [Permission denied]
[2016-01-19 19:03:40.678114] W [MSGID: 114007] [client-handshake.c:1143:client_setvolume_cbk] 2-newvol-client-0: failed to get 'process-uuid' from reply dict [Invalid argument]
[2016-01-19 19:03:40.678133] E [MSGID: 114044] [client-handshake.c:1149:client_setvolume_cbk] 2-newvol-client-0: SETVOLUME on remote-host failed [Permission denied]
[2016-01-19 19:03:40.678145] I [MSGID: 114049] [client-handshake.c:1240:client_setvolume_cbk] 2-newvol-client-0: sending AUTH_FAILED event
[2016-01-19 19:03:40.678159] E [fuse-bridge.c:5200:notify] 0-fuse: Server authenication failed. Shutting down.
[2016-01-19 19:03:40.678171] I [fuse-bridge.c:5669:fini] 0-fuse: Unmounting '/mnt/newvol'.
[2016-01-19 19:03:40.678271] I [fuse-bridge.c:4965:fuse_thread_proc] 0-fuse: unmounting /mnt/newvol
[2016-01-19 19:03:40.678872] W [glusterfsd.c:1236:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dc5) [0x7f49315b7dc5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x7f4932c22905] -->/usr/sbin/glusterfs(cleanup_and_exit+0x69) [0x7f4932c22789] ) 0-: received signum (15), shutting down





Also, I am now to cd to the the mount location, says transport end pint eror


Note: the kernel untarr on rhsauot070 was still going on without any interfeernce

Comment 2 Nag Pavan Chilakam 2016-01-20 12:48:13 UTC
Created attachment 1116638 [details]
client error

Comment 3 Mohammed Rafi KC 2016-01-22 07:19:43 UTC
It is very inconsistently reproducible.

RCA:

It is race between graph change in client graph and an option change in server graph.
During server_reconfigure we authenticate each connected clients against the current options. To do this authentication we store previous values in a dictionary during the connection establishment phase (server_setvolume). If the authentication fails during reconfigure then we will disconnect the transport. Here it introduce a race between server_setvolume and reconfugure. If a reconfigure called before doing a setvolume, the transport will be disconnected.

After three seconds time-out transport will be reconnected.

Comment 4 Mohammed Rafi KC 2016-01-22 09:22:14 UTC
Changing the component since this can be reproduced in any volume also this bug falls into protocol layer.

NOTE: With RCA given in comment3, the failure should not umount

Comment 5 Mohammed Rafi KC 2016-01-22 09:26:33 UTC
upstream master patch merged. http://review.gluster.org/#/c/13271/
release 3.7 : http://review.gluster.org/#/c/13280/

Comment 6 Nag Pavan Chilakam 2016-01-22 09:37:40 UTC
Created attachment 1117135 [details]
mount log

Comment 7 Mohammed Rafi KC 2016-07-01 04:25:46 UTC
patches mentioned in comment5 are merged in upstream, so the fix would be available for 3.2 as part of the rebase.

Comment 8 Milind Changire 2017-01-18 10:00:42 UTC
Moving to MODIFIED.
Patch available downstream as commit 30e4d0d.

Comment 11 hari gowtham 2018-11-08 18:36:15 UTC
As tier is not being actively developed, I'm closing this bug. Feel free to open it if necessary.