1300301 – IOs failed with transport end point error while attach tier(shows authentication problem) and mount gets unmounted

Bug 1300301 - IOs failed with transport end point error while attach tier(shows authentication problem) and mount gets unmounted

Summary: IOs failed with transport end point error while attach tier(shows authenticat...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	tier
Sub Component:
Version:	rhgs-3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Mohammed Rafi KC
QA Contact:	Nag Pavan Chilakam
Docs Contact:
URL:
Whiteboard:	tier-attach-detach
Depends On:	1300564 1300978
Blocks:
TreeView+	depends on / blocked

Reported:	2016-01-20 12:46 UTC by Nag Pavan Chilakam
Modified:	2018-11-08 18:36 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-11-08 18:36:15 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
client error (8.49 KB, text/plain) 2016-01-20 12:48 UTC, Nag Pavan Chilakam	no flags	Details
mount log (45.53 KB, text/plain) 2016-01-22 09:37 UTC, Nag Pavan Chilakam	no flags	Details
View All

Description Nag Pavan Chilakam 2016-01-20 12:46:32 UTC

I created regular volume as below:
[root@rhs-client21 ~]# gluster v info newvol
 
Volume Name: newvol
Type: Tier
Volume ID: d38264e9-6ce8-4c46-b052-ffd5e55554e1
Status: Started
Number of Bricks: 16
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 2 x 2 = 4
Brick1: rhs-client20:/rhs/brick5/newvol_hot
Brick2: rhs-client21:/rhs/brick5/newvol_hot
Brick3: rhs-client20:/rhs/brick4/newvol_hot
Brick4: rhs-client21:/rhs/brick4/newvol_hot
Cold Tier:
Cold Tier Type : Distribute
Number of Bricks: 12
Brick5: rhs-client4:/rhs/brick1/newvol
Brick6: rhs-client20:/rhs/brick1/newvol
Brick7: rhs-client21:/rhs/brick1/newvol
Brick8: rhs-client30:/rhs/brick1/newvol
Brick9: 10.70.37.59:/rhs/brick1/newvol
Brick10: 10.70.37.150:/rhs/brick1/newvol
Brick11: rhs-client4:/rhs/brick2/newvol
Brick12: rhs-client20:/rhs/brick2/newvol
Brick13: rhs-client21:/rhs/brick2/newvol
Brick14: rhs-client30:/rhs/brick2/newvol
Brick15: 10.70.37.59:/rhs/brick2/newvol
Brick16: 10.70.37.150:/rhs/brick2/newvol
Options Reconfigured:
performance.readdir-ahead: on
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on
features.ctr-enabled: on
cluster.tier-mode: cache




2) I then started IOs from two clients as below:
 a) linux untar on rhsauto070
 b)  rhsauto026: dd command for file creates in loop for 50 files of 300MB
while above was going on  with some dd files created already.
3)I then attached the tier (mentoned in vol info), I saw that on rhsauto026 the dd failed immediatly for all the current and all pending files
Also, the client logs shows as below:



[2016-01-19 19:03:40.677980] W [MSGID: 114043] [client-handshake.c:1114:client_setvolume_cbk] 2-newvol-client-0: failed to set the volume [Permission denied]
[2016-01-19 19:03:40.678114] W [MSGID: 114007] [client-handshake.c:1143:client_setvolume_cbk] 2-newvol-client-0: failed to get 'process-uuid' from reply dict [Invalid argument]
[2016-01-19 19:03:40.678133] E [MSGID: 114044] [client-handshake.c:1149:client_setvolume_cbk] 2-newvol-client-0: SETVOLUME on remote-host failed [Permission denied]
[2016-01-19 19:03:40.678145] I [MSGID: 114049] [client-handshake.c:1240:client_setvolume_cbk] 2-newvol-client-0: sending AUTH_FAILED event
[2016-01-19 19:03:40.678159] E [fuse-bridge.c:5200:notify] 0-fuse: Server authenication failed. Shutting down.
[2016-01-19 19:03:40.678171] I [fuse-bridge.c:5669:fini] 0-fuse: Unmounting '/mnt/newvol'.
[2016-01-19 19:03:40.678271] I [fuse-bridge.c:4965:fuse_thread_proc] 0-fuse: unmounting /mnt/newvol
[2016-01-19 19:03:40.678872] W [glusterfsd.c:1236:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dc5) [0x7f49315b7dc5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x7f4932c22905] -->/usr/sbin/glusterfs(cleanup_and_exit+0x69) [0x7f4932c22789] ) 0-: received signum (15), shutting down





Also, I am now to cd to the the mount location, says transport end pint eror


Note: the kernel untarr on rhsauot070 was still going on without any interfeernce

Comment 2 Nag Pavan Chilakam 2016-01-20 12:48:13 UTC

Created attachment 1116638 [details]
client error

Comment 3 Mohammed Rafi KC 2016-01-22 07:19:43 UTC

It is very inconsistently reproducible.

RCA:

It is race between graph change in client graph and an option change in server graph.
During server_reconfigure we authenticate each connected clients against the current options. To do this authentication we store previous values in a dictionary during the connection establishment phase (server_setvolume). If the authentication fails during reconfigure then we will disconnect the transport. Here it introduce a race between server_setvolume and reconfugure. If a reconfigure called before doing a setvolume, the transport will be disconnected.

After three seconds time-out transport will be reconnected.

Comment 4 Mohammed Rafi KC 2016-01-22 09:22:14 UTC

Changing the component since this can be reproduced in any volume also this bug falls into protocol layer.

NOTE: With RCA given in comment3, the failure should not umount

Comment 5 Mohammed Rafi KC 2016-01-22 09:26:33 UTC

upstream master patch merged. http://review.gluster.org/#/c/13271/
release 3.7 : http://review.gluster.org/#/c/13280/

Comment 6 Nag Pavan Chilakam 2016-01-22 09:37:40 UTC

Created attachment 1117135 [details]
mount log

Comment 7 Mohammed Rafi KC 2016-07-01 04:25:46 UTC

patches mentioned in comment5 are merged in upstream, so the fix would be available for 3.2 as part of the rebase.

Comment 8 Milind Changire 2017-01-18 10:00:42 UTC

Moving to MODIFIED.
Patch available downstream as commit 30e4d0d.

Comment 11 hari gowtham 2018-11-08 18:36:15 UTC

As tier is not being actively developed, I'm closing this bug. Feel free to open it if necessary.

Note You need to log in before you can comment on or make changes to this bug.