Bug 1300301 - IOs failed with transport end point error while attach tier(shows authentication problem) and mount gets unmounted
IOs failed with transport end point error while attach tier(shows authenticat...
Status: MODIFIED
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: tier (Show other bugs)
3.1
Unspecified Unspecified
unspecified Severity urgent
: ---
: ---
Assigned To: Mohammed Rafi KC
nchilaka
tier-attach-detach
: ZStream
Depends On: 1300564 1300978
Blocks:
  Show dependency treegraph
 
Reported: 2016-01-20 07:46 EST by nchilaka
Modified: 2017-06-28 05:07 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
client error (8.49 KB, text/plain)
2016-01-20 07:48 EST, nchilaka
no flags Details
mount log (45.53 KB, text/plain)
2016-01-22 04:37 EST, nchilaka
no flags Details

  None (edit)
Description nchilaka 2016-01-20 07:46:32 EST
I created regular volume as below:
[root@rhs-client21 ~]# gluster v info newvol
 
Volume Name: newvol
Type: Tier
Volume ID: d38264e9-6ce8-4c46-b052-ffd5e55554e1
Status: Started
Number of Bricks: 16
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 2 x 2 = 4
Brick1: rhs-client20:/rhs/brick5/newvol_hot
Brick2: rhs-client21:/rhs/brick5/newvol_hot
Brick3: rhs-client20:/rhs/brick4/newvol_hot
Brick4: rhs-client21:/rhs/brick4/newvol_hot
Cold Tier:
Cold Tier Type : Distribute
Number of Bricks: 12
Brick5: rhs-client4:/rhs/brick1/newvol
Brick6: rhs-client20:/rhs/brick1/newvol
Brick7: rhs-client21:/rhs/brick1/newvol
Brick8: rhs-client30:/rhs/brick1/newvol
Brick9: 10.70.37.59:/rhs/brick1/newvol
Brick10: 10.70.37.150:/rhs/brick1/newvol
Brick11: rhs-client4:/rhs/brick2/newvol
Brick12: rhs-client20:/rhs/brick2/newvol
Brick13: rhs-client21:/rhs/brick2/newvol
Brick14: rhs-client30:/rhs/brick2/newvol
Brick15: 10.70.37.59:/rhs/brick2/newvol
Brick16: 10.70.37.150:/rhs/brick2/newvol
Options Reconfigured:
performance.readdir-ahead: on
features.quota: on
features.inode-quota: on
features.quota-deem-statfs: on
features.ctr-enabled: on
cluster.tier-mode: cache




2) I then started IOs from two clients as below:
 a) linux untar on rhsauto070
 b)  rhsauto026: dd command for file creates in loop for 50 files of 300MB
while above was going on  with some dd files created already.
3)I then attached the tier (mentoned in vol info), I saw that on rhsauto026 the dd failed immediatly for all the current and all pending files
Also, the client logs shows as below:



[2016-01-19 19:03:40.677980] W [MSGID: 114043] [client-handshake.c:1114:client_setvolume_cbk] 2-newvol-client-0: failed to set the volume [Permission denied]
[2016-01-19 19:03:40.678114] W [MSGID: 114007] [client-handshake.c:1143:client_setvolume_cbk] 2-newvol-client-0: failed to get 'process-uuid' from reply dict [Invalid argument]
[2016-01-19 19:03:40.678133] E [MSGID: 114044] [client-handshake.c:1149:client_setvolume_cbk] 2-newvol-client-0: SETVOLUME on remote-host failed [Permission denied]
[2016-01-19 19:03:40.678145] I [MSGID: 114049] [client-handshake.c:1240:client_setvolume_cbk] 2-newvol-client-0: sending AUTH_FAILED event
[2016-01-19 19:03:40.678159] E [fuse-bridge.c:5200:notify] 0-fuse: Server authenication failed. Shutting down.
[2016-01-19 19:03:40.678171] I [fuse-bridge.c:5669:fini] 0-fuse: Unmounting '/mnt/newvol'.
[2016-01-19 19:03:40.678271] I [fuse-bridge.c:4965:fuse_thread_proc] 0-fuse: unmounting /mnt/newvol
[2016-01-19 19:03:40.678872] W [glusterfsd.c:1236:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dc5) [0x7f49315b7dc5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x7f4932c22905] -->/usr/sbin/glusterfs(cleanup_and_exit+0x69) [0x7f4932c22789] ) 0-: received signum (15), shutting down





Also, I am now to cd to the the mount location, says transport end pint eror


Note: the kernel untarr on rhsauot070 was still going on without any interfeernce
Comment 2 nchilaka 2016-01-20 07:48 EST
Created attachment 1116638 [details]
client error
Comment 3 Mohammed Rafi KC 2016-01-22 02:19:43 EST
It is very inconsistently reproducible.

RCA:

It is race between graph change in client graph and an option change in server graph.
During server_reconfigure we authenticate each connected clients against the current options. To do this authentication we store previous values in a dictionary during the connection establishment phase (server_setvolume). If the authentication fails during reconfigure then we will disconnect the transport. Here it introduce a race between server_setvolume and reconfugure. If a reconfigure called before doing a setvolume, the transport will be disconnected.

After three seconds time-out transport will be reconnected.
Comment 4 Mohammed Rafi KC 2016-01-22 04:22:14 EST
Changing the component since this can be reproduced in any volume also this bug falls into protocol layer.

NOTE: With RCA given in comment3, the failure should not umount
Comment 5 Mohammed Rafi KC 2016-01-22 04:26:33 EST
upstream master patch merged. http://review.gluster.org/#/c/13271/
release 3.7 : http://review.gluster.org/#/c/13280/
Comment 6 nchilaka 2016-01-22 04:37 EST
Created attachment 1117135 [details]
mount log
Comment 7 Mohammed Rafi KC 2016-07-01 00:25:46 EDT
patches mentioned in comment5 are merged in upstream, so the fix would be available for 3.2 as part of the rebase.
Comment 8 Milind Changire 2017-01-18 05:00:42 EST
Moving to MODIFIED.
Patch available downstream as commit 30e4d0d.

Note You need to log in before you can comment on or make changes to this bug.