Bug 1415939 - [Tiering]: IO error on fuse mount after killing hot-tier bricks on a tiered volume
Summary: [Tiering]: IO error on fuse mount after killing hot-tier bricks on a tiered v...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: tier
Version: rhgs-3.2
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Milind Changire
QA Contact: krishnaram Karthick
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-24 07:02 UTC by Bala Konda Reddy M
Modified: 2018-11-08 18:38 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-11-08 18:38:11 UTC
Embargoed:


Attachments (Terms of Use)

Description Bala Konda Reddy M 2017-01-24 07:02:43 UTC
Description of problem:

On a tiered volume, IO errors are seen on fuse mount after killing hot tier bricks. 

Version-Release number of selected component (if applicable):
3.8.4-12

How reproducible:
Always

Steps to Reproduce:
1. Created a Distribute-Disperse volume 2*(4+2)
2. Attached distributed-replicate as hot-tier (2*2)
[root@dhcp37-179 ~]# gluster vol status testvol
Status of volume: testvol
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Hot Bricks:
Brick 10.70.37.109:/bricks/brick2/testvol_t
ier3 49154 0 Y 9413 
Brick 10.70.37.126:/bricks/brick2/testvol_t
ier2 49154 0 Y 9535 
Brick 10.70.37.170:/bricks/brick2/testvol_t
ier1 49154 0 Y 27447
Brick 10.70.37.179:/bricks/brick2/testvol_t
ier0 49154 0 Y 11844
Cold Bricks:
Brick 10.70.37.179:/bricks/brick0/testvol_b
rick0 49152 0 Y 11459
Brick 10.70.37.170:/bricks/brick0/testvol_b
rick1 49152 0 Y 27233
Brick 10.70.37.126:/bricks/brick0/testvol_b
rick2 49152 0 Y 9321 
Brick 10.70.37.109:/bricks/brick0/testvol_b
rick3 49152 0 Y 9199 
Brick 10.70.37.108:/bricks/brick0/testvol_b
rick4 49152 0 Y 6694 
Brick 10.70.37.151:/bricks/brick0/testvol_b
rick5 49152 0 Y 14287
Brick 10.70.37.179:/bricks/brick1/testvol_b
rick6 49153 0 Y 11478
Brick 10.70.37.170:/bricks/brick1/testvol_b
rick7 49153 0 Y 27252
Brick 10.70.37.126:/bricks/brick1/testvol_b
rick8 49153 0 Y 9340 
Brick 10.70.37.109:/bricks/brick1/testvol_b
rick9 49153 0 Y 9218 
Brick 10.70.37.108:/bricks/brick1/testvol_b
rick10 49153 0 Y 6713 
Brick 10.70.37.151:/bricks/brick1/testvol_b
rick11 49153 0 Y 14306
Self-heal Daemon on localhost N/A N/A Y 11871
Self-heal Daemon on 10.70.37.151 N/A N/A Y 14502
Self-heal Daemon on 10.70.37.108 N/A N/A Y 6910 
Self-heal Daemon on 10.70.37.109 N/A N/A Y 9434 
Self-heal Daemon on 10.70.37.170 N/A N/A Y 27468
Self-heal Daemon on 10.70.37.126 N/A N/A Y 9556 
 
Task Status of Volume testvol
------------------------------------------------------------------------------
Task : Tier migration 
ID : 15464601-de6d-43c1-88d4-0e731d4219df
Status : in progress 

3. fuse mounted the volume on a client

4. Killed the Hot Tier brick processes.

[root@dhcp37-179 ~]# gluster vol status
Status of volume: testvol
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Hot Bricks:
Brick 10.70.37.109:/bricks/brick2/testvol_t
ier3 N/A N/A N N/A 
Brick 10.70.37.126:/bricks/brick2/testvol_t
ier2 N/A N/A N N/A 
Brick 10.70.37.170:/bricks/brick2/testvol_t
ier1 N/A N/A N N/A 
Brick 10.70.37.179:/bricks/brick2/testvol_t
ier0 N/A N/A N N/A 
Cold Bricks:
Brick 10.70.37.179:/bricks/brick0/testvol_b
rick0 49152 0 Y 11459
Brick 10.70.37.170:/bricks/brick0/testvol_b
rick1 49152 0 Y 27233
Brick 10.70.37.126:/bricks/brick0/testvol_b
rick2 49152 0 Y 9321 
Brick 10.70.37.109:/bricks/brick0/testvol_b
rick3 49152 0 Y 9199 
Brick 10.70.37.108:/bricks/brick0/testvol_b
rick4 49152 0 Y 6694 
Brick 10.70.37.151:/bricks/brick0/testvol_b
rick5 49152 0 Y 14287
Brick 10.70.37.179:/bricks/brick1/testvol_b
rick6 49153 0 Y 11478
Brick 10.70.37.170:/bricks/brick1/testvol_b
rick7 49153 0 Y 27252
Brick 10.70.37.126:/bricks/brick1/testvol_b
rick8 49153 0 Y 9340 
Brick 10.70.37.109:/bricks/brick1/testvol_b
rick9 49153 0 Y 9218 
Brick 10.70.37.108:/bricks/brick1/testvol_b
rick10 49153 0 Y 6713 
Brick 10.70.37.151:/bricks/brick1/testvol_b
rick11 49153 0 Y 14306
Self-heal Daemon on localhost N/A N/A Y 11871
Self-heal Daemon on 10.70.37.108 N/A N/A Y 6910 
Self-heal Daemon on 10.70.37.170 N/A N/A Y 27468
Self-heal Daemon on 10.70.37.151 N/A N/A Y 14502
Self-heal Daemon on 10.70.37.109 N/A N/A Y 9434 
Self-heal Daemon on 10.70.37.126 N/A N/A Y 9556 
 
Task Status of Volume testvol
------------------------------------------------------------------------------
Task : Tier migration 
ID : 15464601-de6d-43c1-88d4-0e731d4219df
Status : in progress 

5. Started IO's on mountpoint

Actual results:
touch: cannot touch ‘a.txt’: Input/output error

Expected results:
File should be created without any error. After the Hot Tier bricks are down, IO's should pass to cold tier.

Additional info:
Unmounted the volume and remounted it again, then IO's are passing without any errors and all the IO's are moving to cold tier.

Comment 2 Milind Changire 2017-01-24 13:18:44 UTC
Bala,
Could you add the following three options to the protocol/client sections in trusted-*.vol and testvol.vol files in all the bricks and run a test again:

    option transport.tcp-user-timeout 2
    option transport.socket.keepalive-time 2
    option transport.socket.keepalive-interval 1

Test to see if you indeed need to remount for the "touch a.txt" operation to succeed or the operations succeed without remount.

As an aside, how long does it take for the "touch a.txt" command to return with "Input/Output error" after killing the hot bricks. Maybe running "time touch a.txt" after killing the hot bricks could help.

Comment 3 Bala Konda Reddy M 2017-01-25 10:29:17 UTC
Milind,

I tried setting the options as you mentioned. Then tried to create a file now it is showing Transport endpoint is not connected without any delay.

root@dhcp37-74 ~]# touch /mnt/fuse/b.txt
touch: cannot touch ‘/mnt/fuse/b.txt’: Transport endpoint is not connected

Comment 7 hari gowtham 2018-11-08 18:38:11 UTC
As tier is not being actively developed, I'm closing this bug. Feel free to open it if necessary.


Note You need to log in before you can comment on or make changes to this bug.