Description of problem: On a tiered volume, IO errors are seen on fuse mount after killing hot tier bricks. Version-Release number of selected component (if applicable): 3.8.4-12 How reproducible: Always Steps to Reproduce: 1. Created a Distribute-Disperse volume 2*(4+2) 2. Attached distributed-replicate as hot-tier (2*2) [root@dhcp37-179 ~]# gluster vol status testvol Status of volume: testvol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Hot Bricks: Brick 10.70.37.109:/bricks/brick2/testvol_t ier3 49154 0 Y 9413 Brick 10.70.37.126:/bricks/brick2/testvol_t ier2 49154 0 Y 9535 Brick 10.70.37.170:/bricks/brick2/testvol_t ier1 49154 0 Y 27447 Brick 10.70.37.179:/bricks/brick2/testvol_t ier0 49154 0 Y 11844 Cold Bricks: Brick 10.70.37.179:/bricks/brick0/testvol_b rick0 49152 0 Y 11459 Brick 10.70.37.170:/bricks/brick0/testvol_b rick1 49152 0 Y 27233 Brick 10.70.37.126:/bricks/brick0/testvol_b rick2 49152 0 Y 9321 Brick 10.70.37.109:/bricks/brick0/testvol_b rick3 49152 0 Y 9199 Brick 10.70.37.108:/bricks/brick0/testvol_b rick4 49152 0 Y 6694 Brick 10.70.37.151:/bricks/brick0/testvol_b rick5 49152 0 Y 14287 Brick 10.70.37.179:/bricks/brick1/testvol_b rick6 49153 0 Y 11478 Brick 10.70.37.170:/bricks/brick1/testvol_b rick7 49153 0 Y 27252 Brick 10.70.37.126:/bricks/brick1/testvol_b rick8 49153 0 Y 9340 Brick 10.70.37.109:/bricks/brick1/testvol_b rick9 49153 0 Y 9218 Brick 10.70.37.108:/bricks/brick1/testvol_b rick10 49153 0 Y 6713 Brick 10.70.37.151:/bricks/brick1/testvol_b rick11 49153 0 Y 14306 Self-heal Daemon on localhost N/A N/A Y 11871 Self-heal Daemon on 10.70.37.151 N/A N/A Y 14502 Self-heal Daemon on 10.70.37.108 N/A N/A Y 6910 Self-heal Daemon on 10.70.37.109 N/A N/A Y 9434 Self-heal Daemon on 10.70.37.170 N/A N/A Y 27468 Self-heal Daemon on 10.70.37.126 N/A N/A Y 9556 Task Status of Volume testvol ------------------------------------------------------------------------------ Task : Tier migration ID : 15464601-de6d-43c1-88d4-0e731d4219df Status : in progress 3. fuse mounted the volume on a client 4. Killed the Hot Tier brick processes. [root@dhcp37-179 ~]# gluster vol status Status of volume: testvol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Hot Bricks: Brick 10.70.37.109:/bricks/brick2/testvol_t ier3 N/A N/A N N/A Brick 10.70.37.126:/bricks/brick2/testvol_t ier2 N/A N/A N N/A Brick 10.70.37.170:/bricks/brick2/testvol_t ier1 N/A N/A N N/A Brick 10.70.37.179:/bricks/brick2/testvol_t ier0 N/A N/A N N/A Cold Bricks: Brick 10.70.37.179:/bricks/brick0/testvol_b rick0 49152 0 Y 11459 Brick 10.70.37.170:/bricks/brick0/testvol_b rick1 49152 0 Y 27233 Brick 10.70.37.126:/bricks/brick0/testvol_b rick2 49152 0 Y 9321 Brick 10.70.37.109:/bricks/brick0/testvol_b rick3 49152 0 Y 9199 Brick 10.70.37.108:/bricks/brick0/testvol_b rick4 49152 0 Y 6694 Brick 10.70.37.151:/bricks/brick0/testvol_b rick5 49152 0 Y 14287 Brick 10.70.37.179:/bricks/brick1/testvol_b rick6 49153 0 Y 11478 Brick 10.70.37.170:/bricks/brick1/testvol_b rick7 49153 0 Y 27252 Brick 10.70.37.126:/bricks/brick1/testvol_b rick8 49153 0 Y 9340 Brick 10.70.37.109:/bricks/brick1/testvol_b rick9 49153 0 Y 9218 Brick 10.70.37.108:/bricks/brick1/testvol_b rick10 49153 0 Y 6713 Brick 10.70.37.151:/bricks/brick1/testvol_b rick11 49153 0 Y 14306 Self-heal Daemon on localhost N/A N/A Y 11871 Self-heal Daemon on 10.70.37.108 N/A N/A Y 6910 Self-heal Daemon on 10.70.37.170 N/A N/A Y 27468 Self-heal Daemon on 10.70.37.151 N/A N/A Y 14502 Self-heal Daemon on 10.70.37.109 N/A N/A Y 9434 Self-heal Daemon on 10.70.37.126 N/A N/A Y 9556 Task Status of Volume testvol ------------------------------------------------------------------------------ Task : Tier migration ID : 15464601-de6d-43c1-88d4-0e731d4219df Status : in progress 5. Started IO's on mountpoint Actual results: touch: cannot touch ‘a.txt’: Input/output error Expected results: File should be created without any error. After the Hot Tier bricks are down, IO's should pass to cold tier. Additional info: Unmounted the volume and remounted it again, then IO's are passing without any errors and all the IO's are moving to cold tier.
Bala, Could you add the following three options to the protocol/client sections in trusted-*.vol and testvol.vol files in all the bricks and run a test again: option transport.tcp-user-timeout 2 option transport.socket.keepalive-time 2 option transport.socket.keepalive-interval 1 Test to see if you indeed need to remount for the "touch a.txt" operation to succeed or the operations succeed without remount. As an aside, how long does it take for the "touch a.txt" command to return with "Input/Output error" after killing the hot bricks. Maybe running "time touch a.txt" after killing the hot bricks could help.
Milind, I tried setting the options as you mentioned. Then tried to create a file now it is showing Transport endpoint is not connected without any delay. root@dhcp37-74 ~]# touch /mnt/fuse/b.txt touch: cannot touch ‘/mnt/fuse/b.txt’: Transport endpoint is not connected
As tier is not being actively developed, I'm closing this bug. Feel free to open it if necessary.