+++ This bug was initially created as a clone of Bug #1214289 +++ Description of problem: I/O failure on attaching tier Version-Release number of selected component (if applicable): glusterfs-server-3.7dev-0.994.git0d36d4f.el6.x86_64 How reproducible: Steps to Reproduce: 1. Create a replica volume 2. Start 100% writes I/O on the volum 3. Attach a a tier while the I/O is in progress 4. Attach tier is successful, but I/O fails Actual results: See that the I/O's are failing. Here is the console o/p: linux-2.6.31.1/arch/ia64/include/asm/sn/mspec.h tar: linux-2.6.31.1/arch/ia64/include/asm/sn/mspec.h: Cannot open: Stale file handle linux-2.6.31.1/arch/ia64/include/asm/sn/nodepda.h tar: linux-2.6.31.1/arch/ia64/include/asm/sn/nodepda.h: Cannot open: Stale file handle linux-2.6.31.1/arch/ia64/include/asm/sn/pcibr_provider.h tar: linux-2.6.31.1/arch/ia64/include/asm/sn/pcibr_provider.h: Cannot open: Stale file handle linux-2.6.31.1/arch/ia64/include/asm/sn/pcibus_provider_defs.h tar: linux-2.6.31.1/arch/ia64/include/asm/sn/pcibus_provider_defs.h: Cannot open: Stale file handle linux-2.6.31.1/arch/ia64/include/asm/sn/pcidev.h tar: linux-2.6.31.1/arch/ia64/include/asm/sn/pcidev.h: Cannot open: Stale file handle linux-2.6.31.1/arch/ia64/include/asm/sn/pda.h tar: linux-2.6.31.1/arch/ia64/include/asm/sn/pda.h: Cannot open: Stale file handle linux-2.6.31.1/arch/ia64/include/asm/sn/pic.h tar: linux-2.6.31.1/arch/ia64/include/asm/sn/pic.h: Cannot open: Stale file handle linux-2.6.31.1/arch/ia64/include/asm/sn/rw_mmr.h tar: linux-2.6.31.1/arch/ia64/include/asm/sn/rw_mmr.h: Cannot open: Stale file handle linux-2.6.31.1/arch/ia64/include/asm/sn/shub_mmr.h tar: linux-2.6.31.1/arch/ia64/include/asm/sn/shub_mmr.h: Cannot open: Stale file handle linux-2.6.31.1/arch/ia64/include/asm/sn/shubio.h tar: linux-2.6.31.1/arch/ia64/include/asm/sn/shubio.h: Cannot open: Stale file handle linux-2.6.31.1/arch/ia64/include/asm/sn/simulator.h tar: linux-2.6.31.1/arch/ia64/include/asm/sn/simulator.h: Cannot open: Stale file handle linux-2.6.31.1/arch/ia64/include/asm/sn/sn2/ Expected results: I/O should continue normally while the tier is being added. Additionally, all the new writes post the tier addition should go to the hot tier. Additional info: --- Additional comment from Anoop on 2015-04-22 07:05:58 EDT --- Volume info before attach: Volume Name: vol1 Type: Replicate Volume ID: b77d4050-7fdc-45ff-a084-f85eec2470fc Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: 10.70.35.56:/rhs/brick1 Brick2: 10.70.35.67:/rhs/brick1 Volume Info post attach Volume Name: vol1 Type: Tier Volume ID: b77d4050-7fdc-45ff-a084-f85eec2470fc Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.35.67:/rhs/brick2 Brick2: 10.70.35.56:/rhs/brick2 Brick3: 10.70.35.56:/rhs/brick1 Brick4: 10.70.35.67:/rhs/brick1 --- Additional comment from Dan Lambright on 2015-04-22 15:46:08 EDT --- When we attach a tier, the new added translator has no cached sub volume for IOs in flight. So IOs to open files fail. Solution is to recompute the cached sub volume for all open FDs with a lookup in tier_init, I believe, working on a fix. --- Additional comment from Anand Avati on 2015-04-28 16:28:27 EDT --- REVIEW: http://review.gluster.org/10435 (cluster/tier: don't use hot tier until subvolumes ready (WIP)) posted (#1) for review on master by Dan Lambright (dlambrig) --- Additional comment from Anand Avati on 2015-04-29 16:22:55 EDT --- REVIEW: http://review.gluster.org/10435 (cluster/tier: don't use hot tier until subvolumes ready (WIP)) posted (#2) for review on master by Dan Lambright (dlambrig) --- Additional comment from Anand Avati on 2015-04-29 18:05:44 EDT --- REVIEW: http://review.gluster.org/10435 (cluster/tier: don't use hot tier until subvolumes ready (WIP)) posted (#3) for review on master by Dan Lambright (dlambrig) --- Additional comment from Anand Avati on 2015-05-04 14:55:52 EDT --- REVIEW: http://review.gluster.org/10435 (cluster/tier: don't use hot tier until subvolumes ready) posted (#4) for review on master by Dan Lambright (dlambrig) --- Additional comment from Dan Lambright on 2015-05-04 14:57:34 EDT --- There may still be a window where an I/O error can happen, but this fix should close most of them. The window will be able to be completely close after BZ 1156637 is resolved. --- Additional comment from Anand Avati on 2015-05-05 11:36:32 EDT --- COMMIT: http://review.gluster.org/10435 committed in master by Kaleb KEITHLEY (kkeithle) ------ commit 377505a101eede8943f5a345e11a6901c4f8f420 Author: Dan Lambright <dlambrig> Date: Tue Apr 28 16:26:33 2015 -0400 cluster/tier: don't use hot tier until subvolumes ready When we attach a tier, the hot tier becomes the hashed subvolume. But directories may not yet have been replicated by the fix layout process. Hence lookups to those directories will fail on the hot subvolume. We should only go to the hashed subvolume once the layout has been fixed. This is known if the layout for the parent directory does not have an error. If there is an error, the cold tier is considered the hashed subvolume. The exception to this rules is ENOCON, in which case we do not know where the file is and must abort. Note we may revalidate a lookup for a directory even if the inode has not yet been populated by FUSE. This case can happen in tiering (where one tier has completed a lookup but the other has not, in which case we revalidate one tier when we call lookup the second time). Such inodes are still invalid and should not be consulted for validation. Change-Id: Ia2bc62e1d807bd70590bd2a8300496264d73c523 BUG: 1214289 Signed-off-by: Dan Lambright <dlambrig> Reviewed-on: http://review.gluster.org/10435 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Raghavendra G <rgowdapp> Reviewed-by: N Balachandran <nbalacha> --- Additional comment from Anoop on 2015-05-13 08:31:00 EDT --- Reproduced this ont the BETA2 build too, hence moving it to ASSIGNED. --- Additional comment from nchilaka on 2015-06-02 11:56:53 EDT --- Seeing the following issue on latest downstream build Following are the steps to reproduce: 1)create a dist-rep volume gluster v create tiervol2 replica 2 10.70.46.233:/rhs/brick1/tiervol2 10.70.46.236:/rhs/brick1/tiervol2 10.70.46.240:/rhs/brick1/tiervol2 10.70.46.243:/rhs/brick1 /tiervol2 2)start and issue commands like info and status 3)Now mount using NFS 4) Trigger some IOs on this volume 5)While IOs are happening attach a tier It can be seen that the tier gets attached successfully, but the IOs fail to write anymore Some Observations worth noting: 1)This happens only when we mount using NFS. With glusterfs mount works well(Anoop, comment if you see issue even on glusterfs mount) 2)Seems to be some problem with tiering and NFS interaction as I see that NFS ports are all down when I run above scenario 3)This issue is hit only when IOs were in progress while attaching tier(although this will be the most valid case in customer site) [root@rhsqa14-vm1 ~]# gluster v status tiervol2 Status of volume: tiervol2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.46.233:/rhs/brick1/tiervol2 49153 0 Y 1973 Brick 10.70.46.236:/rhs/brick1/tiervol2 49154 0 Y 24453 Brick 10.70.46.240:/rhs/brick1/tiervol2 49154 0 Y 32272 Brick 10.70.46.243:/rhs/brick1/tiervol2 49153 0 Y 31759 NFS Server on localhost 2049 0 Y 1992 Self-heal Daemon on localhost N/A N/A Y 2017 NFS Server on 10.70.46.243 2049 0 Y 31778 Self-heal Daemon on 10.70.46.243 N/A N/A Y 31790 NFS Server on 10.70.46.236 2049 0 Y 24472 Self-heal Daemon on 10.70.46.236 N/A N/A Y 24482 NFS Server on 10.70.46.240 2049 0 Y 32292 Self-heal Daemon on 10.70.46.240 N/A N/A Y 32312 Task Status of Volume tiervol2 ------------------------------------------------------------------------------ There are no active volume tasks [root@rhsqa14-vm1 ~]# gluster v info tiervol2 Volume Name: tiervol2 Type: Distributed-Replicate Volume ID: a98f39c2-03ed-4ec7-909f-573b89a2a3e8 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.46.233:/rhs/brick1/tiervol2 Brick2: 10.70.46.236:/rhs/brick1/tiervol2 Brick3: 10.70.46.240:/rhs/brick1/tiervol2 Brick4: 10.70.46.243:/rhs/brick1/tiervol2 Options Reconfigured: performance.readdir-ahead: on [root@rhsqa14-vm1 ~]# #################Now i have mounted the regular dist-rep vol https://www.kernel.org/pub/linux/kernel/v4.x/linux-4.0.4.tar.xz You have new mail in /var/spool/mail/root [root@rhsqa14-vm1 ~]# #################Now i have mounted the regular dist-rep vol tiervol2########## [root@rhsqa14-vm1 ~]# ls /rhs/brick1/tiervol2 linux-4.0.4.tar.xz [root@rhsqa14-vm1 ~]# #################Next I will attach a tier while untaring the image, and will check status of vol, it will show nfs down########### [root@rhsqa14-vm1 ~]# ls /rhs/brick1/tiervol2 ;gluster v attach-tier tiervol2 10.70.46.236:/rhs/brick2/tiervol2 10.70.46.240:/rhs/brick2/tiervol2 linux-4.0.4 linux-4.0.4.tar.xz Attach tier is recommended only for testing purposes in this release. Do you want to continue? (y/n) y volume attach-tier: success volume rebalance: tiervol2: success: Rebalance on tiervol2 has been started successfully. Use rebalance status command to check status of the rebalance process. ID: 1e59a5cc-2ff0-48ce-a34e-0521cbe65d73 You have mail in /var/spool/mail/root [root@rhsqa14-vm1 ~]# ls /rhs/brick1/tiervol2 linux-4.0.4 linux-4.0.4.tar.xz [root@rhsqa14-vm1 ~]# gluster v info tiervol2 Volume Name: tiervol2 Type: Tier Volume ID: a98f39c2-03ed-4ec7-909f-573b89a2a3e8 Status: Started Number of Bricks: 6 Transport-type: tcp Hot Tier : Hot Tier Type : Distribute Number of Bricks: 2 Brick1: 10.70.46.240:/rhs/brick2/tiervol2 Brick2: 10.70.46.236:/rhs/brick2/tiervol2 Cold Bricks: Cold Tier Type : Distributed-Replicate Number of Bricks: 2 x 2 = 4 Brick3: 10.70.46.233:/rhs/brick1/tiervol2 Brick4: 10.70.46.236:/rhs/brick1/tiervol2 Brick5: 10.70.46.240:/rhs/brick1/tiervol2 Brick6: 10.70.46.243:/rhs/brick1/tiervol2 Options Reconfigured: performance.readdir-ahead: on [root@rhsqa14-vm1 ~]# gluster v status tiervol2 Status of volume: tiervol2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Hot Bricks: Brick 10.70.46.240:/rhs/brick2/tiervol2 49155 0 Y 32411 Brick 10.70.46.236:/rhs/brick2/tiervol2 49155 0 Y 24590 Brick 10.70.46.233:/rhs/brick1/tiervol2 49153 0 Y 1973 Brick 10.70.46.236:/rhs/brick1/tiervol2 49154 0 Y 24453 Brick 10.70.46.240:/rhs/brick1/tiervol2 49154 0 Y 32272 Brick 10.70.46.243:/rhs/brick1/tiervol2 49153 0 Y 31759 NFS Server on localhost N/A N/A N N/A NFS Server on 10.70.46.236 N/A N/A N N/A NFS Server on 10.70.46.243 N/A N/A N N/A NFS Server on 10.70.46.240 N/A N/A N N/A Task Status of Volume tiervol2 ------------------------------------------------------------------------------ Task : Rebalance ID : 1e59a5cc-2ff0-48ce-a34e-0521cbe65d73 Status : in progress sosreport Logs attached --- Additional comment from nchilaka on 2015-06-02 11:58:21 EDT --- --- Additional comment from Anand Avati on 2015-06-04 14:01:07 EDT --- REVIEW: http://review.gluster.org/11092 (cluster/tier: account for reordered layouts) posted (#1) for review on master by Dan Lambright (dlambrig) --- Additional comment from Dan Lambright on 2015-06-04 14:04:52 EDT --- Will give Nag a special build with fix 11092 and we will try to confirm the problem is in a reasonable state. --- Additional comment from Anand Avati on 2015-06-05 11:08:09 EDT --- REVIEW: http://review.gluster.org/11092 (cluster/tier: account for reordered layouts) posted (#2) for review on master by Dan Lambright (dlambrig) --- Additional comment from Anand Avati on 2015-06-06 12:58:26 EDT --- REVIEW: http://review.gluster.org/11092 (cluster/tier: account for reordered layouts) posted (#3) for review on master by Vijay Bellur (vbellur) --- Additional comment from Anand Avati on 2015-06-09 16:52:57 EDT --- REVIEW: http://review.gluster.org/11092 (cluster/tier: account for reordered layouts) posted (#4) for review on master by Dan Lambright (dlambrig) --- Additional comment from Anand Avati on 2015-06-10 07:54:21 EDT --- REVIEW: http://review.gluster.org/11092 (cluster/tier: account for reordered layouts) posted (#5) for review on master by Dan Lambright (dlambrig)
This can be reproduced on FUSE as follows: 1. create dist rep volume and start it. 2. mount volume to /mnt 3. mkdir -p /mnt/d1/d2/d3 4. cd /mnt/d1/d2 5. stop volume # necessary so fix layout does not start 6. attach tier 7. start volume 8. cat d3/bob.txt] On step 8 will see "stale file handle". It appears to be a hole in DHT self heal.
*** This bug has been marked as a duplicate of bug 1228643 ***