Bug 1259081

Summary: I/O failure on attaching tier on fuse client
Product: [Community] GlusterFS Reporter: Dan Lambright <dlambrig>
Component: tieringAssignee: Nithya Balachandran <nbalacha>
Status: CLOSED CURRENTRELEASE QA Contact: bugs <bugs>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.7.5CC: annair, bugs, dlambrig, josferna, nchilaka, rkavunga, rtalur, vagarwal
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.7.5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1214289
: 1263549 (view as bug list) Environment:
Last Closed: 2015-10-14 10:30:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1214289    
Bug Blocks: 1186580, 1219547, 1228643, 1230692, 1260923, 1263549    

Description Dan Lambright 2015-09-01 23:19:02 UTC
+++ This bug was initially created as a clone of Bug #1214289 +++

Description of problem:
I/O failure on attaching tier

Version-Release number of selected component (if applicable):
glusterfs-server-3.7dev-0.994.git0d36d4f.el6.x86_64

How reproducible:


Steps to Reproduce:
1. Create a replica volume
2. Start 100% writes I/O on the volum
3. Attach a a tier while the I/O is in progress
4. Attach tier is successful, but I/O fails

Actual results:
See that the I/O's are failing. Here is the console o/p:

linux-2.6.31.1/arch/ia64/include/asm/sn/mspec.h
tar: linux-2.6.31.1/arch/ia64/include/asm/sn/mspec.h: Cannot open: Stale file handle
linux-2.6.31.1/arch/ia64/include/asm/sn/nodepda.h
tar: linux-2.6.31.1/arch/ia64/include/asm/sn/nodepda.h: Cannot open: Stale file handle
linux-2.6.31.1/arch/ia64/include/asm/sn/pcibr_provider.h
tar: linux-2.6.31.1/arch/ia64/include/asm/sn/pcibr_provider.h: Cannot open: Stale file handle
linux-2.6.31.1/arch/ia64/include/asm/sn/pcibus_provider_defs.h
tar: linux-2.6.31.1/arch/ia64/include/asm/sn/pcibus_provider_defs.h: Cannot open: Stale file handle
linux-2.6.31.1/arch/ia64/include/asm/sn/pcidev.h
tar: linux-2.6.31.1/arch/ia64/include/asm/sn/pcidev.h: Cannot open: Stale file handle
linux-2.6.31.1/arch/ia64/include/asm/sn/pda.h
tar: linux-2.6.31.1/arch/ia64/include/asm/sn/pda.h: Cannot open: Stale file handle
linux-2.6.31.1/arch/ia64/include/asm/sn/pic.h
tar: linux-2.6.31.1/arch/ia64/include/asm/sn/pic.h: Cannot open: Stale file handle
linux-2.6.31.1/arch/ia64/include/asm/sn/rw_mmr.h
tar: linux-2.6.31.1/arch/ia64/include/asm/sn/rw_mmr.h: Cannot open: Stale file handle
linux-2.6.31.1/arch/ia64/include/asm/sn/shub_mmr.h
tar: linux-2.6.31.1/arch/ia64/include/asm/sn/shub_mmr.h: Cannot open: Stale file handle
linux-2.6.31.1/arch/ia64/include/asm/sn/shubio.h
tar: linux-2.6.31.1/arch/ia64/include/asm/sn/shubio.h: Cannot open: Stale file handle
linux-2.6.31.1/arch/ia64/include/asm/sn/simulator.h
tar: linux-2.6.31.1/arch/ia64/include/asm/sn/simulator.h: Cannot open: Stale file handle
linux-2.6.31.1/arch/ia64/include/asm/sn/sn2/


Expected results:
I/O should continue normally while the tier is being added. Additionally, all the new writes post the tier addition should go to the hot tier.

Additional info:

--- Additional comment from Anoop on 2015-04-22 07:05:58 EDT ---

Volume info before attach:

Volume Name: vol1
Type: Replicate
Volume ID: b77d4050-7fdc-45ff-a084-f85eec2470fc
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.70.35.56:/rhs/brick1
Brick2: 10.70.35.67:/rhs/brick1

Volume Info post attach
Volume Name: vol1
Type: Tier
Volume ID: b77d4050-7fdc-45ff-a084-f85eec2470fc
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.35.67:/rhs/brick2
Brick2: 10.70.35.56:/rhs/brick2
Brick3: 10.70.35.56:/rhs/brick1
Brick4: 10.70.35.67:/rhs/brick1

--- Additional comment from Dan Lambright on 2015-04-22 15:46:08 EDT ---

When we attach a tier, the new added translator has no cached sub volume for IOs in flight. So IOs to open files fail. Solution is to recompute the cached sub volume for all open FDs with a lookup in tier_init, I believe, working on a fix.

--- Additional comment from Anand Avati on 2015-04-28 16:28:27 EDT ---

REVIEW: http://review.gluster.org/10435 (cluster/tier: don't use hot tier until subvolumes ready (WIP)) posted (#1) for review on master by Dan Lambright (dlambrig)

--- Additional comment from Anand Avati on 2015-04-29 16:22:55 EDT ---

REVIEW: http://review.gluster.org/10435 (cluster/tier: don't use hot tier until subvolumes ready (WIP)) posted (#2) for review on master by Dan Lambright (dlambrig)

--- Additional comment from Anand Avati on 2015-04-29 18:05:44 EDT ---

REVIEW: http://review.gluster.org/10435 (cluster/tier: don't use hot tier until subvolumes ready (WIP)) posted (#3) for review on master by Dan Lambright (dlambrig)

--- Additional comment from Anand Avati on 2015-05-04 14:55:52 EDT ---

REVIEW: http://review.gluster.org/10435 (cluster/tier: don't use hot tier until subvolumes ready) posted (#4) for review on master by Dan Lambright (dlambrig)

--- Additional comment from Dan Lambright on 2015-05-04 14:57:34 EDT ---

There may still be a window where an I/O error can happen, but this fix should close most of them. The window will be able to be completely close after BZ 1156637 is resolved.

--- Additional comment from Anand Avati on 2015-05-05 11:36:32 EDT ---

COMMIT: http://review.gluster.org/10435 committed in master by Kaleb KEITHLEY (kkeithle) 
------
commit 377505a101eede8943f5a345e11a6901c4f8f420
Author: Dan Lambright <dlambrig>
Date:   Tue Apr 28 16:26:33 2015 -0400

    cluster/tier: don't use hot tier until subvolumes ready
    
    When we attach a tier, the hot tier becomes the hashed
    subvolume. But directories may not yet have been replicated by
    the fix layout process. Hence lookups to those directories
    will fail on the hot subvolume. We should only go to the hashed
    subvolume once the layout has been fixed. This is known if the
    layout for the parent directory does not have an error. If
    there is an error, the cold tier is considered the hashed
    subvolume. The exception to this rules is ENOCON, in which
    case we do not know where the file is and must abort.
    
    Note we may revalidate a lookup for a directory even if the
    inode has not yet been populated by FUSE. This case can
    happen in tiering (where one tier has completed a lookup
    but the other has not, in which case we revalidate one tier
    when we call lookup the second time). Such inodes are
    still invalid and should not be consulted for validation.
    
    Change-Id: Ia2bc62e1d807bd70590bd2a8300496264d73c523
    BUG: 1214289
    Signed-off-by: Dan Lambright <dlambrig>
    Reviewed-on: http://review.gluster.org/10435
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Raghavendra G <rgowdapp>
    Reviewed-by: N Balachandran <nbalacha>

--- Additional comment from Anoop on 2015-05-13 08:31:00 EDT ---

Reproduced this ont the BETA2 build too, hence moving it to ASSIGNED.

--- Additional comment from nchilaka on 2015-06-02 11:56:53 EDT ---

Seeing the following issue on latest downstream build 
Following are the steps to reproduce:
1)create a dist-rep volume 
  gluster v create tiervol2 replica 2 10.70.46.233:/rhs/brick1/tiervol2 10.70.46.236:/rhs/brick1/tiervol2 10.70.46.240:/rhs/brick1/tiervol2 10.70.46.243:/rhs/brick1  /tiervol2
2)start and issue commands like info and status
3)Now mount using NFS
4) Trigger some IOs on this volume
5)While IOs are happening attach a tier

It can be seen that the tier gets attached successfully, but the IOs fail to write anymore

Some Observations worth noting:
1)This happens only when we mount using NFS. With glusterfs mount works well(Anoop, comment if you see issue even on glusterfs mount)
2)Seems to be some problem with tiering and NFS interaction as I see that NFS ports are all down when I run above scenario
3)This issue is hit only when IOs were in progress while attaching tier(although this will be the most valid case in customer site)


[root@rhsqa14-vm1 ~]# gluster v status tiervol2
Status of volume: tiervol2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.46.233:/rhs/brick1/tiervol2     49153     0          Y       1973 
Brick 10.70.46.236:/rhs/brick1/tiervol2     49154     0          Y       24453
Brick 10.70.46.240:/rhs/brick1/tiervol2     49154     0          Y       32272
Brick 10.70.46.243:/rhs/brick1/tiervol2     49153     0          Y       31759
NFS Server on localhost                     2049      0          Y       1992 
Self-heal Daemon on localhost               N/A       N/A        Y       2017 
NFS Server on 10.70.46.243                  2049      0          Y       31778
Self-heal Daemon on 10.70.46.243            N/A       N/A        Y       31790
NFS Server on 10.70.46.236                  2049      0          Y       24472
Self-heal Daemon on 10.70.46.236            N/A       N/A        Y       24482
NFS Server on 10.70.46.240                  2049      0          Y       32292
Self-heal Daemon on 10.70.46.240            N/A       N/A        Y       32312
 
Task Status of Volume tiervol2
------------------------------------------------------------------------------
There are no active volume tasks
 
[root@rhsqa14-vm1 ~]# gluster v info tiervol2
 
Volume Name: tiervol2
Type: Distributed-Replicate
Volume ID: a98f39c2-03ed-4ec7-909f-573b89a2a3e8
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.46.233:/rhs/brick1/tiervol2
Brick2: 10.70.46.236:/rhs/brick1/tiervol2
Brick3: 10.70.46.240:/rhs/brick1/tiervol2
Brick4: 10.70.46.243:/rhs/brick1/tiervol2
Options Reconfigured:
performance.readdir-ahead: on
[root@rhsqa14-vm1 ~]# #################Now i have mounted the regular dist-rep vol  https://www.kernel.org/pub/linux/kernel/v4.x/linux-4.0.4.tar.xz
You have new mail in /var/spool/mail/root
[root@rhsqa14-vm1 ~]# #################Now i have mounted the regular dist-rep vol  tiervol2##########
[root@rhsqa14-vm1 ~]# ls /rhs/brick1/tiervol2
linux-4.0.4.tar.xz
[root@rhsqa14-vm1 ~]#  #################Next I will attach a tier while untaring the image, and will check status of vol, it will show nfs down###########
[root@rhsqa14-vm1 ~]# ls /rhs/brick1/tiervol2 ;gluster v attach-tier tiervol2 10.70.46.236:/rhs/brick2/tiervol2 10.70.46.240:/rhs/brick2/tiervol2
linux-4.0.4  linux-4.0.4.tar.xz
Attach tier is recommended only for testing purposes in this release. Do you want to continue? (y/n) y
volume attach-tier: success
volume rebalance: tiervol2: success: Rebalance on tiervol2 has been started successfully. Use rebalance status command to check status of the rebalance process.
ID: 1e59a5cc-2ff0-48ce-a34e-0521cbe65d73

You have mail in /var/spool/mail/root
[root@rhsqa14-vm1 ~]# ls /rhs/brick1/tiervol2
linux-4.0.4  linux-4.0.4.tar.xz
[root@rhsqa14-vm1 ~]# gluster v info tiervol2
 
Volume Name: tiervol2
Type: Tier
Volume ID: a98f39c2-03ed-4ec7-909f-573b89a2a3e8
Status: Started
Number of Bricks: 6
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distribute
Number of Bricks: 2
Brick1: 10.70.46.240:/rhs/brick2/tiervol2
Brick2: 10.70.46.236:/rhs/brick2/tiervol2
Cold Bricks:
Cold Tier Type : Distributed-Replicate
Number of Bricks: 2 x 2 = 4
Brick3: 10.70.46.233:/rhs/brick1/tiervol2
Brick4: 10.70.46.236:/rhs/brick1/tiervol2
Brick5: 10.70.46.240:/rhs/brick1/tiervol2
Brick6: 10.70.46.243:/rhs/brick1/tiervol2
Options Reconfigured:
performance.readdir-ahead: on
[root@rhsqa14-vm1 ~]# gluster v status tiervol2
Status of volume: tiervol2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Hot Bricks:
Brick 10.70.46.240:/rhs/brick2/tiervol2     49155     0          Y       32411
Brick 10.70.46.236:/rhs/brick2/tiervol2     49155     0          Y       24590
Brick 10.70.46.233:/rhs/brick1/tiervol2     49153     0          Y       1973 
Brick 10.70.46.236:/rhs/brick1/tiervol2     49154     0          Y       24453
Brick 10.70.46.240:/rhs/brick1/tiervol2     49154     0          Y       32272
Brick 10.70.46.243:/rhs/brick1/tiervol2     49153     0          Y       31759
NFS Server on localhost                     N/A       N/A        N       N/A  
NFS Server on 10.70.46.236                  N/A       N/A        N       N/A  
NFS Server on 10.70.46.243                  N/A       N/A        N       N/A  
NFS Server on 10.70.46.240                  N/A       N/A        N       N/A  
 
Task Status of Volume tiervol2
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 1e59a5cc-2ff0-48ce-a34e-0521cbe65d73
Status               : in progress         
 



sosreport Logs attached

--- Additional comment from nchilaka on 2015-06-02 11:58:21 EDT ---



--- Additional comment from Anand Avati on 2015-06-04 14:01:07 EDT ---

REVIEW: http://review.gluster.org/11092 (cluster/tier: account for reordered layouts) posted (#1) for review on master by Dan Lambright (dlambrig)

--- Additional comment from Dan Lambright on 2015-06-04 14:04:52 EDT ---

Will give Nag a special build with fix 11092 and we will try to confirm the problem is in a reasonable state.

--- Additional comment from Anand Avati on 2015-06-05 11:08:09 EDT ---

REVIEW: http://review.gluster.org/11092 (cluster/tier: account for reordered layouts) posted (#2) for review on master by Dan Lambright (dlambrig)

--- Additional comment from Anand Avati on 2015-06-06 12:58:26 EDT ---

REVIEW: http://review.gluster.org/11092 (cluster/tier: account for reordered layouts) posted (#3) for review on master by Vijay Bellur (vbellur)

--- Additional comment from Anand Avati on 2015-06-09 16:52:57 EDT ---

REVIEW: http://review.gluster.org/11092 (cluster/tier: account for reordered layouts) posted (#4) for review on master by Dan Lambright (dlambrig)

--- Additional comment from Anand Avati on 2015-06-10 07:54:21 EDT ---

REVIEW: http://review.gluster.org/11092 (cluster/tier: account for reordered layouts) posted (#5) for review on master by Dan Lambright (dlambrig)

--- Additional comment from Dan Lambright on 2015-06-11 10:18:36 EDT ---



--- Additional comment from Dan Lambright on 2015-06-11 10:18:36 EDT ---



--- Additional comment from Anand Avati on 2015-06-11 10:37:59 EDT ---

REVIEW: http://review.gluster.org/11092 (cluster/tier: account for reordered layouts) posted (#6) for review on master by Dan Lambright (dlambrig)

--- Additional comment from Anand Avati on 2015-06-11 10:41:14 EDT ---

COMMIT: http://review.gluster.org/11092 committed in master by Niels de Vos (ndevos) 
------
commit b1ff2294d2aaf7dd36918837c09a68152adc0637
Author: Dan Lambright <dlambrig>
Date:   Thu Jun 4 14:00:34 2015 -0400

    cluster/tier: account for reordered layouts
    
    For a tiered volume the cold subvolume is always at a fixed
    position in the graph. DHT's layout array, on the other hand,
    may have the cold subvolume in either the first or second
    index, therefore code cannot make any assumptions. The fix
    searches the layout for the correct position dynamically
    rather than statically.
    
    The bug manifested itself in NFS, in which a newly attached
    subvolume had not received an existing directory. This case
    is a "stale entry" and marked as such in the layout for
    that directory.  The code did not see this, because it
    looked at the wrong index in the layout array.
    
    The fix also adds the check for decomissioned bricks, and
    fixes a problem in detach tier related to starting the
    rebalance process: we never received the right defrag
    command and it did not get directed to the tier translator.
    
    Change-Id: I77cdf9fbb0a777640c98003188565a79be9d0b56
    BUG: 1214289
    Signed-off-by: Dan Lambright <dlambrig>
    Tested-by: Gluster Build System <jenkins.com>
    Tested-by: NetBSD Build System <jenkins.org>
    Reviewed-by: Shyamsundar Ranganathan <srangana>
    Reviewed-by: Joseph Fernandes <josferna>
    Reviewed-by: Mohammed Rafi KC <rkavunga>
    Reviewed-on: http://review.gluster.org/11092

Comment 1 Nag Pavan Chilakam 2015-09-04 07:32:41 UTC
moving to failed_qa due to regression seen in form of bz#1260003 and bz#1260012

Comment 2 Mohammed Rafi KC 2015-09-16 07:13:06 UTC
upstream master patch : http://review.gluster.org/12184

Comment 3 Pranith Kumar K 2015-10-14 10:30:01 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-glusterfs-3.7.5, please open a new bug report.

glusterfs-glusterfs-3.7.5 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://www.gluster.org/pipermail/gluster-users/2015-October/023968.html
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 4 Pranith Kumar K 2015-10-14 10:38:41 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.5, please open a new bug report.

glusterfs-3.7.5 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://www.gluster.org/pipermail/gluster-users/2015-October/023968.html
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user