1347489 – IO ERROR when multiple graph switches

Bug 1347489 - IO ERROR when multiple graph switches

Summary: IO ERROR when multiple graph switches

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	libgfapi
Sub Component:
Version:	3.8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Poornima G
QA Contact:	Sudhir D
Docs Contact:
URL:
Whiteboard:
Depends On:	1343038 1365821 1367310
Blocks:	1351436
TreeView+	depends on / blocked

Reported:	2016-06-17 04:34 UTC by Poornima G
Modified:	2016-08-16 07:49 UTC (History)
CC List:	2 users (show)
Fixed In Version:	glusterfs-3.8.1
Clone Of:	1343038
Clones:	1351436 (view as bug list)
Environment:
Last Closed:	2016-07-08 14:43:58 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Poornima G 2016-06-17 04:34:48 UTC

+++ This bug was initially created as a clone of Bug #1343038 +++

Description of problem:
When the IO is going on a client, 2 or more graph switches one after the other can lead to IO Error from the client.

Version-Release number of selected component (if applicable):


How reproducible:
Attached is the reproducer.
It is also seen when, qemu(libgfapi based) has a VM on gluster storage and replace brick(add brick followed by remove brick) was executed.

Steps to Reproduce:
1. gcc -lgfapi tests/bugs/libgfapi/glfs_vol_set_IO_ERR.c -o tests/bugs/libgfapi/glfs_vol_set_IO_ERR -lgfapi
2. ./tests/bugs/libgfapi/glfs_vol_set_IO_ERR <volname> <log file>
3.

Actual results:
It exists with IO error

Expected results:
It should pass

Additional info:

--- Additional comment from Vijay Bellur on 2016-06-06 07:48:36 EDT ---

REVIEW: http://review.gluster.org/14656 (gfapi: Fix IO error caused when there is consecutive graph switches) posted (#1) for review on master by Poornima G (pgurusid)

--- Additional comment from Vijay Bellur on 2016-06-14 02:01:54 EDT ---

REVIEW: http://review.gluster.org/14722 (gfapi: Fix IO error caused when there is consecutive graph switches) posted (#1) for review on master by Poornima G (pgurusid)

--- Additional comment from Vijay Bellur on 2016-06-16 02:57:18 EDT ---

REVIEW: http://review.gluster.org/14656 (gfapi: Fix IO error caused when there is consecutive graph switches) posted (#2) for review on master by Poornima G (pgurusid)

--- Additional comment from Vijay Bellur on 2016-06-16 07:57:47 EDT ---

COMMIT: http://review.gluster.org/14656 committed in master by Jeff Darcy (jdarcy) 
------
commit b8ac20e888fbacad9d90cd8f1c6ff8579a5cefe9
Author: Poornima G <pgurusid>
Date:   Mon Jun 6 06:29:40 2016 -0400

    gfapi: Fix IO error caused when there is consecutive graph switches
    
    Issue:
    Consider a simple situation, where glfs_init() is done, i.e. initial
    graph is up. Now perform 2 volume sets that results in 2 client side
    graph changes. After this perform some IO, the IO fails with ENOTCON.
    The only way to recover this client is i guess another graph switch
    or restart.
    
    What actually is happening from code perspective:
    Initial graph lets say A, followed by 2 consecutive graph switches
    to B and C without any IO those two switches.
    
    - graph_setup (A) as a result of GF_EVENT_CHILD_UP, and
    fs->next_subvol = A
    
    - glfs_init() results in fs->active_subvol = A, fs->next_subvol = NULL
    
    - graph_setup (B) as a result of GF_EVENT_CHILD_UP, and
    fs->next_subvol = B
    
    - graph_setup (C) as a result of GF_EVENT_CHILD_UP, and
    fs->next_subvol = C. It also sees that the previous graph B was never
    set as fs->active_subvol, i.e. no IO or anything happened on B, so
    can safely send GF_EVENT_PARENT_DOWN (by calling glfs_subvol_done(B)).
    This parent down on B, results in child_down(B), which is fine.
    But child_down also triggers graph_setup(B).
    
    - graph_setup(B) as a result of GF_EVENT_CHILD_DOWN, and
    fs->next_subvol = B, and GF_EVENT_PARENT_DOWN on C as explained
    above. This again leads to GF_EVENT_CHILD_DOWN on C.
    
    - graph_setup(C) as a result of GF_EVENT_CHILD_DOWN, and
    fs->next_subvol = C, and GF_EVENT_PARENT_DOWN on B as explained
    above.
    
    Thus both the graphs B and C are disconnected, and hence the ENOTCON
    
    Solution:
    Remove the call to graph_setup() when the event is GF_EVENT_CHILD_DOWN.
    It don't see any reason why graph_setup should be called when there is
    child_down. Not sure what the original reason was, to have graph_setup
    in child_down. git hostory shows the first patch itself had this call.
    
    Change-Id: I9de86555f66cc94a05649ac863b40ed3426ffd4b
    BUG: 1343038
    Signed-off-by: Poornima G <pgurusid>
    Reviewed-on: http://review.gluster.org/14656
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Jeff Darcy <jdarcy>

Comment 1 Vijay Bellur 2016-06-17 04:37:17 UTC

REVIEW: http://review.gluster.org/14747 (gfapi: Fix IO error caused when there is consecutive graph switches) posted (#1) for review on release-3.8 by Poornima G (pgurusid)

Comment 2 Vijay Bellur 2016-06-17 11:35:29 UTC

COMMIT: http://review.gluster.org/14747 committed in release-3.8 by Jeff Darcy (jdarcy) 
------
commit cf1e98ff5bf8233803b4f74debee1b1f474765af
Author: Poornima G <pgurusid>
Date:   Mon Jun 6 06:29:40 2016 -0400

    gfapi: Fix IO error caused when there is consecutive graph switches
    
    Backport of http://review.gluster.org/#/c/14656/
    
    Issue:
    Consider a simple situation, where glfs_init() is done, i.e. initial
    graph is up. Now perform 2 volume sets that results in 2 client side
    graph changes. After this perform some IO, the IO fails with ENOTCON.
    The only way to recover this client is i guess another graph switch
    or restart.
    
    What actually is happening from code perspective:
    Initial graph lets say A, followed by 2 consecutive graph switches
    to B and C without any IO those two switches.
    
    - graph_setup (A) as a result of GF_EVENT_CHILD_UP, and
    fs->next_subvol = A
    
    - glfs_init() results in fs->active_subvol = A, fs->next_subvol = NULL
    
    - graph_setup (B) as a result of GF_EVENT_CHILD_UP, and
    fs->next_subvol = B
    
    - graph_setup (C) as a result of GF_EVENT_CHILD_UP, and
    fs->next_subvol = C. It also sees that the previous graph B was never
    set as fs->active_subvol, i.e. no IO or anything happened on B, so
    can safely send GF_EVENT_PARENT_DOWN (by calling glfs_subvol_done(B)).
    This parent down on B, results in child_down(B), which is fine.
    But child_down also triggers graph_setup(B).
    
    - graph_setup(B) as a result of GF_EVENT_CHILD_DOWN, and
    fs->next_subvol = B, and GF_EVENT_PARENT_DOWN on C as explained
    above. This again leads to GF_EVENT_CHILD_DOWN on C.
    
    - graph_setup(C) as a result of GF_EVENT_CHILD_DOWN, and
    fs->next_subvol = C, and GF_EVENT_PARENT_DOWN on B as explained
    above.
    
    Thus both the graphs B and C are disconnected, and hence the ENOTCON
    
    Solution:
    Remove the call to graph_setup() when the event is GF_EVENT_CHILD_DOWN.
    It don't see any reason why graph_setup should be called when there is
    child_down. Not sure what the original reason was, to have graph_setup
    in child_down. git hostory shows the first patch itself had this call.
    
    Change-Id: I9de86555f66cc94a05649ac863b40ed3426ffd4b
    BUG: 1347489
    Signed-off-by: Poornima G <pgurusid>
    Reviewed-on: http://review.gluster.org/14656
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Jeff Darcy <jdarcy>
    (cherry picked from commit b8ac20e888fbacad9d90cd8f1c6ff8579a5cefe9)
    Reviewed-on: http://review.gluster.org/14747

Comment 3 Vijay Bellur 2016-06-30 05:31:06 UTC

REVIEW: http://review.gluster.org/14835 (gfapi: Fix IO error caused when there is consecutive graph switches) posted (#1) for review on release-3.7 by Poornima G (pgurusid)

Comment 4 Niels de Vos 2016-07-08 14:43:58 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.1, please open a new bug report.

glusterfs-3.8.1 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.packaging/156
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Note You need to log in before you can comment on or make changes to this bug.