1351436 – IO ERROR when multiple graph switches

Bug 1351436 - IO ERROR when multiple graph switches

Summary: IO ERROR when multiple graph switches

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	libgfapi
Sub Component:
Version:	3.7.12
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:	Sudhir D
Docs Contact:
URL:
Whiteboard:
Depends On:	1343038 1347489 1365821 1367310
Blocks:
TreeView+	depends on / blocked

Reported:	2016-06-30 05:32 UTC by Poornima G
Modified:	2017-03-08 10:49 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:	1347489
Environment:
Last Closed:	2017-03-08 10:49:06 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Poornima G 2016-06-30 05:32:06 UTC

+++ This bug was initially created as a clone of Bug #1347489 +++

+++ This bug was initially created as a clone of Bug #1343038 +++

Description of problem:
When the IO is going on a client, 2 or more graph switches one after the other can lead to IO Error from the client.

Version-Release number of selected component (if applicable):


How reproducible:
Attached is the reproducer.
It is also seen when, qemu(libgfapi based) has a VM on gluster storage and replace brick(add brick followed by remove brick) was executed.

Steps to Reproduce:
1. gcc -lgfapi tests/bugs/libgfapi/glfs_vol_set_IO_ERR.c -o tests/bugs/libgfapi/glfs_vol_set_IO_ERR -lgfapi
2. ./tests/bugs/libgfapi/glfs_vol_set_IO_ERR <volname> <log file>
3.

Actual results:
It exists with IO error

Expected results:
It should pass

Additional info:

--- Additional comment from Vijay Bellur on 2016-06-06 07:48:36 EDT ---

REVIEW: http://review.gluster.org/14656 (gfapi: Fix IO error caused when there is consecutive graph switches) posted (#1) for review on master by Poornima G (pgurusid)

--- Additional comment from Vijay Bellur on 2016-06-14 02:01:54 EDT ---

REVIEW: http://review.gluster.org/14722 (gfapi: Fix IO error caused when there is consecutive graph switches) posted (#1) for review on master by Poornima G (pgurusid)

--- Additional comment from Vijay Bellur on 2016-06-16 02:57:18 EDT ---

REVIEW: http://review.gluster.org/14656 (gfapi: Fix IO error caused when there is consecutive graph switches) posted (#2) for review on master by Poornima G (pgurusid)

--- Additional comment from Vijay Bellur on 2016-06-16 07:57:47 EDT ---

COMMIT: http://review.gluster.org/14656 committed in master by Jeff Darcy (jdarcy) 
------
commit b8ac20e888fbacad9d90cd8f1c6ff8579a5cefe9
Author: Poornima G <pgurusid>
Date:   Mon Jun 6 06:29:40 2016 -0400

    gfapi: Fix IO error caused when there is consecutive graph switches
    
    Issue:
    Consider a simple situation, where glfs_init() is done, i.e. initial
    graph is up. Now perform 2 volume sets that results in 2 client side
    graph changes. After this perform some IO, the IO fails with ENOTCON.
    The only way to recover this client is i guess another graph switch
    or restart.
    
    What actually is happening from code perspective:
    Initial graph lets say A, followed by 2 consecutive graph switches
    to B and C without any IO those two switches.
    
    - graph_setup (A) as a result of GF_EVENT_CHILD_UP, and
    fs->next_subvol = A
    
    - glfs_init() results in fs->active_subvol = A, fs->next_subvol = NULL
    
    - graph_setup (B) as a result of GF_EVENT_CHILD_UP, and
    fs->next_subvol = B
    
    - graph_setup (C) as a result of GF_EVENT_CHILD_UP, and
    fs->next_subvol = C. It also sees that the previous graph B was never
    set as fs->active_subvol, i.e. no IO or anything happened on B, so
    can safely send GF_EVENT_PARENT_DOWN (by calling glfs_subvol_done(B)).
    This parent down on B, results in child_down(B), which is fine.
    But child_down also triggers graph_setup(B).
    
    - graph_setup(B) as a result of GF_EVENT_CHILD_DOWN, and
    fs->next_subvol = B, and GF_EVENT_PARENT_DOWN on C as explained
    above. This again leads to GF_EVENT_CHILD_DOWN on C.
    
    - graph_setup(C) as a result of GF_EVENT_CHILD_DOWN, and
    fs->next_subvol = C, and GF_EVENT_PARENT_DOWN on B as explained
    above.
    
    Thus both the graphs B and C are disconnected, and hence the ENOTCON
    
    Solution:
    Remove the call to graph_setup() when the event is GF_EVENT_CHILD_DOWN.
    It don't see any reason why graph_setup should be called when there is
    child_down. Not sure what the original reason was, to have graph_setup
    in child_down. git hostory shows the first patch itself had this call.
    
    Change-Id: I9de86555f66cc94a05649ac863b40ed3426ffd4b
    BUG: 1343038
    Signed-off-by: Poornima G <pgurusid>
    Reviewed-on: http://review.gluster.org/14656
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Jeff Darcy <jdarcy>

--- Additional comment from Vijay Bellur on 2016-06-17 00:37:17 EDT ---

REVIEW: http://review.gluster.org/14747 (gfapi: Fix IO error caused when there is consecutive graph switches) posted (#1) for review on release-3.8 by Poornima G (pgurusid)

--- Additional comment from Vijay Bellur on 2016-06-17 07:35:29 EDT ---

COMMIT: http://review.gluster.org/14747 committed in release-3.8 by Jeff Darcy (jdarcy) 
------
commit cf1e98ff5bf8233803b4f74debee1b1f474765af
Author: Poornima G <pgurusid>
Date:   Mon Jun 6 06:29:40 2016 -0400

    gfapi: Fix IO error caused when there is consecutive graph switches
    
    Backport of http://review.gluster.org/#/c/14656/
    
    Issue:
    Consider a simple situation, where glfs_init() is done, i.e. initial
    graph is up. Now perform 2 volume sets that results in 2 client side
    graph changes. After this perform some IO, the IO fails with ENOTCON.
    The only way to recover this client is i guess another graph switch
    or restart.
    
    What actually is happening from code perspective:
    Initial graph lets say A, followed by 2 consecutive graph switches
    to B and C without any IO those two switches.
    
    - graph_setup (A) as a result of GF_EVENT_CHILD_UP, and
    fs->next_subvol = A
    
    - glfs_init() results in fs->active_subvol = A, fs->next_subvol = NULL
    
    - graph_setup (B) as a result of GF_EVENT_CHILD_UP, and
    fs->next_subvol = B
    
    - graph_setup (C) as a result of GF_EVENT_CHILD_UP, and
    fs->next_subvol = C. It also sees that the previous graph B was never
    set as fs->active_subvol, i.e. no IO or anything happened on B, so
    can safely send GF_EVENT_PARENT_DOWN (by calling glfs_subvol_done(B)).
    This parent down on B, results in child_down(B), which is fine.
    But child_down also triggers graph_setup(B).
    
    - graph_setup(B) as a result of GF_EVENT_CHILD_DOWN, and
    fs->next_subvol = B, and GF_EVENT_PARENT_DOWN on C as explained
    above. This again leads to GF_EVENT_CHILD_DOWN on C.
    
    - graph_setup(C) as a result of GF_EVENT_CHILD_DOWN, and
    fs->next_subvol = C, and GF_EVENT_PARENT_DOWN on B as explained
    above.
    
    Thus both the graphs B and C are disconnected, and hence the ENOTCON
    
    Solution:
    Remove the call to graph_setup() when the event is GF_EVENT_CHILD_DOWN.
    It don't see any reason why graph_setup should be called when there is
    child_down. Not sure what the original reason was, to have graph_setup
    in child_down. git hostory shows the first patch itself had this call.
    
    Change-Id: I9de86555f66cc94a05649ac863b40ed3426ffd4b
    BUG: 1347489
    Signed-off-by: Poornima G <pgurusid>
    Reviewed-on: http://review.gluster.org/14656
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Jeff Darcy <jdarcy>
    (cherry picked from commit b8ac20e888fbacad9d90cd8f1c6ff8579a5cefe9)
    Reviewed-on: http://review.gluster.org/14747

--- Additional comment from Vijay Bellur on 2016-06-30 01:31:06 EDT ---

REVIEW: http://review.gluster.org/14835 (gfapi: Fix IO error caused when there is consecutive graph switches) posted (#1) for review on release-3.7 by Poornima G (pgurusid)

Comment 1 Kaushal 2017-03-08 10:49:06 UTC

This bug is getting closed because GlusteFS-3.7 has reached its end-of-life.

Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS.
If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.

Note You need to log in before you can comment on or make changes to this bug.