+++ This bug was initially created as a clone of Bug #1347489 +++ +++ This bug was initially created as a clone of Bug #1343038 +++ Description of problem: When the IO is going on a client, 2 or more graph switches one after the other can lead to IO Error from the client. Version-Release number of selected component (if applicable): How reproducible: Attached is the reproducer. It is also seen when, qemu(libgfapi based) has a VM on gluster storage and replace brick(add brick followed by remove brick) was executed. Steps to Reproduce: 1. gcc -lgfapi tests/bugs/libgfapi/glfs_vol_set_IO_ERR.c -o tests/bugs/libgfapi/glfs_vol_set_IO_ERR -lgfapi 2. ./tests/bugs/libgfapi/glfs_vol_set_IO_ERR <volname> <log file> 3. Actual results: It exists with IO error Expected results: It should pass Additional info: --- Additional comment from Vijay Bellur on 2016-06-06 07:48:36 EDT --- REVIEW: http://review.gluster.org/14656 (gfapi: Fix IO error caused when there is consecutive graph switches) posted (#1) for review on master by Poornima G (pgurusid) --- Additional comment from Vijay Bellur on 2016-06-14 02:01:54 EDT --- REVIEW: http://review.gluster.org/14722 (gfapi: Fix IO error caused when there is consecutive graph switches) posted (#1) for review on master by Poornima G (pgurusid) --- Additional comment from Vijay Bellur on 2016-06-16 02:57:18 EDT --- REVIEW: http://review.gluster.org/14656 (gfapi: Fix IO error caused when there is consecutive graph switches) posted (#2) for review on master by Poornima G (pgurusid) --- Additional comment from Vijay Bellur on 2016-06-16 07:57:47 EDT --- COMMIT: http://review.gluster.org/14656 committed in master by Jeff Darcy (jdarcy) ------ commit b8ac20e888fbacad9d90cd8f1c6ff8579a5cefe9 Author: Poornima G <pgurusid> Date: Mon Jun 6 06:29:40 2016 -0400 gfapi: Fix IO error caused when there is consecutive graph switches Issue: Consider a simple situation, where glfs_init() is done, i.e. initial graph is up. Now perform 2 volume sets that results in 2 client side graph changes. After this perform some IO, the IO fails with ENOTCON. The only way to recover this client is i guess another graph switch or restart. What actually is happening from code perspective: Initial graph lets say A, followed by 2 consecutive graph switches to B and C without any IO those two switches. - graph_setup (A) as a result of GF_EVENT_CHILD_UP, and fs->next_subvol = A - glfs_init() results in fs->active_subvol = A, fs->next_subvol = NULL - graph_setup (B) as a result of GF_EVENT_CHILD_UP, and fs->next_subvol = B - graph_setup (C) as a result of GF_EVENT_CHILD_UP, and fs->next_subvol = C. It also sees that the previous graph B was never set as fs->active_subvol, i.e. no IO or anything happened on B, so can safely send GF_EVENT_PARENT_DOWN (by calling glfs_subvol_done(B)). This parent down on B, results in child_down(B), which is fine. But child_down also triggers graph_setup(B). - graph_setup(B) as a result of GF_EVENT_CHILD_DOWN, and fs->next_subvol = B, and GF_EVENT_PARENT_DOWN on C as explained above. This again leads to GF_EVENT_CHILD_DOWN on C. - graph_setup(C) as a result of GF_EVENT_CHILD_DOWN, and fs->next_subvol = C, and GF_EVENT_PARENT_DOWN on B as explained above. Thus both the graphs B and C are disconnected, and hence the ENOTCON Solution: Remove the call to graph_setup() when the event is GF_EVENT_CHILD_DOWN. It don't see any reason why graph_setup should be called when there is child_down. Not sure what the original reason was, to have graph_setup in child_down. git hostory shows the first patch itself had this call. Change-Id: I9de86555f66cc94a05649ac863b40ed3426ffd4b BUG: 1343038 Signed-off-by: Poornima G <pgurusid> Reviewed-on: http://review.gluster.org/14656 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Jeff Darcy <jdarcy> --- Additional comment from Vijay Bellur on 2016-06-17 00:37:17 EDT --- REVIEW: http://review.gluster.org/14747 (gfapi: Fix IO error caused when there is consecutive graph switches) posted (#1) for review on release-3.8 by Poornima G (pgurusid) --- Additional comment from Vijay Bellur on 2016-06-17 07:35:29 EDT --- COMMIT: http://review.gluster.org/14747 committed in release-3.8 by Jeff Darcy (jdarcy) ------ commit cf1e98ff5bf8233803b4f74debee1b1f474765af Author: Poornima G <pgurusid> Date: Mon Jun 6 06:29:40 2016 -0400 gfapi: Fix IO error caused when there is consecutive graph switches Backport of http://review.gluster.org/#/c/14656/ Issue: Consider a simple situation, where glfs_init() is done, i.e. initial graph is up. Now perform 2 volume sets that results in 2 client side graph changes. After this perform some IO, the IO fails with ENOTCON. The only way to recover this client is i guess another graph switch or restart. What actually is happening from code perspective: Initial graph lets say A, followed by 2 consecutive graph switches to B and C without any IO those two switches. - graph_setup (A) as a result of GF_EVENT_CHILD_UP, and fs->next_subvol = A - glfs_init() results in fs->active_subvol = A, fs->next_subvol = NULL - graph_setup (B) as a result of GF_EVENT_CHILD_UP, and fs->next_subvol = B - graph_setup (C) as a result of GF_EVENT_CHILD_UP, and fs->next_subvol = C. It also sees that the previous graph B was never set as fs->active_subvol, i.e. no IO or anything happened on B, so can safely send GF_EVENT_PARENT_DOWN (by calling glfs_subvol_done(B)). This parent down on B, results in child_down(B), which is fine. But child_down also triggers graph_setup(B). - graph_setup(B) as a result of GF_EVENT_CHILD_DOWN, and fs->next_subvol = B, and GF_EVENT_PARENT_DOWN on C as explained above. This again leads to GF_EVENT_CHILD_DOWN on C. - graph_setup(C) as a result of GF_EVENT_CHILD_DOWN, and fs->next_subvol = C, and GF_EVENT_PARENT_DOWN on B as explained above. Thus both the graphs B and C are disconnected, and hence the ENOTCON Solution: Remove the call to graph_setup() when the event is GF_EVENT_CHILD_DOWN. It don't see any reason why graph_setup should be called when there is child_down. Not sure what the original reason was, to have graph_setup in child_down. git hostory shows the first patch itself had this call. Change-Id: I9de86555f66cc94a05649ac863b40ed3426ffd4b BUG: 1347489 Signed-off-by: Poornima G <pgurusid> Reviewed-on: http://review.gluster.org/14656 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Jeff Darcy <jdarcy> (cherry picked from commit b8ac20e888fbacad9d90cd8f1c6ff8579a5cefe9) Reviewed-on: http://review.gluster.org/14747 --- Additional comment from Vijay Bellur on 2016-06-30 01:31:06 EDT --- REVIEW: http://review.gluster.org/14835 (gfapi: Fix IO error caused when there is consecutive graph switches) posted (#1) for review on release-3.7 by Poornima G (pgurusid)
This bug is getting closed because GlusteFS-3.7 has reached its end-of-life. Note: This bug is being closed using a script. No verification has been performed to check if it still exists on newer releases of GlusterFS. If this bug still exists in newer GlusterFS releases, please reopen this bug against the newer release.