+++ This bug was initially created as a clone of Bug #1343038 +++ Description of problem: When the IO is going on a client, 2 or more graph switches one after the other can lead to IO Error from the client. Version-Release number of selected component (if applicable): How reproducible: Attached is the reproducer. It is also seen when, qemu(libgfapi based) has a VM on gluster storage and replace brick(add brick followed by remove brick) was executed. Steps to Reproduce: 1. gcc -lgfapi tests/bugs/libgfapi/glfs_vol_set_IO_ERR.c -o tests/bugs/libgfapi/glfs_vol_set_IO_ERR -lgfapi 2. ./tests/bugs/libgfapi/glfs_vol_set_IO_ERR <volname> <log file> 3. Actual results: It exists with IO error Expected results: It should pass Additional info: --- Additional comment from Vijay Bellur on 2016-06-06 07:48:36 EDT --- REVIEW: http://review.gluster.org/14656 (gfapi: Fix IO error caused when there is consecutive graph switches) posted (#1) for review on master by Poornima G (pgurusid) --- Additional comment from Vijay Bellur on 2016-06-14 02:01:54 EDT --- REVIEW: http://review.gluster.org/14722 (gfapi: Fix IO error caused when there is consecutive graph switches) posted (#1) for review on master by Poornima G (pgurusid) --- Additional comment from Vijay Bellur on 2016-06-16 02:57:18 EDT --- REVIEW: http://review.gluster.org/14656 (gfapi: Fix IO error caused when there is consecutive graph switches) posted (#2) for review on master by Poornima G (pgurusid) --- Additional comment from Vijay Bellur on 2016-06-16 07:57:47 EDT --- COMMIT: http://review.gluster.org/14656 committed in master by Jeff Darcy (jdarcy) ------ commit b8ac20e888fbacad9d90cd8f1c6ff8579a5cefe9 Author: Poornima G <pgurusid> Date: Mon Jun 6 06:29:40 2016 -0400 gfapi: Fix IO error caused when there is consecutive graph switches Issue: Consider a simple situation, where glfs_init() is done, i.e. initial graph is up. Now perform 2 volume sets that results in 2 client side graph changes. After this perform some IO, the IO fails with ENOTCON. The only way to recover this client is i guess another graph switch or restart. What actually is happening from code perspective: Initial graph lets say A, followed by 2 consecutive graph switches to B and C without any IO those two switches. - graph_setup (A) as a result of GF_EVENT_CHILD_UP, and fs->next_subvol = A - glfs_init() results in fs->active_subvol = A, fs->next_subvol = NULL - graph_setup (B) as a result of GF_EVENT_CHILD_UP, and fs->next_subvol = B - graph_setup (C) as a result of GF_EVENT_CHILD_UP, and fs->next_subvol = C. It also sees that the previous graph B was never set as fs->active_subvol, i.e. no IO or anything happened on B, so can safely send GF_EVENT_PARENT_DOWN (by calling glfs_subvol_done(B)). This parent down on B, results in child_down(B), which is fine. But child_down also triggers graph_setup(B). - graph_setup(B) as a result of GF_EVENT_CHILD_DOWN, and fs->next_subvol = B, and GF_EVENT_PARENT_DOWN on C as explained above. This again leads to GF_EVENT_CHILD_DOWN on C. - graph_setup(C) as a result of GF_EVENT_CHILD_DOWN, and fs->next_subvol = C, and GF_EVENT_PARENT_DOWN on B as explained above. Thus both the graphs B and C are disconnected, and hence the ENOTCON Solution: Remove the call to graph_setup() when the event is GF_EVENT_CHILD_DOWN. It don't see any reason why graph_setup should be called when there is child_down. Not sure what the original reason was, to have graph_setup in child_down. git hostory shows the first patch itself had this call. Change-Id: I9de86555f66cc94a05649ac863b40ed3426ffd4b BUG: 1343038 Signed-off-by: Poornima G <pgurusid> Reviewed-on: http://review.gluster.org/14656 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Jeff Darcy <jdarcy>
REVIEW: http://review.gluster.org/14747 (gfapi: Fix IO error caused when there is consecutive graph switches) posted (#1) for review on release-3.8 by Poornima G (pgurusid)
COMMIT: http://review.gluster.org/14747 committed in release-3.8 by Jeff Darcy (jdarcy) ------ commit cf1e98ff5bf8233803b4f74debee1b1f474765af Author: Poornima G <pgurusid> Date: Mon Jun 6 06:29:40 2016 -0400 gfapi: Fix IO error caused when there is consecutive graph switches Backport of http://review.gluster.org/#/c/14656/ Issue: Consider a simple situation, where glfs_init() is done, i.e. initial graph is up. Now perform 2 volume sets that results in 2 client side graph changes. After this perform some IO, the IO fails with ENOTCON. The only way to recover this client is i guess another graph switch or restart. What actually is happening from code perspective: Initial graph lets say A, followed by 2 consecutive graph switches to B and C without any IO those two switches. - graph_setup (A) as a result of GF_EVENT_CHILD_UP, and fs->next_subvol = A - glfs_init() results in fs->active_subvol = A, fs->next_subvol = NULL - graph_setup (B) as a result of GF_EVENT_CHILD_UP, and fs->next_subvol = B - graph_setup (C) as a result of GF_EVENT_CHILD_UP, and fs->next_subvol = C. It also sees that the previous graph B was never set as fs->active_subvol, i.e. no IO or anything happened on B, so can safely send GF_EVENT_PARENT_DOWN (by calling glfs_subvol_done(B)). This parent down on B, results in child_down(B), which is fine. But child_down also triggers graph_setup(B). - graph_setup(B) as a result of GF_EVENT_CHILD_DOWN, and fs->next_subvol = B, and GF_EVENT_PARENT_DOWN on C as explained above. This again leads to GF_EVENT_CHILD_DOWN on C. - graph_setup(C) as a result of GF_EVENT_CHILD_DOWN, and fs->next_subvol = C, and GF_EVENT_PARENT_DOWN on B as explained above. Thus both the graphs B and C are disconnected, and hence the ENOTCON Solution: Remove the call to graph_setup() when the event is GF_EVENT_CHILD_DOWN. It don't see any reason why graph_setup should be called when there is child_down. Not sure what the original reason was, to have graph_setup in child_down. git hostory shows the first patch itself had this call. Change-Id: I9de86555f66cc94a05649ac863b40ed3426ffd4b BUG: 1347489 Signed-off-by: Poornima G <pgurusid> Reviewed-on: http://review.gluster.org/14656 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Jeff Darcy <jdarcy> (cherry picked from commit b8ac20e888fbacad9d90cd8f1c6ff8579a5cefe9) Reviewed-on: http://review.gluster.org/14747
REVIEW: http://review.gluster.org/14835 (gfapi: Fix IO error caused when there is consecutive graph switches) posted (#1) for review on release-3.7 by Poornima G (pgurusid)
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.1, please open a new bug report. glusterfs-3.8.1 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://thread.gmane.org/gmane.comp.file-systems.gluster.packaging/156 [2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user