Pursuing further the RCA of fuse-bridge not waiting till the new graph is up before directing fops to it, I do see the code where fuse_graph_sync is waiting on priv->sync_cond after initializing a new graph as active_subvol. However, "notify" function which broadcasts a signal on priv->sync_cond whenever it receives a CHILD_DOWN/CHILD_UP doesn't check for on which graph the event was received before setting priv->event_recvd to true. For eg., consider the scenario: * fuse_graph_sync is waiting for a CHILD_UP/CHILD_DOWN on new graph by doing pthread_cond_wait on priv->sync_cond * notify receives a CHILD_DOWN on old-graph and signals priv->sync_cond In the above scenario, fuse_graph_sync wakes up even though no CHILD_UP/CHILD_DOWN was received on new graph and starts directing ops to new graph, which will fail eventually till new graph is up. There is some evidence in the logs too. Note that we started seeing failed fops immediately after CHILD_DOWN event from afr to its parents (fuse-bridge) and there were no errors before that. [2017-02-08 10:25:21.404411] E [MSGID: 108006] [afr-common.c:4681:afr_notify] 0-andromeda-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up. [2017-02-08 10:25:21.438033] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38318: WRITE => -1 gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1 fd=0x7f83c804b0f8 (Input/output error) [2017-02-08 10:25:21.438541] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38320: WRITE => -1 gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1 fd=0x7f83c804b0f8 (Input/output error) [2017-02-08 10:25:21.455715] W [fuse-bridge.c:767:fuse_attr_cbk] 0-glusterfs-fuse: 38290: STAT() <gfid:8dad6ee2-a57f-47b8-ac06-648931200375> => -1 (Input/output error) [2017-02-08 10:25:21.455821] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38312: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output error) [2017-02-08 10:25:21.456344] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38324: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output error) [2017-02-08 10:25:21.456692] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38326: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output error) I'll wait for debug logs from sas to confirm the above RCA. If it turns out the RCA above is correct, the fix would be to make fuse_graph_sync wait till a CHILD_UP/CHILD_DOWN event on "new-graph" it just set as active-subvol instead of waking up on receiving CHILD_UP/CHILD_DOWN on _any_ graph.
REVIEW: https://review.gluster.org/16709 (features/shard: Put the onus of choosing the inode to resolve on individual fops) posted (#1) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: https://review.gluster.org/14419 (features/shard: Fix write/read failure due to EINVAL) posted (#4) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: https://review.gluster.org/16709 (features/shard: Put onus of choosing the inode to resolve on individual fops) posted (#2) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: https://review.gluster.org/14419 (features/shard: Fix write/read failure due to EINVAL) posted (#6) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: https://review.gluster.org/14419 (features/shard: Fix write/read failure due to EINVAL) posted (#7) for review on master by Pranith Kumar Karampuri (pkarampu)
REVIEW: https://review.gluster.org/16709 (features/shard: Put onus of choosing the inode to resolve on individual fops) posted (#3) for review on master by Pranith Kumar Karampuri (pkarampu)
REVIEW: https://review.gluster.org/14419 (features/shard: Fix EIO error on add-brick) posted (#8) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: https://review.gluster.org/16709 (features/shard: Put the onus of choosing the inode to resolve on individual fops) posted (#4) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: https://review.gluster.org/14419 (features/shard: Fix EIO error on add-brick) posted (#9) for review on master by Krutika Dhananjay (kdhananj)
REVIEW: https://review.gluster.org/16709 (features/shard: Put onus of choosing the inode to resolve on individual fops) posted (#5) for review on master by Krutika Dhananjay (kdhananj)
COMMIT: https://review.gluster.org/16709 committed in master by Pranith Kumar Karampuri (pkarampu) ------ commit 583e6cfc5bc73c2a79be9d42e89b90a71454596f Author: Krutika Dhananjay <kdhananj> Date: Wed Feb 22 14:43:46 2017 +0530 features/shard: Put onus of choosing the inode to resolve on individual fops ... as opposed to adding checks in "common" functions to choose the inode to resolve based local->fop, which is rather ugly and prone to errors. Change-Id: Ia46cc59992baa2979516369cb72d8991452c0274 BUG: 1420623 Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: https://review.gluster.org/16709 Tested-by: Pranith Kumar Karampuri <pkarampu> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Smoke: Gluster Build System <jenkins.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu>
REVIEW: https://review.gluster.org/14419 (features/shard: Fix EIO error on add-brick) posted (#10) for review on master by Krutika Dhananjay (kdhananj)
COMMIT: https://review.gluster.org/14419 committed in master by Atin Mukherjee (amukherj) ------ commit 1e2773cf1586b78c71e5b8adc24c6b65f065357c Author: Krutika Dhananjay <kdhananj> Date: Tue May 17 15:37:18 2016 +0530 features/shard: Fix EIO error on add-brick DHT seems to link inode during lookup even before initializing inode ctx with layout information, which comes after directory healing. Consider two parallel writes. As part of the first write, shard sends lookup on .shard which in its return path would cause DHT to link .shard inode. Now at this point, when a second write is wound, inode_find() of .shard succeeds and as a result of this, shard goes to create the participant shards by issuing MKNODs under .shard. Since the layout is yet to be initialized, mknod fails in dht call path with EIO, leading to VM pauses. The fix involves shard maintaining a flag to denote whether a fresh lookup on .shard completed one network trip. If it didn't, all inode_find()s in fop path will be followed by a lookup before proceeding with the next stage of the fop. Big thanks to Raghavendra G and Pranith Kumar K for the RCA and subsequent inputs and feedback on the patch. Change-Id: I9383ec7e3f24b34cd097a1b01ad34e4eeecc621f BUG: 1420623 Signed-off-by: Krutika Dhananjay <kdhananj> Reviewed-on: https://review.gluster.org/14419 Reviewed-by: Pranith Kumar Karampuri <pkarampu> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Smoke: Gluster Build System <jenkins.org> Reviewed-by: Atin Mukherjee <amukherj>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report. glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html [2] https://www.gluster.org/pipermail/gluster-users/