Bug 1420623

Summary:	[RHV-RHGS]: Application VM paused after add brick operation and VM didn't comeup after power cycle.
Product:	[Community] GlusterFS	Reporter:	Raghavendra G <rgowdapp>
Component:	fuse	Assignee:	Raghavendra G <rgowdapp>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	mainline	CC:	amukherj, bsrirama, bugs, csaba, kdhananj, ravishankar, rcyriac, rgowdapp, rhs-bugs, sabose, sasundar, storage-qa-internal
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.11.0	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1412554
Clones:	1426508 1426512 (view as bug list)		Environment:
Last Closed:	2017-05-30 18:41:34 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1412554
Bug Blocks:	1387878

Comment 1 Raghavendra G 2017-02-09 06:14:56 UTC

Pursuing further the RCA of fuse-bridge not waiting till the new graph is up before directing fops to it, I do see the code where fuse_graph_sync is waiting on priv->sync_cond after initializing a new graph as active_subvol. However, "notify" function which broadcasts a signal on priv->sync_cond whenever it receives a CHILD_DOWN/CHILD_UP doesn't check for on which graph the event was received before setting priv->event_recvd to true. For eg., consider the scenario:

* fuse_graph_sync is waiting for a CHILD_UP/CHILD_DOWN on new graph by doing pthread_cond_wait on priv->sync_cond
* notify receives a CHILD_DOWN on old-graph and signals priv->sync_cond

In the above scenario, fuse_graph_sync wakes up even though no CHILD_UP/CHILD_DOWN was received on new graph and starts directing ops to new graph, which will fail eventually till new graph is up.

There is some evidence in the logs too. Note that we started seeing failed fops immediately after CHILD_DOWN event from afr to its parents (fuse-bridge) and there were no errors before that.

[2017-02-08 10:25:21.404411] E [MSGID: 108006] [afr-common.c:4681:afr_notify] 0-andromeda-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2017-02-08 10:25:21.438033] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38318: WRITE => -1 gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1 fd=0x7f83c804b0f8 (Input/output error)
[2017-02-08 10:25:21.438541] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38320: WRITE => -1 gfid=d04d7083-bdfe-4424-be50-a8ce01caa8a1 fd=0x7f83c804b0f8 (Input/output error)
[2017-02-08 10:25:21.455715] W [fuse-bridge.c:767:fuse_attr_cbk] 0-glusterfs-fuse: 38290: STAT() <gfid:8dad6ee2-a57f-47b8-ac06-648931200375> => -1 (Input/output error)
[2017-02-08 10:25:21.455821] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38312: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output error)
[2017-02-08 10:25:21.456344] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38324: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output error)
[2017-02-08 10:25:21.456692] W [fuse-bridge.c:2312:fuse_writev_cbk] 0-glusterfs-fuse: 38326: WRITE => -1 gfid=8dad6ee2-a57f-47b8-ac06-648931200375 fd=0x7f83c804b06c (Input/output error)


I'll wait for debug logs from sas to confirm the above RCA. If it turns out the RCA above is correct, the fix would be to make fuse_graph_sync wait till a CHILD_UP/CHILD_DOWN event on "new-graph" it just set as active-subvol instead of waking up on receiving CHILD_UP/CHILD_DOWN on _any_ graph.

Comment 2 Worker Ant 2017-02-22 09:36:21 UTC

REVIEW: https://review.gluster.org/16709 (features/shard: Put the onus of choosing the inode to resolve on individual fops) posted (#1) for review on master by Krutika Dhananjay (kdhananj)

Comment 3 Worker Ant 2017-02-22 09:37:39 UTC

REVIEW: https://review.gluster.org/14419 (features/shard: Fix write/read failure due to EINVAL) posted (#4) for review on master by Krutika Dhananjay (kdhananj)

Comment 4 Worker Ant 2017-02-22 10:17:28 UTC

REVIEW: https://review.gluster.org/16709 (features/shard: Put onus of choosing the inode to resolve on individual fops) posted (#2) for review on master by Krutika Dhananjay (kdhananj)

Comment 5 Worker Ant 2017-02-22 10:31:18 UTC

REVIEW: https://review.gluster.org/14419 (features/shard: Fix write/read failure due to EINVAL) posted (#6) for review on master by Krutika Dhananjay (kdhananj)

Comment 6 Worker Ant 2017-02-23 06:48:21 UTC

REVIEW: https://review.gluster.org/14419 (features/shard: Fix write/read failure due to EINVAL) posted (#7) for review on master by Pranith Kumar Karampuri (pkarampu)

Comment 7 Worker Ant 2017-02-23 06:48:27 UTC

REVIEW: https://review.gluster.org/16709 (features/shard: Put onus of choosing the inode to resolve on individual fops) posted (#3) for review on master by Pranith Kumar Karampuri (pkarampu)

Comment 8 Worker Ant 2017-02-23 06:53:11 UTC

REVIEW: https://review.gluster.org/14419 (features/shard: Fix EIO error on add-brick) posted (#8) for review on master by Krutika Dhananjay (kdhananj)

Comment 9 Worker Ant 2017-02-23 06:53:17 UTC

REVIEW: https://review.gluster.org/16709 (features/shard: Put the onus of choosing the inode to resolve on individual fops) posted (#4) for review on master by Krutika Dhananjay (kdhananj)

Comment 10 Worker Ant 2017-02-23 07:43:32 UTC

REVIEW: https://review.gluster.org/14419 (features/shard: Fix EIO error on add-brick) posted (#9) for review on master by Krutika Dhananjay (kdhananj)

Comment 11 Worker Ant 2017-02-23 07:43:37 UTC

REVIEW: https://review.gluster.org/16709 (features/shard: Put onus of choosing the inode to resolve on individual fops) posted (#5) for review on master by Krutika Dhananjay (kdhananj)

Comment 12 Worker Ant 2017-02-23 07:56:42 UTC

COMMIT: https://review.gluster.org/16709 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit 583e6cfc5bc73c2a79be9d42e89b90a71454596f
Author: Krutika Dhananjay <kdhananj>
Date:   Wed Feb 22 14:43:46 2017 +0530

    features/shard: Put onus of choosing the inode to resolve on individual fops
    
    ... as opposed to adding checks in "common" functions to choose the inode
    to resolve based local->fop, which is rather ugly and prone to errors.
    
    Change-Id: Ia46cc59992baa2979516369cb72d8991452c0274
    BUG: 1420623
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: https://review.gluster.org/16709
    Tested-by: Pranith Kumar Karampuri <pkarampu>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>

Comment 13 Worker Ant 2017-02-23 10:50:55 UTC

REVIEW: https://review.gluster.org/14419 (features/shard: Fix EIO error on add-brick) posted (#10) for review on master by Krutika Dhananjay (kdhananj)

Comment 14 Worker Ant 2017-02-24 04:57:47 UTC

COMMIT: https://review.gluster.org/14419 committed in master by Atin Mukherjee (amukherj) 
------
commit 1e2773cf1586b78c71e5b8adc24c6b65f065357c
Author: Krutika Dhananjay <kdhananj>
Date:   Tue May 17 15:37:18 2016 +0530

    features/shard: Fix EIO error on add-brick
    
    DHT seems to link inode during lookup even before initializing
    inode ctx with layout information, which comes after
    directory healing.
    
    Consider two parallel writes. As part of the first write,
    shard sends lookup on .shard which in its return path would
    cause DHT to link .shard inode. Now at this point, when a
    second write is wound, inode_find() of .shard succeeds and
    as a result of this, shard goes to create the participant
    shards by issuing MKNODs under .shard. Since the layout is
    yet to be initialized, mknod fails in dht call path with EIO,
    leading to VM pauses.
    
    The fix involves shard maintaining a flag to denote whether
    a fresh lookup on .shard completed one network trip. If it
    didn't, all inode_find()s in fop path will be followed by a
    lookup before proceeding with the next stage of the fop.
    
    Big thanks to Raghavendra G and Pranith Kumar K for the RCA
    and subsequent inputs and feedback on the patch.
    
    Change-Id: I9383ec7e3f24b34cd097a1b01ad34e4eeecc621f
    BUG: 1420623
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: https://review.gluster.org/14419
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>
    Reviewed-by: Atin Mukherjee <amukherj>

Comment 15 Shyamsundar 2017-05-30 18:41:34 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.11.0, please open a new bug report.

glusterfs-3.11.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-May/000073.html
[2] https://www.gluster.org/pipermail/gluster-users/