Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1337405

Summary:	Some of VMs go to paused state when there is concurrent I/O on vms
Product:	[Community] GlusterFS	Reporter:	Krutika Dhananjay <kdhananj>
Component:	sharding	Assignee:	Krutika Dhananjay <kdhananj>
Status:	CLOSED CURRENTRELEASE	QA Contact:	bugs <bugs>
Severity:	high	Docs Contact:
Priority:	high
Version:	mainline	CC:	amukherj, atumball, bugs, bugs, kdhananj, knarra, pcuzner, pkarampu, rcyriac, rhinduja, sabose, sasundar
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	glusterfs-3.12.13	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1331280
Clones:	1337870 (view as bug list)		Environment:
Last Closed:	2018-10-08 10:43:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Gluster	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1331280
Bug Blocks:	1337870, 1337872

Description Krutika Dhananjay 2016-05-19 06:55:53 UTC

+++ This bug was initially created as a clone of Bug #1331280 +++

Description of problem:
I stopped all the vms running on my gluster volumes and started it again. When the vms are started back all the vms came up except one and it moved to paused state.

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-setup-1.3.4.0-1.el7ev.noarch
libgovirt-0.3.3-1.el7_2.1.x86_64
ovirt-host-deploy-1.4.1-1.el7ev.noarch
ovirt-vmconsole-1.0.0-1.el7ev.noarch
ovirt-vmconsole-host-1.0.0-1.el7ev.noarch
ovirt-setup-lib-1.0.1-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.5.1-1.el7ev.noarch


How reproducible:
Twice

Steps to Reproduce:
1. Install HC setup
2. BootStrom windows and linux vms
3. stop all the vms running on gluster volumes.
4. start it again

Actual results:
one of the vm went to paused state.

Expected results:
vms should not go to paused state.

Additional info:

[2016-04-28 07:15:33.626451] W [fuse-bridge.c:2221:fuse_readv_cbk] 0-glusterfs-fuse: 129914: READ => -1 (Invalid argument)

Moving to gluster team


--- Additional comment from Krutika Dhananjay on 2016-05-17 01:20:27 EDT ---

Issue root-caused.

This is a race which can result in EINVAL under the following circumstance:

When two threads send fresh lookups on a shard in parallel, and they send two new inodes (I1 and I2 respectively created from call to inode_new()) in their return paths, consider the following scenario:

thread 1                                    thread 2
========                                    ========
afr gets the lookup rsp,
calls inode_link(I1) in
afr_lookup_sh_metadata_wrap(),
gets I1.

                                           afr gets the lookup rsp, calls
                                           inode_link(I2) in
                                           afr_lookup_sh_metadata_wrap,
                                           gets I1.

                                           Yet, afr unwinds the stack with I2.

                                           DHT initialises inode ctx for I2
                                           and unwinds the lookup to shard xl.

                                           shard calls inode_link(I2), and
                                           gets I1 in return.

                                           shard creates anon fd against I1 and sends
                                           writev/readv call on this fd.

                                           DHT fails to get the inode ctx for I1 since
                                           I1 was never the inode that was part of the unwind
                                           path of the lookup, and so it fails the fop with
                                           EINVAL.

                                           Shard as a result declares the fop a failure and
                                           propagates EINVAL up to FUSE.

                                           FUSE returns this failure to the app (qemu in this
                                           case). On encountering failure, it pauses the VM.

Comment 1 Vijay Bellur 2016-05-19 07:03:29 UTC

REVIEW: http://review.gluster.org/14419 (features/shard: Fix write/read failure due to EINVAL) posted (#1) for review on master by Krutika Dhananjay (kdhananj)

Comment 2 Vijay Bellur 2016-05-19 12:39:18 UTC

REVIEW: http://review.gluster.org/14422 (cluster/afr: Do not inode_link in afr) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu)

Comment 3 Vijay Bellur 2016-05-20 08:10:22 UTC

REVIEW: http://review.gluster.org/14419 (features/shard: Fix write/read failure due to EINVAL) posted (#2) for review on master by Krutika Dhananjay (kdhananj)

Comment 4 Vijay Bellur 2016-05-20 09:55:58 UTC

COMMIT: http://review.gluster.org/14422 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit 6a51464cf4704e7d7fcbce8919a5ef386a9cfd53
Author: Pranith Kumar K <pkarampu>
Date:   Thu May 19 16:24:09 2016 +0530

    cluster/afr: Do not inode_link in afr
    
    Race is explained at
    https://bugzilla.redhat.com/show_bug.cgi?id=1337405#c0
    
    This patch also handles performing of self-heal with shd-pid.
    Also performs the healing with this->itable's inode rather than
    main itable.
    
    BUG: 1337405
    Change-Id: Id657a6623b71998b027b1dff6af5bbdf8cab09c9
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: http://review.gluster.org/14422
    Smoke: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.com>
    Reviewed-by: Krutika Dhananjay <kdhananj>

Comment 5 Vijay Bellur 2016-05-26 13:19:32 UTC

REVIEW: http://review.gluster.org/14545 (cluster/afr: Fix warning about unused variable) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu)

Comment 6 Krutika Dhananjay 2016-11-17 05:16:18 UTC

Patch was merged long time back (20th May, 2016). Moving the bug to MODIFIED state.

Comment 7 Worker Ant 2017-02-22 09:36:15 UTC

REVIEW: https://review.gluster.org/14419 (features/shard: Fix write/read failure due to EINVAL) posted (#3) for review on master by Krutika Dhananjay (kdhananj)

Comment 8 Worker Ant 2017-02-22 10:17:22 UTC

REVIEW: https://review.gluster.org/14419 (features/shard: Fix write/read failure due to EINVAL) posted (#5) for review on master by Krutika Dhananjay (kdhananj)