1331280 – Some of VMs go to paused state when there is concurrent I/O on vms

Bug 1331280 - Some of VMs go to paused state when there is concurrent I/O on vms

Summary: Some of VMs go to paused state when there is concurrent I/O on vms

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	replicate
Sub Component:
Version:	rhgs-3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.1.3
Assignee:	Krutika Dhananjay
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	Gluster-HC-1 1311817 1337405 1337870 1337872
TreeView+	depends on / blocked

Reported:	2016-04-28 08:01 UTC by RamaKasturi
Modified:	2016-09-17 12:09 UTC (History)
CC List:	12 users (show)
Fixed In Version:	glusterfs-3.7.9-6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1337405 (view as bug list)
Environment:
Last Closed:	2016-06-23 05:20:11 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
log file showing errors during the vm pause events (1.11 MB, text/plain) 2016-05-10 05:40 UTC, Paul Cuzner	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:1240	0	normal	SHIPPED_LIVE	Red Hat Gluster Storage 3.1 Update 3	2016-06-23 08:51:28 UTC

Description RamaKasturi 2016-04-28 08:01:20 UTC

Description of problem:
I stopped all the vms running on my gluster volumes and started it again. When the vms are started back all the vms came up except one and it moved to paused state.

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-setup-1.3.4.0-1.el7ev.noarch
libgovirt-0.3.3-1.el7_2.1.x86_64
ovirt-host-deploy-1.4.1-1.el7ev.noarch
ovirt-vmconsole-1.0.0-1.el7ev.noarch
ovirt-vmconsole-host-1.0.0-1.el7ev.noarch
ovirt-setup-lib-1.0.1-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.5.1-1.el7ev.noarch


How reproducible:
Twice

Steps to Reproduce:
1. Install HC setup
2. BootStrom windows and linux vms
3. stop all the vms running on gluster volumes.
4. start it again

Actual results:
one of the vm went to paused state.

Expected results:
vms should not go to paused state.

Additional info:

Comment 2 Sahina Bose 2016-04-28 11:19:04 UTC

sosreport-sulphur.lab.eng.blr.redhat.com.1331280-20160428131011/var/log/vdsm/vdsm.log.1.xz:libvirtEventLoop::DEBUG::2016-04-28 12:45:33,636::__init__::206::jsonrpc.Notification::(emit) Sending event {"params": {"notify_time": 4539584670, "f9340ab8-5339-4f1e-958d-547760ef81b9": {"status": "Paused", "ioerror": {"alias": "virtio-disk0", "name": "vda", "path": "/rhev/data-center/00000001-0001-0001-0001-000000000128/297a9b9c-4396-4b30-8bfe-976a67d49a74/images/aa3b33bd-e327-4edd-967e-19afc504a401/0dcebf1c-8cdd-4b01-a2c2-beaf89515afa"}, "pauseCode": "EOTHER"}}, "jsonrpc": "2.0", "method": "|virt|VM_status|f9340ab8-5339-4f1e-958d-547760ef81b9"}



[2016-04-28 07:15:33.626451] W [fuse-bridge.c:2221:fuse_readv_cbk] 0-glusterfs-fuse: 129914: READ => -1 (Invalid argument)

Moving to gluster team

Comment 4 Sahina Bose 2016-05-04 07:51:04 UTC

Is this a duplicate of Bug 1311881?

Comment 5 Krutika Dhananjay 2016-05-04 08:31:03 UTC

(In reply to Sahina Bose from comment #4)
> Is this a duplicate of Bug 1311881?

Not entirely sure. I don't think Kasturi saw the 'permission denied' errors on her setup when she reported this bug. In her case, it was the readvs that failed with EINVAL.

-Krutika

Comment 6 Krutika Dhananjay 2016-05-05 03:40:36 UTC

Posting the progress so far on this bug:
We (Kasturi and I) attached the client process to gdb and set the breakpoint at fuse_readv_cbk if (op_ret < 0) so that when it is hit we could see the backtrace and figure which translator is pumping the error. With this in place, she tried the vms stop + start cycle again but to no avail. Everything worked fine.
For now, we need to get there (hit the breakpoint that is) before the issue can be debugged further.

Comment 7 Paul Cuzner 2016-05-10 05:39:06 UTC

I see the pause issue in the test environment I'm using in BAGL - gprfc085/6/7.

The cluster is currently running 3.7.9-4.

Whenever I run a 9 concurrent fio jobs without rate limiting, I see vm's enter a paused state.

Looking at the logs on the hosts that were running the effected vm's you can see

[2016-05-10 04:10:42.476785] W [dict.c:1452:dict_get_with_ref] (-->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(build_shd_graph+0x69) [0x7ff1a7172539] -->/lib64/libglusterfs.so.0(dict_get_str_boolean+0x32) [0x7ff1b25b1192] -->/lib64/libglusterfs.so.0(+0x1ed56) [0x7ff1b25aed56] ) 0-dict: dict OR key (graph-check) is NULL [Invalid argument]
[2016-05-10 04:10:42.480382] W [dict.c:1452:dict_get_with_ref] (-->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(build_shd_graph+0x69) [0x7ff1a7172539] -->/lib64/libglusterfs.so.0(dict_get_str_boolean+0x32) [0x7ff1b25b1192] -->/lib64/libglusterfs.so.0(+0x1ed56) [0x7ff1b25aed56] ) 0-dict: dict OR key (graph-check) is NULL [Invalid argument]
[2016-05-10 04:20:18.444527] W [dict.c:1452:dict_get_with_ref] (-->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(build_shd_graph+0x69) [0x7ff1a7172539] -->/lib64/libglusterfs.so.0(dict_get_str_boolean+0x32) [0x7ff1b25b1192] -->/lib64/libglusterfs.so.0(+0x1ed56) [0x7ff1b25aed56] ) 0-dict: dict OR key (graph-check) is NULL [Invalid argument]
[2016-05-10 04:20:18.447965] W [dict.c:1452:dict_get_with_ref] (-->/usr/lib64/glusterfs/3.7.9/xlator/mgmt/glusterd.so(build_shd_graph+0x69) [0x7ff1a7172539] -->/lib64/libglusterfs.so.0(dict_get_str_boolean+0x32) [0x7ff1b25b1192] -->/lib64/libglusterfs.so.0(+0x1ed56) [0x7ff1b25aed56] ) 0-dict: dict OR key (graph-check) is NULL [Invalid argument]

These servers are accessible to eng for further investigation.

Will attach a log file from gprfc085

Comment 8 Paul Cuzner 2016-05-10 05:40:17 UTC

Created attachment 1155521 [details]
log file showing errors during the vm pause events

log file attached

Comment 11 Krutika Dhananjay 2016-05-17 05:20:27 UTC

Issue root-caused.

This is a race which can result in EINVAL under the following circumstance:

When two threads send fresh lookups on a shard in parallel, and they send two new inodes (I1 and I2 respectively created from call to inode_new()) in their return paths, consider the following scenario:

thread 1                                    thread 2
========                                    ========
afr gets the lookup rsp,
calls inode_link(I1) in
afr_lookup_sh_metadata_wrap(),
gets I1.

                                           afr gets the lookup rsp, calls
                                           inode_link(I2) in
                                           afr_lookup_sh_metadata_wrap,
                                           gets I1.

                                           Yet, afr unwinds the stack with I2.

                                           DHT initialises inode ctx for I2
                                           and unwinds the lookup to shard xl.

                                           shard calls inode_link(I2), and
                                           gets I1 in return.

                                           shard creates anon fd against I1 and sends
                                           writev/readv call on this fd.

                                           DHT fails to get the inode ctx for I1 since
                                           I1 was never the inode that was part of the unwind
                                           path of the lookup, and so it fails the fop with
                                           EINVAL.

                                           Shard as a result declares the fop a failure and
                                           propagates EINVAL up to FUSE.

                                           FUSE returns this failure to the app (qemu in this
                                           case). On encountering failure, it pauses the VM.

Comment 14 Krutika Dhananjay 2016-05-19 07:06:02 UTC

upstream patch posted at http://review.gluster.org/#/c/14419/

Still need confirmation from Paul etc that it fixes the pause under heavy load.
Moving the BZ to POST state.
-Krutika

Comment 19 Krutika Dhananjay 2016-05-24 08:27:12 UTC

(In reply to Paul Cuzner from comment #18)
> I just tested 3.7.9-6 and am still seeing vm's pause and system wide impact
> from random write workloads.

Hi Paul,

Sahina, Pranith and I checked your setup in gdb as the i/os are still failing as we speak.

SETATTRs were failing with EROFS (read-only filesystem).
All bricks are up as per volume-status.
But when I checked the connection status between node 85's client and the bricks from the mount in .meta/graphs/active/vm-store-client-*/private, it seems that brick-0 (local) and brick-2 (remote) have 'connected=0' status. So effectively the mount is able to talk to only brick-1. And the rpc internal state in the process still says 'connected'. The bug is in rpc layer of GlusterFS. We will need Raghavendra G's help in this regard.

In short, this issue is not related to the EINVAL fix that went in.
I am going to raise a new bug to track this issue.

Comment 20 SATHEESARAN 2016-05-27 08:08:34 UTC

(In reply to Krutika Dhananjay from comment #19)

> In short, this issue is not related to the EINVAL fix that went in.
> I am going to raise a new bug to track this issue.

To keep track of the information, Krutika has raised a new bug[1] for this issue
[1] - https://bugzilla.redhat.com/show_bug.cgi?id=1339136

Comment 21 SATHEESARAN 2016-06-07 13:20:32 UTC

Tested with RHGS 3.1.3 nightly build ( glusterfs-3.7.9-6.el7rhgs ) with the following steps

1. Created a RHEV data domain backed by gluster replica 3 sharded volume which is optimized for virt-store

2. Created around 10 VMs and ran the random 100% write workload using fio tool

3. There are no VM pause issues related to 'permission denied' or EINVAL

Comment 23 errata-xmlrpc 2016-06-23 05:20:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240

Note You need to log in before you can comment on or make changes to this bug.