1330044 – one of vm goes to paused state when network goes down and comes up back

Bug 1330044 - one of vm goes to paused state when network goes down and comes up back

Summary: one of vm goes to paused state when network goes down and comes up back

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	replicate
Sub Component:
Version:	rhgs-3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	RHGS 3.1.3
Assignee:	Ravishankar N
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	Gluster-HC-1 1311817 1336612 1337822 1337831
TreeView+	depends on / blocked

Reported:	2016-04-25 10:39 UTC by RamaKasturi
Modified:	2016-09-17 12:17 UTC (History)
CC List:	12 users (show)
Fixed In Version:	glusterfs-3.7.9-6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1336612 (view as bug list)
Environment:
Last Closed:	2016-06-23 05:19:16 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:1240	0	normal	SHIPPED_LIVE	Red Hat Gluster Storage 3.1 Update 3	2016-06-23 08:51:28 UTC

Description RamaKasturi 2016-04-25 10:39:06 UTC

Description of problem:
windows vm goes to paused state when one of the nic in the hypervisor is brought down and brought up again back.

Version-Release number of selected component (if applicable):
rhevm-appliance-20160323.1-1.el7ev.noarch
ovirt-host-deploy-1.4.1-1.el7ev.noarch
ovirt-vmconsole-1.0.0-1.el7ev.noarch
ovirt-vmconsole-host-1.0.0-1.el7ev.noarch
ovirt-setup-lib-1.0.1-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.5.1-1.el7ev.noarch


How reproducible:
Hit it once

Steps to Reproduce:
1. Installed HC setup.
2. BootStromed 30 vms which includes both linux and windows vms.
3. Now brought down nic in one of the hypervisor 
4. Brought it up again back.

Actual results:
1) when the nic is down i see that all the vms has a question mark i.e unknown state assoicated to it.
2) Engine vm is not accessible for few minutes.
2) When the nic is brought up again back i see that one of the windows vm went to paused state.

Expected results:
VM should not go to paused state.

Additional info:

Comment 1 RamaKasturi 2016-04-25 11:01:49 UTC

sos reports are present here.

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1330044/

Comment 2 Sahina Bose 2016-04-26 11:21:25 UTC

Can you provide info on which node was brought down and which node HE was running on when nic was brought down?

Comment 3 RamaKasturi 2016-04-26 11:36:39 UTC

Had three hosts, zod, sulphur and tettnang.

zod - nic was brought down

sulphur - primary server for gluster

tettnang - HE was running.

Comment 4 Sahina Bose 2016-04-27 11:29:15 UTC

Regarding HE engine restarts - I think we need a separate bug to track this.
I see periodic umount and mount in gluster logs - repeated every minute and errors in HE agent logs on accessing storage domain. No related errors in gluster logs however.

Regarding VM pause error on zod server:
There are a lot of messages related to EIO like this:
var/log/vdsm/vdsm.log.1.xz:Thread-3503763::INFO::2016-04-25 14:56:36,103::clientIF::182::vds::(contEIOVms) Cont vm 74f7c2f0-150a-4076-9ade-9c467bcc922b in EIO

However these VMs are then resumed, and only one continues to have IO error

libvirtEventLoop::INFO::2016-04-25 14:56:37,532::vm::5084::virt.vm::(_logGuestCpuStatus) vmId=`1a6ad336-5317-40eb-a476-fbc70998e948`::CPU stopped: onIOError
libvirtEventLoop::DEBUG::2016-04-25 14:56:37,533::__init__::206::jsonrpc.Notification::(emit) Sending event {"params": {"1a6ad336-5317-40eb-a476-fbc70998e948": {"status": "Paused", "ioerror": {"alias": "virtio-disk0", "name": "vda", "path": "/rhev/data-center/00000001-0001-0001-0001-000000000128/297a9b9c-4396-4b30-8bfe-976a67d49a74/images/928c922c-65ab-453e-bf75-472cc41a1b31/c71e0e1e-7cfb-40a6-b85c-170705aa36e7"}, "pauseCode": "EIO"}, "notify_time": 5501919980}, "jsonrpc": "2.0", "method": "|virt|VM_status|1a6ad336-5317-40eb-a476-fbc70998e948"}



and in gluster mount log for vmstore

[2016-04-25 09:26:37.470109] E [MSGID: 108008] [afr-transaction.c:1981:afr_transaction] 0-vmstore-replicate-0: Failing WRITE on gfid a289ee5c-6ade-4c7d-954e-c7472cbcb284: split-brain observed. [Input/output error]
[2016-04-25 09:26:37.470240] W [fuse-bridge.c:1287:fuse_err_cbk] 0-glusterfs-fuse: 105212361: FSYNC() ERR => -1 (Input/output error)
[2016-04-25 09:26:37.470222] E [MSGID: 108008] [afr-transaction.c:1981:afr_transaction] 0-vmstore-replicate-0: Failing WRITE on gfid a289ee5c-6ade-4c7d-954e-c7472cbcb284: split-brain observed. [Input/output error]

Moving to gluster team

Comment 9 Krutika Dhananjay 2016-05-17 09:20:14 UTC

http://review.gluster.org/14368 and http://review.gluster.org/14369 posted for review in upstream. Moving this bug to POST state.

Comment 13 SATHEESARAN 2016-06-07 15:48:13 UTC

Tested with RHGS 3.1.3 nightly build ( glusterfs-3.7.9-6.el7rhgs ) with the following tests

1. Brought down the network on one of the node and the VMs running on the node got paused, but other VMs were running healthy

2. The other test involves, creation of new shards when one of the replica 3 brick is down. And that brick was brought up later after sometime. The heal happened successfully and there are no VM pauses.

Based on the above observation, marking this bug as VERIFIED

Comment 15 errata-xmlrpc 2016-06-23 05:19:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240

Note You need to log in before you can comment on or make changes to this bug.