Bug 1330044

Summary: one of vm goes to paused state when network goes down and comes up back
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: RamaKasturi <knarra>
Component: replicateAssignee: Ravishankar N <ravishankar>
Status: CLOSED ERRATA QA Contact: SATHEESARAN <sasundar>
Severity: urgent Docs Contact:
Priority: high    
Version: rhgs-3.1CC: amureini, asrivast, bugs, knarra, mchangir, pkarampu, rcyriac, rhinduja, rhs-bugs, sabose, sankarshan, storage-qa-internal
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.1.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.7.9-6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1336612 (view as bug list) Environment:
Last Closed: 2016-06-23 05:19:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Gluster RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1258386, 1311817, 1336612, 1337822, 1337831    

Description RamaKasturi 2016-04-25 10:39:06 UTC
Description of problem:
windows vm goes to paused state when one of the nic in the hypervisor is brought down and brought up again back.

Version-Release number of selected component (if applicable):
rhevm-appliance-20160323.1-1.el7ev.noarch
ovirt-host-deploy-1.4.1-1.el7ev.noarch
ovirt-vmconsole-1.0.0-1.el7ev.noarch
ovirt-vmconsole-host-1.0.0-1.el7ev.noarch
ovirt-setup-lib-1.0.1-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.5.1-1.el7ev.noarch


How reproducible:
Hit it once

Steps to Reproduce:
1. Installed HC setup.
2. BootStromed 30 vms which includes both linux and windows vms.
3. Now brought down nic in one of the hypervisor 
4. Brought it up again back.

Actual results:
1) when the nic is down i see that all the vms has a question mark i.e unknown state assoicated to it.
2) Engine vm is not accessible for few minutes.
2) When the nic is brought up again back i see that one of the windows vm went to paused state.

Expected results:
VM should not go to paused state.

Additional info:

Comment 1 RamaKasturi 2016-04-25 11:01:49 UTC
sos reports are present here.

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1330044/

Comment 2 Sahina Bose 2016-04-26 11:21:25 UTC
Can you provide info on which node was brought down and which node HE was running on when nic was brought down?

Comment 3 RamaKasturi 2016-04-26 11:36:39 UTC
Had three hosts, zod, sulphur and tettnang.

zod - nic was brought down

sulphur - primary server for gluster

tettnang - HE was running.

Comment 4 Sahina Bose 2016-04-27 11:29:15 UTC
Regarding HE engine restarts - I think we need a separate bug to track this.
I see periodic umount and mount in gluster logs - repeated every minute and errors in HE agent logs on accessing storage domain. No related errors in gluster logs however.

Regarding VM pause error on zod server:
There are a lot of messages related to EIO like this:
var/log/vdsm/vdsm.log.1.xz:Thread-3503763::INFO::2016-04-25 14:56:36,103::clientIF::182::vds::(contEIOVms) Cont vm 74f7c2f0-150a-4076-9ade-9c467bcc922b in EIO

However these VMs are then resumed, and only one continues to have IO error

libvirtEventLoop::INFO::2016-04-25 14:56:37,532::vm::5084::virt.vm::(_logGuestCpuStatus) vmId=`1a6ad336-5317-40eb-a476-fbc70998e948`::CPU stopped: onIOError
libvirtEventLoop::DEBUG::2016-04-25 14:56:37,533::__init__::206::jsonrpc.Notification::(emit) Sending event {"params": {"1a6ad336-5317-40eb-a476-fbc70998e948": {"status": "Paused", "ioerror": {"alias": "virtio-disk0", "name": "vda", "path": "/rhev/data-center/00000001-0001-0001-0001-000000000128/297a9b9c-4396-4b30-8bfe-976a67d49a74/images/928c922c-65ab-453e-bf75-472cc41a1b31/c71e0e1e-7cfb-40a6-b85c-170705aa36e7"}, "pauseCode": "EIO"}, "notify_time": 5501919980}, "jsonrpc": "2.0", "method": "|virt|VM_status|1a6ad336-5317-40eb-a476-fbc70998e948"}



and in gluster mount log for vmstore

[2016-04-25 09:26:37.470109] E [MSGID: 108008] [afr-transaction.c:1981:afr_transaction] 0-vmstore-replicate-0: Failing WRITE on gfid a289ee5c-6ade-4c7d-954e-c7472cbcb284: split-brain observed. [Input/output error]
[2016-04-25 09:26:37.470240] W [fuse-bridge.c:1287:fuse_err_cbk] 0-glusterfs-fuse: 105212361: FSYNC() ERR => -1 (Input/output error)
[2016-04-25 09:26:37.470222] E [MSGID: 108008] [afr-transaction.c:1981:afr_transaction] 0-vmstore-replicate-0: Failing WRITE on gfid a289ee5c-6ade-4c7d-954e-c7472cbcb284: split-brain observed. [Input/output error]

Moving to gluster team

Comment 9 Krutika Dhananjay 2016-05-17 09:20:14 UTC
http://review.gluster.org/14368 and http://review.gluster.org/14369 posted for review in upstream. Moving this bug to POST state.

Comment 13 SATHEESARAN 2016-06-07 15:48:13 UTC
Tested with RHGS 3.1.3 nightly build ( glusterfs-3.7.9-6.el7rhgs ) with the following tests

1. Brought down the network on one of the node and the VMs running on the node got paused, but other VMs were running healthy

2. The other test involves, creation of new shards when one of the replica 3 brick is down. And that brick was brought up later after sometime. The heal happened successfully and there are no VM pauses.

Based on the above observation, marking this bug as VERIFIED

Comment 15 errata-xmlrpc 2016-06-23 05:19:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240