Description of problem: windows vm goes to paused state when one of the nic in the hypervisor is brought down and brought up again back. Version-Release number of selected component (if applicable): rhevm-appliance-20160323.1-1.el7ev.noarch ovirt-host-deploy-1.4.1-1.el7ev.noarch ovirt-vmconsole-1.0.0-1.el7ev.noarch ovirt-vmconsole-host-1.0.0-1.el7ev.noarch ovirt-setup-lib-1.0.1-1.el7ev.noarch ovirt-hosted-engine-ha-1.3.5.1-1.el7ev.noarch How reproducible: Hit it once Steps to Reproduce: 1. Installed HC setup. 2. BootStromed 30 vms which includes both linux and windows vms. 3. Now brought down nic in one of the hypervisor 4. Brought it up again back. Actual results: 1) when the nic is down i see that all the vms has a question mark i.e unknown state assoicated to it. 2) Engine vm is not accessible for few minutes. 2) When the nic is brought up again back i see that one of the windows vm went to paused state. Expected results: VM should not go to paused state. Additional info:
sos reports are present here. http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1330044/
Can you provide info on which node was brought down and which node HE was running on when nic was brought down?
Had three hosts, zod, sulphur and tettnang. zod - nic was brought down sulphur - primary server for gluster tettnang - HE was running.
Regarding HE engine restarts - I think we need a separate bug to track this. I see periodic umount and mount in gluster logs - repeated every minute and errors in HE agent logs on accessing storage domain. No related errors in gluster logs however. Regarding VM pause error on zod server: There are a lot of messages related to EIO like this: var/log/vdsm/vdsm.log.1.xz:Thread-3503763::INFO::2016-04-25 14:56:36,103::clientIF::182::vds::(contEIOVms) Cont vm 74f7c2f0-150a-4076-9ade-9c467bcc922b in EIO However these VMs are then resumed, and only one continues to have IO error libvirtEventLoop::INFO::2016-04-25 14:56:37,532::vm::5084::virt.vm::(_logGuestCpuStatus) vmId=`1a6ad336-5317-40eb-a476-fbc70998e948`::CPU stopped: onIOError libvirtEventLoop::DEBUG::2016-04-25 14:56:37,533::__init__::206::jsonrpc.Notification::(emit) Sending event {"params": {"1a6ad336-5317-40eb-a476-fbc70998e948": {"status": "Paused", "ioerror": {"alias": "virtio-disk0", "name": "vda", "path": "/rhev/data-center/00000001-0001-0001-0001-000000000128/297a9b9c-4396-4b30-8bfe-976a67d49a74/images/928c922c-65ab-453e-bf75-472cc41a1b31/c71e0e1e-7cfb-40a6-b85c-170705aa36e7"}, "pauseCode": "EIO"}, "notify_time": 5501919980}, "jsonrpc": "2.0", "method": "|virt|VM_status|1a6ad336-5317-40eb-a476-fbc70998e948"} and in gluster mount log for vmstore [2016-04-25 09:26:37.470109] E [MSGID: 108008] [afr-transaction.c:1981:afr_transaction] 0-vmstore-replicate-0: Failing WRITE on gfid a289ee5c-6ade-4c7d-954e-c7472cbcb284: split-brain observed. [Input/output error] [2016-04-25 09:26:37.470240] W [fuse-bridge.c:1287:fuse_err_cbk] 0-glusterfs-fuse: 105212361: FSYNC() ERR => -1 (Input/output error) [2016-04-25 09:26:37.470222] E [MSGID: 108008] [afr-transaction.c:1981:afr_transaction] 0-vmstore-replicate-0: Failing WRITE on gfid a289ee5c-6ade-4c7d-954e-c7472cbcb284: split-brain observed. [Input/output error] Moving to gluster team
http://review.gluster.org/14368 and http://review.gluster.org/14369 posted for review in upstream. Moving this bug to POST state.
Tested with RHGS 3.1.3 nightly build ( glusterfs-3.7.9-6.el7rhgs ) with the following tests 1. Brought down the network on one of the node and the VMs running on the node got paused, but other VMs were running healthy 2. The other test involves, creation of new shards when one of the replica 3 brick is down. And that brick was brought up later after sometime. The heal happened successfully and there are no VM pauses. Based on the above observation, marking this bug as VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1240