1273078 – VM migration fail following moving host to maintenance.

Bug 1273078 - VM migration fail following moving host to maintenance.

Summary: VM migration fail following moving host to maintenance.

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.5.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Nobody
QA Contact:	Israel Pinto
Docs Contact:
URL:
Whiteboard:	virt
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-10-19 14:11 UTC by Ilanit Stein
Modified:	2015-10-26 14:51 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-10-26 13:08:08 UTC
oVirt Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
source host vdsm log (1.63 MB, application/zip) 2015-10-19 14:19 UTC, Ilanit Stein	no flags	Details
destination host vdsm log (663.84 KB, application/zip) 2015-10-19 14:20 UTC, Ilanit Stein	no flags	Details
engine log (1.36 MB, application/zip) 2015-10-19 14:25 UTC, Ilanit Stein	no flags	Details
src host messages log (1.07 MB, text/plain) 2015-10-24 21:02 UTC, Ilanit Stein	no flags	Details
src host messages log (1.07 MB, text/plain) 2015-10-24 21:04 UTC, Ilanit Stein	no flags	Details
dst host messages log (645.94 KB, text/plain) 2015-10-24 21:04 UTC, Ilanit Stein	no flags	Details
Show Obsolete (1) View All

Description Ilanit Stein 2015-10-19 14:11:15 UTC

Description of problem:
On automatic migration test,
On a setup with 2 up rhev-h 7.1 hosts,
moving host to maintenance was followed by VM migration libvirt failure,
seen on both source & destination hosts vdsm.log:

Thread-1337::DEBUG::2015-10-18 16:15:46,728::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 42 edom: 10 level: 2 message: Domain not found: no domain with matching uuid '6090ff3a-66e5-4c45-b4d6-e554e6f4e1cf' (golden_env_mixed_virtio_0)

Version-Release number of selected component (if applicable):
engine: rhevm vt-17.3
hosts: RHEV Hypervisor - 7.1 - 20151015.0.el7ev

How reproducible:
It didn't reproduce manually on the same setup,
and also didn't reproduce when the same migration test was run (passed twice in a role).

Steps to Reproduce:
Occurred on automatic test, with these steps:
1. Have 2 hosts up on the same cluster. 1 VM is running on one of the hosts
2. Try to move the host that runs the VM (source host) to maintenance.
VM migration starts on both source and destination, but fail with the above libvirt error.
3. Eventually the source host moved to maintenance.

Test link:
https://rhev-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/3.5/job/3.5-GE-Tier1/123/


Actual results:
VM migration failed.

Expected results:
VM migration should have succeeded.

Additional info:
There was an external VM running on the host, seen on engine/vdsm.log, but it seems not to affect this bug.

Comment 1 Ilanit Stein 2015-10-19 14:19:07 UTC

Created attachment 1084424 [details]
source host vdsm log

Time on host is 3 hours behind engine time.

Comment 2 Ilanit Stein 2015-10-19 14:20:06 UTC

Created attachment 1084425 [details]
destination host vdsm log

Time on host is 3 hours behind engine time.

Comment 3 Ilanit Stein 2015-10-19 14:25:55 UTC

Created attachment 1084426 [details]
engine log

See time 19:15-19:18 for moving host to maintenance.

Comment 4 Fabian Deutsch 2015-10-20 08:14:01 UTC

Ilanit, I see a lot of storage errors in the logs, which couldd indicate that there is somethign up with the storage (or the connection to it):

Thread-13::WARNING::2015-10-18 16:01:25,852::lvm::376::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] ['  /dev/mapper/360060160f4a0300084a443e9eca6e411: read failed after 0 of 4096 at 0: Input/output error', '  /dev/mapper/360060160f4a0300084a443e9eca6e411: read failed after 0 of 4096 at 53687025664: Input/output error', '  /dev/mapper/360060160f4a0300084a443e9eca6e411: read failed after 0 of 4096 at 53687083008: Input/output error', '  WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160f4a0300084a443e9eca6e411 was disabled', '  /dev/mapper/360060160f4a03000c351a0fb87a0e411: read failed after 0 of 4096 at 0: Input/output error', '  /dev/mapper/360060160f4a03000c351a0fb87a0e411: read failed after 0 of 4096 at 161061208064: Input/output error', '  /dev/mapper/360060160f4a03000c351a0fb87a0e411: read failed after 0 of 4096 at 161061265408: Input/output error', '  WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160f4a03000c351a0fb87a0e411 was disabled', '  /dev/mapper/360060160f4a03000db96cb1888a0e411: read failed after 0 of 4096 at 0: Input/output error', '  /dev/mapper/360060160f4a03000db96cb1888a0e411: read failed after 0 of 4096 at 161061208064: Input/output error', '  /dev/mapper/360060160f4a03000db96cb1888a0e411: read failed after 0 of 4096 at 161061265408: Input/output error', '  WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160f4a03000db96cb1888a0e411 was disabled', '  /dev/mapper/360060160f4a0300037a6d62988a0e411: read failed after 0 of 4096 at 0: Input/output error', '  /dev/mapper/360060160f4a0300037a6d62988a0e411: read failed after 0 of 4096 at 161061208064: Input/output error', '  /dev/mapper/360060160f4a0300037a6d62988a0e411: read failed after 0 of 4096 at 161061265408: Input/output error', '  WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160f4a0300037a6d62988a0e411 was disabled', '  /dev/mapper/b1c83183--642a--49d6--8002--e36d36c59cdf-metadata: read failed after 0 of 4096 at 0: Input/output error', '  /dev/mapper/b1c83183--642a--49d6--8002--e36d36c59cdf-metadata: read failed after 0 of 4096 at 536805376: Input/output error', '  /dev/mapper/b1c83183--642a--49d6--8002--e36d36c59cdf-metadata: read failed after 0 of 4096 at 536862720: Input/output error', '  WARNING: Error counts reached a limit of 3. Device /dev/mapper/b1c83183--642a--49d6--8002--e36d36c59cdf-metadata was disabled',

<snip/>

Thread-13::ERROR::2015-10-18 16:01:25,860::task::866::Storage.TaskManager.Task::(_setError) Task=`62dcccc8-e187-4684-8fad-08330be5bf4a`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
  File "/usr/share/vdsm/storage/hsm.py", line 3114, in getVolumeSize
  File "/usr/share/vdsm/storage/sdc.py", line 98, in produce
  File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain
  File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce
  File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain
  File "/usr/share/vdsm/storage/sdc.py", line 171, in _findUnfetchedDomain
StorageDomainDoesNotExist: Storage domain does not exist: (u'406f9245-cb9c-4978-bbec-f783f96d761c',)

Could you run the failed test on a different set of hosts?

Comment 5 Fabian Deutsch 2015-10-20 08:28:35 UTC

Nir, do you see any special reason for the storage failures?
To me it looks like there was some random problem with the storage which then in turn caused the symptomps (failed migration).

Comment 6 Ilanit Stein 2015-10-20 13:15:52 UTC

We ran the failing migration test again twice after this failure happened, on the same rhev-h 7.1 hosts - and it passed. 

Also, I tried on the same hosts to reproduce it manually, and it didn't reproduce. 

I see no need in running same test on another host/s, as it already didn't reproduce even on the same host/s.

Comment 7 Moti Asayag 2015-10-21 05:08:11 UTC

Returning needinfo on Nir to reply on Comment 5.

Comment 8 Moti Asayag 2015-10-21 10:02:33 UTC

(In reply to Ilanit Stein from comment #6)
> We ran the failing migration test again twice after this failure happened,
> on the same rhev-h 7.1 hosts - and it passed. 
> 
> Also, I tried on the same hosts to reproduce it manually, and it didn't
> reproduce. 
> 
> I see no need in running same test on another host/s, as it already didn't
> reproduce even on the same host/s.

Reducing severity per comment #6

Comment 9 Piotr Kliczewski 2015-10-21 15:02:59 UTC

Here is how migration occurred on both hosts:

(source)
Migration started at 16:13:28,996 and finished at 16:15:46,726 with message: Migration succeeded (code=4). 

(dest)
Migration started at 16:15:29,748 and 16:15:29,941 and finished with (migrationCreate) Destination VM creation succeeded.

I can see plenty of similar libvirt issues occurring after storage issue as in comment #4. First one is:

Thread-13::DEBUG::2015-10-18 16:01:25,890::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 42 edom: 10 level: 2 message: Domain not found: no domain with matching uuid '7f92da6c-f9d7-47f9-a183-ec2eff429781'

Every occurrence has different uuid.

Comment 10 Gilad Lazarovich 2015-10-22 08:19:34 UTC

Hi, I was looking into the logs, and in response to comment #4, the devices not accessible are iSCSI LUNs, while this test specifically deals with VM migrations in which there are only NFS disks.  The RAM migration isn't related to any storage domain.

Comment 11 Michal Skrivanek 2015-10-22 14:25:35 UTC

Ilanit, can you please get the messages log from that time? I really can't find anything on the jenkins page

Comment 12 Ilanit Stein 2015-10-24 21:02:30 UTC

Created attachment 1086133 [details]
src host messages log

Comment 13 Ilanit Stein 2015-10-24 21:04:13 UTC

Created attachment 1086134 [details]
src host messages log

Comment 14 Ilanit Stein 2015-10-24 21:04:59 UTC

Created attachment 1086135 [details]
dst host messages log

Comment 15 Oved Ourfali 2015-10-25 09:00:03 UTC

From all the above it sounds like a virt issue.

Comment 16 Ilanit Stein 2015-10-25 12:47:37 UTC

Further QE testing, run by ipinto & istein:

1. Manual test: Run VM on SPM host, and move host to maintenance - PASS.

2. Automation host to maintenance in a loop 100, were the host moved to maintenance is the SPM.

- setup with RHEV-Hs 7.1 - PASS (100 cycles).
KVM Version:2.1.2 - 23.el7_1.10
LIBVIRT Version: libvirt-1.2.8-16.el7_1.4
VDSM Version: vdsm-4.16.27-1.el7ev

- setup with RHEL 7.1 - PASS (100 cycles).
KVM Version:2.1.2 - 23.el7_1.10
LIBVIRT Version: libvirt-1.2.8-16.el7_1.5
VDSM Version: vdsm-4.16.27-1.el7ev

3.  Reproduce trial - 3 cycles of all QE teams Tier 1, on NFS - REPRODUCED 
[On another test case]:
1 out of these 3 runs failed on move host to maintenance, on a "System QE" test case,
which is a different test case, than the one failed, when bug reported.

https://rhev-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/3.5/job/3.5-GE-Tier1/126/

Also here storage errors were seen in the vdsm log, like in comment #4. 
Problem might be related to "external" VM existing on the host, from previous test, run within that cycle, called 'vm_ram_snapshot_nfs', seen in the original source host vdsm.log, and in the following attached 'Further testing vdsm log'.
It is yet clear why this external VM exist.

Comment 17 Yaniv Kaul 2015-10-25 13:32:35 UTC

Ilanit - I assume you've looked at the logs and did not find anything suspicious (for example, that the NIC was flipping: enp5s0f0 specifically, on the target).

Comment 18 Yaniv Kaul 2015-10-25 14:14:51 UTC

Ilanit - any idea what this command is?

2015-10-18 19:18:47,107 INFO  [org.ovirt.engine.core.bll.StopVmCommand] (ajp-/127.0.0.1:8702-12) [vms_syncAction_17e5462b-79f5-4a10] Running command: StopVmCommand internal: false. Entities affected :  ID: 6090ff3a-66e5-4c45-b4d6-e554e6f4e1cf Type: VMAction group VM_BASIC_OPERATIONS with role type USER
2015-10-18 19:18:47,113 INFO  [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (ajp-/127.0.0.1:8702-12) [vms_syncAction_17e5462b-79f5-4a10] START, DestroyVmVDSCommand(HostName = host_mixed_2, HostId = e1e99113-ce07-4ed8-b356-079031229cc9, vmId=6090ff3a-66e5-4c45-b4d6-e554e6f4e1cf, force=false, secondsToWait=0, gracefully=false, reason=), log id: 40b5226c
2015-10-18 19:18:47,120 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ajp-/127.0.0.1:8702-12) [vms_syncAction_17e5462b-79f5-4a10] START, DestroyVDSCommand(HostName = host_mixed_2, HostId = e1e99113-ce07-4ed8-b356-079031229cc9, vmId=6090ff3a-66e5-4c45-b4d6-e554e6f4e1cf, force=false, secondsToWait=0, gracefully=false, reason=), log id: 37980911
2015-10-18 19:18:47,809 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ajp-/127.0.0.1:8702-12) [vms_syncAction_17e5462b-79f5-4a10] FINISH, DestroyVDSCommand, log id: 37980911

Comment 19 Yaniv Kaul 2015-10-25 14:20:57 UTC

And VDSM, on destination:
libvirtEventLoop::DEBUG::2015-10-18 16:18:47,804::vm::2799::vm.Vm::(setDownStatus) vmId=`6090ff3a-66e5-4c45-b4d6-e554e6f4e1cf`::Changed state to Down: Admin shut down from the engine

Migration has succeeded earlier:
Thread-5575::DEBUG::2015-10-18 16:15:46,421::vm::4137::vm.Vm::(_completeIncomingMigration) vmId=`6090ff3a-66e5-4c45-b4d6-e554e6f4e1cf`::End of migration


So it looks like after successful migration, the VM was shut down?

Comment 20 Ilanit Stein 2015-10-26 13:08:08 UTC

This VM shutdown is part of the test teardown. It is run after test had failed to move the source host to maintenance.
  
From jenkins run of the same failing test, mentioned in the description,
https://rhev-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/3.5/job/3.5-GE-Tier1/123/consoleFull:

 19:18:40 2015-10-18 19:18:44,028 - MainThread - plmanagement.unittest_loader - INFO - TEST GROUP tearDown: rhevmtests.virt.migration
19:18:40 2015-10-18 19:18:44,029 - MainThread - Virt_Network_Migration_Init - INFO - Teardown...
. . . 
19:18:41 2015-10-18 19:18:45,153 - MainThread - Virt_Network_Migration_Init - INFO - Running on golden env, stopping VM golden_env_mixed_virtio_0
19:18:42 2015-10-18 19:18:45,548 - MainThread - art.rhevm_api.tests_lib.low_level.vms - INFO - Got ip 10.35.163.202
19:18:43 2015-10-18 19:18:46,961 - MainThread - vms - INFO - Using Correlation-Id: vms_syncAction_17e5462b-79f5-4a10
19:18:44 2015-10-18 19:18:47,974 - MainThread - vms - INFO - vm status is 'down'


Closing bug as move host to maintenance had failed as a result of uncleaned environment,
which included external VM.

The external VM, externa-vm_ram_snapshot_nfs was created originally by a storage test, run earlier.
ratamir (storage QE) had run the relevant test several times, and it PASSED, and didn't leave external VM.

Therefore we concluded that the external VM was created probably by this storage test failure, in the past.  

QE cleanup of host external VMs will be added, before running Tier1.

Michal,
What is the expected behavior when moving host to maintenance, in case there are external-vms on the host, and there are available up hosts on cluster?

Thanks,
Ilanit.

Comment 21 Michal Skrivanek 2015-10-26 14:51:25 UTC

we don't really touch external vms. They are considered independent and separate from the setup. And for maintenance we assume the operator wants to do something intrusive...so I guess requesting external VMs to be gone is reasonable...even though...well, maybe debatable

Note You need to log in before you can comment on or make changes to this bug.