Description of problem: On automatic migration test, On a setup with 2 up rhev-h 7.1 hosts, moving host to maintenance was followed by VM migration libvirt failure, seen on both source & destination hosts vdsm.log: Thread-1337::DEBUG::2015-10-18 16:15:46,728::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 42 edom: 10 level: 2 message: Domain not found: no domain with matching uuid '6090ff3a-66e5-4c45-b4d6-e554e6f4e1cf' (golden_env_mixed_virtio_0) Version-Release number of selected component (if applicable): engine: rhevm vt-17.3 hosts: RHEV Hypervisor - 7.1 - 20151015.0.el7ev How reproducible: It didn't reproduce manually on the same setup, and also didn't reproduce when the same migration test was run (passed twice in a role). Steps to Reproduce: Occurred on automatic test, with these steps: 1. Have 2 hosts up on the same cluster. 1 VM is running on one of the hosts 2. Try to move the host that runs the VM (source host) to maintenance. VM migration starts on both source and destination, but fail with the above libvirt error. 3. Eventually the source host moved to maintenance. Test link: https://rhev-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/3.5/job/3.5-GE-Tier1/123/ Actual results: VM migration failed. Expected results: VM migration should have succeeded. Additional info: There was an external VM running on the host, seen on engine/vdsm.log, but it seems not to affect this bug.
Created attachment 1084424 [details] source host vdsm log Time on host is 3 hours behind engine time.
Created attachment 1084425 [details] destination host vdsm log Time on host is 3 hours behind engine time.
Created attachment 1084426 [details] engine log See time 19:15-19:18 for moving host to maintenance.
Ilanit, I see a lot of storage errors in the logs, which couldd indicate that there is somethign up with the storage (or the connection to it): Thread-13::WARNING::2015-10-18 16:01:25,852::lvm::376::Storage.LVM::(_reloadvgs) lvm vgs failed: 5 [] [' /dev/mapper/360060160f4a0300084a443e9eca6e411: read failed after 0 of 4096 at 0: Input/output error', ' /dev/mapper/360060160f4a0300084a443e9eca6e411: read failed after 0 of 4096 at 53687025664: Input/output error', ' /dev/mapper/360060160f4a0300084a443e9eca6e411: read failed after 0 of 4096 at 53687083008: Input/output error', ' WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160f4a0300084a443e9eca6e411 was disabled', ' /dev/mapper/360060160f4a03000c351a0fb87a0e411: read failed after 0 of 4096 at 0: Input/output error', ' /dev/mapper/360060160f4a03000c351a0fb87a0e411: read failed after 0 of 4096 at 161061208064: Input/output error', ' /dev/mapper/360060160f4a03000c351a0fb87a0e411: read failed after 0 of 4096 at 161061265408: Input/output error', ' WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160f4a03000c351a0fb87a0e411 was disabled', ' /dev/mapper/360060160f4a03000db96cb1888a0e411: read failed after 0 of 4096 at 0: Input/output error', ' /dev/mapper/360060160f4a03000db96cb1888a0e411: read failed after 0 of 4096 at 161061208064: Input/output error', ' /dev/mapper/360060160f4a03000db96cb1888a0e411: read failed after 0 of 4096 at 161061265408: Input/output error', ' WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160f4a03000db96cb1888a0e411 was disabled', ' /dev/mapper/360060160f4a0300037a6d62988a0e411: read failed after 0 of 4096 at 0: Input/output error', ' /dev/mapper/360060160f4a0300037a6d62988a0e411: read failed after 0 of 4096 at 161061208064: Input/output error', ' /dev/mapper/360060160f4a0300037a6d62988a0e411: read failed after 0 of 4096 at 161061265408: Input/output error', ' WARNING: Error counts reached a limit of 3. Device /dev/mapper/360060160f4a0300037a6d62988a0e411 was disabled', ' /dev/mapper/b1c83183--642a--49d6--8002--e36d36c59cdf-metadata: read failed after 0 of 4096 at 0: Input/output error', ' /dev/mapper/b1c83183--642a--49d6--8002--e36d36c59cdf-metadata: read failed after 0 of 4096 at 536805376: Input/output error', ' /dev/mapper/b1c83183--642a--49d6--8002--e36d36c59cdf-metadata: read failed after 0 of 4096 at 536862720: Input/output error', ' WARNING: Error counts reached a limit of 3. Device /dev/mapper/b1c83183--642a--49d6--8002--e36d36c59cdf-metadata was disabled', <snip/> Thread-13::ERROR::2015-10-18 16:01:25,860::task::866::Storage.TaskManager.Task::(_setError) Task=`62dcccc8-e187-4684-8fad-08330be5bf4a`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run File "/usr/share/vdsm/logUtils.py", line 45, in wrapper File "/usr/share/vdsm/storage/hsm.py", line 3114, in getVolumeSize File "/usr/share/vdsm/storage/sdc.py", line 98, in produce File "/usr/share/vdsm/storage/sdc.py", line 52, in getRealDomain File "/usr/share/vdsm/storage/sdc.py", line 122, in _realProduce File "/usr/share/vdsm/storage/sdc.py", line 141, in _findDomain File "/usr/share/vdsm/storage/sdc.py", line 171, in _findUnfetchedDomain StorageDomainDoesNotExist: Storage domain does not exist: (u'406f9245-cb9c-4978-bbec-f783f96d761c',) Could you run the failed test on a different set of hosts?
Nir, do you see any special reason for the storage failures? To me it looks like there was some random problem with the storage which then in turn caused the symptomps (failed migration).
We ran the failing migration test again twice after this failure happened, on the same rhev-h 7.1 hosts - and it passed. Also, I tried on the same hosts to reproduce it manually, and it didn't reproduce. I see no need in running same test on another host/s, as it already didn't reproduce even on the same host/s.
Returning needinfo on Nir to reply on Comment 5.
(In reply to Ilanit Stein from comment #6) > We ran the failing migration test again twice after this failure happened, > on the same rhev-h 7.1 hosts - and it passed. > > Also, I tried on the same hosts to reproduce it manually, and it didn't > reproduce. > > I see no need in running same test on another host/s, as it already didn't > reproduce even on the same host/s. Reducing severity per comment #6
Here is how migration occurred on both hosts: (source) Migration started at 16:13:28,996 and finished at 16:15:46,726 with message: Migration succeeded (code=4). (dest) Migration started at 16:15:29,748 and 16:15:29,941 and finished with (migrationCreate) Destination VM creation succeeded. I can see plenty of similar libvirt issues occurring after storage issue as in comment #4. First one is: Thread-13::DEBUG::2015-10-18 16:01:25,890::libvirtconnection::143::root::(wrapper) Unknown libvirterror: ecode: 42 edom: 10 level: 2 message: Domain not found: no domain with matching uuid '7f92da6c-f9d7-47f9-a183-ec2eff429781' Every occurrence has different uuid.
Hi, I was looking into the logs, and in response to comment #4, the devices not accessible are iSCSI LUNs, while this test specifically deals with VM migrations in which there are only NFS disks. The RAM migration isn't related to any storage domain.
Ilanit, can you please get the messages log from that time? I really can't find anything on the jenkins page
Created attachment 1086133 [details] src host messages log
Created attachment 1086134 [details] src host messages log
Created attachment 1086135 [details] dst host messages log
From all the above it sounds like a virt issue.
Further QE testing, run by ipinto & istein: 1. Manual test: Run VM on SPM host, and move host to maintenance - PASS. 2. Automation host to maintenance in a loop 100, were the host moved to maintenance is the SPM. - setup with RHEV-Hs 7.1 - PASS (100 cycles). KVM Version:2.1.2 - 23.el7_1.10 LIBVIRT Version: libvirt-1.2.8-16.el7_1.4 VDSM Version: vdsm-4.16.27-1.el7ev - setup with RHEL 7.1 - PASS (100 cycles). KVM Version:2.1.2 - 23.el7_1.10 LIBVIRT Version: libvirt-1.2.8-16.el7_1.5 VDSM Version: vdsm-4.16.27-1.el7ev 3. Reproduce trial - 3 cycles of all QE teams Tier 1, on NFS - REPRODUCED [On another test case]: 1 out of these 3 runs failed on move host to maintenance, on a "System QE" test case, which is a different test case, than the one failed, when bug reported. https://rhev-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/3.5/job/3.5-GE-Tier1/126/ Also here storage errors were seen in the vdsm log, like in comment #4. Problem might be related to "external" VM existing on the host, from previous test, run within that cycle, called 'vm_ram_snapshot_nfs', seen in the original source host vdsm.log, and in the following attached 'Further testing vdsm log'. It is yet clear why this external VM exist.
Ilanit - I assume you've looked at the logs and did not find anything suspicious (for example, that the NIC was flipping: enp5s0f0 specifically, on the target).
Ilanit - any idea what this command is? 2015-10-18 19:18:47,107 INFO [org.ovirt.engine.core.bll.StopVmCommand] (ajp-/127.0.0.1:8702-12) [vms_syncAction_17e5462b-79f5-4a10] Running command: StopVmCommand internal: false. Entities affected : ID: 6090ff3a-66e5-4c45-b4d6-e554e6f4e1cf Type: VMAction group VM_BASIC_OPERATIONS with role type USER 2015-10-18 19:18:47,113 INFO [org.ovirt.engine.core.vdsbroker.DestroyVmVDSCommand] (ajp-/127.0.0.1:8702-12) [vms_syncAction_17e5462b-79f5-4a10] START, DestroyVmVDSCommand(HostName = host_mixed_2, HostId = e1e99113-ce07-4ed8-b356-079031229cc9, vmId=6090ff3a-66e5-4c45-b4d6-e554e6f4e1cf, force=false, secondsToWait=0, gracefully=false, reason=), log id: 40b5226c 2015-10-18 19:18:47,120 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ajp-/127.0.0.1:8702-12) [vms_syncAction_17e5462b-79f5-4a10] START, DestroyVDSCommand(HostName = host_mixed_2, HostId = e1e99113-ce07-4ed8-b356-079031229cc9, vmId=6090ff3a-66e5-4c45-b4d6-e554e6f4e1cf, force=false, secondsToWait=0, gracefully=false, reason=), log id: 37980911 2015-10-18 19:18:47,809 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ajp-/127.0.0.1:8702-12) [vms_syncAction_17e5462b-79f5-4a10] FINISH, DestroyVDSCommand, log id: 37980911
And VDSM, on destination: libvirtEventLoop::DEBUG::2015-10-18 16:18:47,804::vm::2799::vm.Vm::(setDownStatus) vmId=`6090ff3a-66e5-4c45-b4d6-e554e6f4e1cf`::Changed state to Down: Admin shut down from the engine Migration has succeeded earlier: Thread-5575::DEBUG::2015-10-18 16:15:46,421::vm::4137::vm.Vm::(_completeIncomingMigration) vmId=`6090ff3a-66e5-4c45-b4d6-e554e6f4e1cf`::End of migration So it looks like after successful migration, the VM was shut down?
This VM shutdown is part of the test teardown. It is run after test had failed to move the source host to maintenance. From jenkins run of the same failing test, mentioned in the description, https://rhev-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/3.5/job/3.5-GE-Tier1/123/consoleFull: 19:18:40 2015-10-18 19:18:44,028 - MainThread - plmanagement.unittest_loader - INFO - TEST GROUP tearDown: rhevmtests.virt.migration 19:18:40 2015-10-18 19:18:44,029 - MainThread - Virt_Network_Migration_Init - INFO - Teardown... . . . 19:18:41 2015-10-18 19:18:45,153 - MainThread - Virt_Network_Migration_Init - INFO - Running on golden env, stopping VM golden_env_mixed_virtio_0 19:18:42 2015-10-18 19:18:45,548 - MainThread - art.rhevm_api.tests_lib.low_level.vms - INFO - Got ip 10.35.163.202 19:18:43 2015-10-18 19:18:46,961 - MainThread - vms - INFO - Using Correlation-Id: vms_syncAction_17e5462b-79f5-4a10 19:18:44 2015-10-18 19:18:47,974 - MainThread - vms - INFO - vm status is 'down' Closing bug as move host to maintenance had failed as a result of uncleaned environment, which included external VM. The external VM, externa-vm_ram_snapshot_nfs was created originally by a storage test, run earlier. ratamir (storage QE) had run the relevant test several times, and it PASSED, and didn't leave external VM. Therefore we concluded that the external VM was created probably by this storage test failure, in the past. QE cleanup of host external VMs will be added, before running Tier1. Michal, What is the expected behavior when moving host to maintenance, in case there are external-vms on the host, and there are available up hosts on cluster? Thanks, Ilanit.
we don't really touch external vms. They are considered independent and separate from the setup. And for maintenance we assume the operator wants to do something intrusive...so I guess requesting external VMs to be gone is reasonable...even though...well, maybe debatable