Description of problem: After a live migration fails because it exceeds the timeout RHEV-M logs and reports a "Domain not found: no domain with matching uuid" error to the user. Version-Release number of selected component (if applicable): rhevm-3.3.3-0.52.el6ev How reproducible: Frequently (always?) Steps to Reproduce: 1. Live migrate a busy VM. 2. Wait for live migration timeout to abort live migration on source host. 3. Actual results: RHEV-M reports "Domain not found: no domain with matching uuid" error to the user. Expected results: No such error is reported to the user. Additional info:
There are 3 separate questions to be answered here: 1 - the reason migration had failed (in log it appears to be timeout ... but why did it timeout ? over loaded CPU? Network , what was the memory size of that VM)... should we adjust the timout calculation ? 2 - the reason that the domain didn't exist on the destination host on cancelling the migration. 3 - Should we report such flow to the engine's even-log ?
The monitoring is not supposed to create such audit-log for the destination host since http://gerrit.ovirt.org/#/c/9199/ was merged. In order to get this audit log, the vm had to have running_on_vds that points to the destination host - but I don't see any log that indicates there was hand-over to the destination. Maybe that's another side-effect which is caused by issues that were already solved in that area (race between maintenance reruns, transactional migrations etc). I suggest to check if it is reproduced in the latest version that includes those fixes.
(In reply to Arik from comment #5) After further checking, here are the findings: 1. in 20:52:25, the source host was switched to maintenance. Because of the problem which was solved by bz 1110146, the migration of the VM we're interesting about was started at 21:02:37. 2. Because of the problem which was solved by bz 1131856, the migrating_to_vds field of the migrated VM pointed to the source host. 3. In 21:07:46,359 (5 minutes after the previous MaintenanceNumberOfVdss attempt was finished), we tried to switch the source host to maintenance again. as part of this attempt, we canceled all the incoming migrations to this host, including the migration we're interested about (the cancel migration operation succeeded). 4. In 21:07:46,738, a rerun attempt to migrate the VM was triggered. 5. In 21:14:42,722, the source host detects that the ongoing migration takes more time then the maximum timeout so it stops the migration. 6. In 21:14:43,840, the qemu process on the destination host died. 7. In 21:14:43,857, the destination host understand that the domain crashed. 8. In 21:14:43,914, there's a call for destroy in the destination source that I don't know where it came from (inner operation within vdsm?) 9. In 21:14:44,098, the destination host set the status of the VM to Down with reason: Domain not found: no domain with matching uuid ... 10. In 21:14:45,471, the monitoring in the engine send request for Destroy (the VM was received with Down status). The operation fails because there is no such domain (already destroyed) - that explains the failed destroy operation. 11. the monitoring continues and produce this audit log since it received VM in Down state with error. Doing a fix which is similar to the one that was implemented in http://gerrit.ovirt.org/#/c/9199/ for this case is not good since in some cases we do want to produce such audit log, for example if the VM is in migration-to status and was already destroyed on the source and then an error was encountered. I think the problem resides in VDSM that set the status of the VM to Down. In this case the VM is already going to be destroyed (again, I think it is internal operation in vdsm because I don't see it coming from the engine), so VDSM should report it as migration-to until it is destroyed (or stop reporting it).
http://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=commit;h=68aba2b12b90a997cee0f1e0221eb6f48eb8fd35
fixed in vt3, moving to on_qa. if you believe this bug isn't released in vt3, please report to rhev-integ
Created attachment 948246 [details] Host Losts
Created attachment 948247 [details] engine logs
Check with version:3.5.0-0.12.beta.el6ev The message not appear any more. Check secnario: 1. VM with Defined Memory: 2G 2. 2 Hosts 3. On migrate VM runnnig linux stress with command: stress --vm 1 --vm-bytes 512M --vm-hang 2 --timeout 3600s & 4. Start migrate: 2014-Oct-19, 14:38 Migration started (VM: test-02, Source: 10.35.4.161, Destination: 10.35.4.137, User: admin). 5. The migrate failed after ~2 min, in the event tab the message was: 2014-Oct-19, 14:40 Migration failed due to Error: Migration not in progress (VM: test-02, Source: 10.35.4.161, Destination: 10.35.4.137). No record for the message: "Domain not found: no domain with matching uuid" *** From the vdsm log: Thread-75::WARNING::2014-10-19 14:40:35,602::migration::435::vm.Vm::(monitor_migration) vmId=`6b3cd572-a7ce-4775-b405-4eb53e7a0968`::The migration took 130 seconds which is exceeding the configured maximum time for migrations of 128 seconds. The migration will be aborted. The migrate failed since timeout. See attached logs.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0158.html