Bug 1104195 - "Domain not found: no domain with matching uuid" error logged to audit_log after live migration fails due to timeout exceeded
Summary: "Domain not found: no domain with matching uuid" error logged to audit_log af...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.3.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: 3.5.0
Assignee: Arik
QA Contact: Israel Pinto
URL:
Whiteboard: virt
Depends On:
Blocks: 1134974 rhev3.5beta 1156165
TreeView+ depends on / blocked
 
Reported: 2014-06-03 13:25 UTC by Julio Entrena Perez
Modified: 2019-04-28 10:47 UTC (History)
10 users (show)

Fixed In Version: vt3
Doc Type: Bug Fix
Doc Text:
Previously, virtual machines that went down on a destination host as part of a migration operation were considered as having crashed. This would result in an incorrect audit log entry stating "Domain not found: no domain with matching uuid". With this update, The Manager no longer treats virtual machines that went down on a destination host as having crashed, preventing incorrect audit log entries from being recorded when a virtual machine goes down during a migration operation.
Clone Of:
: 1134974 (view as bug list)
Environment:
Last Closed: 2015-02-11 18:03:20 UTC
oVirt Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Host Losts (1.18 MB, application/zip)
2014-10-19 12:02 UTC, Israel Pinto
no flags Details
engine logs (1.12 MB, application/zip)
2014-10-19 12:06 UTC, Israel Pinto
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2015:0158 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Virtualization Manager 3.5.0 2015-02-11 22:38:50 UTC
oVirt gerrit 32137 0 'None' 'MERGED' 'core: prevent incorrect down events on migration' 2019-11-21 13:08:17 UTC
oVirt gerrit 32261 0 'None' 'MERGED' 'core: prevent incorrect down events on migration' 2019-11-21 13:08:17 UTC

Description Julio Entrena Perez 2014-06-03 13:25:01 UTC
Description of problem:
After a live migration fails because it exceeds the timeout RHEV-M logs and reports a "Domain not found: no domain with matching uuid" error to the user.

Version-Release number of selected component (if applicable):
rhevm-3.3.3-0.52.el6ev

How reproducible:
Frequently (always?)

Steps to Reproduce:
1. Live migrate a busy VM.
2. Wait for live migration timeout to abort live migration on source host.
3.

Actual results:
RHEV-M reports "Domain not found: no domain with matching uuid" error to the user.

Expected results:
No such error is reported to the user.

Additional info:

Comment 4 Barak 2014-06-08 15:05:28 UTC
There are 3 separate questions to be answered here:
1 - the reason migration had failed (in log it appears to be timeout ... but why 
    did it timeout ? over loaded CPU? Network , what was the memory size of that 
    VM)... should we adjust the timout calculation ?
2 - the reason that the domain didn't exist on the destination host on 
    cancelling the migration.
3 - Should we report such flow to the engine's even-log ?

Comment 5 Arik 2014-08-26 15:57:08 UTC
The monitoring is not supposed to create such audit-log for the destination host since http://gerrit.ovirt.org/#/c/9199/ was merged.

In order to get this audit log, the vm had to have running_on_vds that points to the destination host - but I don't see any log that indicates there was hand-over to the destination.

Maybe that's another side-effect which is caused by issues that were already solved in that area (race between maintenance reruns, transactional migrations etc). I suggest to check if it is reproduced in the latest version that includes those fixes.

Comment 6 Arik 2014-08-27 11:48:12 UTC
(In reply to Arik from comment #5)
After further checking, here are the findings:

1. in 20:52:25, the source host was switched to maintenance. Because of the problem which was solved by bz 1110146, the migration of the VM we're interesting about was started at 21:02:37.

2. Because of the problem which was solved by bz 1131856, the migrating_to_vds field of the migrated VM pointed to the source host.

3. In 21:07:46,359 (5 minutes after the previous MaintenanceNumberOfVdss attempt was finished), we tried to switch the source host to maintenance again. as part of this attempt, we canceled all the incoming migrations to this host, including the migration we're interested about (the cancel migration operation succeeded).

4. In 21:07:46,738, a rerun attempt to migrate the VM was triggered.

5. In 21:14:42,722, the source host detects that the ongoing migration takes more time then the maximum timeout so it stops the migration.

6. In 21:14:43,840, the qemu process on the destination host died.

7. In 21:14:43,857, the destination host understand that the domain crashed.

8. In 21:14:43,914, there's a call for destroy in the destination source that I don't know where it came from (inner operation within vdsm?)

9. In 21:14:44,098, the destination host set the status of the VM to Down with reason:  Domain not found: no domain with matching uuid ...

10. In 21:14:45,471, the monitoring in the engine send request for Destroy (the VM was received with Down status). The operation fails because there is no such domain (already destroyed) - that explains the failed destroy operation.

11. the monitoring continues and produce this audit log since it received VM in Down state with error.

Doing a fix which is similar to the one that was implemented in http://gerrit.ovirt.org/#/c/9199/ for this case is not good since in some cases we do want to produce such audit log, for example if the VM is in migration-to status and was already destroyed on the source and then an error was encountered.

I think the problem resides in VDSM that set the status of the VM to Down. In this case the VM is already going to be destroyed (again, I think it is internal operation in vdsm because I don't see it coming from the engine), so VDSM should report it as migration-to until it is destroyed (or stop reporting it).

Comment 9 Eyal Edri 2014-09-10 20:21:50 UTC
fixed in vt3, moving to on_qa.
if you believe this bug isn't released in vt3, please report to rhev-integ

Comment 10 Israel Pinto 2014-10-19 12:02:35 UTC
Created attachment 948246 [details]
Host Losts

Comment 11 Israel Pinto 2014-10-19 12:06:15 UTC
Created attachment 948247 [details]
engine logs

Comment 12 Israel Pinto 2014-10-19 12:16:55 UTC
Check with version:3.5.0-0.12.beta.el6ev 
The message not appear any more.

Check secnario:
1. VM with Defined Memory: 2G
2. 2 Hosts
3. On migrate VM runnnig linux stress with command:
    stress --vm 1 --vm-bytes 512M --vm-hang 2 --timeout 3600s &
4. Start migrate:
   2014-Oct-19, 14:38
   Migration started (VM: test-02, Source: 10.35.4.161, Destination: 10.35.4.137, User: admin).
5. The migrate failed after ~2 min, in the event tab the message was:
   2014-Oct-19, 14:40	
   Migration failed due to Error: Migration not in progress (VM: test-02, Source: 10.35.4.161, Destination: 10.35.4.137). 
   No record for the message: 
   "Domain not found: no domain with matching uuid"

*** From the vdsm log:
Thread-75::WARNING::2014-10-19 14:40:35,602::migration::435::vm.Vm::(monitor_migration) vmId=`6b3cd572-a7ce-4775-b405-4eb53e7a0968`::The migration took 130 seconds which is exceeding the configured maximum time for migrations of 128 seconds. The migration will be aborted.
The migrate failed since timeout.
See attached logs.

Comment 15 errata-xmlrpc 2015-02-11 18:03:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0158.html


Note You need to log in before you can comment on or make changes to this bug.