1104195 – "Domain not found: no domain with matching uuid" error logged to audit_log after live migration fails due to timeout exceeded

Bug 1104195 - "Domain not found: no domain with matching uuid" error logged to audit_log after live migration fails due to timeout exceeded

Summary: "Domain not found: no domain with matching uuid" error logged to audit_log af...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.3.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Arik
QA Contact:	Israel Pinto
Docs Contact:
URL:
Whiteboard:	virt
Depends On:
Blocks:	1134974 rhev3.5beta 1156165
TreeView+	depends on / blocked

Reported:	2014-06-03 13:25 UTC by Julio Entrena Perez
Modified:	2019-04-28 10:47 UTC (History)
CC List:	10 users (show)
Fixed In Version:	vt3
Doc Type:	Bug Fix
Doc Text:	Previously, virtual machines that went down on a destination host as part of a migration operation were considered as having crashed. This would result in an incorrect audit log entry stating "Domain not found: no domain with matching uuid". With this update, The Manager no longer treats virtual machines that went down on a destination host as having crashed, preventing incorrect audit log entries from being recorded when a virtual machine goes down during a migration operation.
Clone Of:
Clones:	1134974 (view as bug list)
Environment:
Last Closed:	2015-02-11 18:03:20 UTC
oVirt Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Host Losts (1.18 MB, application/zip) 2014-10-19 12:02 UTC, Israel Pinto	no flags	Details
engine logs (1.12 MB, application/zip) 2014-10-19 12:06 UTC, Israel Pinto	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2015:0158	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Virtualization Manager 3.5.0	2015-02-11 22:38:50 UTC
oVirt gerrit	32137	'None'	'MERGED'	'core: prevent incorrect down events on migration'	2019-11-21 13:08:17 UTC
oVirt gerrit	32261	'None'	'MERGED'	'core: prevent incorrect down events on migration'	2019-11-21 13:08:17 UTC

Description Julio Entrena Perez 2014-06-03 13:25:01 UTC

Description of problem:
After a live migration fails because it exceeds the timeout RHEV-M logs and reports a "Domain not found: no domain with matching uuid" error to the user.

Version-Release number of selected component (if applicable):
rhevm-3.3.3-0.52.el6ev

How reproducible:
Frequently (always?)

Steps to Reproduce:
1. Live migrate a busy VM.
2. Wait for live migration timeout to abort live migration on source host.
3.

Actual results:
RHEV-M reports "Domain not found: no domain with matching uuid" error to the user.

Expected results:
No such error is reported to the user.

Additional info:

Comment 4 Barak 2014-06-08 15:05:28 UTC

There are 3 separate questions to be answered here:
1 - the reason migration had failed (in log it appears to be timeout ... but why 
    did it timeout ? over loaded CPU? Network , what was the memory size of that 
    VM)... should we adjust the timout calculation ?
2 - the reason that the domain didn't exist on the destination host on 
    cancelling the migration.
3 - Should we report such flow to the engine's even-log ?

Comment 5 Arik 2014-08-26 15:57:08 UTC

The monitoring is not supposed to create such audit-log for the destination host since http://gerrit.ovirt.org/#/c/9199/ was merged.

In order to get this audit log, the vm had to have running_on_vds that points to the destination host - but I don't see any log that indicates there was hand-over to the destination.

Maybe that's another side-effect which is caused by issues that were already solved in that area (race between maintenance reruns, transactional migrations etc). I suggest to check if it is reproduced in the latest version that includes those fixes.

Comment 6 Arik 2014-08-27 11:48:12 UTC

(In reply to Arik from comment #5)
After further checking, here are the findings:

1. in 20:52:25, the source host was switched to maintenance. Because of the problem which was solved by bz 1110146, the migration of the VM we're interesting about was started at 21:02:37.

2. Because of the problem which was solved by bz 1131856, the migrating_to_vds field of the migrated VM pointed to the source host.

3. In 21:07:46,359 (5 minutes after the previous MaintenanceNumberOfVdss attempt was finished), we tried to switch the source host to maintenance again. as part of this attempt, we canceled all the incoming migrations to this host, including the migration we're interested about (the cancel migration operation succeeded).

4. In 21:07:46,738, a rerun attempt to migrate the VM was triggered.

5. In 21:14:42,722, the source host detects that the ongoing migration takes more time then the maximum timeout so it stops the migration.

6. In 21:14:43,840, the qemu process on the destination host died.

7. In 21:14:43,857, the destination host understand that the domain crashed.

8. In 21:14:43,914, there's a call for destroy in the destination source that I don't know where it came from (inner operation within vdsm?)

9. In 21:14:44,098, the destination host set the status of the VM to Down with reason: Domain not found: no domain with matching uuid ...

10. In 21:14:45,471, the monitoring in the engine send request for Destroy (the VM was received with Down status). The operation fails because there is no such domain (already destroyed) - that explains the failed destroy operation.

11. the monitoring continues and produce this audit log since it received VM in Down state with error.

Doing a fix which is similar to the one that was implemented in http://gerrit.ovirt.org/#/c/9199/ for this case is not good since in some cases we do want to produce such audit log, for example if the VM is in migration-to status and was already destroyed on the source and then an error was encountered.

I think the problem resides in VDSM that set the status of the VM to Down. In this case the VM is already going to be destroyed (again, I think it is internal operation in vdsm because I don't see it coming from the engine), so VDSM should report it as migration-to until it is destroyed (or stop reporting it).

Comment 8 Arik 2014-09-03 11:33:30 UTC

http://gerrit.ovirt.org/gitweb?p=ovirt-engine.git;a=commit;h=68aba2b12b90a997cee0f1e0221eb6f48eb8fd35

Comment 9 Eyal Edri 2014-09-10 20:21:50 UTC

fixed in vt3, moving to on_qa.
if you believe this bug isn't released in vt3, please report to rhev-integ

Comment 10 Israel Pinto 2014-10-19 12:02:35 UTC

Created attachment 948246 [details]
Host Losts

Comment 11 Israel Pinto 2014-10-19 12:06:15 UTC

Created attachment 948247 [details]
engine logs

Comment 12 Israel Pinto 2014-10-19 12:16:55 UTC

Check with version:3.5.0-0.12.beta.el6ev 
The message not appear any more.

Check secnario:
1. VM with Defined Memory: 2G
2. 2 Hosts
3. On migrate VM runnnig linux stress with command:
    stress --vm 1 --vm-bytes 512M --vm-hang 2 --timeout 3600s &
4. Start migrate:
   2014-Oct-19, 14:38
   Migration started (VM: test-02, Source: 10.35.4.161, Destination: 10.35.4.137, User: admin).
5. The migrate failed after ~2 min, in the event tab the message was:
   2014-Oct-19, 14:40	
   Migration failed due to Error: Migration not in progress (VM: test-02, Source: 10.35.4.161, Destination: 10.35.4.137). 
   No record for the message: 
   "Domain not found: no domain with matching uuid"

*** From the vdsm log:
Thread-75::WARNING::2014-10-19 14:40:35,602::migration::435::vm.Vm::(monitor_migration) vmId=`6b3cd572-a7ce-4775-b405-4eb53e7a0968`::The migration took 130 seconds which is exceeding the configured maximum time for migrations of 128 seconds. The migration will be aborted.
The migrate failed since timeout.
See attached logs.

Comment 15 errata-xmlrpc 2015-02-11 18:03:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0158.html

Note You need to log in before you can comment on or make changes to this bug.