1590068 – HA VM died on both source and destination during failed migration, but was not restarted.

Bug 1590068 - HA VM died on both source and destination during failed migration, but was not restarted.

Summary: HA VM died on both source and destination during failed migration, but was no...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.1.9
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	ovirt-4.4.0
Target Release:	4.3.0
Assignee:	Arik
QA Contact:	Liran Rotenberg
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-06-12 01:43 UTC by Germano Veit Michel
Modified:	2021-09-09 14:34 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-09-13 12:45:49 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1590063	0	high	CLOSED	VM was destroyed on destination after successful migration due to missing the 'device' key on the lease device	2021-09-09 14:36:12 UTC
Red Hat Issue Tracker	RHV-43520	0	None	None	None	2021-09-09 14:34:53 UTC

Internal Links: 1590063

Description Germano Veit Michel 2018-06-12 01:43:41 UTC

Description of problem:

A host was switched to Non-Operational due to a missing network (link down). This triggered a migration, which failed (see BZ1590063).
During a failed migration, the HA VM died on both source and destination hosts. It was added to the Rerun but it was never started again.

1. Down on both src and dst during failed migration:
2018-06-10 15:27:50,834+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-3) [] VM '50cab403-bf5d-422f-9fac-d3ca3cc57a66' was reported as Down on VDS '412521ed-f0ad-4867-99e0-384bbca8105c'
2018-06-10 15:27:52,550+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-3) [] VM '50cab403-bf5d-422f-9fac-d3ca3cc57a66' was reported as Down on VDS 'b6732abc-9733-411b-bc6c-723315370d79'

2018-06-10 15:27:56,676+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler5) [2eb52fe3] add VM '50cab403-bf5d-422f-9fac-d3ca3cc57a66'(lxkasf102p0007) to rerun treatment
2018-06-10 15:27:56,682+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (DefaultQuartzScheduler5) [2eb52fe3] Rerun VM '50cab403-bf5d-422f-9fac-d3ca3cc57a66'. Called from VDS 'REMOVED'

2018-06-10 15:27:56,734+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-49) [2eb52fe3] EVENT_ID: VM_MIGRATION_FAILED(65), Correlation ID: 3feaf139, Job ID: 4f65901f-9e6f-47f7-b719-ae42467d906e, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: Migration failed  (VM: REMOVED, Source: REMOVED).

The rerun never ran and the HA VM was down indefinitely until the user manually powered it on.

Could the intricate change of statuses due to the migration have triggered a bug that prevented the VM from being started again?

Version-Release number of selected component (if applicable):
ovirt-engine-4.1.9.2-0.1.el7.noarch

How reproducible:
Unknown

Comment 2 Michal Skrivanek 2018-06-12 04:40:35 UTC

Please also add vdsm logs and qemu logs from both, the VM shouldn’t have died on the source

Comment 3 Germano Veit Michel 2018-06-12 04:44:24 UTC

(In reply to Michal Skrivanek from comment #2)
> Please also add vdsm logs and qemu logs from both, the VM shouldn’t have
> died on the source

Sure, I'll attach them in a moment.

The migration completed and the source shut down. The libvirt/qemu part worked.

2018-06-10 15:27:48,562+0200 INFO  (migmon/50cab403) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') Migration Progress: 50 seconds elapsed, 99% of data processed, total data: 12385MB, processed data: 5415MB, remaining data: 149MB, transfer speed 111MBps, zero pages: 2002848MB, compressed: 0MB, dirty rate: 16322, memory iteration: 3 (migration:815)
2018-06-10 15:27:50,236+0200 INFO  (libvirt/events) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') CPU stopped: onSuspend (vm:5119)
2018-06-10 15:27:52,540+0200 INFO  (migsrc/50cab403) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') migration took 54 seconds to complete (migration:492)
2018-06-10 15:27:52,545+0200 INFO  (migsrc/50cab403) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') Changed state to Down: Migration succeeded (code=4) (vm:1259)
2018-06-10 15:27:52,546+0200 INFO  (migsrc/50cab403) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') Stopping connection (guestagent:430)
2018-06-10 15:27:52,555+0200 INFO  (jsonrpc/1) [vdsm.api] START destroy(gracefulAttempts=1) from=10.172.180.108,60298 (api:46)
2018-06-10 15:27:52,556+0200 INFO  (jsonrpc/1) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') Release VM resources (vm:4298)
2018-06-10 15:27:52,556+0200 WARN  (jsonrpc/1) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') trying to set state to Powering down when already Down (vm:361)
2018-06-10 15:27:52,556+0200 INFO  (jsonrpc/1) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') Stopping connection (guestagent:430)
2018-06-10 15:27:52,557+0200 INFO  (jsonrpc/1) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') _destroyVmGraceful attempt #0 (vm:4334)
2018-06-10 15:27:52,558+0200 INFO  (jsonrpc/1) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') VM '50cab403-bf5d-422f-9fac-d3ca3cc57a66' already down and destroyed (vm:4352)

Comment 4 Germano Veit Michel 2018-06-12 04:50:51 UTC

(In reply to Michal Skrivanek from comment #2)
> Please also add vdsm logs and qemu logs from both, the VM shouldn’t have
> died on the source

Ahh, I confused this bug with the other one I just opened, sorry. 

Here is why the VM died on the source: https://bugzilla.redhat.com/show_bug.cgi?id=1590063

Comment 8 Sandro Bonazzola 2018-06-15 13:47:39 UTC

Moving to Virt / Michal who is already investigating the issue.
This is a migration of failure of an Highly Available VM which is out of scope for integration team.

Comment 9 Michal Skrivanek 2018-06-19 10:36:38 UTC

thoughts on why the HA VM was not restarted?
The problem with migration failure has been solved in bug 1590063 ,but that doesn't explain the failure of HA restart

Comment 10 Arik 2018-06-19 11:57:28 UTC

(In reply to Michal Skrivanek from comment #9)
> thoughts on why the HA VM was not restarted?
> The problem with migration failure has been solved in bug 1590063 ,but that
> doesn't explain the failure of HA restart

It's quite clear, that's a gap we have for a long time.
The monitoring always compares what it knows to what is reported.
In this case, the migration failed in a way that the VM switched to Down on both source and destination and this the migrate command was terminated and set the VM to Down and run_on_vds = NULL.

We should be careful not to introduce a solution that would retry to restart a VM when run-VM of HA VM fails but would only be relevant to migrate VM. Need to think a bit of how to implement it nicely (my intuition is that migrate-VM would register the VM to AutoStartVmRunner in that case) but that shouldn't be too complex.

Comment 11 Arik 2018-06-19 17:58:49 UTC

(In reply to Arik from comment #10)
> (In reply to Michal Skrivanek from comment #9)
> > thoughts on why the HA VM was not restarted?
> > The problem with migration failure has been solved in bug 1590063 ,but that
> > doesn't explain the failure of HA restart
> 
> It's quite clear, that's a gap we have for a long time.
> The monitoring always compares what it knows to what is reported.
> In this case, the migration failed in a way that the VM switched to Down on
> both source and destination and this the migrate command was terminated and
> set the VM to Down and run_on_vds = NULL.
> 
> We should be careful not to introduce a solution that would retry to restart
> a VM when run-VM of HA VM fails but would only be relevant to migrate VM.
> Need to think a bit of how to implement it nicely (my intuition is that
> migrate-VM would register the VM to AutoStartVmRunner in that case) but that
> shouldn't be too complex.

That being said, the probability of this to happen is very low (as an evidence, that may be the first time we encounter this). Setting the priority accordingly and moving to 4.3.

Comment 13 Ryan Barry 2019-01-21 13:34:23 UTC

Re-targeting, because these bugs either do not have blocker+, or do not have a patch posted

Comment 15 Daniel Gur 2019-08-28 13:11:44 UTC

sync2jira

Comment 16 Daniel Gur 2019-08-28 13:15:57 UTC

sync2jira

Comment 17 Ryan Barry 2019-09-13 12:45:49 UTC

Closing, since it's likely that https://bugzilla.redhat.com/show_bug.cgi?id=1590063 resolved the most likely cause, and there have not been any reproducers. Please re-open if this is observed again.

Note You need to log in before you can comment on or make changes to this bug.