Bug 1590068 - HA VM died on both source and destination during failed migration, but was not restarted.
Summary: HA VM died on both source and destination during failed migration, but was no...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.1.9
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ovirt-4.4.0
: 4.3.0
Assignee: Arik
QA Contact: Liran Rotenberg
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-12 01:43 UTC by Germano Veit Michel
Modified: 2021-09-09 14:34 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-09-13 12:45:49 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1590063 0 high CLOSED VM was destroyed on destination after successful migration due to missing the 'device' key on the lease device 2021-09-09 14:36:12 UTC
Red Hat Issue Tracker RHV-43520 0 None None None 2021-09-09 14:34:53 UTC

Internal Links: 1590063

Description Germano Veit Michel 2018-06-12 01:43:41 UTC
Description of problem:

A host was switched to Non-Operational due to a missing network (link down). This triggered a migration, which failed (see BZ1590063).
During a failed migration, the HA VM died on both source and destination hosts. It was added to the Rerun but it was never started again.

1. Down on both src and dst during failed migration:
2018-06-10 15:27:50,834+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-3) [] VM '50cab403-bf5d-422f-9fac-d3ca3cc57a66' was reported as Down on VDS '412521ed-f0ad-4867-99e0-384bbca8105c'
2018-06-10 15:27:52,550+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-3) [] VM '50cab403-bf5d-422f-9fac-d3ca3cc57a66' was reported as Down on VDS 'b6732abc-9733-411b-bc6c-723315370d79'

2018-06-10 15:27:56,676+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler5) [2eb52fe3] add VM '50cab403-bf5d-422f-9fac-d3ca3cc57a66'(lxkasf102p0007) to rerun treatment
2018-06-10 15:27:56,682+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (DefaultQuartzScheduler5) [2eb52fe3] Rerun VM '50cab403-bf5d-422f-9fac-d3ca3cc57a66'. Called from VDS 'REMOVED'

2018-06-10 15:27:56,734+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-49) [2eb52fe3] EVENT_ID: VM_MIGRATION_FAILED(65), Correlation ID: 3feaf139, Job ID: 4f65901f-9e6f-47f7-b719-ae42467d906e, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: Migration failed  (VM: REMOVED, Source: REMOVED).

The rerun never ran and the HA VM was down indefinitely until the user manually powered it on.

Could the intricate change of statuses due to the migration have triggered a bug that prevented the VM from being started again?

Version-Release number of selected component (if applicable):
ovirt-engine-4.1.9.2-0.1.el7.noarch

How reproducible:
Unknown

Comment 2 Michal Skrivanek 2018-06-12 04:40:35 UTC
Please also add vdsm logs and qemu logs from both, the VM shouldn’t have died on the source

Comment 3 Germano Veit Michel 2018-06-12 04:44:24 UTC
(In reply to Michal Skrivanek from comment #2)
> Please also add vdsm logs and qemu logs from both, the VM shouldn’t have
> died on the source

Sure, I'll attach them in a moment.

The migration completed and the source shut down. The libvirt/qemu part worked.

2018-06-10 15:27:48,562+0200 INFO  (migmon/50cab403) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') Migration Progress: 50 seconds elapsed, 99% of data processed, total data: 12385MB, processed data: 5415MB, remaining data: 149MB, transfer speed 111MBps, zero pages: 2002848MB, compressed: 0MB, dirty rate: 16322, memory iteration: 3 (migration:815)
2018-06-10 15:27:50,236+0200 INFO  (libvirt/events) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') CPU stopped: onSuspend (vm:5119)
2018-06-10 15:27:52,540+0200 INFO  (migsrc/50cab403) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') migration took 54 seconds to complete (migration:492)
2018-06-10 15:27:52,545+0200 INFO  (migsrc/50cab403) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') Changed state to Down: Migration succeeded (code=4) (vm:1259)
2018-06-10 15:27:52,546+0200 INFO  (migsrc/50cab403) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') Stopping connection (guestagent:430)
2018-06-10 15:27:52,555+0200 INFO  (jsonrpc/1) [vdsm.api] START destroy(gracefulAttempts=1) from=10.172.180.108,60298 (api:46)
2018-06-10 15:27:52,556+0200 INFO  (jsonrpc/1) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') Release VM resources (vm:4298)
2018-06-10 15:27:52,556+0200 WARN  (jsonrpc/1) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') trying to set state to Powering down when already Down (vm:361)
2018-06-10 15:27:52,556+0200 INFO  (jsonrpc/1) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') Stopping connection (guestagent:430)
2018-06-10 15:27:52,557+0200 INFO  (jsonrpc/1) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') _destroyVmGraceful attempt #0 (vm:4334)
2018-06-10 15:27:52,558+0200 INFO  (jsonrpc/1) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') VM '50cab403-bf5d-422f-9fac-d3ca3cc57a66' already down and destroyed (vm:4352)

Comment 4 Germano Veit Michel 2018-06-12 04:50:51 UTC
(In reply to Michal Skrivanek from comment #2)
> Please also add vdsm logs and qemu logs from both, the VM shouldn’t have
> died on the source

Ahh, I confused this bug with the other one I just opened, sorry. 

Here is why the VM died on the source: https://bugzilla.redhat.com/show_bug.cgi?id=1590063

Comment 8 Sandro Bonazzola 2018-06-15 13:47:39 UTC
Moving to Virt / Michal who is already investigating the issue.
This is a migration of failure of an Highly Available VM which is out of scope for integration team.

Comment 9 Michal Skrivanek 2018-06-19 10:36:38 UTC
thoughts on why the HA VM was not restarted?
The problem with migration failure has been solved in bug 1590063 ,but that doesn't explain the failure of HA restart

Comment 10 Arik 2018-06-19 11:57:28 UTC
(In reply to Michal Skrivanek from comment #9)
> thoughts on why the HA VM was not restarted?
> The problem with migration failure has been solved in bug 1590063 ,but that
> doesn't explain the failure of HA restart

It's quite clear, that's a gap we have for a long time.
The monitoring always compares what it knows to what is reported.
In this case, the migration failed in a way that the VM switched to Down on both source and destination and this the migrate command was terminated and set the VM to Down and run_on_vds = NULL.

We should be careful not to introduce a solution that would retry to restart a VM when run-VM of HA VM fails but would only be relevant to migrate VM. Need to think a bit of how to implement it nicely (my intuition is that migrate-VM would register the VM to AutoStartVmRunner in that case) but that shouldn't be too complex.

Comment 11 Arik 2018-06-19 17:58:49 UTC
(In reply to Arik from comment #10)
> (In reply to Michal Skrivanek from comment #9)
> > thoughts on why the HA VM was not restarted?
> > The problem with migration failure has been solved in bug 1590063 ,but that
> > doesn't explain the failure of HA restart
> 
> It's quite clear, that's a gap we have for a long time.
> The monitoring always compares what it knows to what is reported.
> In this case, the migration failed in a way that the VM switched to Down on
> both source and destination and this the migrate command was terminated and
> set the VM to Down and run_on_vds = NULL.
> 
> We should be careful not to introduce a solution that would retry to restart
> a VM when run-VM of HA VM fails but would only be relevant to migrate VM.
> Need to think a bit of how to implement it nicely (my intuition is that
> migrate-VM would register the VM to AutoStartVmRunner in that case) but that
> shouldn't be too complex.

That being said, the probability of this to happen is very low (as an evidence, that may be the first time we encounter this). Setting the priority accordingly and moving to 4.3.

Comment 13 Ryan Barry 2019-01-21 13:34:23 UTC
Re-targeting, because these bugs either do not have blocker+, or do not have a patch posted

Comment 15 Daniel Gur 2019-08-28 13:11:44 UTC
sync2jira

Comment 16 Daniel Gur 2019-08-28 13:15:57 UTC
sync2jira

Comment 17 Ryan Barry 2019-09-13 12:45:49 UTC
Closing, since it's likely that https://bugzilla.redhat.com/show_bug.cgi?id=1590063 resolved the most likely cause, and there have not been any reproducers. Please re-open if this is observed again.


Note You need to log in before you can comment on or make changes to this bug.