Description of problem: A host was switched to Non-Operational due to a missing network (link down). This triggered a migration, which failed (see BZ1590063). During a failed migration, the HA VM died on both source and destination hosts. It was added to the Rerun but it was never started again. 1. Down on both src and dst during failed migration: 2018-06-10 15:27:50,834+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-3) [] VM '50cab403-bf5d-422f-9fac-d3ca3cc57a66' was reported as Down on VDS '412521ed-f0ad-4867-99e0-384bbca8105c' 2018-06-10 15:27:52,550+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-3) [] VM '50cab403-bf5d-422f-9fac-d3ca3cc57a66' was reported as Down on VDS 'b6732abc-9733-411b-bc6c-723315370d79' 2018-06-10 15:27:56,676+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler5) [2eb52fe3] add VM '50cab403-bf5d-422f-9fac-d3ca3cc57a66'(lxkasf102p0007) to rerun treatment 2018-06-10 15:27:56,682+02 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmsMonitoring] (DefaultQuartzScheduler5) [2eb52fe3] Rerun VM '50cab403-bf5d-422f-9fac-d3ca3cc57a66'. Called from VDS 'REMOVED' 2018-06-10 15:27:56,734+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-49) [2eb52fe3] EVENT_ID: VM_MIGRATION_FAILED(65), Correlation ID: 3feaf139, Job ID: 4f65901f-9e6f-47f7-b719-ae42467d906e, Call Stack: null, Custom ID: null, Custom Event ID: -1, Message: Migration failed (VM: REMOVED, Source: REMOVED). The rerun never ran and the HA VM was down indefinitely until the user manually powered it on. Could the intricate change of statuses due to the migration have triggered a bug that prevented the VM from being started again? Version-Release number of selected component (if applicable): ovirt-engine-4.1.9.2-0.1.el7.noarch How reproducible: Unknown
Please also add vdsm logs and qemu logs from both, the VM shouldn’t have died on the source
(In reply to Michal Skrivanek from comment #2) > Please also add vdsm logs and qemu logs from both, the VM shouldn’t have > died on the source Sure, I'll attach them in a moment. The migration completed and the source shut down. The libvirt/qemu part worked. 2018-06-10 15:27:48,562+0200 INFO (migmon/50cab403) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') Migration Progress: 50 seconds elapsed, 99% of data processed, total data: 12385MB, processed data: 5415MB, remaining data: 149MB, transfer speed 111MBps, zero pages: 2002848MB, compressed: 0MB, dirty rate: 16322, memory iteration: 3 (migration:815) 2018-06-10 15:27:50,236+0200 INFO (libvirt/events) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') CPU stopped: onSuspend (vm:5119) 2018-06-10 15:27:52,540+0200 INFO (migsrc/50cab403) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') migration took 54 seconds to complete (migration:492) 2018-06-10 15:27:52,545+0200 INFO (migsrc/50cab403) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') Changed state to Down: Migration succeeded (code=4) (vm:1259) 2018-06-10 15:27:52,546+0200 INFO (migsrc/50cab403) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') Stopping connection (guestagent:430) 2018-06-10 15:27:52,555+0200 INFO (jsonrpc/1) [vdsm.api] START destroy(gracefulAttempts=1) from=10.172.180.108,60298 (api:46) 2018-06-10 15:27:52,556+0200 INFO (jsonrpc/1) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') Release VM resources (vm:4298) 2018-06-10 15:27:52,556+0200 WARN (jsonrpc/1) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') trying to set state to Powering down when already Down (vm:361) 2018-06-10 15:27:52,556+0200 INFO (jsonrpc/1) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') Stopping connection (guestagent:430) 2018-06-10 15:27:52,557+0200 INFO (jsonrpc/1) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') _destroyVmGraceful attempt #0 (vm:4334) 2018-06-10 15:27:52,558+0200 INFO (jsonrpc/1) [virt.vm] (vmId='50cab403-bf5d-422f-9fac-d3ca3cc57a66') VM '50cab403-bf5d-422f-9fac-d3ca3cc57a66' already down and destroyed (vm:4352)
(In reply to Michal Skrivanek from comment #2) > Please also add vdsm logs and qemu logs from both, the VM shouldn’t have > died on the source Ahh, I confused this bug with the other one I just opened, sorry. Here is why the VM died on the source: https://bugzilla.redhat.com/show_bug.cgi?id=1590063
Moving to Virt / Michal who is already investigating the issue. This is a migration of failure of an Highly Available VM which is out of scope for integration team.
thoughts on why the HA VM was not restarted? The problem with migration failure has been solved in bug 1590063 ,but that doesn't explain the failure of HA restart
(In reply to Michal Skrivanek from comment #9) > thoughts on why the HA VM was not restarted? > The problem with migration failure has been solved in bug 1590063 ,but that > doesn't explain the failure of HA restart It's quite clear, that's a gap we have for a long time. The monitoring always compares what it knows to what is reported. In this case, the migration failed in a way that the VM switched to Down on both source and destination and this the migrate command was terminated and set the VM to Down and run_on_vds = NULL. We should be careful not to introduce a solution that would retry to restart a VM when run-VM of HA VM fails but would only be relevant to migrate VM. Need to think a bit of how to implement it nicely (my intuition is that migrate-VM would register the VM to AutoStartVmRunner in that case) but that shouldn't be too complex.
(In reply to Arik from comment #10) > (In reply to Michal Skrivanek from comment #9) > > thoughts on why the HA VM was not restarted? > > The problem with migration failure has been solved in bug 1590063 ,but that > > doesn't explain the failure of HA restart > > It's quite clear, that's a gap we have for a long time. > The monitoring always compares what it knows to what is reported. > In this case, the migration failed in a way that the VM switched to Down on > both source and destination and this the migrate command was terminated and > set the VM to Down and run_on_vds = NULL. > > We should be careful not to introduce a solution that would retry to restart > a VM when run-VM of HA VM fails but would only be relevant to migrate VM. > Need to think a bit of how to implement it nicely (my intuition is that > migrate-VM would register the VM to AutoStartVmRunner in that case) but that > shouldn't be too complex. That being said, the probability of this to happen is very low (as an evidence, that may be the first time we encounter this). Setting the priority accordingly and moving to 4.3.
Re-targeting, because these bugs either do not have blocker+, or do not have a patch posted
sync2jira
Closing, since it's likely that https://bugzilla.redhat.com/show_bug.cgi?id=1590063 resolved the most likely cause, and there have not been any reproducers. Please re-open if this is observed again.