Bug 1414626 - Crash VM during migrating with error "Failed in MigrateBrokerVDS"
Summary: Crash VM during migrating with error "Failed in MigrateBrokerVDS"
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.18.22
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ovirt-4.1.1
: 4.19.6
Assignee: Francesco Romani
QA Contact: Israel Pinto
URL:
Whiteboard:
: 1413847 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-19 03:46 UTC by lifeman
Modified: 2017-04-21 09:35 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-04-21 09:35:41 UTC
oVirt Team: Virt
fromani: needinfo-
rule-engine: ovirt-4.1+
rule-engine: planning_ack+
tjelinek: devel_ack+
mavital: testing_ack+


Attachments (Terms of Use)
engine.log (5.09 MB, text/plain)
2017-01-19 03:46 UTC, lifeman
no flags Details
vdsm-node01.log (4.06 MB, text/plain)
2017-01-19 03:49 UTC, lifeman
no flags Details
vdsm-node02.log (9.04 MB, text/plain)
2017-01-19 03:51 UTC, lifeman
no flags Details
messages (4.70 MB, text/plain)
2017-01-24 08:38 UTC, lifeman
no flags Details
vm.log (8.47 KB, text/plain)
2017-01-24 08:39 UTC, lifeman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 71158 0 master MERGED migration: don't require optional jobStats fields 2020-06-17 03:33:39 UTC
oVirt gerrit 71159 0 master MERGED migration: make progress reporting reliable 2020-06-17 03:33:39 UTC
oVirt gerrit 71160 0 master MERGED migration: add boolean to control retries 2020-06-17 03:33:39 UTC
oVirt gerrit 71300 0 master MERGED tests: migration: add test to exercise retry 2020-06-17 03:33:39 UTC
oVirt gerrit 71716 0 ovirt-4.1 MERGED migration: don't require optional jobStats fields 2020-06-17 03:33:39 UTC
oVirt gerrit 71717 0 ovirt-4.1 MERGED migration: make progress reporting reliable 2020-06-17 03:33:38 UTC
oVirt gerrit 71718 0 ovirt-4.1 MERGED migration: add boolean to control retries 2020-06-17 03:33:38 UTC
oVirt gerrit 71867 0 ovirt-4.0 MERGED migration: don't require optional jobStats fields 2020-06-17 03:33:38 UTC
oVirt gerrit 71868 0 ovirt-4.0 MERGED migration: make progress reporting reliable 2020-06-17 03:33:38 UTC
oVirt gerrit 71869 0 ovirt-4.0 MERGED migration: add boolean to control retries 2020-06-17 03:33:38 UTC

Description lifeman 2017-01-19 03:46:02 UTC
Created attachment 1242331 [details]
engine.log

Description of problem:

Crash VM during migration with error "Failed in MigrateBrokerVDS" at 10:24:54 PM

Comment 1 lifeman 2017-01-19 03:49:15 UTC
Created attachment 1242332 [details]
vdsm-node01.log

Comment 2 lifeman 2017-01-19 03:51:23 UTC
Created attachment 1242333 [details]
vdsm-node02.log

Comment 3 Tomas Jelinek 2017-01-19 07:38:16 UTC
could you please also attach libvirt and qemu logs?

Comment 4 lifeman 2017-01-19 07:50:56 UTC
Where path of the logs is located?(/var/log/libvirt/quemu/vm.log?)

Comment 5 Tomas Jelinek 2017-01-24 08:00:18 UTC
yeah, that and than /var/log/messages
If there will be nothing interesting we can enable debug logging of libvirt and look at there.

Comment 6 lifeman 2017-01-24 08:38:39 UTC
Created attachment 1243854 [details]
messages

Comment 7 lifeman 2017-01-24 08:39:05 UTC
Created attachment 1243855 [details]
vm.log

Comment 8 Tomas Jelinek 2017-01-24 15:27:58 UTC
hmm:
2017-01-19 03:23:22.986+0000: initiating migration
2017-01-19 03:24:38.939+0000: shutting down
2017-01-19T03:24:39.464773Z qemu-kvm: terminating on signal 15 from pid 2669

@Francesco: any idea?

Comment 9 Tomas Jelinek 2017-01-25 10:21:43 UTC
*** Bug 1413847 has been marked as a duplicate of this bug. ***

Comment 10 Francesco Romani 2017-01-25 10:46:10 UTC
Looks like there is a couple of bugs in Vdsm.
1. Vdsm fails to retrieve the progress from libvirt job stats. This is one issue per se, as we fail to update the downtime, and this could make migration not converging, or converging slower.
2. There is a race in migration progress reporting. This could cause the progression meter go backward, but it is much easier to trigger only if we hit bug #1. In this case, the race confused the migration source Vdsm, leading it to believe the migration was NOT completed - while it was. What happened
2.a. migration attempt #1 completed, despite lack of downtime adjustment
2.b. due to bug#1 and the race, the progress report was not correctly set to 100% after migration completed
2.c. the migration source handler, misdetected the migration completed (because the progress was not 100% once it ended) and started a new one, which failed
2.d. the Engine only saw the last failed migration - this error was bogus, and acted accordingly

We will fix both issues.

Comment 11 Francesco Romani 2017-02-06 15:59:22 UTC
bug actually on Vdsm, and fixed there. Engine reacted according to (false) information reported, so it's innocent.

Comment 12 Red Hat Bugzilla Rules Engine 2017-02-06 15:59:30 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 13 Francesco Romani 2017-02-06 16:01:46 UTC
no doc_text, this is just a plain bug caused by one unusual, but possible, sequence of events,

Comment 14 Francesco Romani 2017-02-10 12:00:43 UTC
patches merged in the stable branch -> MODIFIED

Comment 15 Israel Pinto 2017-02-22 11:20:59 UTC
Verify with:
Red Hat Virtualization Manager Version: 4.1.1.2-0.1.el7

run migration sanity all pass


Note You need to log in before you can comment on or make changes to this bug.