Bug 1795726 - after_migration is not sent to the guest after migration [RHV clone - 4.3.9]
Summary: after_migration is not sent to the guest after migration [RHV clone - 4.3.9]
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.3.7
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ovirt-4.3.9
: 4.3.9
Assignee: Milan Zamazal
QA Contact: Beni Pelled
URL:
Whiteboard:
Depends On: 1788783
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-01-28 17:22 UTC by RHV bug bot
Modified: 2023-09-07 21:37 UTC (History)
7 users (show)

Fixed In Version: vdsm-4.30.42
Doc Type: Bug Fix
Doc Text:
Before this update, when migrating a virtual machine, information about the running guest agent wasn't always passed to the destination host, resulting in an after_migration life cycle event not being sent to the virtual machine on the destination host. This update fixes the issue, and the after_migration notification should work as expected.
Clone Of: 1788783
Environment:
Last Closed: 2020-04-02 16:32:08 UTC
oVirt Team: Virt
Target Upstream Version:
Embargoed:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 4781491 0 None None None 2020-02-03 05:33:03 UTC
Red Hat Product Errata RHBA-2020:1307 0 None None None 2020-04-02 16:32:11 UTC
oVirt gerrit 106556 0 master MERGED virt: Update guest agent API version before making migration params 2020-12-10 17:32:34 UTC
oVirt gerrit 106677 0 ovirt-4.3 MERGED virt: Update guest agent API version before making migration params 2020-12-10 17:32:34 UTC

Description RHV bug bot 2020-01-28 17:22:18 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1788783 +++
======================================================================

Description of problem:

Sometimes the after_migration lifecycle event does not reach the Guest.

It looks like this is due to sometimes GuestAPIVersion going 3 -> 0 -> 3 instead of staying at 3, and the event is triggered by VDSM (on _vmDependentInit) while its 0, so its not sent to the Guest.
It hits this instead of sending to the Guest: https://github.com/oVirt/vdsm/blob/ovirt-4.3/lib/vdsm/virt/guestagent.py#L332

Added some debug to confirm, and the Guest API Version is 0 in VDSM at that point, so the event is not sent:

2020-01-08 14:50:47,285+1000 INFO  (vm/ac7757ca) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') Starting connection (guestagent:256)
2020-01-08 14:50:47,286+1000 DEBUG (vm/ac7757ca) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') send_lifecycle_event after_migration called (guestagent:566)
2020-01-08 14:50:47,286+1000 DEBUG (vm/ac7757ca) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') GuestAgentUnsupportedMessage: 3 > 0 (guestagent:333)
2020-01-08 14:50:47,287+1000 INFO  (vm/ac7757ca) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') CPU running: domain initialization (vm:6062)

Shortly after GuestAPIVersion goes to 3.

2020-01-08 14:50:47,303+1000 INFO  (vm/ac7757ca) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') End of migration (vm:4088)
2020-01-08 14:50:47,649+1000 DEBUG (vmchannels) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') Attempting connection to /var/lib/libvirt/qemu/channels/ac7757ca-4fc8-4625-a91c-38f518064140.ovirt-guest-agent.0 (guestagent:310)
2020-01-08 14:50:47,649+1000 DEBUG (vmchannels) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') Connected to /var/lib/libvirt/qemu/channels/ac7757ca-4fc8-4625-a91c-38f518064140.ovirt-guest-agent.0 (guestagent:314)
2020-01-08 14:50:47,649+1000 DEBUG (vmchannels) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') sent '{"__name__": "refresh", "apiVersion": 3}\n' (guestagent:343)
2020-01-08 14:50:47,957+1000 DEBUG (vmchannels) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') Guest's message heartbeat: {u'memory-stat': {u'swap_out': 0, u'majflt': 2, u'swap_usage': 0, u'mem_cached': 43572, u'mem_free': 2045264, u'mem_buffers': 0, u'swap_in': 0, u'swap_total': 2088956, u'pageflt': 83, u'mem_total': 2236884, u'mem_unused': 2001692}, u'free-ram': u'1997', u'apiVersion': 3} (guestagent:346)
2020-01-08 14:50:47,957+1000 INFO  (vmchannels) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') Guest API version changed from 0 to 3 (guestagent:292)  <======= If API Version is sent from SRC to DST, why do we see this?

But when it works during the migration, its like this:

2020-01-08 15:03:33,803+1000 INFO  (vm/ac7757ca) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') Starting connection (guestagent:256)
2020-01-08 15:03:33,804+1000 DEBUG (vm/ac7757ca) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') send_lifecycle_event after_migration called (guestagent:566)
2020-01-08 15:03:34,524+1000 DEBUG (vmchannels) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') Attempting connection to /var/lib/libvirt/qemu/channels/ac7757ca-4fc8-4625-a91c-38f518064140.ovirt-guest-agent.0 (guestagent:310)
2020-01-08 15:03:34,524+1000 DEBUG (vmchannels) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') Connected to /var/lib/libvirt/qemu/channels/ac7757ca-4fc8-4625-a91c-38f518064140.ovirt-guest-agent.0 (guestagent:314)
2020-01-08 15:03:34,524+1000 DEBUG (vmchannels) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') sent '{"__name__": "refresh", "apiVersion": 3}\n' (guestagent:343)
2020-01-08 15:03:34,525+1000 DEBUG (vmchannels) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') Guest's message heartbeat: {u'memory-stat': {u'swap_out': 0, u'majflt': 1, u'swap_usage': 0, u'mem_cached': 41792, u'mem_free': 2046252, u'mem_buffers': 0, u'swap_in': 0, u'swap_total': 2088956, u'pageflt': 169, u'mem_total': 2236884, u'mem_unused': 2004460}, u'free-ram': u'1998', u'apiVersion': 3} (guestagent:346)
2020-01-08 15:03:34,526+1000 DEBUG (vmchannels) [jsonrpc.Notification] Sending event {"params": {"ac7757ca-4fc8-4625-a91c-38f518064140": {"status": "Migration Destination", "username": "Unknown", "memUsage": "13", "guestFQDN": "", "memoryStats": {"swap_out": "0", "majflt": "1", "mem_cached": "41792", "mem_free": "2004460", "mem_buffers": "0", "swap_in": "0", "pageflt": "169", "mem_total": "2236884", "mem_unused": "2004460"}, "session": "Unknown", "netIfaces": [], "guestCPUCount": -1, "appsList": [], "guestIPs": "", "disksUsage": []}, "notify_time": 4296663730}, "jsonrpc": "2.0", "method": "|virt|VM_status|ac7757ca-4fc8-4625-a91c-38f518064140"} (__init__:181)
2020-01-08 15:03:34,527+1000 DEBUG (vm/ac7757ca) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') sent '{"__name__": "lifecycle-event", "type": "after_migration"}\n' (guestagent:343)

Without "Guest API version changed" message, which indicates its 3 -> 3 instead of 3 -> 0 -> 3.

I've noticed that guestAgentAPIVersion is always sent from SRC to DST VDSM metadata, but sometimes its not present in the DST when it dumps it after migrationCreate.
When its not represent it seems to produce the problem of 3-> 0 -> 3. For both examples above:

2020-01-08 14:50:47,261+1000 DEBUG (vm/ac7757ca) [virt.metadata.Descriptor] dumped metadata for ac7757ca-4fc8-4625-a91c-38f518064140: <?xml version='1.0' encoding='utf-8'?>
<vm>
    <clusterVersion>4.3</clusterVersion>
    <destroy_on_reboot type="bool">False</destroy_on_reboot>
    <launchPaused>false</launchPaused>
    <memGuaranteedSize type="int">1536</memGuaranteedSize>
    <minGuaranteedMemoryMb type="int">1536</minGuaranteedMemoryMb>
    <resumeBehavior>auto_resume</resumeBehavior>
    <startTime type="float">1578458970.56</startTime>
    <device mac_address="00:1a:4a:61:18:47">
        <network>rhevm</network>
        <custom>
            <queues>2</queues>
        </custom>
    </device>
    <device mac_address="00:1a:4a:61:18:51">
        <network>rhevm</network>
        <custom>
            <queues>2</queues>
        </custom>
    </device>
    <device mac_address="00:1a:4a:61:18:50">
        <network>rhevm</network>
        <custom>
            <queues>2</queues>
        </custom>
    </device>
    <device devtype="disk" name="sda">
        <domainID>f7738745-8c1e-497a-b78a-977119c4354c</domainID>
        <guestName>/dev/sda</guestName>
        <imageID>37eed169-5581-4252-a774-4a68fb22bc1e</imageID>
        <poolID>8922eadb-09a6-4a42-88ca-e6298e95b605</poolID>
        <volumeID>619e76f0-7688-45e1-9620-bedcbddbec4a</volumeID>
        <volumeChain>
            <volumeChainNode>
                <domainID>f7738745-8c1e-497a-b78a-977119c4354c</domainID>
                <imageID>37eed169-5581-4252-a774-4a68fb22bc1e</imageID>
                <leaseOffset type="int">0</leaseOffset>
                <leasePath>/rhev/data-center/mnt/10.64.24.26:_storage/f7738745-8c1e-497a-b78a-977119c4354c/images/37eed169-5581-4252-a774-4a68fb22bc1e/619e76f0-7688-45e1-9620-bedcbddbec4a.lease</leasePath>
                <path>/rhev/data-center/mnt/10.64.24.26:_storage/f7738745-8c1e-497a-b78a-977119c4354c/images/37eed169-5581-4252-a774-4a68fb22bc1e/619e76f0-7688-45e1-9620-bedcbddbec4a</path>
                <volumeID>619e76f0-7688-45e1-9620-bedcbddbec4a</volumeID>
            </volumeChainNode>
        </volumeChain>
    </device>
</vm>


2020-01-08 14:50:41,817+1000 DEBUG (jsonrpc/3) [api] FINISH migrationCreate response={'status': {'message': 'Done', 'code': 0}, 'migrationPort': 0, 'params': {..... u'guestAgentAPIVersion': 3,...

2020-01-08 15:02:36,204+1000 DEBUG (jsonrpc/5) [virt.metadata.Descriptor] parsing metadata for ac7757ca-4fc8-4625-a91c-38f518064140: <?xml version='1.0' encoding='utf-8'?>
<ovirt-vm:vm xmlns:ovirt-vm="http://ovirt.org/vm/1.0">
    <ovirt-vm:clusterVersion>4.3</ovirt-vm:clusterVersion>
    <ovirt-vm:destroy_on_reboot type="bool">False</ovirt-vm:destroy_on_reboot>
    <ovirt-vm:guestAgentAPIVersion type="int">3</ovirt-vm:guestAgentAPIVersion>                    <-------- does not exist on the above
    <ovirt-vm:launchPaused>false</ovirt-vm:launchPaused>
    <ovirt-vm:memGuaranteedSize type="int">1536</ovirt-vm:memGuaranteedSize>
    <ovirt-vm:minGuaranteedMemoryMb type="int">1536</ovirt-vm:minGuaranteedMemoryMb>
    <ovirt-vm:resumeBehavior>auto_resume</ovirt-vm:resumeBehavior>
    <ovirt-vm:startTime type="float">1578459610.04</ovirt-vm:startTime>
    <ovirt-vm:device mac_address="00:1a:4a:61:18:47">
        <ovirt-vm:network>rhevm</ovirt-vm:network>
        <ovirt-vm:custom>
            <ovirt-vm:queues>2</ovirt-vm:queues>
        </ovirt-vm:custom>
    </ovirt-vm:device>
    <ovirt-vm:device mac_address="00:1a:4a:61:18:51">
        <ovirt-vm:network>rhevm</ovirt-vm:network>
        <ovirt-vm:custom>
            <ovirt-vm:queues>2</ovirt-vm:queues>
        </ovirt-vm:custom>
    </ovirt-vm:device>
    <ovirt-vm:device mac_address="00:1a:4a:61:18:50">
        <ovirt-vm:network>rhevm</ovirt-vm:network>
        <ovirt-vm:custom>
            <ovirt-vm:queues>2</ovirt-vm:queues>
        </ovirt-vm:custom>
    </ovirt-vm:device>
    <ovirt-vm:device devtype="disk" name="sda">
        <ovirt-vm:domainID>f7738745-8c1e-497a-b78a-977119c4354c</ovirt-vm:domainID>
        <ovirt-vm:guestName>/dev/sda</ovirt-vm:guestName>
        <ovirt-vm:imageID>37eed169-5581-4252-a774-4a68fb22bc1e</ovirt-vm:imageID>
        <ovirt-vm:poolID>8922eadb-09a6-4a42-88ca-e6298e95b605</ovirt-vm:poolID>
        <ovirt-vm:volumeID>619e76f0-7688-45e1-9620-bedcbddbec4a</ovirt-vm:volumeID>
        <ovirt-vm:volumeChain>
            <ovirt-vm:volumeChainNode>
                <ovirt-vm:domainID>f7738745-8c1e-497a-b78a-977119c4354c</ovirt-vm:domainID>
                <ovirt-vm:imageID>37eed169-5581-4252-a774-4a68fb22bc1e</ovirt-vm:imageID>
                <ovirt-vm:leaseOffset type="int">0</ovirt-vm:leaseOffset>
                <ovirt-vm:leasePath>/rhev/data-center/mnt/10.64.24.26:_storage/f7738745-8c1e-497a-b78a-977119c4354c/images/37eed169-5581-4252-a774-4a68fb22bc1e/619e76f0-7688-45e1-9620-bedcbddbec4a.lease</ovirt-vm:leasePath>
                <ovirt-vm:path>/rhev/data-center/mnt/10.64.24.26:_storage/f7738745-8c1e-497a-b78a-977119c4354c/images/37eed169-5581-4252-a774-4a68fb22bc1e/619e76f0-7688-45e1-9620-bedcbddbec4a</ovirt-vm:path>
                <ovirt-vm:volumeID>619e76f0-7688-45e1-9620-bedcbddbec4a</ovirt-vm:volumeID>
            </ovirt-vm:volumeChainNode>
        </ovirt-vm:volumeChain>
    </ovirt-vm:device>
    <ovirt-vm:device devtype="disk" name="hdc" />
</ovirt-vm:vm>
 (metadata:689)

But as shown above guestAgentAPIVersion is in migrationCreate().

Version-Release number of selected component (if applicable):
vdsm-4.30.38-1.el7ev.x86_64

How reproducible:
Sometimes

Steps to Reproduce:
1. Migrate VM

Actual results:
* after_migration lifecycle event not sent
* VM briefly shows warning in the Admin Portal about Guest Agent not running

(Originally by Germano Veit Michel)

Comment 2 RHV bug bot 2020-01-28 17:22:22 UTC
Hi Germano, it's weird. If I understand it right, guestAgentAPIVersion is always present in params, but not always in metadata. But the value is set to be the same in both params in metadata in Vm.migration_parameters(). We have a bug in 4.4 that the value is not set in metadata, but then it's also not set in params.
We always get the API version from metadata in 4.3. If it is missing then there is a race, if after_migrate is called before the API version is retrieved, it doesn't work.
Now the question is why is the value missing in metadata after migration. Can you see it in the source log in the domain XML sent to the destination? Can you see "dumped metadata" debug log message in vdsm.log on the source just before a migration demonstrating the failure? We have libvirt hooks transforming the domain XML on migrations but they are completely unrelated and should be harmless. A libvirt bug is also unlikely, why would it eat just guestAgentAPIVersion.
I can't see any obvious point of failure at the moment; we should look into source vdsm.log whether we can get any hint from it.

(Originally by Milan Zamazal)

Comment 3 RHV bug bot 2020-01-28 17:22:24 UTC
Hi Milan,

It is indeed very weird. I could not find any clue on where this is coming from. It seems like the guestAgentAPIVersion is simply wiped on the destination host at some point.
I don't have the logs from that host anymore, I've wiped it. But I can reproduce this again, with debug also on source.

Besides virt debug on both src and dst, do you have any other idea of log to enable or even add to the code to try to clarify this?

Thanks

(Originally by Germano Veit Michel)

Comment 4 RHV bug bot 2020-01-28 17:22:26 UTC
(In reply to Germano Veit Michel from comment #3)

> Besides virt debug on both src and dst, do you have any other idea of log to
> enable or even add to the code to try to clarify this?

Enabling libvirt debug log may be useful, in case Vdsm logs don't provide enough info.

(Originally by Milan Zamazal)

Comment 5 RHV bug bot 2020-01-28 17:22:28 UTC
I can reproduce it now. libvirt log shows a correct and completed metadata update but then the inserted item is apparently missing in the domain XML sent to destination. I suspect some timing issues with change propagation in libvirt, will look into it further.

(Originally by Milan Zamazal)

Comment 6 RHV bug bot 2020-01-28 17:22:30 UTC
Thanks. I've reproduced it again as well, will attach logs anyway just in case.

2020-01-21 13:29:31,362+1000 DEBUG (vm/59244f31) [virt.vm] (vmId='59244f31-f6a2-474d-b52c-afe2a1e5e978') send_lifecycle_event after_migration called (guestagent:565)

Slightly after:

2020-01-21 13:29:34,782+1000 INFO  (vmchannels) [virt.vm] (vmId='59244f31-f6a2-474d-b52c-afe2a1e5e978') Guest API version changed from 0 to 3 (guestagent:292)

(Originally by Germano Veit Michel)

Comment 8 RHV bug bot 2020-01-28 17:22:34 UTC
I think the problem is that initial destination metadata is not read from the libvirt domain XML but from _srcDomXML parameter provided in the Vdsm migration creation call. But the guest agent API version is set on the source only after _srcDomXML value is retrieved, so it's missing in _srcDomXML. I may still be missing something, because it doesn't explain itself why the problem doesn't appear on every migration. But up-to-date metadata should be present in _srcDomainXML anyway and that should be fixed.

Germano, would you have a chance to check whether https://gerrit.ovirt.org/106483 fixes the problem with after_migration events?

(Originally by Milan Zamazal)

Comment 9 RHV bug bot 2020-01-28 17:22:36 UTC
Hi Milan, sure I can. Just did not have any chance to work on this today. Please hold a bit, hopefully tomorrow.

(Originally by Germano Veit Michel)

Comment 10 RHV bug bot 2020-01-28 17:22:38 UTC
(In reply to Milan Zamazal from comment #8)
> Germano, would you have a chance to check whether
> https://gerrit.ovirt.org/106483 fixes the problem with after_migration
> events?

It seems to work for me. Did 4 migrations in a row, and it worked 4 times:

# grep after /var/log/ovirt-guest-agent/ovirt-guest-agent.log 
Dummy-1::INFO::2020-01-24 13:14:07,029::hooks::64::Hooks::Hook(after_migration) "/etc/ovirt-guest-agent/hooks.d/after_migration/55_flush-caches" executed
Dummy-1::INFO::2020-01-24 13:19:58,731::hooks::64::Hooks::Hook(after_migration) "/etc/ovirt-guest-agent/hooks.d/after_migration/55_flush-caches" executed
Dummy-1::INFO::2020-01-24 13:21:23,520::hooks::64::Hooks::Hook(after_migration) "/etc/ovirt-guest-agent/hooks.d/after_migration/55_flush-caches" executed
Dummy-1::INFO::2020-01-24 13:22:41,195::hooks::64::Hooks::Hook(after_migration) "/etc/ovirt-guest-agent/hooks.d/after_migration/55_flush-caches" executed

And the only Guest API change was when the Guest started, nothing on migrations:
2020-01-24 13:13:40,653+1000 INFO  (vmchannels) [virt.vm] (vmId='59244f31-f6a2-474d-b52c-afe2a1e5e978') Guest API version changed from 0 to 3 (guestagent:292)

(Originally by Germano Veit Michel)

Comment 11 RHV bug bot 2020-01-28 17:22:39 UTC
Hi Germano, thank you for testing. I posted a patch fixing the bug in master.

(Originally by Milan Zamazal)

Comment 12 RHV bug bot 2020-01-28 17:22:41 UTC
You are welcome, and thanks for the fix!

Is this on the radar for 4.3.z after the fix is accepted? There is a customer ticket attached...

(Originally by Germano Veit Michel)

Comment 13 RHV bug bot 2020-01-28 17:22:43 UTC
The patch is basically one line, so it's definitely a suitable target

(Originally by Ryan Barry)

Comment 14 RHV bug bot 2020-02-21 11:02:06 UTC
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3.z': '?'}', ]

For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason:

[Found non-acked flags: '{'rhevm-4.3.z': '?'}', ]

For more info please contact: rhv-devops

Comment 17 Milan Zamazal 2020-03-09 09:01:09 UTC
To clarify the scope of this issue: after_migrate hook functionality is available only when using ovirt-guest-agent. That means it can work only with el7 guests; it doesn't work with el8 guests since there is no ovirt-guest-agent on el8 and AFAIK there is no substitute of this kind of hook functionality anywhere.

Comment 18 Beni Pelled 2020-03-09 11:57:54 UTC
Verified on:
- RHV 4.3.9.1-0.1.el7
- vdsm-4.30.42-1.el7ev.x86_64 (host)
- libvirt-4.5.0-33.el7.x86_64 (host)

Verification steps:
1. Migrate VM.

Result:
after_migration is sent to the guest:

# tail -f /var/log/vdsm/vdsm.log | grep 'after_migration'
2020-03-09 13:52:38,583+0200 DEBUG (vm/be5d4f39) [virt.vm] (vmId='be5d4f39-cb05-4a4d-9c6e-9856ff9c3e61') send_lifecycle_event after_migration called (guestagent:565)
2020-03-09 13:52:38,932+0200 DEBUG (vm/be5d4f39) [virt.vm] (vmId='be5d4f39-cb05-4a4d-9c6e-9856ff9c3e61') sent '{"__name__": "lifecycle-event", "type": "after_migration"}\n' (guestagent:342)

Comment 20 errata-xmlrpc 2020-04-02 16:32:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:1307


Note You need to log in before you can comment on or make changes to this bug.