Bug 1788783 - after_migration is not sent to the guest after migration
Summary: after_migration is not sent to the guest after migration
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.3.7
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ovirt-4.4.0
: ---
Assignee: Milan Zamazal
QA Contact: Beni Pelled
URL:
Whiteboard:
Depends On:
Blocks: 1795726
TreeView+ depends on / blocked
 
Reported: 2020-01-08 05:32 UTC by Germano Veit Michel
Modified: 2020-08-04 13:28 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, when migrating a virtual machine, information about the running guest agent was not always passed to the destination host. In these cases, the migrated virtual machine on the destination host did not receive an after_migration life cycle event notification. This update fixes this issue. The after_migration notification works as expected now.
Clone Of:
: 1795726 (view as bug list)
Environment:
Last Closed: 2020-08-04 13:27:53 UTC
oVirt Team: Virt
Target Upstream Version:
lsvaty: testing_plan_complete-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 4781491 0 None None None 2020-02-03 05:31:24 UTC
Red Hat Product Errata RHEA-2020:3246 0 None None None 2020-08-04 13:28:24 UTC
oVirt gerrit 106556 0 master MERGED virt: Update guest agent API version before making migration params 2020-08-02 05:50:21 UTC

Description Germano Veit Michel 2020-01-08 05:32:14 UTC
Description of problem:

Sometimes the after_migration lifecycle event does not reach the Guest.

It looks like this is due to sometimes GuestAPIVersion going 3 -> 0 -> 3 instead of staying at 3, and the event is triggered by VDSM (on _vmDependentInit) while its 0, so its not sent to the Guest.
It hits this instead of sending to the Guest: https://github.com/oVirt/vdsm/blob/ovirt-4.3/lib/vdsm/virt/guestagent.py#L332

Added some debug to confirm, and the Guest API Version is 0 in VDSM at that point, so the event is not sent:

2020-01-08 14:50:47,285+1000 INFO  (vm/ac7757ca) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') Starting connection (guestagent:256)
2020-01-08 14:50:47,286+1000 DEBUG (vm/ac7757ca) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') send_lifecycle_event after_migration called (guestagent:566)
2020-01-08 14:50:47,286+1000 DEBUG (vm/ac7757ca) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') GuestAgentUnsupportedMessage: 3 > 0 (guestagent:333)
2020-01-08 14:50:47,287+1000 INFO  (vm/ac7757ca) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') CPU running: domain initialization (vm:6062)

Shortly after GuestAPIVersion goes to 3.

2020-01-08 14:50:47,303+1000 INFO  (vm/ac7757ca) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') End of migration (vm:4088)
2020-01-08 14:50:47,649+1000 DEBUG (vmchannels) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') Attempting connection to /var/lib/libvirt/qemu/channels/ac7757ca-4fc8-4625-a91c-38f518064140.ovirt-guest-agent.0 (guestagent:310)
2020-01-08 14:50:47,649+1000 DEBUG (vmchannels) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') Connected to /var/lib/libvirt/qemu/channels/ac7757ca-4fc8-4625-a91c-38f518064140.ovirt-guest-agent.0 (guestagent:314)
2020-01-08 14:50:47,649+1000 DEBUG (vmchannels) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') sent '{"__name__": "refresh", "apiVersion": 3}\n' (guestagent:343)
2020-01-08 14:50:47,957+1000 DEBUG (vmchannels) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') Guest's message heartbeat: {u'memory-stat': {u'swap_out': 0, u'majflt': 2, u'swap_usage': 0, u'mem_cached': 43572, u'mem_free': 2045264, u'mem_buffers': 0, u'swap_in': 0, u'swap_total': 2088956, u'pageflt': 83, u'mem_total': 2236884, u'mem_unused': 2001692}, u'free-ram': u'1997', u'apiVersion': 3} (guestagent:346)
2020-01-08 14:50:47,957+1000 INFO  (vmchannels) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') Guest API version changed from 0 to 3 (guestagent:292)  <======= If API Version is sent from SRC to DST, why do we see this?

But when it works during the migration, its like this:

2020-01-08 15:03:33,803+1000 INFO  (vm/ac7757ca) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') Starting connection (guestagent:256)
2020-01-08 15:03:33,804+1000 DEBUG (vm/ac7757ca) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') send_lifecycle_event after_migration called (guestagent:566)
2020-01-08 15:03:34,524+1000 DEBUG (vmchannels) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') Attempting connection to /var/lib/libvirt/qemu/channels/ac7757ca-4fc8-4625-a91c-38f518064140.ovirt-guest-agent.0 (guestagent:310)
2020-01-08 15:03:34,524+1000 DEBUG (vmchannels) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') Connected to /var/lib/libvirt/qemu/channels/ac7757ca-4fc8-4625-a91c-38f518064140.ovirt-guest-agent.0 (guestagent:314)
2020-01-08 15:03:34,524+1000 DEBUG (vmchannels) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') sent '{"__name__": "refresh", "apiVersion": 3}\n' (guestagent:343)
2020-01-08 15:03:34,525+1000 DEBUG (vmchannels) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') Guest's message heartbeat: {u'memory-stat': {u'swap_out': 0, u'majflt': 1, u'swap_usage': 0, u'mem_cached': 41792, u'mem_free': 2046252, u'mem_buffers': 0, u'swap_in': 0, u'swap_total': 2088956, u'pageflt': 169, u'mem_total': 2236884, u'mem_unused': 2004460}, u'free-ram': u'1998', u'apiVersion': 3} (guestagent:346)
2020-01-08 15:03:34,526+1000 DEBUG (vmchannels) [jsonrpc.Notification] Sending event {"params": {"ac7757ca-4fc8-4625-a91c-38f518064140": {"status": "Migration Destination", "username": "Unknown", "memUsage": "13", "guestFQDN": "", "memoryStats": {"swap_out": "0", "majflt": "1", "mem_cached": "41792", "mem_free": "2004460", "mem_buffers": "0", "swap_in": "0", "pageflt": "169", "mem_total": "2236884", "mem_unused": "2004460"}, "session": "Unknown", "netIfaces": [], "guestCPUCount": -1, "appsList": [], "guestIPs": "", "disksUsage": []}, "notify_time": 4296663730}, "jsonrpc": "2.0", "method": "|virt|VM_status|ac7757ca-4fc8-4625-a91c-38f518064140"} (__init__:181)
2020-01-08 15:03:34,527+1000 DEBUG (vm/ac7757ca) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') sent '{"__name__": "lifecycle-event", "type": "after_migration"}\n' (guestagent:343)

Without "Guest API version changed" message, which indicates its 3 -> 3 instead of 3 -> 0 -> 3.

I've noticed that guestAgentAPIVersion is always sent from SRC to DST VDSM metadata, but sometimes its not present in the DST when it dumps it after migrationCreate.
When its not represent it seems to produce the problem of 3-> 0 -> 3. For both examples above:

2020-01-08 14:50:47,261+1000 DEBUG (vm/ac7757ca) [virt.metadata.Descriptor] dumped metadata for ac7757ca-4fc8-4625-a91c-38f518064140: <?xml version='1.0' encoding='utf-8'?>
<vm>
    <clusterVersion>4.3</clusterVersion>
    <destroy_on_reboot type="bool">False</destroy_on_reboot>
    <launchPaused>false</launchPaused>
    <memGuaranteedSize type="int">1536</memGuaranteedSize>
    <minGuaranteedMemoryMb type="int">1536</minGuaranteedMemoryMb>
    <resumeBehavior>auto_resume</resumeBehavior>
    <startTime type="float">1578458970.56</startTime>
    <device mac_address="00:1a:4a:61:18:47">
        <network>rhevm</network>
        <custom>
            <queues>2</queues>
        </custom>
    </device>
    <device mac_address="00:1a:4a:61:18:51">
        <network>rhevm</network>
        <custom>
            <queues>2</queues>
        </custom>
    </device>
    <device mac_address="00:1a:4a:61:18:50">
        <network>rhevm</network>
        <custom>
            <queues>2</queues>
        </custom>
    </device>
    <device devtype="disk" name="sda">
        <domainID>f7738745-8c1e-497a-b78a-977119c4354c</domainID>
        <guestName>/dev/sda</guestName>
        <imageID>37eed169-5581-4252-a774-4a68fb22bc1e</imageID>
        <poolID>8922eadb-09a6-4a42-88ca-e6298e95b605</poolID>
        <volumeID>619e76f0-7688-45e1-9620-bedcbddbec4a</volumeID>
        <volumeChain>
            <volumeChainNode>
                <domainID>f7738745-8c1e-497a-b78a-977119c4354c</domainID>
                <imageID>37eed169-5581-4252-a774-4a68fb22bc1e</imageID>
                <leaseOffset type="int">0</leaseOffset>
                <leasePath>/rhev/data-center/mnt/10.64.24.26:_storage/f7738745-8c1e-497a-b78a-977119c4354c/images/37eed169-5581-4252-a774-4a68fb22bc1e/619e76f0-7688-45e1-9620-bedcbddbec4a.lease</leasePath>
                <path>/rhev/data-center/mnt/10.64.24.26:_storage/f7738745-8c1e-497a-b78a-977119c4354c/images/37eed169-5581-4252-a774-4a68fb22bc1e/619e76f0-7688-45e1-9620-bedcbddbec4a</path>
                <volumeID>619e76f0-7688-45e1-9620-bedcbddbec4a</volumeID>
            </volumeChainNode>
        </volumeChain>
    </device>
</vm>


2020-01-08 14:50:41,817+1000 DEBUG (jsonrpc/3) [api] FINISH migrationCreate response={'status': {'message': 'Done', 'code': 0}, 'migrationPort': 0, 'params': {..... u'guestAgentAPIVersion': 3,...

2020-01-08 15:02:36,204+1000 DEBUG (jsonrpc/5) [virt.metadata.Descriptor] parsing metadata for ac7757ca-4fc8-4625-a91c-38f518064140: <?xml version='1.0' encoding='utf-8'?>
<ovirt-vm:vm xmlns:ovirt-vm="http://ovirt.org/vm/1.0">
    <ovirt-vm:clusterVersion>4.3</ovirt-vm:clusterVersion>
    <ovirt-vm:destroy_on_reboot type="bool">False</ovirt-vm:destroy_on_reboot>
    <ovirt-vm:guestAgentAPIVersion type="int">3</ovirt-vm:guestAgentAPIVersion>                    <-------- does not exist on the above
    <ovirt-vm:launchPaused>false</ovirt-vm:launchPaused>
    <ovirt-vm:memGuaranteedSize type="int">1536</ovirt-vm:memGuaranteedSize>
    <ovirt-vm:minGuaranteedMemoryMb type="int">1536</ovirt-vm:minGuaranteedMemoryMb>
    <ovirt-vm:resumeBehavior>auto_resume</ovirt-vm:resumeBehavior>
    <ovirt-vm:startTime type="float">1578459610.04</ovirt-vm:startTime>
    <ovirt-vm:device mac_address="00:1a:4a:61:18:47">
        <ovirt-vm:network>rhevm</ovirt-vm:network>
        <ovirt-vm:custom>
            <ovirt-vm:queues>2</ovirt-vm:queues>
        </ovirt-vm:custom>
    </ovirt-vm:device>
    <ovirt-vm:device mac_address="00:1a:4a:61:18:51">
        <ovirt-vm:network>rhevm</ovirt-vm:network>
        <ovirt-vm:custom>
            <ovirt-vm:queues>2</ovirt-vm:queues>
        </ovirt-vm:custom>
    </ovirt-vm:device>
    <ovirt-vm:device mac_address="00:1a:4a:61:18:50">
        <ovirt-vm:network>rhevm</ovirt-vm:network>
        <ovirt-vm:custom>
            <ovirt-vm:queues>2</ovirt-vm:queues>
        </ovirt-vm:custom>
    </ovirt-vm:device>
    <ovirt-vm:device devtype="disk" name="sda">
        <ovirt-vm:domainID>f7738745-8c1e-497a-b78a-977119c4354c</ovirt-vm:domainID>
        <ovirt-vm:guestName>/dev/sda</ovirt-vm:guestName>
        <ovirt-vm:imageID>37eed169-5581-4252-a774-4a68fb22bc1e</ovirt-vm:imageID>
        <ovirt-vm:poolID>8922eadb-09a6-4a42-88ca-e6298e95b605</ovirt-vm:poolID>
        <ovirt-vm:volumeID>619e76f0-7688-45e1-9620-bedcbddbec4a</ovirt-vm:volumeID>
        <ovirt-vm:volumeChain>
            <ovirt-vm:volumeChainNode>
                <ovirt-vm:domainID>f7738745-8c1e-497a-b78a-977119c4354c</ovirt-vm:domainID>
                <ovirt-vm:imageID>37eed169-5581-4252-a774-4a68fb22bc1e</ovirt-vm:imageID>
                <ovirt-vm:leaseOffset type="int">0</ovirt-vm:leaseOffset>
                <ovirt-vm:leasePath>/rhev/data-center/mnt/10.64.24.26:_storage/f7738745-8c1e-497a-b78a-977119c4354c/images/37eed169-5581-4252-a774-4a68fb22bc1e/619e76f0-7688-45e1-9620-bedcbddbec4a.lease</ovirt-vm:leasePath>
                <ovirt-vm:path>/rhev/data-center/mnt/10.64.24.26:_storage/f7738745-8c1e-497a-b78a-977119c4354c/images/37eed169-5581-4252-a774-4a68fb22bc1e/619e76f0-7688-45e1-9620-bedcbddbec4a</ovirt-vm:path>
                <ovirt-vm:volumeID>619e76f0-7688-45e1-9620-bedcbddbec4a</ovirt-vm:volumeID>
            </ovirt-vm:volumeChainNode>
        </ovirt-vm:volumeChain>
    </ovirt-vm:device>
    <ovirt-vm:device devtype="disk" name="hdc" />
</ovirt-vm:vm>
 (metadata:689)

But as shown above guestAgentAPIVersion is in migrationCreate().

Version-Release number of selected component (if applicable):
vdsm-4.30.38-1.el7ev.x86_64

How reproducible:
Sometimes

Steps to Reproduce:
1. Migrate VM

Actual results:
* after_migration lifecycle event not sent
* VM briefly shows warning in the Admin Portal about Guest Agent not running

Comment 2 Milan Zamazal 2020-01-17 16:32:12 UTC
Hi Germano, it's weird. If I understand it right, guestAgentAPIVersion is always present in params, but not always in metadata. But the value is set to be the same in both params in metadata in Vm.migration_parameters(). We have a bug in 4.4 that the value is not set in metadata, but then it's also not set in params.
We always get the API version from metadata in 4.3. If it is missing then there is a race, if after_migrate is called before the API version is retrieved, it doesn't work.
Now the question is why is the value missing in metadata after migration. Can you see it in the source log in the domain XML sent to the destination? Can you see "dumped metadata" debug log message in vdsm.log on the source just before a migration demonstrating the failure? We have libvirt hooks transforming the domain XML on migrations but they are completely unrelated and should be harmless. A libvirt bug is also unlikely, why would it eat just guestAgentAPIVersion.
I can't see any obvious point of failure at the moment; we should look into source vdsm.log whether we can get any hint from it.

Comment 3 Germano Veit Michel 2020-01-20 06:21:19 UTC
Hi Milan,

It is indeed very weird. I could not find any clue on where this is coming from. It seems like the guestAgentAPIVersion is simply wiped on the destination host at some point.
I don't have the logs from that host anymore, I've wiped it. But I can reproduce this again, with debug also on source.

Besides virt debug on both src and dst, do you have any other idea of log to enable or even add to the code to try to clarify this?

Thanks

Comment 4 Milan Zamazal 2020-01-20 08:54:54 UTC
(In reply to Germano Veit Michel from comment #3)

> Besides virt debug on both src and dst, do you have any other idea of log to
> enable or even add to the code to try to clarify this?

Enabling libvirt debug log may be useful, in case Vdsm logs don't provide enough info.

Comment 5 Milan Zamazal 2020-01-20 16:13:31 UTC
I can reproduce it now. libvirt log shows a correct and completed metadata update but then the inserted item is apparently missing in the domain XML sent to destination. I suspect some timing issues with change propagation in libvirt, will look into it further.

Comment 6 Germano Veit Michel 2020-01-21 03:37:39 UTC
Thanks. I've reproduced it again as well, will attach logs anyway just in case.

2020-01-21 13:29:31,362+1000 DEBUG (vm/59244f31) [virt.vm] (vmId='59244f31-f6a2-474d-b52c-afe2a1e5e978') send_lifecycle_event after_migration called (guestagent:565)

Slightly after:

2020-01-21 13:29:34,782+1000 INFO  (vmchannels) [virt.vm] (vmId='59244f31-f6a2-474d-b52c-afe2a1e5e978') Guest API version changed from 0 to 3 (guestagent:292)

Comment 8 Milan Zamazal 2020-01-22 17:07:06 UTC
I think the problem is that initial destination metadata is not read from the libvirt domain XML but from _srcDomXML parameter provided in the Vdsm migration creation call. But the guest agent API version is set on the source only after _srcDomXML value is retrieved, so it's missing in _srcDomXML. I may still be missing something, because it doesn't explain itself why the problem doesn't appear on every migration. But up-to-date metadata should be present in _srcDomainXML anyway and that should be fixed.

Germano, would you have a chance to check whether https://gerrit.ovirt.org/106483 fixes the problem with after_migration events?

Comment 9 Germano Veit Michel 2020-01-23 05:03:04 UTC
Hi Milan, sure I can. Just did not have any chance to work on this today. Please hold a bit, hopefully tomorrow.

Comment 10 Germano Veit Michel 2020-01-24 03:25:27 UTC
(In reply to Milan Zamazal from comment #8)
> Germano, would you have a chance to check whether
> https://gerrit.ovirt.org/106483 fixes the problem with after_migration
> events?

It seems to work for me. Did 4 migrations in a row, and it worked 4 times:

# grep after /var/log/ovirt-guest-agent/ovirt-guest-agent.log 
Dummy-1::INFO::2020-01-24 13:14:07,029::hooks::64::Hooks::Hook(after_migration) "/etc/ovirt-guest-agent/hooks.d/after_migration/55_flush-caches" executed
Dummy-1::INFO::2020-01-24 13:19:58,731::hooks::64::Hooks::Hook(after_migration) "/etc/ovirt-guest-agent/hooks.d/after_migration/55_flush-caches" executed
Dummy-1::INFO::2020-01-24 13:21:23,520::hooks::64::Hooks::Hook(after_migration) "/etc/ovirt-guest-agent/hooks.d/after_migration/55_flush-caches" executed
Dummy-1::INFO::2020-01-24 13:22:41,195::hooks::64::Hooks::Hook(after_migration) "/etc/ovirt-guest-agent/hooks.d/after_migration/55_flush-caches" executed

And the only Guest API change was when the Guest started, nothing on migrations:
2020-01-24 13:13:40,653+1000 INFO  (vmchannels) [virt.vm] (vmId='59244f31-f6a2-474d-b52c-afe2a1e5e978') Guest API version changed from 0 to 3 (guestagent:292)

Comment 11 Milan Zamazal 2020-01-27 11:13:33 UTC
Hi Germano, thank you for testing. I posted a patch fixing the bug in master.

Comment 12 Germano Veit Michel 2020-01-28 00:17:12 UTC
You are welcome, and thanks for the fix!

Is this on the radar for 4.3.z after the fix is accepted? There is a customer ticket attached...

Comment 13 Ryan Barry 2020-01-28 05:47:33 UTC
The patch is basically one line, so it's definitely a suitable target

Comment 15 Sandro Bonazzola 2020-03-13 10:38:15 UTC
this bug is targeting 4.4.1 and is in modified state. Can we retarget to 4.4.0 and move to QE?

Comment 16 Milan Zamazal 2020-03-13 10:48:45 UTC
(In reply to Sandro Bonazzola from comment #15)
> this bug is targeting 4.4.1 and is in modified state. Can we retarget to
> 4.4.0 and move to QE?

Yes.

Comment 17 Beni Pelled 2020-03-22 14:59:32 UTC
Verified on:
- RHV 4.4.0-0.26.master.el8ev
- vdsm-4.40.7-1.el8ev.x86_64 (host)
- libvirt-6.0.0-13.module+el8.2.0+6048+0fa476b4.x86_64 (host)

Verification steps:
1. Start a VM
2. Migrate the VM.

Result:
after_migration is sent to the guest:

# tail -f /var/log/vdsm/vdsm.log | grep 'after_migration'
2020-03-09 13:52:38,583+0200 DEBUG (vm/be5d4f39) [virt.vm] (vmId='be5d4f39-cb05-4a4d-9c6e-9856ff9c3e61') send_lifecycle_event after_migration called (guestagent:565)
2020-03-09 13:52:38,932+0200 DEBUG (vm/be5d4f39) [virt.vm] (vmId='be5d4f39-cb05-4a4d-9c6e-9856ff9c3e61') sent '{"__name__": "lifecycle-event", "type": "after_migration"}\n' (guestagent:342)

(DEBUG mode should be enabled in order to see these lines)

Comment 26 errata-xmlrpc 2020-08-04 13:27:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV RHEL Host (ovirt-host) 4.4), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:3246


Note You need to log in before you can comment on or make changes to this bug.