+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1788783 +++ ====================================================================== Description of problem: Sometimes the after_migration lifecycle event does not reach the Guest. It looks like this is due to sometimes GuestAPIVersion going 3 -> 0 -> 3 instead of staying at 3, and the event is triggered by VDSM (on _vmDependentInit) while its 0, so its not sent to the Guest. It hits this instead of sending to the Guest: https://github.com/oVirt/vdsm/blob/ovirt-4.3/lib/vdsm/virt/guestagent.py#L332 Added some debug to confirm, and the Guest API Version is 0 in VDSM at that point, so the event is not sent: 2020-01-08 14:50:47,285+1000 INFO (vm/ac7757ca) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') Starting connection (guestagent:256) 2020-01-08 14:50:47,286+1000 DEBUG (vm/ac7757ca) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') send_lifecycle_event after_migration called (guestagent:566) 2020-01-08 14:50:47,286+1000 DEBUG (vm/ac7757ca) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') GuestAgentUnsupportedMessage: 3 > 0 (guestagent:333) 2020-01-08 14:50:47,287+1000 INFO (vm/ac7757ca) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') CPU running: domain initialization (vm:6062) Shortly after GuestAPIVersion goes to 3. 2020-01-08 14:50:47,303+1000 INFO (vm/ac7757ca) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') End of migration (vm:4088) 2020-01-08 14:50:47,649+1000 DEBUG (vmchannels) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') Attempting connection to /var/lib/libvirt/qemu/channels/ac7757ca-4fc8-4625-a91c-38f518064140.ovirt-guest-agent.0 (guestagent:310) 2020-01-08 14:50:47,649+1000 DEBUG (vmchannels) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') Connected to /var/lib/libvirt/qemu/channels/ac7757ca-4fc8-4625-a91c-38f518064140.ovirt-guest-agent.0 (guestagent:314) 2020-01-08 14:50:47,649+1000 DEBUG (vmchannels) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') sent '{"__name__": "refresh", "apiVersion": 3}\n' (guestagent:343) 2020-01-08 14:50:47,957+1000 DEBUG (vmchannels) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') Guest's message heartbeat: {u'memory-stat': {u'swap_out': 0, u'majflt': 2, u'swap_usage': 0, u'mem_cached': 43572, u'mem_free': 2045264, u'mem_buffers': 0, u'swap_in': 0, u'swap_total': 2088956, u'pageflt': 83, u'mem_total': 2236884, u'mem_unused': 2001692}, u'free-ram': u'1997', u'apiVersion': 3} (guestagent:346) 2020-01-08 14:50:47,957+1000 INFO (vmchannels) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') Guest API version changed from 0 to 3 (guestagent:292) <======= If API Version is sent from SRC to DST, why do we see this? But when it works during the migration, its like this: 2020-01-08 15:03:33,803+1000 INFO (vm/ac7757ca) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') Starting connection (guestagent:256) 2020-01-08 15:03:33,804+1000 DEBUG (vm/ac7757ca) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') send_lifecycle_event after_migration called (guestagent:566) 2020-01-08 15:03:34,524+1000 DEBUG (vmchannels) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') Attempting connection to /var/lib/libvirt/qemu/channels/ac7757ca-4fc8-4625-a91c-38f518064140.ovirt-guest-agent.0 (guestagent:310) 2020-01-08 15:03:34,524+1000 DEBUG (vmchannels) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') Connected to /var/lib/libvirt/qemu/channels/ac7757ca-4fc8-4625-a91c-38f518064140.ovirt-guest-agent.0 (guestagent:314) 2020-01-08 15:03:34,524+1000 DEBUG (vmchannels) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') sent '{"__name__": "refresh", "apiVersion": 3}\n' (guestagent:343) 2020-01-08 15:03:34,525+1000 DEBUG (vmchannels) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') Guest's message heartbeat: {u'memory-stat': {u'swap_out': 0, u'majflt': 1, u'swap_usage': 0, u'mem_cached': 41792, u'mem_free': 2046252, u'mem_buffers': 0, u'swap_in': 0, u'swap_total': 2088956, u'pageflt': 169, u'mem_total': 2236884, u'mem_unused': 2004460}, u'free-ram': u'1998', u'apiVersion': 3} (guestagent:346) 2020-01-08 15:03:34,526+1000 DEBUG (vmchannels) [jsonrpc.Notification] Sending event {"params": {"ac7757ca-4fc8-4625-a91c-38f518064140": {"status": "Migration Destination", "username": "Unknown", "memUsage": "13", "guestFQDN": "", "memoryStats": {"swap_out": "0", "majflt": "1", "mem_cached": "41792", "mem_free": "2004460", "mem_buffers": "0", "swap_in": "0", "pageflt": "169", "mem_total": "2236884", "mem_unused": "2004460"}, "session": "Unknown", "netIfaces": [], "guestCPUCount": -1, "appsList": [], "guestIPs": "", "disksUsage": []}, "notify_time": 4296663730}, "jsonrpc": "2.0", "method": "|virt|VM_status|ac7757ca-4fc8-4625-a91c-38f518064140"} (__init__:181) 2020-01-08 15:03:34,527+1000 DEBUG (vm/ac7757ca) [virt.vm] (vmId='ac7757ca-4fc8-4625-a91c-38f518064140') sent '{"__name__": "lifecycle-event", "type": "after_migration"}\n' (guestagent:343) Without "Guest API version changed" message, which indicates its 3 -> 3 instead of 3 -> 0 -> 3. I've noticed that guestAgentAPIVersion is always sent from SRC to DST VDSM metadata, but sometimes its not present in the DST when it dumps it after migrationCreate. When its not represent it seems to produce the problem of 3-> 0 -> 3. For both examples above: 2020-01-08 14:50:47,261+1000 DEBUG (vm/ac7757ca) [virt.metadata.Descriptor] dumped metadata for ac7757ca-4fc8-4625-a91c-38f518064140: <?xml version='1.0' encoding='utf-8'?> <vm> <clusterVersion>4.3</clusterVersion> <destroy_on_reboot type="bool">False</destroy_on_reboot> <launchPaused>false</launchPaused> <memGuaranteedSize type="int">1536</memGuaranteedSize> <minGuaranteedMemoryMb type="int">1536</minGuaranteedMemoryMb> <resumeBehavior>auto_resume</resumeBehavior> <startTime type="float">1578458970.56</startTime> <device mac_address="00:1a:4a:61:18:47"> <network>rhevm</network> <custom> <queues>2</queues> </custom> </device> <device mac_address="00:1a:4a:61:18:51"> <network>rhevm</network> <custom> <queues>2</queues> </custom> </device> <device mac_address="00:1a:4a:61:18:50"> <network>rhevm</network> <custom> <queues>2</queues> </custom> </device> <device devtype="disk" name="sda"> <domainID>f7738745-8c1e-497a-b78a-977119c4354c</domainID> <guestName>/dev/sda</guestName> <imageID>37eed169-5581-4252-a774-4a68fb22bc1e</imageID> <poolID>8922eadb-09a6-4a42-88ca-e6298e95b605</poolID> <volumeID>619e76f0-7688-45e1-9620-bedcbddbec4a</volumeID> <volumeChain> <volumeChainNode> <domainID>f7738745-8c1e-497a-b78a-977119c4354c</domainID> <imageID>37eed169-5581-4252-a774-4a68fb22bc1e</imageID> <leaseOffset type="int">0</leaseOffset> <leasePath>/rhev/data-center/mnt/10.64.24.26:_storage/f7738745-8c1e-497a-b78a-977119c4354c/images/37eed169-5581-4252-a774-4a68fb22bc1e/619e76f0-7688-45e1-9620-bedcbddbec4a.lease</leasePath> <path>/rhev/data-center/mnt/10.64.24.26:_storage/f7738745-8c1e-497a-b78a-977119c4354c/images/37eed169-5581-4252-a774-4a68fb22bc1e/619e76f0-7688-45e1-9620-bedcbddbec4a</path> <volumeID>619e76f0-7688-45e1-9620-bedcbddbec4a</volumeID> </volumeChainNode> </volumeChain> </device> </vm> 2020-01-08 14:50:41,817+1000 DEBUG (jsonrpc/3) [api] FINISH migrationCreate response={'status': {'message': 'Done', 'code': 0}, 'migrationPort': 0, 'params': {..... u'guestAgentAPIVersion': 3,... 2020-01-08 15:02:36,204+1000 DEBUG (jsonrpc/5) [virt.metadata.Descriptor] parsing metadata for ac7757ca-4fc8-4625-a91c-38f518064140: <?xml version='1.0' encoding='utf-8'?> <ovirt-vm:vm xmlns:ovirt-vm="http://ovirt.org/vm/1.0"> <ovirt-vm:clusterVersion>4.3</ovirt-vm:clusterVersion> <ovirt-vm:destroy_on_reboot type="bool">False</ovirt-vm:destroy_on_reboot> <ovirt-vm:guestAgentAPIVersion type="int">3</ovirt-vm:guestAgentAPIVersion> <-------- does not exist on the above <ovirt-vm:launchPaused>false</ovirt-vm:launchPaused> <ovirt-vm:memGuaranteedSize type="int">1536</ovirt-vm:memGuaranteedSize> <ovirt-vm:minGuaranteedMemoryMb type="int">1536</ovirt-vm:minGuaranteedMemoryMb> <ovirt-vm:resumeBehavior>auto_resume</ovirt-vm:resumeBehavior> <ovirt-vm:startTime type="float">1578459610.04</ovirt-vm:startTime> <ovirt-vm:device mac_address="00:1a:4a:61:18:47"> <ovirt-vm:network>rhevm</ovirt-vm:network> <ovirt-vm:custom> <ovirt-vm:queues>2</ovirt-vm:queues> </ovirt-vm:custom> </ovirt-vm:device> <ovirt-vm:device mac_address="00:1a:4a:61:18:51"> <ovirt-vm:network>rhevm</ovirt-vm:network> <ovirt-vm:custom> <ovirt-vm:queues>2</ovirt-vm:queues> </ovirt-vm:custom> </ovirt-vm:device> <ovirt-vm:device mac_address="00:1a:4a:61:18:50"> <ovirt-vm:network>rhevm</ovirt-vm:network> <ovirt-vm:custom> <ovirt-vm:queues>2</ovirt-vm:queues> </ovirt-vm:custom> </ovirt-vm:device> <ovirt-vm:device devtype="disk" name="sda"> <ovirt-vm:domainID>f7738745-8c1e-497a-b78a-977119c4354c</ovirt-vm:domainID> <ovirt-vm:guestName>/dev/sda</ovirt-vm:guestName> <ovirt-vm:imageID>37eed169-5581-4252-a774-4a68fb22bc1e</ovirt-vm:imageID> <ovirt-vm:poolID>8922eadb-09a6-4a42-88ca-e6298e95b605</ovirt-vm:poolID> <ovirt-vm:volumeID>619e76f0-7688-45e1-9620-bedcbddbec4a</ovirt-vm:volumeID> <ovirt-vm:volumeChain> <ovirt-vm:volumeChainNode> <ovirt-vm:domainID>f7738745-8c1e-497a-b78a-977119c4354c</ovirt-vm:domainID> <ovirt-vm:imageID>37eed169-5581-4252-a774-4a68fb22bc1e</ovirt-vm:imageID> <ovirt-vm:leaseOffset type="int">0</ovirt-vm:leaseOffset> <ovirt-vm:leasePath>/rhev/data-center/mnt/10.64.24.26:_storage/f7738745-8c1e-497a-b78a-977119c4354c/images/37eed169-5581-4252-a774-4a68fb22bc1e/619e76f0-7688-45e1-9620-bedcbddbec4a.lease</ovirt-vm:leasePath> <ovirt-vm:path>/rhev/data-center/mnt/10.64.24.26:_storage/f7738745-8c1e-497a-b78a-977119c4354c/images/37eed169-5581-4252-a774-4a68fb22bc1e/619e76f0-7688-45e1-9620-bedcbddbec4a</ovirt-vm:path> <ovirt-vm:volumeID>619e76f0-7688-45e1-9620-bedcbddbec4a</ovirt-vm:volumeID> </ovirt-vm:volumeChainNode> </ovirt-vm:volumeChain> </ovirt-vm:device> <ovirt-vm:device devtype="disk" name="hdc" /> </ovirt-vm:vm> (metadata:689) But as shown above guestAgentAPIVersion is in migrationCreate(). Version-Release number of selected component (if applicable): vdsm-4.30.38-1.el7ev.x86_64 How reproducible: Sometimes Steps to Reproduce: 1. Migrate VM Actual results: * after_migration lifecycle event not sent * VM briefly shows warning in the Admin Portal about Guest Agent not running (Originally by Germano Veit Michel)
Hi Germano, it's weird. If I understand it right, guestAgentAPIVersion is always present in params, but not always in metadata. But the value is set to be the same in both params in metadata in Vm.migration_parameters(). We have a bug in 4.4 that the value is not set in metadata, but then it's also not set in params. We always get the API version from metadata in 4.3. If it is missing then there is a race, if after_migrate is called before the API version is retrieved, it doesn't work. Now the question is why is the value missing in metadata after migration. Can you see it in the source log in the domain XML sent to the destination? Can you see "dumped metadata" debug log message in vdsm.log on the source just before a migration demonstrating the failure? We have libvirt hooks transforming the domain XML on migrations but they are completely unrelated and should be harmless. A libvirt bug is also unlikely, why would it eat just guestAgentAPIVersion. I can't see any obvious point of failure at the moment; we should look into source vdsm.log whether we can get any hint from it. (Originally by Milan Zamazal)
Hi Milan, It is indeed very weird. I could not find any clue on where this is coming from. It seems like the guestAgentAPIVersion is simply wiped on the destination host at some point. I don't have the logs from that host anymore, I've wiped it. But I can reproduce this again, with debug also on source. Besides virt debug on both src and dst, do you have any other idea of log to enable or even add to the code to try to clarify this? Thanks (Originally by Germano Veit Michel)
(In reply to Germano Veit Michel from comment #3) > Besides virt debug on both src and dst, do you have any other idea of log to > enable or even add to the code to try to clarify this? Enabling libvirt debug log may be useful, in case Vdsm logs don't provide enough info. (Originally by Milan Zamazal)
I can reproduce it now. libvirt log shows a correct and completed metadata update but then the inserted item is apparently missing in the domain XML sent to destination. I suspect some timing issues with change propagation in libvirt, will look into it further. (Originally by Milan Zamazal)
Thanks. I've reproduced it again as well, will attach logs anyway just in case. 2020-01-21 13:29:31,362+1000 DEBUG (vm/59244f31) [virt.vm] (vmId='59244f31-f6a2-474d-b52c-afe2a1e5e978') send_lifecycle_event after_migration called (guestagent:565) Slightly after: 2020-01-21 13:29:34,782+1000 INFO (vmchannels) [virt.vm] (vmId='59244f31-f6a2-474d-b52c-afe2a1e5e978') Guest API version changed from 0 to 3 (guestagent:292) (Originally by Germano Veit Michel)
I think the problem is that initial destination metadata is not read from the libvirt domain XML but from _srcDomXML parameter provided in the Vdsm migration creation call. But the guest agent API version is set on the source only after _srcDomXML value is retrieved, so it's missing in _srcDomXML. I may still be missing something, because it doesn't explain itself why the problem doesn't appear on every migration. But up-to-date metadata should be present in _srcDomainXML anyway and that should be fixed. Germano, would you have a chance to check whether https://gerrit.ovirt.org/106483 fixes the problem with after_migration events? (Originally by Milan Zamazal)
Hi Milan, sure I can. Just did not have any chance to work on this today. Please hold a bit, hopefully tomorrow. (Originally by Germano Veit Michel)
(In reply to Milan Zamazal from comment #8) > Germano, would you have a chance to check whether > https://gerrit.ovirt.org/106483 fixes the problem with after_migration > events? It seems to work for me. Did 4 migrations in a row, and it worked 4 times: # grep after /var/log/ovirt-guest-agent/ovirt-guest-agent.log Dummy-1::INFO::2020-01-24 13:14:07,029::hooks::64::Hooks::Hook(after_migration) "/etc/ovirt-guest-agent/hooks.d/after_migration/55_flush-caches" executed Dummy-1::INFO::2020-01-24 13:19:58,731::hooks::64::Hooks::Hook(after_migration) "/etc/ovirt-guest-agent/hooks.d/after_migration/55_flush-caches" executed Dummy-1::INFO::2020-01-24 13:21:23,520::hooks::64::Hooks::Hook(after_migration) "/etc/ovirt-guest-agent/hooks.d/after_migration/55_flush-caches" executed Dummy-1::INFO::2020-01-24 13:22:41,195::hooks::64::Hooks::Hook(after_migration) "/etc/ovirt-guest-agent/hooks.d/after_migration/55_flush-caches" executed And the only Guest API change was when the Guest started, nothing on migrations: 2020-01-24 13:13:40,653+1000 INFO (vmchannels) [virt.vm] (vmId='59244f31-f6a2-474d-b52c-afe2a1e5e978') Guest API version changed from 0 to 3 (guestagent:292) (Originally by Germano Veit Michel)
Hi Germano, thank you for testing. I posted a patch fixing the bug in master. (Originally by Milan Zamazal)
You are welcome, and thanks for the fix! Is this on the radar for 4.3.z after the fix is accepted? There is a customer ticket attached... (Originally by Germano Veit Michel)
The patch is basically one line, so it's definitely a suitable target (Originally by Ryan Barry)
WARN: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.3.z': '?'}', ] For more info please contact: rhv-devops: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [Found non-acked flags: '{'rhevm-4.3.z': '?'}', ] For more info please contact: rhv-devops
To clarify the scope of this issue: after_migrate hook functionality is available only when using ovirt-guest-agent. That means it can work only with el7 guests; it doesn't work with el8 guests since there is no ovirt-guest-agent on el8 and AFAIK there is no substitute of this kind of hook functionality anywhere.
Verified on: - RHV 4.3.9.1-0.1.el7 - vdsm-4.30.42-1.el7ev.x86_64 (host) - libvirt-4.5.0-33.el7.x86_64 (host) Verification steps: 1. Migrate VM. Result: after_migration is sent to the guest: # tail -f /var/log/vdsm/vdsm.log | grep 'after_migration' 2020-03-09 13:52:38,583+0200 DEBUG (vm/be5d4f39) [virt.vm] (vmId='be5d4f39-cb05-4a4d-9c6e-9856ff9c3e61') send_lifecycle_event after_migration called (guestagent:565) 2020-03-09 13:52:38,932+0200 DEBUG (vm/be5d4f39) [virt.vm] (vmId='be5d4f39-cb05-4a4d-9c6e-9856ff9c3e61') sent '{"__name__": "lifecycle-event", "type": "after_migration"}\n' (guestagent:342)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1307