Created attachment 1717471 [details] logs Created attachment 1717471 [details] logs Description of problem: VM migration fails when upgrading from RHV 4.3 to 4.4 Version-Release number of selected component (if applicable): v4.4: 4.4.3.3-0.19.el8ev -> host with rhel-8.3 v4.3: 4.3.11 -> host with rhel-7.9 How reproducible: 100% Steps to Reproduce: 1. Use v4.3 with VM running on hosts 2. Backup Engine 4.3 3. Provisioning Engine with OS 'rhel-8.3' 4. Restore Engine from backup 5. Enter first rhel-7.9 host into maintenance, reprovision with 'rhel-8.3' and add to Engine 6. Enter second rhel-7.9 host that include running VM into maintenance Actual results: Host is stuck in "Preparing for maintenance" Expected results: Host should enter into maintenance after migrating the VM to the new rhel-8.3 host Additional info: See attached logs... It seems that VM migration fails with the following error The host is still holding the VM and that's why it stuck in "Preparing for maintenance" state ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-19) [] Migration of VM 'Win2016_Cvm_64b' to host 'host_mixed_1' failed: Image /var/run/vdsm/payload/4809e0c7-f4fd-4ea9-8d86-d452b942929c.5560c76d66146eb12ef8b4165f475c5c.img is not inside /run/vdsm/payload directory.
I see that the payload was saved into a file named <vm_id>.<hash>.md: /var/run/vdsm/payload/4809e0c7-f4fd-4ea9-8d86-d452b942929c.5560c76d66146eb12ef8b4165f475c5c.img But then the VM is migrated from a 4.3.11-8 host to a 4.4.3-6 host. In 4.4 the <hash> part was removed from the payload filename (https://gerrit.ovirt.org/#/c/102698/) and so I suspect that on the target host the payload is saved to /var/run/vdsm/payload/4809e0c7-f4fd-4ea9-8d86-d452b942929c.img And therefore the domain cannot access the payload source that is specified in the cd-rom device and the migration fails. This combination of migrating VMs with payload from 4.3 to 4.4 is not that common and this shouldn't block automation. That said, we should preserve backward compatibility for migrating VMs - Milan, what do you think?
(In reply to Arik from comment #2) > This combination of migrating VMs with payload from 4.3 to 4.4 is not that > common and this shouldn't block automation. Roni, can you please repeat the tests on a homogeneous environment?
I suspect the problem is caused by the fact that we switched from /var/run to /run recently in Vdsm and the error message apparently comes from injectFilesToFs function that would fail exactly on that. We must find a way how to handle migrations with payloads from older Vdsm versions.
(In reply to Arik from comment #3) > (In reply to Arik from comment #2) > > This combination of migrating VMs with payload from 4.3 to 4.4 is not that > > common and this shouldn't block automation. > > Roni, can you please repeat the tests on a homogeneous environment? When I shut down the VM that was running on the v4.3 host and then start it on the v4.4 host then I can successfully migrate it to v4.4 host and to v4.3 host and vice versa. It is still an upgrade issue if we want to keep the VMs running during the upgrade
(In reply to Roni from comment #5) > When I shut down the VM that was running on the v4.3 host and then start it > on the v4.4 host > then I can successfully migrate it to v4.4 host and to v4.3 host and vice > versa. > It is still an upgrade issue if we want to keep the VMs running during the > upgrade Yes, it seems the VM initially ran with run-once + payload. When shutting down the VM and starting it again, it starts without the payload and then migration from a 4.3 host to a 4.4 host would work
(In reply to Milan Zamazal from comment #4) > I suspect the problem is caused by the fact that we switched from /var/run > to /run recently in Vdsm and the error message apparently comes from > injectFilesToFs function that would fail exactly on that. We must find a way > how to handle migrations with payloads from older Vdsm versions. Good, that explains why this issue was reported as a recently introduced regression (the change I've mentioned in comment 2 got in long time ago)
Petr, why does it depend on bz 1883817?
Because that bug blocks our upgrade flow where this bug happened.
(In reply to Petr Matyáš from comment #9) > Because that bug blocks our upgrade flow where this bug happened. So wouldn't it make more sense then to set it the other way around - that this bug blocks bz 1883817 and to verify this one not in the context of upgrade?
Verified on vdsm-4.40.35-1.el8ev
This bugzilla is included in oVirt 4.4.3 release, published on November 10th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.3 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.