Bug 2055905
Summary: | Virtual Machine with large number of Direct LUNs failed to migrate without adjusting vars/migration_listener_timeout | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Germano Veit Michel <gveitmic> |
Component: | vdsm | Assignee: | Milan Zamazal <mzamazal> |
Status: | CLOSED ERRATA | QA Contact: | Qin Yuan <qiyuan> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.4.9 | CC: | aefrat, ahadas, apinnick, emarcus, lsurette, mavital, michal.skrivanek, mzamazal, sfishbai, srevivo, ycui |
Target Milestone: | ovirt-4.5.1 | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | vdsm-4.50.1.1 | Doc Type: | Bug Fix |
Doc Text: |
The default migration timeout period was increased to enable VMs with many direct LUN disks, which require more preparation time on the destination host, to be migrated.
The migration_listener_prepare_disk_timeout and max_migration_listener_timeout VDSM options were added to enable the default migration timeout period to be extended, when needed.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2022-07-14 12:41:13 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Germano Veit Michel
2022-02-17 21:58:45 UTC
moving to virt for considering better timeouts. e.g. do we have the number of drives at hand? Germano, could you please attach the relevant parts from the logs here? (In reply to Arik from comment #2) > Germano, could you please attach the relevant parts from the logs here? Of course. Migration of this VM 8bd921f9-21d1-4945-8088-1adb37abf02e starting at 2022-02-09 08:37:36,721-0500 Also, would be nice to have this out of debug level and into error, so the problem is clearer from vdsm logs without debug. 6018 self.log.debug('Timeout while waiting for path preparation') As discussed with Nir, the issue is indeed likely to be caused by many udev reloads, which are quite slow with so many devices on the host. If this is the case then the proper solution would be not to reload udev configuration after each rule change but only once all the rules are changed. Then timeout adjustments wouldn't be needed. It would probably require a non-trivial amount of work though. (In reply to Milan Zamazal from comment #7) > As discussed with Nir, the issue is indeed likely to be caused by many udev > reloads, which are quite slow with so many devices on the host. If this is > the case then the proper solution would be not to reload udev configuration > after each rule change but only once all the rules are changed. Then timeout > adjustments wouldn't be needed. It would probably require a non-trivial > amount of work though. Yes, if each LUN took a lot less time (around 80% faster) it would complete in the 30s. The _loadCorrectedTimeout() is still a bit off though, the number of udev devices and LUNs on the VM have greater influence than the "load" (number of VMs) on the host. (In reply to Milan Zamazal from comment #7) > As discussed with Nir, the issue is indeed likely to be caused by many udev > reloads, which are quite slow with so many devices on the host. If this is > the case then the proper solution would be not to reload udev configuration > after each rule change but only once all the rules are changed. Then timeout > adjustments wouldn't be needed. It would probably require a non-trivial > amount of work though. OTOH it's not difficult to improve the _loadCorrectedTimeout to account for number of drives. Just add 2s/disk in waitForMigrationDestinationPrepare and we have a great improvement for large VMs, while for most common cases this won't make any difference. Perhaps with some upper bound. It's not difficult to extend the timeout by some value per disk. The difficult part is what the value should be. 1, 2, 3 seconds, all is equally good or bad. And since it depends on the destination host, we cannot measure easily how long it takes to reload udev rules. But if you think 2 seconds is a good value (and obviously better than 0) then yes, we can lengthen the timeout this way. (In reply to Milan Zamazal from comment #11) > It's not difficult to extend the timeout by some value per disk. The > difficult part is what the value should be. 1, 2, 3 seconds, all is equally > good or bad. And since it depends on the destination host, we cannot measure > easily how long it takes to reload udev rules. But if you think 2 seconds is > a good value (and obviously better than 0) then yes, we can lengthen the > timeout this way. when they're equally good or bad then just pick one;) With some upper bound, say 5 minutes. (In reply to Michal Skrivanek from comment #12) > when they're equally good or bad then just pick one;) With some upper bound, > say 5 minutes. OK, will do. Verified with: vdsm-4.50.1.1-1.el8ev.x86_64 Steps: 1. Migrate VM without direct lun from host1 to host2, check the preparation timeout in vdsm.log on host2. 2. Attach 20 direct luns to the VM, migrate it from host1 to host2, check vdsm.log on host2 to see if the preparation timeout=(timeout in step1 + 20*2)s. 3. Set vars/migration_listener_prepare_disk_timeout to 30 on host2, migrate the VM from host1 to host2, check vdsm.log on host2 to see if the preparation timeout=max_migration_listener_timeout=600s. 4. Set vars/max_migration_listener_timeout to 800 on host2, migrate the VM from host1 to host2, check vdsm.log on host2 to see if the preparation timeout=(timeout in step1 + 20*30)s. Results: 1. Migrating VM without direct lun succeeded, the preparation timeout is 36.0s: 2022-06-14 14:16:09,015+0300 DEBUG (jsonrpc/5) [virt.vm] (vmId='a3c4c8c2-cb63-4451-8a13-c1f0277815eb') migration destination: waiting 36.0s for path preparation (vm:5805) 2. Migrating VM with 20 direct luns attached succeeded, the preparation timeout is 76.0s, it's equal to (36.0s + 20*2s) 2022-06-14 14:56:56,398+0300 DEBUG (jsonrpc/2) [virt.vm] (vmId='a3c4c8c2-cb63-4451-8a13-c1f0277815eb') migration destination: waiting 76.0s for path preparation (vm:5805) 3. Migrating VM with 20 direct luns attached, vars/migration_listener_prepare_disk_timeout=30 succeeded, the preparation timeout is 600s, which is the default max_migration_listener_timeout. 2022-06-14 15:08:41,154+0300 DEBUG (jsonrpc/5) [virt.vm] (vmId='a3c4c8c2-cb63-4451-8a13-c1f0277815eb') migration destination: waiting 600s for path preparation (vm:5805) 4. Migrating VM with 20 direct luns attached, vars/migration_listener_prepare_disk_timeout=30, vars/max_migration_listener_timeout=800 succeeded, the preparation timeout is 636.0s, it's equal to (36.0s + 20*30)s. 2022-06-14 15:18:06,290+0300 DEBUG (jsonrpc/0) [virt.vm] (vmId='a3c4c8c2-cb63-4451-8a13-c1f0277815eb') migration destination: waiting 636.0s for path preparation (vm:5805) According to the results, the bug fix worked as expected, when migrating VM with direct luns attached, the preparation timeout will be extended by (direct luns number * migration_listener_prepare_disk_timeout)s, and won't exceed the max_migration_listener_timeout. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV RHEL Host (ovirt-host) [ovirt-4.5.1] update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:5583 Due to QE capacity, we are not going to cover this issue in our automation |