Due to a problem in qemu a VM with maximum memory size of more than cca 256GB takes noticeable time. Progressively the time increases with more than O(n2) complexity, at 1TB it's already minutes. RHEV uses a default 4TB max when hot-plug mem is enabled and then the VM startup takes ages, reflected in RHEV GUI as a VM stuck in "WaitForLaunch" state forever
David, please feel free to open your own qemu-kvm bug for more details I'd like to track it as part of RHEV just in case we have to create a different code path/config for x86 vs ppc (in case it's going to be a limitation during 3.6 on qemu side)
I've created bug 1262143 to track the qemu side of this. The Regression flag doesn't seem quite right for this bug, since memory hotplug is a new feature.
this is an AutomationBlocker at most, as for manual tests the workaround is a simple decrease of max allowed memory size either way, due to https://bugzilla.redhat.com/show_bug.cgi?id=1262143#c1 I propose to limit the max size on POWER to 1TB to not affect all the VMs, only in case someone wants/needs a >1TB VM a configuration option should be used to increase the limit (and suffer the startup delay on all VMs then)
I think the definition for AutomationBlocker is an issue that prevents automation from running, not any automation failure. Thus, removing this flag.
Created attachment 1072834 [details] Test_with_512GB
Update testing: I tested memory hot plug on PPC with VM64BitMaxMemorySizeInMB: 512GB,510GB and 256BG Setup: RHEVM 3.6.0.12 : Red Hat Enterprise Virtualization Manager Version: 3.6.0-0.15.master.el6 VDSM: vdsm-4.17.6-1.el7ev Libvirt: libvirt-1.2.17-8.el7 Results: 1. With 512GB and 510GB: The VM failed to run: "VM golden_env_mixed_virtio_0 is down with error. Exit message: Lost connection with qemu process." 2. With 256GB: test PASSSED Attached engine, vdsm, qeum logs
David, is there any other limitation regarding RAM size? Seems in comment #5 it's failing to start with 512GB
Well, there aren't supposed to be other limitations but there's always the possibility of further bugs. It looks like the problem you're hitting is the same one reported in bug 1262143 comment 2. I'm not immediately sure why you and Qunfang both hit this, but I didn't - I'm investigating. As a temporary workaround for testing you may be able to configure a larger maxmem if you minimise the number of other devices (of any sort) in the guest - the problem appears to be that we're running out of space in the limited buffer for the guest device tree.
Michal, regarding the doc text. We have a fix in the queue that should fix the startup times - not completely, but now minutes of startup time should only start happening around 2T of maxmem. However there's another problem that means a 1T limit is a good idea: bug 1263039 covers a crash during guest boot with certain guests and maxmem above around 256G (exactly where depends on how many cpus and other devices are in the system). We have a fix for that, but it just increases a small limited buffer by a certain factor. 1T of maxmem and plenty of devices should be safe with the fix, but 2T of maxmem isn't. We plan to fix this better, but that will require more upstream work and won't be ready for RHEL 7.2.
Hi Michal, I have updated the doc text. Please let me know if anything needs to be changed. Kind regards, Julie
how about this?
(In reply to Michal Skrivanek from comment #11) > how about this? Thanks! Looks good.
Verify with rhevm on ppc env: RHEVM Version: 3.6.0-0.18.el6 vdsm version: vdsm-4.17.8-1.el7ev libvirt version: libvirt-1.2.17-12.el7 Scenario: 1. Create VM with 1G 2. hot plug memory 1G/2G/256M 3. Check in VM memory status with free 4. Migrate VM All cases pass.
(In reply to Israel Pinto from comment #13) > Verify with rhevm on ppc env: > RHEVM Version: 3.6.0-0.18.el6 > vdsm version: vdsm-4.17.8-1.el7ev > libvirt version: libvirt-1.2.17-12.el7 > Scenario: > 1. Create VM with 1G > 2. hot plug memory 1G/2G/256M > 3. Check in VM memory status with free > 4. Migrate VM > > All cases pass. Either your wording is unclear or this test scenario does not handle this bugfix at all. you should measure startup time for vms with huge amounts of (hot pluggable)ram, not check the vm memory status inside the vm (has also nothing to do with vm migration). But maybe I misread your test case?
There problem is not with memory size, however with the size on the engine: VM64BitMaxMemorySizeInMB. If the size is 4T it was impossible to start VM, we found that if you use less memory in (VM64BitMaxMemorySizeInMB) the VM is up with no problem. I add migration also to check that the memory stay the same on different host.