1261812 – [ppc64le] VM startup takes too long when hot-plug memory feature is enabled

Bug 1261812 - [ppc64le] VM startup takes too long when hot-plug memory feature is enabled

Summary: [ppc64le] VM startup takes too long when hot-plug memory feature is enabled

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	3.6.0
Hardware:	ppc64le
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	ovirt-3.6.0-rc
Target Release:	3.6.0
Assignee:	Martin Betak
QA Contact:	Israel Pinto
Docs Contact:
URL:
Whiteboard:
Depends On:	1262143 1263039
Blocks:	515840 RHEV3.6PPC 1224886 1277183 1277184
TreeView+	depends on / blocked

Reported:	2015-09-10 08:37 UTC by Michal Skrivanek
Modified:	2016-04-20 01:26 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	Due to the QEMU known issue (see BZ#1262143), ppc64le virtual machines take longer to start. To work around this issue, the default virtual machine maximum memory setting for ppc64le systems is set to 1TB instead of 4TB on x86_64 systems. This default can be increased, however then it takes a few minutes for a ppc64le virtual machine to start.
Clone Of:
Environment:
Last Closed:	2016-04-20 01:26:59 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Test_with_512GB (926.09 KB, application/zip) 2015-09-13 08:55 UTC, Israel Pinto	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1284775	medium	CLOSED	[PPC64LE] enable memory hotplug support	2021-02-22 00:41:40 UTC
oVirt gerrit	46099	master	MERGED	core: Decrease max memory size for ppc64 to 1TB	Never
oVirt gerrit	46213	ovirt-engine-3.6	MERGED	core: Decrease max memory size for ppc64 to 1TB	Never

Internal Links: 1284775

Description Michal Skrivanek 2015-09-10 08:37:05 UTC

Due to a problem in qemu a VM with maximum memory size of more than cca 256GB takes noticeable time. Progressively the time increases with more than O(n2) complexity, at 1TB it's already minutes.

RHEV uses a default 4TB max when hot-plug mem is enabled and then the VM startup takes ages, reflected in RHEV GUI as a VM stuck in "WaitForLaunch" state forever

Comment 1 Michal Skrivanek 2015-09-10 08:39:13 UTC

David, please feel free to open your own qemu-kvm bug for more details
I'd like to track it as part of RHEV just in case we have to create a different code path/config for x86 vs ppc (in case it's going to be a limitation during 3.6 on qemu side)

Comment 2 David Gibson 2015-09-11 00:08:35 UTC

I've created bug 1262143 to track the qemu side of this.

The Regression flag doesn't seem quite right for this bug, since memory hotplug is a new feature.

Comment 3 Michal Skrivanek 2015-09-11 14:53:35 UTC

this is an AutomationBlocker at most, as for manual tests the workaround is a simple decrease of max allowed memory size

either way, due to https://bugzilla.redhat.com/show_bug.cgi?id=1262143#c1 I propose to limit the max size on POWER to 1TB to not affect all the VMs, only in case someone wants/needs a >1TB VM a configuration option should be used to increase the limit (and suffer the startup delay on all VMs then)

Comment 4 Yaniv Kaul 2015-09-13 06:50:25 UTC

I think the definition for AutomationBlocker is an issue that prevents automation from running, not any automation failure. Thus, removing this flag.

Comment 5 Israel Pinto 2015-09-13 08:55:18 UTC

Created attachment 1072834 [details]
Test_with_512GB

Comment 6 Israel Pinto 2015-09-13 08:57:42 UTC

Update testing:
I tested memory hot plug on PPC with VM64BitMaxMemorySizeInMB: 512GB,510GB and 256BG
Setup:
RHEVM 3.6.0.12 :
Red Hat Enterprise Virtualization Manager Version: 3.6.0-0.15.master.el6
VDSM:  vdsm-4.17.6-1.el7ev
Libvirt: libvirt-1.2.17-8.el7
Results:
1. With 512GB and 510GB: The VM failed to run:
"VM golden_env_mixed_virtio_0 is down with error. Exit message: Lost connection with qemu process."
2. With 256GB: test PASSSED


Attached engine, vdsm, qeum logs

Comment 7 Michal Skrivanek 2015-09-14 12:46:02 UTC

David, is there any other limitation regarding RAM size? Seems in comment #5 it's failing to start with 512GB

Comment 8 David Gibson 2015-09-15 01:16:36 UTC

Well, there aren't supposed to be other limitations but there's always the possibility of further bugs.

It looks like the problem you're hitting is the same one reported in bug 1262143 comment 2.  I'm not immediately sure why you and Qunfang both hit this, but I didn't - I'm investigating.

As a temporary workaround for testing you may be able to configure a larger maxmem if you minimise the number of other devices (of any sort) in the guest - the problem appears to be that we're running out of space in the limited buffer for the guest device tree.

Comment 9 David Gibson 2015-09-16 23:48:41 UTC

Michal, regarding the doc text.  We have a fix in the queue that should fix the startup times - not completely, but now minutes of startup time should only start happening around 2T of maxmem.  However there's another problem that means a 1T limit is a good idea: bug 1263039 covers a crash during guest boot with certain guests and maxmem above around 256G (exactly where depends on how many cpus and other devices are in the system).  We have a fix for that, but it just increases a small limited buffer by a certain factor.  1T of maxmem and plenty of devices should be safe with the fix, but 2T of maxmem isn't.

We plan to fix this better, but that will require more upstream work and won't be ready for RHEL 7.2.

Comment 10 Julie 2015-09-21 03:58:16 UTC

Hi Michal,
   I have updated the doc text. Please let me know if anything needs to be changed.

Kind regards,
Julie

Comment 11 Michal Skrivanek 2015-09-21 08:34:38 UTC

how about this?

Comment 12 Julie 2015-09-21 23:52:47 UTC

(In reply to Michal Skrivanek from comment #11)
> how about this?

Thanks! Looks good.

Comment 13 Israel Pinto 2015-10-12 14:30:36 UTC

Verify with rhevm on ppc env:
RHEVM Version: 3.6.0-0.18.el6
vdsm version: vdsm-4.17.8-1.el7ev
libvirt version: libvirt-1.2.17-12.el7
Scenario:
1. Create VM with 1G
2. hot plug memory 1G/2G/256M
3. Check in VM memory status with free
4. Migrate VM

All cases pass.

Comment 14 Sven Kieske 2015-10-14 09:36:51 UTC

(In reply to Israel Pinto from comment #13)
> Verify with rhevm on ppc env:
> RHEVM Version: 3.6.0-0.18.el6
> vdsm version: vdsm-4.17.8-1.el7ev
> libvirt version: libvirt-1.2.17-12.el7
> Scenario:
> 1. Create VM with 1G
> 2. hot plug memory 1G/2G/256M
> 3. Check in VM memory status with free
> 4. Migrate VM
> 
> All cases pass.

Either your wording is unclear or this test scenario does not handle this bugfix at all.

you should measure startup time for vms with huge amounts of (hot pluggable)ram, not check the vm memory status inside the vm (has also nothing to do with vm migration).

But maybe I misread your test case?

Comment 15 Israel Pinto 2015-10-14 13:23:43 UTC

There problem is not with memory size, however with the size on the engine: VM64BitMaxMemorySizeInMB.
If the size is 4T it was impossible to start VM, we  found that if you use less
memory in (VM64BitMaxMemorySizeInMB) the VM is up with no problem.
I add migration also to check that the memory stay the same on different host.

Note You need to log in before you can comment on or make changes to this bug.