Bug 1269180

Summary: Device creation should not exit(1) on OOM - qxl and virtio-vga
Product: Red Hat Enterprise Linux Advanced Virtualization Reporter: Markus Armbruster <armbru>
Component: qemu-kvmAssignee: Gerd Hoffmann <kraxel>
Status: CLOSED WONTFIX QA Contact: FuXiangChun <xfu>
Severity: medium Docs Contact:
Priority: medium    
Version: ---CC: chayang, dgilbert, jinzhao, juzhang, kanderso, knoel, kraxel, michen, rbalakri, virt-maint, xfu
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-12-12 07:06:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Markus Armbruster 2015-10-06 14:21:35 UTC
Description of problem:
A number of device models call exit(1) on out-of-memory errors in their realize methods.  They should propagate the error instead.  Known offenders: "stm32f205-soc", "xlnx,zynqmp", "cgthree", "qxl-vga", "qxl", "SUNW,tcx", "isa-cirrus-vga", "cirrus-vga", "isa-vga", "VGA", "secondary-vga", "virtio-vga", "vmware-svga".

Some device models do it in their instance_init() methods.  The code that can fail that wat should be moved to the realize method instead, where the error can be propagated properly.  Known offenders: "cgthree", "SUNW,tcx".

Several of these devices are irrelevant for RHEL-7, but not all.

How reproducible:
Found by code inspection.  Actually reproducing the incorrect exit(1) would involve rigging the out-of-memory condition somehow.

Additional info:
See http://lists.gnu.org/archive/html/qemu-devel/2015-09/msg03493.html

Comment 4 Ademar Reis 2018-12-10 21:32:56 UTC
(In reply to Markus Armbruster from comment #0)
> Description of problem:
> A number of device models call exit(1) on out-of-memory errors in their
> realize methods.  They should propagate the error instead.  Known offenders:
> "stm32f205-soc", "xlnx,zynqmp", "cgthree", "qxl-vga", "qxl", "SUNW,tcx",
> "isa-cirrus-vga", "cirrus-vga", "isa-vga", "VGA", "secondary-vga",
> "virtio-vga", "vmware-svga".

From the list above, looks like only qxl and virtio-vga are the ones we care about in RHEL. Gerd, can you please review the status upstream? I'm tempted to close this BZ.

Comment 5 Markus Armbruster 2018-12-11 07:14:31 UTC
Both qxl and virtio-vga are still affected upstream.

VGA devices call vga_common_init(), which passes &error_fatal to memory_region_init_ram_nomigrate().  Some memory region functions including this one allocate guest RAM with qemu_ram_alloc(), which boils down to g_malloc0().  g_malloc0() treats OOM as fatal.  So, even if vga_common_init() handled errors properly instead of passing &error_fatal, it still wouldn't get to handle OOM.

Allocating memory regions is the common pattern for all the devices listed in this bug.

I discussed qxl on the upstream mailing list:

    https://lists.nongnu.org/archive/html/qemu-devel/2018-10/msg03853.html
    Subject: Re: When it's okay to treat OOM as fatal?
    Message-ID: <87o9brl7zc.fsf.sub.org>

I've since come to the conclusion that fixing this is not worthwhile.  I made my case in that thread.  I can summarize here if necessary.  For the record, David Gilbert (cc'ed) disagrees with me.

Comment 6 Ademar Reis 2018-12-11 16:37:49 UTC
Reassigning to Gerd, who maintains both virtio-vga and qxl. Gerd: it's up to you to decide if this is worth fixing.

Comment 7 Gerd Hoffmann 2018-12-12 07:06:52 UTC
(In reply to Ademar Reis from comment #6)
> Reassigning to Gerd, who maintains both virtio-vga and qxl. Gerd: it's up to
> you to decide if this is worth fixing.

I'd say no.

For one, hotplugging display devices is a rather uncommon use case, also not supported very well in qemu.

Second, it'll only actually work you turn off memory overcommit, otherwise you can allocate more memory than you actually have ram+swap for (see comment 5 qemu-devel link), and the linux kernel's oom killer may get you when the guest starts using the vga device memory and the pages get faulted in.