Bug 1911581

Summary:	Core dump when hitting file descriptor limit
Product:	Red Hat Enterprise Linux Advanced Virtualization	Reporter:	Xujun Ma <xuma>
Component:	qemu-kvm	Assignee:	Greg Kurz <gkurz>
qemu-kvm sub component:	General	QA Contact:	Xujun Ma <xuma>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	medium	CC:	ailan, chayang, ddepaula, gkurz, jinzhao, juzhang, nilal, pbonzini, qzhang, smitterl, virt-maint, yama, yuhuang
Version:	8.4	Keywords:	TestOnly, Triaged
Target Milestone:	rc
Target Release:	8.5
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	qemu-kvm-6.0.0-17.module+el8.5.0+11173+c9fce0bb	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-11-16 07:51:11 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1788991

Description Xujun Ma 2020-12-30 06:31:41 UTC

Description of problem:
Failed to boot up guest with 384 vcpus

Version-Release number of selected component (if applicable):
host:
kernel-4.18.0-266.el8.ppc64le
qemu-kvm-5.2.0-2.module+el8.4.0+9186+ec44380f.ppc64le
guest:
kernel-4.18.0-259.el8.ppc64le

How reproducible:
100%

Steps to Reproduce:
1.Boot up guest with command
/usr/libexec/qemu-kvm \
 -smp 384  \
 -m 8192 \
 -nodefaults \
 -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4 \
 -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=rhel840-ppc64le-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \
 -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \
 -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \
 -device virtio-net-pci,netdev=net0,id=nic0,mac=52:54:00:c4:e7:84 \
 -netdev tap,id=net0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown,vhost=on \
 -chardev stdio,mux=on,id=serial_id_serial0,server,nowait,signal=off \
 -device spapr-vty,id=serial111,chardev=serial_id_serial0 \
 -mon chardev=serial_id_serial0,mode=readline \

2.
3.

Actual results:
Failed to boot up with error
qemu-kvm: warning: Number of SMP cpus requested (384) exceeds the recommended cpus supported by KVM (120)
qemu-kvm: warning: Number of hotpluggable cpus requested (384) exceeds the recommended cpus supported by KVM (120)
VNC server running on ::1:5900


SLOF **********************************************************************
QEMU Starting
 Build Date = Apr 28 2020 01:44:26
 FW Version = mockbuild@ release 20191022
 Press "s" to enter Open Firmware.

Populating /vdevice methods
Populating /vdevice/nvram@71000000
Populating /vdevice/vty@71000001
Populating /pci@800000020000000
                     00 0000 (D) : 1af4 1000    virtio [ net ]
                     00 2000 (D) : 1af4 1004    virtio [ scsi ]
Populating /pci@800000020000000/scsi@4
qemu-kvm: virtio_bus_set_host_notifier: unable to init event notifier: Too many open files (-24)
virtio-scsi: Failed to set host notifier (-24)
qemu-kvm: ../hw/scsi/virtio-scsi-dataplane.c:72: virtio_scsi_data_plane_handle_ctrl: Assertion `s->ctx && s->dataplane_started' failed.
d.sh: line 14:  3396 Aborted                 (core dumped) /usr/libexec/qemu-kvm -smp 384 -m 8192 -nodefaults -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4 -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=rhel840-ppc64le-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 -device scsi-hd,id=image1,drive=drive_image1,write-cache=on -device virtio-net-pci,netdev=net0,id=nic0,mac=52:54:00:c4:e7:84 -netdev tap,id=net0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown,vhost=on -chardev stdio,mux=on,id=serial_id_serial0,server,nowait,signal=off -device spapr-vty,id=serial111,chardev=serial_id_serial0 -mon chardev=serial_id_serial0,mode=readline

Expected results:
Boot up guest successfully.
Additional info:
Can solve this issue by increasing file limit,but there will be another problem:guest stop at "Trying to load:  from: /pci@800000020000000/scsi@4/disk@100000000000000 ...   Successfully loaded" about 3:18 minutes,I think it's not acceptable.and please help add friendly message if need more file limit.

qemu-kvm-5.1.0-17.module+el8.3.1+9213+7ace09c3 has no this problem.

Comment 1 Qunfang Zhang 2021-01-05 02:30:17 UTC

Xujun, can you confirm the hardware field?

Comment 2 David Gibson 2021-01-05 03:55:59 UTC

I'm pretty sure this is the same as the problem Greg looked at a while back.  Some changes have meant that qemu consumes more file descriptors per vcpu, which means we can now run into the RHEL default fd limits.

Comment 3 Xujun Ma 2021-01-05 07:25:25 UTC

(In reply to Qunfang Zhang from comment #1)
> Xujun, can you confirm the hardware field?

Have tested，x86 has this problem too.

Comment 4 Greg Kurz 2021-01-05 09:06:23 UTC

(In reply to David Gibson from comment #2)
> I'm pretty sure this is the same as the problem Greg looked at a while back.
> Some changes have meant that qemu consumes more file descriptors per vcpu,
> which means we can now run into the RHEL default fd limits.

Hmm... indeed we do see a "Too many open files" error that is likely the
same as in bug #1902548, but here we also have a QEMU crash. I'd prefer
to have a look before marking this bug as a duplicate of the other bug.

Comment 5 Greg Kurz 2021-01-06 18:53:26 UTC

(In reply to Greg Kurz from comment #4)
> (In reply to David Gibson from comment #2)
> > I'm pretty sure this is the same as the problem Greg looked at a while back.
> > Some changes have meant that qemu consumes more file descriptors per vcpu,
> > which means we can now run into the RHEL default fd limits.
> 
> Hmm... indeed we do see a "Too many open files" error that is likely the
> same as in bug #1902548, but here we also have a QEMU crash. I'd prefer
> to have a look before marking this bug as a duplicate of the other bug.

The QEMU crash happens because the rollback path does:

fail_vrings:
    aio_wait_bh_oneshot(s->ctx, virtio_scsi_dataplane_stop_bh, s);

virtio_scsi_dataplane_stop_bh() clears the host notifiers and causes the
vq handlers to be invoked.

This triggers the assertion in virtio_scsi_data_plane_handle_ctrl()
because s->dataplane_started hasn't been set to true yet.

So even if the root cause is the same (ran into fd limits), this isn't a
duplicate of bug #1902548: virtio-scsi should have a working fallback like
virtio-blk for this case, or at least print an error+hint to raise the fd
limit and exit gracefully instead of aborting.

Comment 6 David Gibson 2021-01-07 00:03:35 UTC

Good analysis, thanks Greg.  Can you work with Igor to figure out how to fix that?

Comment 7 Igor Mammedov 2021-01-07 10:28:07 UTC

virtio-scsi is Paolo's domain, adding him to CC.

Comment 8 Yanhui Ma 2021-01-11 10:08:39 UTC

(In reply to Xujun Ma from comment #3)
> (In reply to Qunfang Zhang from comment #1)
> > Xujun, can you confirm the hardware field?
> 
> Have tested，x86 has this problem too.

The results of x86 are as follows:

# ./cmd.bak 
QEMU 5.2.0 monitor - type 'help' for more information
(qemu) qemu-kvm: virtio_bus_set_host_notifier: unable to init event notifier: Too many open files (-24)
virtio-scsi: Failed to set host notifier (-24)
qemu-kvm: ../hw/scsi/virtio-scsi-dataplane.c:59: virtio_scsi_data_plane_handle_cmd: Assertion `s->ctx && s->dataplane_started' failed.
./cmd.bak: line 35: 109355 Aborted                 (core dumped) /usr/libexec/qemu-kvm -name 'avocado-vt-vm1' -sandbox on -machine q35,kernel-irqchip=split -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0 -nodefaults -device VGA,bus=pcie.0,addr=0x2 -m 30720 -smp 384 -device intel-iommu,intremap=on,eim=on -cpu 'IvyBridge',+kvm_pv_unhalt -chardev socket,nowait,path=/var/tmp/monitor-qmpmonitor1-20201130-083617-UdoMuUZg,server,id=qmp_id_qmpmonitor1 -mon chardev=qmp_id_qmpmonitor1,mode=control -chardev socket,nowait,path=/var/tmp/monitor-catch_monitor-20201130-083617-UdoMuUZg,server,id=qmp_id_catch_monitor -mon chardev=qmp_id_catch_monitor,mode=control -device pvpanic,ioport=0x505,id=idUR0xIV -chardev socket,nowait,path=/var/tmp/serial-serial0-20201130-083617-UdoMuUZg,server,id=chardev_serial0 -device isa-serial,id=serial0,chardev=chardev_serial0 -chardev socket,id=seabioslog_id_20201130-083617-UdoMuUZg,path=/var/tmp/seabios-20201130-083617-UdoMuUZg,server,nowait -device isa-debugcon,chardev=seabioslog_id_20201130-083617-UdoMuUZg,iobase=0x402 -device pcie-root-port,id=pcie-root-port-1,port=0x1,addr=0x1.0x1,bus=pcie.0,chassis=2 -device qemu-xhci,id=usb1,bus=pcie-root-port-1,addr=0x0 -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie-root-port-2,addr=0x0 -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=/home/rhel840-64-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off -blockdev node-name=drive_image1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 -device scsi-hd,id=image1,drive=drive_image1,write-cache=on -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 -vnc :0 -rtc base=utc -boot menu=off,order=cdn,once=c,strict=off -enable-kvm -monitor stdio -device pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,bus=pcie.0,addr=0x3,chassis=5

Comment 9 Eduardo Habkost 2021-02-09 21:03:07 UTC

(In reply to Xujun Ma from comment #0)
> Can solve this issue by increasing file limit,but there will be another
> problem:guest stop at "Trying to load:  from:
> /pci@800000020000000/scsi@4/disk@100000000000000 ...   Successfully loaded"
> about 3:18 minutes,I think it's not acceptable.and please help add friendly
> message if need more file limit.
> 

Having to increase the file descriptor limit for large VMs is not necessarily a bug.

If you are seeing additional problems after you increase the limit, this might be a bug that's more serious than this one and it needs a separate BZ.  Please open a BZ with more details.

Comment 10 Xujun Ma 2021-02-10 04:57:10 UTC

(In reply to Eduardo Habkost from comment #9)
> (In reply to Xujun Ma from comment #0)
> > Can solve this issue by increasing file limit,but there will be another
> > problem:guest stop at "Trying to load:  from:
> > /pci@800000020000000/scsi@4/disk@100000000000000 ...   Successfully loaded"
> > about 3:18 minutes,I think it's not acceptable.and please help add friendly
> > message if need more file limit.
> > 
> 
> Having to increase the file descriptor limit for large VMs is not
> necessarily a bug.
> 
> If you are seeing additional problems after you increase the limit, this
> might be a bug that's more serious than this one and it needs a separate BZ.
> Please open a BZ with more details.

Have filed a new bug:https://bugzilla.redhat.com/show_bug.cgi?id=1927108 for that problem.
Do we need to add a friendly warning instead of core dumped for this situation?

Comment 11 Eduardo Habkost 2021-02-10 20:41:10 UTC

(In reply to Xujun Ma from comment #10)
> Have filed a new bug:https://bugzilla.redhat.com/show_bug.cgi?id=1927108 for
> that problem.

Thanks!

> Do we need to add a friendly warning instead of core dumped for this
> situation?

Yes.  Crashing instead of printing a more friendly error message after hitting the limit is a bug, but not a major one.  Surely it is not a regression.

Note that a newer machine requiring more resources than older machines is not a regression.  "pseries-rhel8.4.0" and "pc-q35-rhel8.4.0" are expected to increase the number of virtio queues depending on the number of VCPUs, and will require higher open file limits.

Comment 12 Eduardo Habkost 2021-03-05 20:39:02 UTC

(In reply to Xujun Ma from comment #0)
> Expected results:
> Boot up guest successfully.
> Additional info:
> Can solve this issue by increasing file limit,but there will be another
> problem:guest stop at "Trying to load:  from:
> /pci@800000020000000/scsi@4/disk@100000000000000 ...   Successfully loaded"
> about 3:18 minutes,I think it's not acceptable.and please help add friendly
> message if need more file limit.
> 

There are three different problems described in the paragraph above:
1) Default configuration needs to be manually changed to run larger VMs.
2) A request to print a more friendly error message if the limit is too low.
3) A report that boot is stuck (or low) after manually increasing the file descriptor limit.

The scope of this BZ needs to be clearly defined.  Please clarify which of the problems above is being tracked by this BZ.

If I understand correctly, item #3 is being tracked at bug 1927108 and is out of the scope of this BZ.

Comment 13 Xujun Ma 2021-03-08 02:51:00 UTC

(In reply to Eduardo Habkost from comment #12)
> (In reply to Xujun Ma from comment #0)
> > Expected results:
> > Boot up guest successfully.
> > Additional info:
> > Can solve this issue by increasing file limit,but there will be another
> > problem:guest stop at "Trying to load:  from:
> > /pci@800000020000000/scsi@4/disk@100000000000000 ...   Successfully loaded"
> > about 3:18 minutes,I think it's not acceptable.and please help add friendly
> > message if need more file limit.
> > 
> 
> There are three different problems described in the paragraph above:
> 1) Default configuration needs to be manually changed to run larger VMs.
> 2) A request to print a more friendly error message if the limit is too low.
> 3) A report that boot is stuck (or low) after manually increasing the file
> descriptor limit.
> 
> The scope of this BZ needs to be clearly defined.  Please clarify which of
> the problems above is being tracked by this BZ.
> 
> If I understand correctly, item #3 is being tracked at bug 1927108 and is
> out of the scope of this BZ.

Yes,You are right.I think should add friendly error message for user at least so that user know how to handle when meeting this kind of situation.

Comment 14 Eduardo Habkost 2021-03-08 15:46:02 UTC

(In reply to Xujun Ma from comment #13)
> (In reply to Eduardo Habkost from comment #12)
> > (In reply to Xujun Ma from comment #0)
> > > Expected results:
> > > Boot up guest successfully.
> > > Additional info:
> > > Can solve this issue by increasing file limit,but there will be another
> > > problem:guest stop at "Trying to load:  from:
> > > /pci@800000020000000/scsi@4/disk@100000000000000 ...   Successfully loaded"
> > > about 3:18 minutes,I think it's not acceptable.and please help add friendly
> > > message if need more file limit.
> > > 
> > 
> > There are three different problems described in the paragraph above:
> > 1) Default configuration needs to be manually changed to run larger VMs.
> > 2) A request to print a more friendly error message if the limit is too low.
> > 3) A report that boot is stuck (or low) after manually increasing the file
> > descriptor limit.
> > 
> > The scope of this BZ needs to be clearly defined.  Please clarify which of
> > the problems above is being tracked by this BZ.
> > 
> > If I understand correctly, item #3 is being tracked at bug 1927108 and is
> > out of the scope of this BZ.
> 
> Yes,You are right.I think should add friendly error message for user at
> least so that user know how to handle when meeting this kind of situation.

If this BZ is just about making the error message friendlier (#2), it is not a regression and priority/severity shouldn't be high.

Item #1 above could be tracked on a separate BZ, but I don't believe it is a bug (a new machine type requiring more resources to run is not a regression).

Comment 15 Greg Kurz 2021-03-17 19:40:09 UTC

The following upstream changes fixes the crash:

commit 6f1a5c37db5a6fc7c5c44b3e45cee6e33df31e9d
Author: Maxim Levitsky <mlevitsk>
Date:   Thu Dec 17 17:00:38 2020 +0200

    virtio-scsi: don't process IO on fenced dataplane
    
    If virtio_scsi_dataplane_start fails, there is a small window when it drops the
    aio lock (in aio_wait_bh_oneshot) and the dataplane's AIO handler can
    still run during that window.
    
    This is done after the dataplane was marked as fenced, thus we use this flag
    to avoid it doing any IO.
    
    Signed-off-by: Maxim Levitsky <mlevitsk>
    Message-Id: <20201217150040.906961-2-mlevitsk>
    Signed-off-by: Paolo Bonzini <pbonzini>


QEMU now falls back to run in a degraded (slower) mode instead.

This is indicated by the following warning:

virtio-scsi: Failed to set host notifier (-24)
qemu-system-ppc64: virtio_bus_start_ioeventfd: failed. Fallback to userspace (slower).

Unfortunately, the same warning is printed for each queue and floods the monitor.
I'll post a patch for that.

Comment 16 Greg Kurz 2021-05-25 10:49:34 UTC

(In reply to Greg Kurz from comment #15)
> The following upstream changes fixes the crash:
> 
> commit 6f1a5c37db5a6fc7c5c44b3e45cee6e33df31e9d
> Author: Maxim Levitsky <mlevitsk>
> Date:   Thu Dec 17 17:00:38 2020 +0200
> 
>     virtio-scsi: don't process IO on fenced dataplane
>     
>     If virtio_scsi_dataplane_start fails, there is a small window when it
> drops the
>     aio lock (in aio_wait_bh_oneshot) and the dataplane's AIO handler can
>     still run during that window.
>     
>     This is done after the dataplane was marked as fenced, thus we use this
> flag
>     to avoid it doing any IO.
>     
>     Signed-off-by: Maxim Levitsky <mlevitsk>
>     Message-Id: <20201217150040.906961-2-mlevitsk>
>     Signed-off-by: Paolo Bonzini <pbonzini>
> 
> 
> QEMU now falls back to run in a degraded (slower) mode instead.
> 
> This is indicated by the following warning:
> 
> virtio-scsi: Failed to set host notifier (-24)
> qemu-system-ppc64: virtio_bus_start_ioeventfd: failed. Fallback to userspace
> (slower).
> 
> Unfortunately, the same warning is printed for each queue and floods the
> monitor.
> I'll post a patch for that.

Reducing the flood isn't that trivial and it is just a nice to have.
The above commit is enough to fix the current bug. Let's move forward.

Comment 18 Danilo de Paula 2021-06-08 00:28:17 UTC

Upstream feature already present in qemu-6.0.
Marked as TestOnly and moved directly to ON_QA

Comment 19 Xujun Ma 2021-06-08 08:47:41 UTC

Boot up guest successfully with 384 vcpus. Bug has been fixed in this build.

Booting log:
Trying to load:  from: /pci@800000020000000/scsi@4/disk@100000000000000 ... qemu-kvm: virtio_bus_set_host_notifier: unable to init event notifier: Too many open files (-24)
virtio-scsi: Failed to set host notifier (-24)
qemu-kvm: virtio_bus_start_ioeventfd: failed. Fallback to userspace (slower).
  Successfully loaded
qemu-kvm: virtio_bus_set_host_notifier: unable to init event notifier: Too many open files (-24)
virtio-scsi: Failed to set host notifier (-24)
qemu-kvm: virtio_bus_start_ioeventfd: failed. Fallback to userspace (slower).
qemu-kvm: virtio_bus_set_host_notifier: unable to init event notifier: Too many open files (-24)
virtio-scsi: Failed to set host notifier (-24)
qemu-kvm: virtio_bus_start_ioeventfd: failed. Fallback to userspace (slower).

Base the test result above, set bug to verified.

Comment 22 errata-xmlrpc 2021-11-16 07:51:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4684