+++ This bug was initially created as a clone of Bug #1661940 +++ Description: When libvirtd starts a qemu guest, it does it as follows: (1) Create the qemu process paused (-S) (2) Unpause the guest using the qemu monitor (3) Query the qemu monitor for list of devices etc (Bug 1661940 is about changing the ordering of events to make qemu more robust, so don't complain that libvirt is doing it wrong :-) Unfortunately when creating a qemu aarch64 guest using the PL011 serial driver, between (2) and (3) a lot of boot messages can get written to the serial console. However nothing is listening on the other end of the serial console socket at this point (because libvirt hasn't returned to the caller process). Because of the implementation of the PL011 driver, it doesn't just throw away the output like a normal serial port would do. Instead there is a small race which can causes the qemu monitor to lock up, thus causing a deadlock. In particular commit 6ab3fc32ea64: > diff --git a/hw/char/pl011.c b/hw/char/pl011.c > index c0fbf8a87428..786e605fdd61 100644 > --- a/hw/char/pl011.c > +++ b/hw/char/pl011.c > @@ -146,7 +146,9 @@ static void pl011_write(void *opaque, hwaddr offset, > /* ??? Check if transmitter is enabled. */ > ch = value; > if (s->chr) > - qemu_chr_fe_write(s->chr, &ch, 1); > + /* XXX this blocks entire thread. Rewrite to use > + * qemu_chr_fe_write and background I/O callbacks */ > + qemu_chr_fe_write_all(s->chr, &ch, 1); > s->int_level |= PL011_INT_TX; > pl011_update(s); > break; This is only a brief summary of the bug. The longer details can be found in the following comments: https://bugzilla.redhat.com/show_bug.cgi?id=1661940#c42 https://bugzilla.redhat.com/show_bug.cgi?id=1661940#c43 An awkward reproducer which requires patching libvirt to slow it down between steps (2) and (3) is: https://bugzilla.redhat.com/show_bug.cgi?id=1661940#c44 Version: qemu-kvm-3.1.0-3.module+el8+2638+e43dad09.aarch64
(In reply to Richard W.M. Jones from comment #0) > (Bug 1661940 is about changing the ordering of events to make > qemu more robust, so don't complain that libvirt is doing it > wrong :-) I mean to make *libvirtd* more robust :-)
Setting this to 8.1, unless we consider it a blocker.
btw, I'm wondering, the pl011 driver seems very old. Is there another uart driver we could use for arm?
QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks
Created attachment 1664266 [details] TCP server/client program
Created attachment 1664267 [details] qemu patch - enforce flooding on PL011 on startup
I think Richard already provided all the details. The issue seems caused by flooding on TCP connection, which isn't accepted on server side. I don't know how to build a private libvirtd, so the attached "tcp.c" and patch are used to reproduce the issue. First of all, run binary "tcp", which is build from "tcp.c" as below. The transmission fails and returns -EAGAIN on client side after continuous 431 transmissons. It means the client can send data even the connection isn't accepted on server side. Besides, the sending buffer can be overruned quickly. session1# ./tcp -s 50900 server: connected with client localhost (127.0.0.1) session2# ./tcp -c 50900 line[0]: succeed, 1024 line[1]: succeed, 1024 line[2]: succeed, 1024 line[3]: succeed, 1024 : line[431]: succeed, 1024 line[432]: succeed, 673 line[433]: ret=-1, errno=11 Another experiment is start a VM with customized qemu, which is built from upstream qemu and the the attached patch. It also proves the flooding on PL011 can prevent the system from booting: session1# ./tcp -s 50900 session2# ~/sandbox/src/qemu/qemu.main.aarch64/aarch64-softmmu/qemu-system-aarch64 \ -machine virt,gic-version=3 -cpu max -m 4096 \ -smp 2,sockets=2,cores=1,threads=1 \ -monitor none -serial tcp:127.0.0.1:50900 -nographic -s \ : flood_uart: 0 flood_uart: 1 flood_uart: 2 flood_uart: 3 flood_uart: 4 flood_uart: 5 : flood_uart: 2573 flood_uart: 2574 <system stops here> For this, a patch has been posted for review. I will do downstream backporting once it's finalized. https://patchwork.kernel.org/patch/11393393/
The community chooses not to fix the corner case according to the discussion as below link shows. Instead, the user should have the TCP connection accepted and kept being polled for data on the server side. So I'm closing this with "WONTFIX". https://patchwork.kernel.org/patch/11393393/