Bug 1665063 - qemu aarch64 pl011 driver shouldn't use synchronous writes
Summary: qemu aarch64 pl011 driver shouldn't use synchronous writes
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: qemu-kvm
Version: 8.0
Hardware: aarch64
OS: Unspecified
urgent
high
Target Milestone: rc
: ---
Assignee: Guowen Shan
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: TRACKER-bugs-affecting-libguestfs 1677408
TreeView+ depends on / blocked
 
Reported: 2019-01-10 12:40 UTC by Richard W.M. Jones
Modified: 2020-03-26 20:14 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1661940
Environment:
Last Closed: 2020-03-01 22:32:40 UTC
Type: Bug
Target Upstream Version:


Attachments (Terms of Use)
TCP server/client program (4.19 KB, text/plain)
2020-02-20 06:09 UTC, Guowen Shan
no flags Details
qemu patch - enforce flooding on PL011 on startup (2.93 KB, patch)
2020-02-20 06:11 UTC, Guowen Shan
no flags Details | Diff

Description Richard W.M. Jones 2019-01-10 12:40:20 UTC
+++ This bug was initially created as a clone of Bug #1661940 +++

Description:

When libvirtd starts a qemu guest, it does it as follows:

(1) Create the qemu process paused (-S)
(2) Unpause the guest using the qemu monitor
(3) Query the qemu monitor for list of devices etc

(Bug 1661940 is about changing the ordering of events to make
qemu more robust, so don't complain that libvirt is doing it
wrong :-)

Unfortunately when creating a qemu aarch64 guest using the PL011
serial driver, between (2) and (3) a lot of boot messages can
get written to the serial console.  However nothing is listening
on the other end of the serial console socket at this point (because
libvirt hasn't returned to the caller process).  Because of the
implementation of the PL011 driver, it doesn't just throw away
the output like a normal serial port would do.  Instead there is
a small race which can causes the qemu monitor to lock up, thus
causing a deadlock.

In particular commit 6ab3fc32ea64:

> diff --git a/hw/char/pl011.c b/hw/char/pl011.c
> index c0fbf8a87428..786e605fdd61 100644
> --- a/hw/char/pl011.c
> +++ b/hw/char/pl011.c
> @@ -146,7 +146,9 @@ static void pl011_write(void *opaque, hwaddr offset,
>          /* ??? Check if transmitter is enabled.  */
>          ch = value;
>          if (s->chr)
> -            qemu_chr_fe_write(s->chr, &ch, 1);
> +            /* XXX this blocks entire thread. Rewrite to use
> +             * qemu_chr_fe_write and background I/O callbacks */
> +            qemu_chr_fe_write_all(s->chr, &ch, 1);
>          s->int_level |= PL011_INT_TX;
>          pl011_update(s);
>          break;

This is only a brief summary of the bug.  The longer details can be
found in the following comments:

https://bugzilla.redhat.com/show_bug.cgi?id=1661940#c42
https://bugzilla.redhat.com/show_bug.cgi?id=1661940#c43

An awkward reproducer which requires patching libvirt to
slow it down between steps (2) and (3) is:

https://bugzilla.redhat.com/show_bug.cgi?id=1661940#c44

Version:

qemu-kvm-3.1.0-3.module+el8+2638+e43dad09.aarch64

Comment 1 Richard W.M. Jones 2019-01-10 12:41:21 UTC
(In reply to Richard W.M. Jones from comment #0)
> (Bug 1661940 is about changing the ordering of events to make
> qemu more robust, so don't complain that libvirt is doing it
> wrong :-)

I mean to make *libvirtd* more robust :-)

Comment 2 Luiz Capitulino 2019-01-10 19:00:49 UTC
Setting this to 8.1, unless we consider it a blocker.

Comment 3 Luiz Capitulino 2019-01-10 19:18:23 UTC
btw, I'm wondering, the pl011 driver seems very old. Is there another uart driver we could use for arm?

Comment 6 Ademar Reis 2020-02-05 22:53:50 UTC
QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks

Comment 7 Guowen Shan 2020-02-20 06:09:38 UTC
Created attachment 1664266 [details]
TCP server/client program

Comment 8 Guowen Shan 2020-02-20 06:11:06 UTC
Created attachment 1664267 [details]
qemu patch - enforce flooding on PL011 on startup

Comment 9 Guowen Shan 2020-02-20 06:25:43 UTC
I think Richard already provided all the details. The issue seems caused by flooding on TCP connection, which isn't accepted on server side. I don't know how to build a private libvirtd, so the attached "tcp.c" and patch are used to reproduce the issue.

First of all, run binary "tcp", which is build from "tcp.c" as below. The transmission fails and returns -EAGAIN on client side after continuous 431 transmissons. It means the client can send data even the connection isn't accepted on server side. Besides, the sending buffer can be overruned quickly.

   session1# ./tcp -s 50900
   server: connected with client localhost (127.0.0.1)
   session2# ./tcp -c 50900
   line[0]: succeed, 1024
   line[1]: succeed, 1024
   line[2]: succeed, 1024
   line[3]: succeed, 1024
      :
   line[431]: succeed, 1024
   line[432]: succeed, 673
   line[433]: ret=-1, errno=11

Another experiment is start a VM with customized qemu, which is built from upstream qemu and the the attached patch. It also proves the flooding on PL011 can prevent the system from booting:

   session1# ./tcp -s 50900
   session2# ~/sandbox/src/qemu/qemu.main.aarch64/aarch64-softmmu/qemu-system-aarch64  \
             -machine virt,gic-version=3 -cpu max -m 4096                              \
             -smp 2,sockets=2,cores=1,threads=1                                        \
             -monitor none -serial tcp:127.0.0.1:50900 -nographic -s                   \
               :
            flood_uart: 0
            flood_uart: 1
            flood_uart: 2
            flood_uart: 3
            flood_uart: 4
            flood_uart: 5
               :
            flood_uart: 2573
            flood_uart: 2574
            <system stops here>

For this, a patch has been posted for review. I will do downstream backporting once it's finalized.

   https://patchwork.kernel.org/patch/11393393/

Comment 10 Guowen Shan 2020-03-01 22:32:17 UTC
The community chooses not to fix the corner case according to the discussion as below link shows.
Instead, the user should have the TCP connection accepted and kept being polled for data on the
server side. So I'm closing this with "WONTFIX".

   https://patchwork.kernel.org/patch/11393393/


Note You need to log in before you can comment on or make changes to this bug.