Bug 1661940 - libvirtd should be made robust against qemu serial console causing qemu monitor to lock up after guest is unpaused
Summary: libvirtd should be made robust against qemu serial console causing qemu monit...
Keywords:
Status: ON_QA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: libvirt
Version: 8.0
Hardware: aarch64
OS: Unspecified
medium
high
Target Milestone: rc
: ---
Assignee: Andrea Bolognani
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: TRACKER-bugs-affecting-libguestfs
TreeView+ depends on / blocked
 
Reported: 2018-12-24 16:43 UTC by Danilo Cesar de Paula
Modified: 2019-12-10 15:58 UTC (History)
17 users (show)

Fixed In Version: libvirt-5.3.0-1.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1665063 (view as bug list)
Environment:
Last Closed:
Type: Bug
Target Upstream Version:


Attachments (Terms of Use)
root.log for package versions etc (207.56 KB, text/plain)
2018-12-25 07:49 UTC, Richard W.M. Jones
no flags Details
build.log (3.41 MB, text/plain)
2018-12-25 07:50 UTC, Richard W.M. Jones
no flags Details
build.log with libvirtd debugging and libvirtd stack trace (114.36 KB, application/x-xz)
2019-01-09 20:09 UTC, Richard W.M. Jones
no flags Details

Description Danilo Cesar de Paula 2018-12-24 16:43:08 UTC
When scratch-building libguestfs it failed to build for aarch64.
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=19612808

I talked with Richard Jones and he mentioned that it's an actual bug in kernel/qemu/AAVMF: this kernel doesn't boot on qemu on this architecture.
Unfortunately there's nothing useful in the logs so probably someone will need to fire up a RHEL 8 instance in Beaker to try to diagnose the issue.

Comment 1 Richard W.M. Jones 2018-12-25 07:49:03 UTC
kernel 4.18.0-57.el8
qemu-kvm 15:2.12.0-50.module+el8+2596+0a642e54

Comment 2 Richard W.M. Jones 2018-12-25 07:49:43 UTC
Created attachment 1516625 [details]
root.log for package versions etc

Comment 3 Richard W.M. Jones 2018-12-25 07:50:40 UTC
Created attachment 1516626 [details]
build.log

Not much in the build log because it silently fails, so it's
most likely actually an AAVMF thing.  Attaching it anyway.

Comment 4 Andrew Jones 2018-12-27 09:03:57 UTC
Based on the hw_info.log it appears this is on an HP Moonshot. I just tried to provision a couple of them with rhel8, but the machines can't get IPMI connections. I wish we would stop using these unsupported machines. Lately they seem to be more trouble than they're worth.

Comment 5 Luiz Capitulino 2019-01-04 15:55:53 UTC
As per internal discussion on 8Dev, my understanding is that we're trying to determine whether
this only happens on HP Moonshot? And if so, the solution will be to stop using those machines
for build?

I assume brew has other machines than HP Moonshots, right? In this case, what happens if we
keep trying a scratch build until brew picks a different machine?

Comment 8 Danilo Cesar de Paula 2019-01-04 17:14:40 UTC
(In reply to Luiz Capitulino from comment #5)
> As per internal discussion on 8Dev, my understanding is that we're trying to
> determine whether
> this only happens on HP Moonshot? And if so, the solution will be to stop
> using those machines
> for build?
> 
> I assume brew has other machines than HP Moonshots, right? In this case,
> what happens if we
> keep trying a scratch build until brew picks a different machine?

This is a no go.
This affects regular builds, not only scratch builds. This also happen when building modules.
Building a module can take a couple of hours, and all packages needs to build successfully.
If one package fails to build, the whole module build is rejected. It prevents us to release other packages.

Comment 9 Luiz Capitulino 2019-01-04 21:10:34 UTC
(In reply to Danilo Cesar de Paula from comment #8)
> (In reply to Luiz Capitulino from comment #5)
> > As per internal discussion on 8Dev, my understanding is that we're trying to
> > determine whether
> > this only happens on HP Moonshot? And if so, the solution will be to stop
> > using those machines
> > for build?
> > 
> > I assume brew has other machines than HP Moonshots, right? In this case,
> > what happens if we
> > keep trying a scratch build until brew picks a different machine?
> 
> This is a no go.

Oh, I didn't suggested this as a solution. It was just one way to confirm that
problem only happens on HP Moonshot.

Comment 11 Wei Huang 2019-01-07 22:43:51 UTC
I was able to provision hp-moonshot-03-c36.lab.eng.rdu.redhat.com with the latest RHEL8. I didn't see any problems in booting the bare-metal and installing a RHEL8 guest VM on it. So maybe the problem is brew server specific?

PS: serial console access indeed has troubles. But it only happens on RDU servers, not on Boston Moonshot servers).

-Wei

Comment 12 Richard W.M. Jones 2019-01-08 10:42:08 UTC
I am also _unable_ to reproduce this on HP Moonshot.

Comment 13 Luiz Capitulino 2019-01-08 13:52:11 UTC
Guys,

Is it possible that there's something that's specific to certain HP Moonshots that
could be causing this? Say, a chipset or something? Do you think it is worth
spending time debugging this?

Comment 14 Luiz Capitulino 2019-01-08 13:54:49 UTC
Danilo,

Have you tried doing the scratch build again, maybe until brew picks a different
machine and see if that works?

Note that I'm suggesting this as a solution, but just to help showing the problem
is someway specific that system.

Comment 16 Richard W.M. Jones 2019-01-08 17:29:12 UTC
Latest build failed again, in the same way, and only on aarch64:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=19703675

I'm not sure how to tell from the hwinfo what hardware that is.  Although I'm not
sure this is really hardware related at all anyway.

Comment 17 Luiz Capitulino 2019-01-08 17:45:55 UTC
(In reply to Richard W.M. Jones from comment #16)
> Latest build failed again, in the same way, and only on aarch64:
> 
> https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=19703675
> 
> I'm not sure how to tell from the hwinfo what hardware that is.  Although
> I'm not
> sure this is really hardware related at all anyway.

Another important detail, libguestfs test-suite is using tcg. Is it at all
possible to force it to use kvm to see what happens?

But in any case, we have to unblock the build. I think we're going to that
hack we talked about on the last meeting: skip make check for arm until
we figure what's wrong.

Do you think we can do it?

Comment 18 Richard W.M. Jones 2019-01-08 18:50:37 UTC
The problem is we'll build something that we know doesn't work.  Might as well
ExcludeArch aarch64 in that case.  The test is there for a reason and the fact
that the kernel doesn't boot on qemu is some problem with the kernel or qemu
(and unlikely to be anything to do with libguestfs).

I'm currenty attempting to diagnose this further, but my efforts are
stymied by being unable to do scratch builds at all because of a serious of
interlinking brokenness in our tooling.

Comment 19 Luiz Capitulino 2019-01-08 19:33:05 UTC
(In reply to Richard W.M. Jones from comment #18)
> The problem is we'll build something that we know doesn't work.  Might as
> well
> ExcludeArch aarch64 in that case.  The test is there for a reason and the
> fact
> that the kernel doesn't boot on qemu is some problem with the kernel or qemu
> (and unlikely to be anything to do with libguestfs).

OK, makes sense. But what about trying with kvm? It seems to be using tcg,
so is it possible it's a tcg-related issue?

Btw, when you tried to reproduce on comment 12 did you use tcg as well?
 
> I'm currenty attempting to diagnose this further, but my efforts are
> stymied by being unable to do scratch builds at all because of a serious of
> interlinking brokenness in our tooling.

Thanks a lot, that's going to be very helpful.

Comment 20 Richard W.M. Jones 2019-01-08 19:33:22 UTC
First of all I can very reliably reproduce this in Koji (not elsewhere).

I temporarily changed the backend from "libvirt" to "direct"[1].  The test
now succeeds, which suggests that it's something to do with the qemu command line
which libvirt constructs.

The command lines.  Of course exactly what the important difference is, that's
very hard indeed to say.

libvirt (fails):

LC_ALL=C LD_LIBRARY_PATH=/builddir/build/BUILD/libguestfs-1.38.4/ruby/ext/guestfs:/builddir/build/BUILD/libguestfs-1.38.4/lib/.libs:/builddir/build/BUILD/libguestfs-1.38.4/java/.libs:/builddir/build/BUILD/libguestfs-1.38.4/gobject/.libs PATH=/builddir/build/BUILD/libguestfs-1.38.4/v2v:/builddir/build/BUILD/libguestfs-1.38.4/tools:/builddir/build/BUILD/libguestfs-1.38.4/test-tool:/builddir/build/BUILD/libguestfs-1.38.4/sysprep:/builddir/build/BUILD/libguestfs-1.38.4/sparsify:/builddir/build/BUILD/libguestfs-1.38.4/resize:/builddir/build/BUILD/libguestfs-1.38.4/rescue:/builddir/build/BUILD/libguestfs-1.38.4/p2v:/builddir/build/BUILD/libguestfs-1.38.4/make-fs:/builddir/build/BUILD/libguestfs-1.38.4/inspector:/builddir/build/BUILD/libguestfs-1.38.4/get-kernel:/builddir/build/BUILD/libguestfs-1.38.4/fuse:/builddir/build/BUILD/libguestfs-1.38.4/format:/builddir/build/BUILD/libguestfs-1.38.4/fish:/builddir/build/BUILD/libguestfs-1.38.4/erlang:/builddir/build/BUILD/libguestfs-1.38.4/edit:/builddir/build/BUILD/libguestfs-1.38.4/diff:/builddir/build/BUILD/libguestfs-1.38.4/dib:/builddir/build/BUILD/libguestfs-1.38.4/df:/builddir/build/BUILD/libguestfs-1.38.4/customize:/builddir/build/BUILD/libguestfs-1.38.4/cat:/builddir/build/BUILD/libguestfs-1.38.4/builder:/builddir/build/BUILD/libguestfs-1.38.4/align:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/sbin HOME=/builddir USER=mockbuild LOGNAME=mockbuild QEMU_AUDIO_DRV=none TMPDIR=/builddir/build/BUILD/libguestfs-1.38.4/tmp /usr/libexec/qemu-kvm -name guest=guestfs-6e9ikzf2eardkjpv,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/builddir/.config/libvirt/qemu/lib/domain-1-guestfs-6e9ikzf2eard/master-key.aes -machine virt-rhel7.6.0,accel=tcg,usb=off,dump-guest-core=off,gic-version=2 -cpu cortex-a57 -drive file=/usr/share/AAVMF/AAVMF_CODE.verbose.fd,if=pflash,format=raw,unit=0,readonly=on -drive file=/builddir/build/BUILD/libguestfs-1.38.4/tmp/libguestfsYNOkt3/vars2.fd,if=pflash,format=raw,unit=1 -m 768 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 78f54c12-4f91-4b7d-aedb-5b7d8e3edd47 -display none -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=20,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -no-reboot -no-acpi -boot strict=on -kernel /builddir/build/BUILD/libguestfs-1.38.4/tmp/.guestfs-996/appliance.d/kernel -initrd /builddir/build/BUILD/libguestfs-1.38.4/tmp/.guestfs-996/appliance.d/initrd -append 'panic=1 console=ttyAMA0 earlyprintk=pl011,0x9000000 ignore_loglevel efi-rtc=noprobe edd=off udevtimeout=6000 udev.event-timeout=6000 no_timer_check printk.time=1 cgroup_disable=memory usbcore.nousb cryptomgr.notests tsc=reliable 8250.nr_uarts=1 root=/dev/sdb selinux=0 guestfs_verbose=1 TERM=vt100' -device pcie-root-port,port=0x8,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x1 -device pcie-root-port,port=0x9,chassis=2,id=pci.2,bus=pcie.0,addr=0x1.0x1 -device pcie-root-port,port=0xa,chassis=3,id=pci.3,bus=pcie.0,addr=0x1.0x2 -device pcie-root-port,port=0xb,chassis=4,id=pci.4,bus=pcie.0,addr=0x1.0x3 -device virtio-scsi-pci,id=scsi0,bus=pci.1,addr=0x0 -device virtio-serial-pci,id=virtio-serial0,bus=pci.2,addr=0x0 -drive file=/builddir/build/BUILD/libguestfs-1.38.4/tmp/libguestfsYNOkt3/scratch1.img,format=raw,if=none,id=drive-scsi0-0-0-0,cache=unsafe -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1,write-cache=on -drive file=/builddir/build/BUILD/libguestfs-1.38.4/tmp/libguestfsYNOkt3/overlay3.qcow2,format=qcow2,if=none,id=drive-scsi0-0-1-0,cache=unsafe -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=1,lun=0,drive=drive-scsi0-0-1-0,id=scsi0-0-1-0,write-cache=on -chardev socket,id=charserial0,path=/tmp/libguestfsgP02RS/console.sock -serial chardev:charserial0 -chardev socket,id=charchannel0,path=/tmp/libguestfsgP02RS/guestfsd.sock -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.libguestfs.channel.0 -object rng-random,id=objrng0,filename=/dev/urandom -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.3,addr=0x0 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on

direct (succeeds):

/usr/libexec/qemu-kvm \
    -global virtio-blk-pci.scsi=off \
    -enable-fips \
    -nodefaults \
    -display none \
    -machine virt,accel=kvm:tcg \
    -cpu cortex-a57 \
    -m 768 \
    -no-reboot \
    -rtc driftfix=slew \
    -drive if=pflash,format=raw,file=/usr/share/AAVMF/AAVMF_CODE.verbose.fd,readonly \
    -drive if=pflash,format=raw,file=/builddir/build/BUILD/libguestfs-1.38.4/tmp/libguestfs6AbJjz/vars3.fd \
    -kernel /builddir/build/BUILD/libguestfs-1.38.4/tmp/.guestfs-996/appliance.d/kernel \
    -initrd /builddir/build/BUILD/libguestfs-1.38.4/tmp/.guestfs-996/appliance.d/initrd \
    -object rng-random,filename=/dev/urandom,id=rng0 \
    -device virtio-rng-pci,rng=rng0 \
    -device virtio-scsi-pci,id=scsi \
    -drive file=/builddir/build/BUILD/libguestfs-1.38.4/tmp/libguestfs6AbJjz/scratch1.img,cache=unsafe,format=raw,id=hd0,if=none \
    -device scsi-hd,drive=hd0 \
    -drive file=/builddir/build/BUILD/libguestfs-1.38.4/tmp/.guestfs-996/appliance.d/root,snapshot=on,id=appliance,cache=unsafe,if=none,format=raw \
    -device scsi-hd,drive=appliance \
    -device virtio-serial-pci \
    -serial stdio \
    -chardev socket,path=/tmp/libguestfsUhWnPW/guestfsd.sock,id=channel0 \
    -device virtserialport,chardev=channel0,name=org.libguestfs.channel.0 \
    -append "panic=1 console=ttyAMA0 earlyprintk=pl011,,0x9000000 ignore_loglevel efi-rtc=noprobe edd=off udevtimeout=6000 udev.event-timeout=6000 no_timer_check printk.time=1 cgroup_disable=memory usbcore.nousb cryptomgr.notests tsc=reliable 8250.nr_uarts=1 root=/dev/sdb selinux=0 guestfs_verbose=1 TERM=vt100"

[1] http://libguestfs.org/guestfs.3.html#backend

Comment 21 Richard W.M. Jones 2019-01-08 19:56:50 UTC
(In reply to Luiz Capitulino from comment #19)
> OK, makes sense. But what about trying with kvm? It seems to be using tcg,
> so is it possible it's a tcg-related issue?

The problem is KVM isn't available it'll use TCG, but we can't try KVM to
see if that would make a difference.

In comment 20 above it's using TCG in both cases, and in one case it *works*,
which IMHO excludes TCG as a cause.

> Btw, when you tried to reproduce on comment 12 did you use tcg as well?

Yes I've tried all sorts:

Running in L0 with L1 KVM appliance
Running in L0 with L1 TCG appliance
Running in L1 with L2 TCG appliance

and I've not been able to reproduce it with Moonshot at all under any scenario.

Since this is not using KVM I doubt hardware makes any difference.

Comment 22 Luiz Capitulino 2019-01-08 20:04:04 UTC
(In reply to Richard W.M. Jones from comment #21)
> (In reply to Luiz Capitulino from comment #19)
> > OK, makes sense. But what about trying with kvm? It seems to be using tcg,
> > so is it possible it's a tcg-related issue?
> 
> The problem is KVM isn't available it'll use TCG, but we can't try KVM to
> see if that would make a difference.
> 
> In comment 20 above it's using TCG in both cases, and in one case it *works*,
> which IMHO excludes TCG as a cause.

Yeah, I agree. Your debugging work on comment 20 supersedes my question :)

So, what are we down to now? I can think of the following:

1. Some "environmental" issue relating to libvirt, ie. some lib that libvirt
needs is unavailable

2. libvirt is passing some bad command-line to qemu

3. libvirt is using some command-line option that in turn is triggering
a bug in qemu

Comment 23 Richard W.M. Jones 2019-01-08 20:14:14 UTC
Here is the successful build using LIBGUESTFS_BACKEND=direct
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=19712035

(In reply to Luiz Capitulino from comment #22)
> 1. Some "environmental" issue relating to libvirt, ie. some lib that libvirt
> needs is unavailable

We know that libvirt does run qemu, so that would seem to exclude a
missing component.

We know that no messages are seen from qemu in the failure case.  Normally
we'd expect to see:

add-symbol-file /builddir/build/BUILD/edk2-ee3198e672e2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/AARCH64/ArmPlatformPkg/PrePeiCore/PrePeiCoreUniCore/DEBUG/ArmPlatformPrePeiCore.dll 0x1800
add-symbol-file /builddir/build/BUILD/edk2-ee3198e672e2/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/AARCH64/MdeModulePkg/Core/Pei/PeiMain/DEBUG/PeiCore.dll 0x7980
UEFI firmware starting.

which are messages from AAVMF, but we do not see those (or else the console
is broken somehow, but that seems unlikely).

> 2. libvirt is passing some bad command-line to qemu

This seems the strongest possibility.

> 3. libvirt is using some command-line option that in turn is triggering
> a bug in qemu

Indeed, something like -sandbox would be a strong contender here.

Comment 24 Luiz Capitulino 2019-01-08 21:07:33 UTC
Based on the evidence that we have so far (comment 20 and comment 23), this looks
like a libvirt issue. Re-assigning. Please, re-assign back if you find evidence
of the contrary.

Comment 25 Andrew Jones 2019-01-09 10:34:36 UTC
The libvirt command line on comment 20 has -no-acpi. RHEL kernels require ACPI.

Everything else in the command line looks sane, except for some of the kernel command line parameters in -append (they're obviously for x86, not aarch64), but those are the same in the direct command line, so cannot be the culprit.

Comment 26 Richard W.M. Jones 2019-01-09 10:38:25 UTC
Yes the -no-acpi thing worried me too.  However when I ran the same test on our HP Moonshot,
I noticed that libvirt is also using -no-acpi there, and it's the same AAVMF and kernel, and
it works there.  (This bug would be a lot easier to solve if I could reproduce it!)

Comment 27 Andrea Bolognani 2019-01-09 11:33:19 UTC
(In reply to Andrew Jones from comment #25)
> The libvirt command line on comment 20 has -no-acpi. RHEL kernels require
> ACPI.

I just tried with a RHEL 8 guest I had lying around, so it might not
be the latest and greatest, but it booted just fine after I removed
the <acpi/> element from the corresponding guest XML... Looks like
we're not really requiring ACPI after all?

Either way, AAVMF and GRUB should come up regardless of what the
kernel requirements are, and that's not the case here.

I'm trying to make sense of this but I don't have any specific
insights yet. I'll update the bug as I find out more.

Comment 30 Laszlo Ersek 2019-01-09 12:46:27 UTC
I've compared the command lines from comment 20.

(01) There's a difference in the kernel command lines ("-append") option.
     While I think this difference is irrelevant, it's worth pointing out in
     general.

     libvirt: "earlyprintk=pl011,0x9000000"
     direct:  "earlyprintk=pl011,,0x9000000"

(02) The libvirt backend starts the domain with "-S" (i.e., paused). If
     libguestfs failed to unpause the domain for whatever reason, that would
     be consistent with the symptom, I believe.

(03) The libvirt backend specifies "-boot strict=on". This should make no
     difference, as it only generates a HALT entry at the end of the
     "bootorder" fw_cfg file, and AAVMF simply ignores that (AAVMF behaves
     as if "-boot strict=on" were present at all times, anyway). Also, this
     difference in fw_cfg would be encountered much later in the boot
     process.

(04) Libvirt specifies "-uuid". Most likely irrelevant.

(05) Libvirt specifies "-sandbox", as highlighted above. Could be relevant.

(06) The -machine options are quite different:

     (a) libvirt: "virt-rhel7.6.0", direct: "virt". These are identical in
         practice.

     (b) libvirt adds "usb=off,dump-guest-core=off". Should be irrelevant.

     (c) libvirt specifies "gic-version=2". This is the default, so it's
         irrelevant.

     (d) The libvirt backend forces "accel=tcg", while the QEMU backend
         prefers KVM ("accel=kvm:tcg"). In comment 21, Rich wrote, "In
         comment 20 above it's using TCG in both cases" -- how do we know
         that?

(07) The qemu backend specifies "-enable-fips". While I think it should not
     matter in this case, I do find it pretty interesting -- I'm sure it was
     added manually for a particular reason, and I wonder why it's OK to not
     do the same through the libvirt backend.

(08) The qemu backend specifies "-global virtio-blk-pci.scsi=off".
     Irrelevant, neither backend uses any virtio-blk-pci devices.

(09) The libvirt backend specifies "-msg timestamp=on". Irrelevant (and "on"
     seems to be the default anyway, according to the QEMU manual).

(10) The libvirt backend sets the guest name, the host-side process name,
     and the host-side thread names, with "-name ...". Irrelevant.

(11) Libvirt sets "-no-user-config". I don't think it makes a difference in
     the Brew build environment.

(12) The libvirt backend specifie "-object secret,id=masterKey0,...". This
     should not matter, as "masterKey0" is not referenced anywhere on the
     command line.

(13) The libvirt backend generates a PCIe hierarchy with the virtio devices
     plugged into PCIe root ports. The qemu backend places the virtio
     devices directly on the root complex. As a result, the libvirt
     backend's domain will have modern-only virtio devices, while the qemu
     backend's guest will see transitional virtio devices. This is certainly
     guest-visible, but should not matter until much later into the boot
     process.

(14) The SCSI hard disk devices are specified with more details by libvirt.
     "write-cache" is enabled (which I *think* should be the default
     anyway), plus "bootindex=1" is set explicitly. Even if it mattered
     eventually, it shouldn't matter early in the boot process.

(15) There's a difference in how the "appliance" (boot?) disk's backend is
     configured. QEMU uses a raw disk image (but with "snapshot=on"???),
     whereas libvirt uses a qcow2 image which seems to be based on another
     backing file. However... none of this should matter for the early boot
     hang.

(16) The libvirt backend spells out "base=utc" for "-rtc", but that's the
     default anyway. Not relevant.

(17) The libvirt backend says "-no-acpi". I think that's a grave divergence
     from the qemu backend (it is extremely guest-visible), so I think it
     would be worth reconciling with the QEMU backend. However, it doesn't
     explain the *early* boot hang.

(18) The libvirt backend says "-realtime mlock=off". The qemu documentation
     is extremely confusing on this, but the source code helps. Namely,
     "mlock=on" is the default *IF* a bare-bones "-realtime" option is
     passed. However, if "-realtime" itself is not specified, then qemu &
     the guest RAM are *not* mlocked by default. Therefore, "-realtime
     mlock=off" only re-states the default, and there is no difference with
     the qemu backend in this regard.

(19) While it's not immediately obvious, the "guestfsd.sock"-backed virtio
     serial *port* is set up identically between the QEMU and libvirt
     backends.

(20) The libvirt backend uses the "console.sock" unix domain socket to
     expose the guest's PL011 serial port to the host. The QEMU backend uses
     QEMU's stin/stdout for the same. Given that the symptom we're seeing is
     tied to the (emulated PL011, not virtio) serial console, I think this
     difference could be relevant.

(21) The QEMU backend doesn't set up a monitor connection at all. The
     libvirt backend does. It uses "fd=20", so I think this is simply
     libvirtd's internal monitor connection. I think this monitor connection
     is minimally used to unpause the guest (see point (02)), although I'm
     unsure how exactly libguestfs asks libvirtd to do that. Still, on the
     same note as (02), it remains a difference that the QEMU backend lets
     the guest run at once, while with libvirtd, a separate resume action is
     needed.

I guess we could start adapting the direct backend's command line,
gradually, to see where the issue is triggered first. With reference to:
- (02)/(21) -- start paused (although this could require an entirely new
               communication pattern between libguestfs and QEMU),
- (05)      -- sandbox the emulator,
- (06)(d)   -- force TCG,
- (20)      -- expose PL011 via a unix domain socket.

Comment 31 Laszlo Ersek 2019-01-09 12:48:46 UTC
(Sorry, I started writing what ended up as comment 30 before seeing Drew's comment 25 and the subsequent comments. That's the only reason I covered those same observations myself -- I didn't mean to ignore the earlier findings.)

Comment 32 Richard W.M. Jones 2019-01-09 14:05:09 UTC
(In reply to Laszlo Ersek from comment #30)
> I've compared the command lines from comment 20.
> 
> (01) There's a difference in the kernel command lines ("-append") option.
>      While I think this difference is irrelevant, it's worth pointing out in
>      general.
> 
>      libvirt: "earlyprintk=pl011,0x9000000"
>      direct:  "earlyprintk=pl011,,0x9000000"

What's happening here is libguestfs is attempting to quote commas in the -append
argument by doubling them, which looks like it's a bug.  Filed it:

https://bugzilla.redhat.com/show_bug.cgi?id=1664711

> (02) The libvirt backend starts the domain with "-S" (i.e., paused). If
>      libguestfs failed to unpause the domain for whatever reason, that would
>      be consistent with the symptom, I believe.

libvirt should be unpausing the domain.  There's some reason why libvirt starts
it paused which I forget now, but it's all hidden inside libvirt.

>      (d) The libvirt backend forces "accel=tcg", while the QEMU backend
>          prefers KVM ("accel=kvm:tcg"). In comment 21, Rich wrote, "In
>          comment 20 above it's using TCG in both cases" -- how do we know
>          that?

We know that because the first output from qemu in direct mode is:

Could not access KVM kernel module: No such file or directory
qemu-kvm: failed to initialize KVM: No such file or directory
qemu-kvm: Back to tcg accelerator

(followed by AAVMF messages).

> (07) The qemu backend specifies "-enable-fips". While I think it should not
>      matter in this case, I do find it pretty interesting -- I'm sure it was
>      added manually for a particular reason, and I wonder why it's OK to not
>      do the same through the libvirt backend.

Yes this is interesting.  Libvirt used to add it, apparently it doesn't
add it any more.  The reason we added it is so that qemu "does the right thing"
when FIPS mode is enabled on the host, but note that the option doesn't
actually "enable FIPS" so it is misnamed.  See fips_enabled in util/osdep.c
for the logic.  Oh and by the way, FIPS is _only_ used in qemu upstream to
prevent you from using password auth in VNC, and nothing else.

> (11) Libvirt sets "-no-user-config". I don't think it makes a difference in
>      the Brew build environment.

Yes I noticed this too - direct mode should set it as well so it's a bug, but
not one that should affect us here.

> (20) The libvirt backend uses the "console.sock" unix domain socket to
>      expose the guest's PL011 serial port to the host. The QEMU backend uses
>      QEMU's stin/stdout for the same. Given that the symptom we're seeing is
>      tied to the (emulated PL011, not virtio) serial console, I think this
>      difference could be relevant.

So my belief is that it's not a serial console failure.  Although not seeing
any messages would be an obvious symptom of this, the serial console is only
used for log messages and nothing important.  There is a second connection
which is used to pass messages to the guestfsd daemon (inside the appliance)
and the library outside, and the test could still succeed as long as this
second connection is established and working.  However we don't see an expected
debug message from libguestfs when it successfully handshakes with the daemon.

(However this doesn't discount the possibility that the serial console fails
and then later something else fails and the error is not seen because the
serial console is broken)

> I guess we could start adapting the direct backend's command line,
> gradually, to see where the issue is triggered first. With reference to:
> - (02)/(21) -- start paused (although this could require an entirely new
>                communication pattern between libguestfs and QEMU),
> - (05)      -- sandbox the emulator,
> - (06)(d)   -- force TCG,
> - (20)      -- expose PL011 via a unix domain socket.

I'll have a look at some of these today, thanks for your detailed analysis.

Comment 34 Richard W.M. Jones 2019-01-09 16:19:49 UTC
As a test, I added the following extra flags to the direct command line, but it still
did _not_ fail (in brew), so I guess we can conclude that the problem is NOT related
to any of these.

    -machine virt,accel=kvm:tcg,usb=off,dump-guest-core=off,gic-version=2 \
    -no-user-config \
    -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
    -no-acpi \

Comment 35 Andrea Bolognani 2019-01-09 16:24:38 UTC
(In reply to Richard W.M. Jones from comment #32)
> (In reply to Laszlo Ersek from comment #30)
> >      (d) The libvirt backend forces "accel=tcg", while the QEMU backend
> >          prefers KVM ("accel=kvm:tcg"). In comment 21, Rich wrote, "In
> >          comment 20 above it's using TCG in both cases" -- how do we know
> >          that?
> 
> We know that because the first output from qemu in direct mode is:
> 
> Could not access KVM kernel module: No such file or directory
> qemu-kvm: failed to initialize KVM: No such file or directory
> qemu-kvm: Back to tcg accelerator
> 
> (followed by AAVMF messages).

Note that libvirt correctly detects and reports that KVM support
is not available on the host:

  <capabilities>
    ...
    <guest>
      <os_type>hvm</os_type>
      <arch name='aarch64'>
        <wordsize>64</wordsize>
        <emulator>/usr/libexec/qemu-kvm</emulator>
        <machine maxCpus='255'>virt-rhel7.6.0</machine>
        <machine canonical='virt-rhel7.6.0' maxCpus='255'>virt</machine>
        <domain type='qemu'/>
      </arch>
      ...
    </guest>
  </capabilities>

Perhaps that's because brew builds happen inside virtual machines
already? That wouldn't surprise me.

In any case, making sure the test suite requests the same
accelerator both when libvirt is used and when it's not sounds
like a good idea.

Comment 36 Richard W.M. Jones 2019-01-09 16:43:06 UTC
We recently fixed this properly upstream (bug 1605071), but the fix hasn't
been included in RHEL 8 yet.  In any case I don't believe it's related to
this bug.

Comment 37 Andrea Bolognani 2019-01-09 17:02:07 UTC
(In reply to Richard W.M. Jones from comment #36)
> We recently fixed this properly upstream (bug 1605071), but the fix hasn't
> been included in RHEL 8 yet.  In any case I don't believe it's related to
> this bug.

Yeah, I just thought I'd point that out - and I had forgotten all
about Bug 1605071 :)

Comment 38 Richard W.M. Jones 2019-01-09 17:47:46 UTC
Dan came up with the excellent suggestion of running gstack periodically to get a
stack trace.  The stack trace is as follows, repeatedly:

Thread 5 (Thread 0xffffae8bea10 (LWP 19204)):
#0  0x0000ffffbdbd1a00 in nanosleep () from /lib64/libpthread.so.0
#1  0x0000ffffbdf974c0 in g_usleep () from /lib64/libglib-2.0.so.0
#2  0x0000aaaaaae878a0 in qemu_chr_write_buffer ()
#3  0x0000aaaaaae87bd4 in qemu_chr_write ()
#4  0x0000aaaaaad8ae3c in pl011_write ()
#5  0x0000aaaaaac61114 in memory_region_write_accessor ()
#6  0x0000aaaaaac5f4ec in access_with_adjusted_size ()
#7  0x0000aaaaaac63248 in memory_region_dispatch_write ()
#8  0x0000aaaaaac74160 in io_writex.isra ()
#9  0x0000ffffafb55388 in code_gen_buffer ()
#10 0x0000aaaaaac83a68 in cpu_exec ()
#11 0x0000aaaaaac4f170 in qemu_tcg_cpu_thread_fn ()
#12 0x0000ffffbdbc77f8 in start_thread () from /lib64/libpthread.so.0
#13 0x0000ffffbdb1157c in thread_start () from /lib64/libc.so.6
Thread 4 (Thread 0xffffaf0fea10 (LWP 19203)):
#0  0x0000ffffbdb07c5c in ppoll () from /lib64/libc.so.6
#1  0x0000aaaaaaeebb54 in qemu_poll_ns ()
#2  0x0000aaaaaaeedb04 in aio_poll ()
#3  0x0000aaaaaad3e8c4 in iothread_run ()
#4  0x0000ffffbdbc77f8 in start_thread () from /lib64/libpthread.so.0
#5  0x0000ffffbdb1157c in thread_start () from /lib64/libc.so.6
Thread 3 (Thread 0xffffbc3bea10 (LWP 19201)):
#0  0x0000ffffbda73aa0 in sigtimedwait () from /lib64/libc.so.6
#1  0x0000ffffbdbd2224 in sigwait () from /lib64/libpthread.so.0
#2  0x0000aaaaaaeedf6c in sigwait_compat ()
#3  0x0000ffffbdbc77f8 in start_thread () from /lib64/libpthread.so.0
#4  0x0000ffffbdb1157c in thread_start () from /lib64/libc.so.6
Thread 2 (Thread 0xffffbcbcea10 (LWP 19200)):
#0  0x0000ffffbdb0d810 in syscall () from /lib64/libc.so.6
#1  0x0000aaaaaaeeff18 in qemu_event_wait ()
#2  0x0000aaaaaaf00d14 in call_rcu_thread ()
#3  0x0000ffffbdbc77f8 in start_thread () from /lib64/libpthread.so.0
#4  0x0000ffffbdb1157c in thread_start () from /lib64/libc.so.6
Thread 1 (Thread 0xffffbd1b77b0 (LWP 19199)):
#0  0x0000ffffbdbd0df4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000ffffbdbc9e98 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x0000aaaaaaeef684 in qemu_mutex_lock_impl ()
#3  0x0000aaaaaac4e89c in qemu_mutex_lock_iothread ()
#4  0x0000aaaaaaeecaa8 in main_loop_wait ()
#5  0x0000aaaaaac10eb8 in main ()

Therefore it appears to get stuck in this loop:

https://git.qemu.org/?p=qemu.git;a=blob;f=chardev/char.c;h=ccba36bafb5f0d6fe8306886ad4af30523118e37;hb=HEAD#l111

Comment 39 Richard W.M. Jones 2019-01-09 18:42:19 UTC
I got the strace results.  The final lines before the hang are:

write(2, "libguestfs: ", 12libguestfs: )            = 12
write(2, "launch libvirt guest", 20launch libvirt guest)    = 20
write(2, "\n", 1
)                       = 1
gettid()                                = 22154
write(2, "2019-01-09 18:30:51.790+0000: 22"..., 26152019-01-09 18:30:51.790+0000: 22154: debug : virDomainCreateXML:166 : conn=0xaaaaaaaf7eb0, xmlDesc=<?xml version="1.0"?>
<domain type="qemu" xmlns:qemu="http://libvirt.org/schemas/domain/qemu/1.0">
  <name>guestfs-kwd20ic8pldsqgx3</name>
  <memory unit="MiB">768</memory>
  <currentMemory unit="MiB">768</currentMemory>
  <cpu>
    <model>cortex-a57</model>
  </cpu>
  <vcpu>1</vcpu>
  <clock offset="utc">
    <timer name="rtc" tickpolicy="catchup"/>
    <timer name="pit" tickpolicy="delay"/>
  </clock>
  <os>
    <type machine="virt">hvm</type>
    <loader readonly="yes" type="pflash">/usr/share/AAVMF/AAVMF_CODE.verbose.fd</loader>
    <nvram>/builddir/build/BUILD/libguestfs-1.38.4/tmp/libguestfs1ULedD/vars2.fd</nvram>
    <kernel>/builddir/build/BUILD/libguestfs-1.38.4/tmp/.guestfs-996/appliance.d/kernel</kernel>
    <initrd>/builddir/build/BUILD/libguestfs-1.38.4/tmp/.guestfs-996/appliance.d/initrd</initrd>
    <cmdline>panic=1 console=ttyAMA0 earlyprintk=pl011,0x9000000 ignore_loglevel efi-rtc=noprobe edd=off udevtimeout=6000 udev.event-timeout=6000 no_timer_check printk.time=1 cgroup_disable=memory usbcore.nousb cryptomgr.notests tsc=reliable 8250.nr_uarts=1 root=/dev/sdb selinux=0 guestfs_verbose=1 TERM=vt100</cmdline>
  </os>
  <on_reboot>destroy</on_reboot>
  <devices>
    <rng model="virtio">
      <backend model="random">/dev/urandom</backend>
    </rng>
    <controller type="scsi" index="0" model="virtio-scsi"/>
    <disk device="disk" type="file">
      <source file="/builddir/build/BUILD/libguestfs-1.38.4/tmp/libguestfs1ULedD/scratch1.img"/>
      <target dev="sda" bus="scsi"/>
      <driver name="qemu" type="raw" cache="unsafe"/>
      <address type="drive" controller="0" bus="0" target="0" unit="0"/>
    </disk>
    <disk type="file" device="disk">
      <source file="/builddir/build/BUILD/libguestfs-1.38.4/tmp/libguestfs1ULedD/overlay3.qcow2"/>
      <target dev="sdb" bus="scsi"/>
      <driver name="qemu" type="qcow2" cache="unsafe"/>
      <address type="drive" controller="0" bus="0" target="1" unit="0"/>
    </disk>
    <serial type="unix">
      <source mode="connect" path="/tmp/libguestfsUw59Lt/console.sock"/>
      <target port="0"/>
    </serial>
    <channel type="unix">
      <source mode="connect" path="/tmp/libguestfsUw59Lt/guestfsd.sock"/>
      <target type="virtio" name="org.libguestfs.channel.0"/>
    </channel>
    <controller type="usb" model="none"/>
    <memballoon model="none"/>
  </devices>
  <qemu:commandline>
    <qemu:env name="TMPDIR" value="/builddir/build/BUILD/libguestfs-1.38.4/tmp"/>
  </qemu:commandline>
</domain>
, flags=0x2
) = 2615
gettid()                                = 22154
write(2, "2019-01-09 18:30:51.790+0000: 22"..., 962019-01-09 18:30:51.790+0000: 22154: debug : virNetMessageNew:46 : msg=0xaaaaaaafff90 tracked=0
) = 96
gettid()                                = 22154
write(2, "2019-01-09 18:30:51.790+0000: 22"..., 1002019-01-09 18:30:51.790+0000: 22154: debug : virNetMessageEncodePayload:389 : Encode length as 2540
) = 100
gettid()                                = 22154
write(2, "2019-01-09 18:30:51.790+0000: 22"..., 1872019-01-09 18:30:51.790+0000: 22154: info : virNetClientSendInternal:2118 : RPC_CLIENT_MSG_TX_QUEUE: client=0xaaaaaaafe7f0 len=2540 prog=536903814 vers=1 proc=10 type=0 status=0 serial=6
) = 187
gettid()                                = 22154
write(2, "2019-01-09 18:30:51.790+0000: 22"..., 1432019-01-09 18:30:51.790+0000: 22154: debug : virNetClientCallNew:2071 : New call 0xaaaaaaafef90: msg=0xaaaaaaafff90, expectReply=1, nonBlock=0
) = 143
gettid()                                = 22154
write(2, "2019-01-09 18:30:51.790+0000: 22"..., 1602019-01-09 18:30:51.790+0000: 22154: debug : virNetClientIO:1877 : Outgoing message prog=536903814 version=1 serial=6 proc=10 type=0 length=2540 dispatch=(nil)
) = 160
gettid()                                = 22154
write(2, "2019-01-09 18:30:51.790+0000: 22"..., 1242019-01-09 18:30:51.790+0000: 22154: debug : virNetClientIO:1936 : We have the buck head=0xaaaaaaafef90 call=0xaaaaaaafef90
) = 124
gettid()                                = 22154
write(2, "2019-01-09 18:30:51.790+0000: 22"..., 1272019-01-09 18:30:51.790+0000: 22154: debug : virNetSocketUpdateIOCallback:2206 : Watch not registered on socket 0xaaaaaaafe200
) = 127
rt_sigprocmask(SIG_BLOCK, [PIPE CHLD WINCH], [], 8) = 0
ppoll([{fd=4, events=POLLOUT}, {fd=3, events=POLLIN}], 2, NULL, NULL, 0) = 1 ([{fd=4, revents=POLLOUT}])
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
write(4, "\0\0\t\354 \0\200\206\0\0\0\1\0\0\0\n\0\0\0\0\0\0\0\6\0\0\0\0\0\0\t\310"..., 2540) = 2540
rt_sigprocmask(SIG_BLOCK, [PIPE CHLD WINCH], [], 8) = 0
<--- hang happens here

These indicate that we're still inside virDomainCreateXML at the point where the
hang occurs.  This firmly points to a problem in libvirt.

The detailed reasoning is as follows:

(1) libguestfs prints the message "launch libvirt guest"
(https://github.com/libguestfs/libguestfs/blob/fe1ee20a59ba62a89ef7e233b5b909c1942684bf/lib/launch-libvirt.c#L606)

(2) Libguestfs calls virDomainCreateXML.

(3) The next system call in libguestfs is a call to accept4(), but no such call is
seen in the strace output.

Comment 40 Richard W.M. Jones 2019-01-09 20:01:33 UTC
We suspected that MT-TCG might be causing a problem, however adding this
to the libvirt XML did NOT make any difference, it still hangs.  (Given
we are not using -smp I suppose this makes sense?)

  <qemu:commandline>
    <qemu:arg value="-accel"/>
    <qemu:arg value="tcg,thread=single"/>
  </qemu:commandline>

Comment 41 Richard W.M. Jones 2019-01-09 20:09:46 UTC
Created attachment 1519573 [details]
build.log with libvirtd debugging and libvirtd stack trace

The information in this attachment was requested by Dan for further
debugging of libvirtd.

It contains near to the end the libvirtd debugging information using:
log_filters="1:qemu"

Above that, it contains periodic stack traces collected every 10 seconds
from libvirtd while it (or qemu) was hanging.  Note that the stack
trace from one thread includes a call to virDomainCreateXML.

It seems to be hanging waiting on the qemu monitor?

Comment 42 Richard W.M. Jones 2019-01-09 21:46:31 UTC
My current theory on this bug:

(1) libvirtd creates and unpauses the qemu process.

(2) The qemu process quickly starts writing to the serial console (because
AAVMF produces copious early output).

(3) The serial console is configured as:

  -chardev socket,id=charserial0,path=/tmp/libguestfsgP02RS/console.sock \
  -serial chardev:charserial0

but initially nothing is listening on that socket.

(4) Libvirt starts probing qemu features using the qemu monitor.

(5) For some reason - why? - the qemu monitor does not respond to monitor
queries, and libvirtd gets stuck waiting.

(6) Because libvirtd never progresses, it never returns to libguestfs, and
so nothing ever accepts the serial console connection.

Comment 43 Laszlo Ersek 2019-01-10 12:12:27 UTC
Rich,

if libvirtd unpauses the guest in step (1), then libvirtd uses the QEMU
monitor for that. It seems strange that libvirt would then probe qemu
features only in step (4), after having used the monitor for unpausing the
guest.

Anyway, if QEMU waits on libguestfs (to accept()), and libguestfs waits on
libvirtd (to return from an API call), and libvirtd waits on QEMU (to
respond on the monitor), then we have a circular dependency, and if *all* of
these waits are blocking, then we get a deadlock. What I don't get is why
QEMU effectively implements a blocking write in qemu_chr_write_buffer() (see
near the top of your comment 38):

>  111     retry:
>  112         res = cc->chr_write(s, buf + *offset, len - *offset);
>  113         if (res < 0 && errno == EAGAIN && write_all) {
>  114             g_usleep(100);
>  115             goto retry;
>  116         }

Hm... the logic was introduced in QEMU commit cd18720a294b ("char: introduce
a blocking version of qemu_chr_fe_write", 2013-03-26), but it's not
immediately clear what for. The child commit, 0fbf01fe911e ("qtest: use
synchronous I/O for char device", 2013-03-26) provides the first use case;
however that commit only added a qtest-related call site, not production
code.

... OK, so I think the currently relevant use of the blocking logic goes
back to commit 6ab3fc32ea64 ("hw: replace most use of qemu_chr_fe_write with
qemu_chr_fe_write_all", 2016-09-13). The function pl011_write() is seen in
the stack trace from comment 38, and said commit modified it as follows:

> diff --git a/hw/char/pl011.c b/hw/char/pl011.c
> index c0fbf8a87428..786e605fdd61 100644
> --- a/hw/char/pl011.c
> +++ b/hw/char/pl011.c
> @@ -146,7 +146,9 @@ static void pl011_write(void *opaque, hwaddr offset,
>          /* ??? Check if transmitter is enabled.  */
>          ch = value;
>          if (s->chr)
> -            qemu_chr_fe_write(s->chr, &ch, 1);
> +            /* XXX this blocks entire thread. Rewrite to use
> +             * qemu_chr_fe_write and background I/O callbacks */
> +            qemu_chr_fe_write_all(s->chr, &ch, 1);
>          s->int_level |= PL011_INT_TX;
>          pl011_update(s);
>          break;

Given that commit 6ab3fc32ea64 fixed another bug (see the commit message for
explanation / references), this is not a regression, but "technical debt".
It is a problem in the PL011 device model. If Dan agrees, I'd suggest
reassigning this BZ to QEMU. (The blocking write in pl011_write(), in
"hw/char/pl011.c", is present in QEMU as of commit 8ae951fbc106 ("Merge
remote-tracking branch 'remotes/dgibson/tags/ppc-for-4.0-20190109' into
staging", 2019-01-09).)

Comment 44 Richard W.M. Jones 2019-01-10 12:26:07 UTC
I reproduced this bug locally using libvirt 4.10 compiled from source with this patch.
The qemu and libvirtd stack traces look the same as observed in Koji.  I believe this
proves my theory outlined in comment 42.

diff --git a/src/qemu/qemu_monitor_json.c b/src/qemu/qemu_monitor_json.c
index 5a806f6c0e..8d4cd6079b 100644
--- a/src/qemu/qemu_monitor_json.c
+++ b/src/qemu/qemu_monitor_json.c
@@ -7102,6 +7102,9 @@ qemuMonitorJSONGetDeviceAliases(qemuMonitorPtr mon,
 
     *aliases = NULL;
 
+    VIR_DEBUG("the-big-sleep");
+    sleep (5);
+
     n = qemuMonitorJSONGetObjectListPaths(mon, "/machine/peripheral", &paths);
     if (n < 0)
         return -1;

Comment 45 Danilo Cesar de Paula 2019-01-10 12:34:45 UTC
Folks, I made a test yesteday night and got the result today morning:
 - I pushed a branch into qemu-kvm dist-git (private-no-edk2-aarch64) removing "Requires: edk2-aarch64" from qemu-kvm spec file.
 - pushed a branch into the virt module (private-danilo-test), based on 8.0.0 (fast-train) pointing to that qemu-kvm dist-git. All other packages are the same.
 - asked for a new module build based on that private one [1]
 - Took a few hours, but seems that libguestfs built fine.

A few considerations:
- Module built still failed because perl-sys-virt is broken [2].
- This might be a sign that the problem was introduced by the edk2-aarch64 requirement, or indicates that the issue is intermitent.
- It might or might not help, I just thought that you would like to know that.

[1] https://mbsweb.engineering.redhat.com/module/2653
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1664422

Comment 46 Andrea Bolognani 2019-01-10 12:56:23 UTC
Dan, would you mind taking this one? It's not really arch-specific
after all, and you know a whole lot more than me about this part of
libvirt...

We might also want to revisit the severity, as fixing either this
one or Bug 1665063 will avoid the hang and thus make the remaining
bug not as urgent to fix.

Comment 47 Daniel Berrangé 2019-01-10 13:33:35 UTC
I'm thinking the problem in lbivirt is introduced in this commit:

commit 93db7eea1b86408e02852a8c886d473f4f5007e3
Author: Peter Krempa <pkrempa@redhat.com>
Date:   Thu Feb 1 15:02:17 2018 +0100

    qemu: migration: Refresh device information after transferring state
    
    In my first approach in 4b480d10768c I overlooked the comment in
    qemuMigrationRunIncoming stating that during actual migration the
    qemuMigrationRunIncoming does not wait until the migration is complete
    but rather offloads that to the Finish phase of migration.
    
    This means that during actual migration qemuProcessRefreshState was
    called prior to qemu actually transferring the full state and thus the
    queries did not get the correct information. The approach worked only
    for restore, where we wait for the migration to finish during qemu
    startup.
    
    Fix the issue by calling qemuProcessRefreshState both from
    qemuProcessStart if there's no incomming migration and from
    qemuMigrationFinish so that the code actually works as expected.


This causes us to use the QMP monitor /after/ unpausing CPUs.

I think we need to refactor this again so that it is run /after/ migration completes, but  still before we unpause CPUs.

Comment 51 Andrea Bolognani 2019-01-11 14:26:49 UTC
Disclaimer: socket programming is not my strong suit, quite the
opposite in fact; that said, I've discussed this with colleagues
so I'm somewhat confident I'm not just gonna write nonsense and
publicly make a fool of myself :)

It seems to me that, while both libvirt and QEMU should be
fixed, not to block on pl011 write and not to resume guest CPUs
up until the very last moment possible respectively, libguestfs
is also not being robust enough when starting libvirt guests.

Relevant snippet:

  if (bind (console_sock, (struct sockaddr *) &addr, sizeof addr) == -1) {
    perrorf (g, "bind");
    goto cleanup;
  }

  if (listen (console_sock, 1) == -1) {
    perrorf (g, "listen");
    goto cleanup;
  }

  ...

  dom = virDomainCreateXML (conn, (char *) xml, VIR_DOMAIN_START_AUTODESTROY);
  if (!dom) {
    libvirt_error (g, _(
			"could not create appliance through libvirt.\n"
			"\n"
			"Try running qemu directly without libvirt using this environment variable:\n"
			"export LIBGUESTFS_BACKEND=direct\n"
			"\n"
			"Original error from libvirt"));
    goto cleanup;
  }

  g->state = LAUNCHING;

  /* Wait for console socket to be opened (by qemu). */
  r = accept4 (console_sock, NULL, NULL, SOCK_NONBLOCK|SOCK_CLOEXEC);

So libguestfs will prepare a socket for the serial console, start
listening on it and then start a guest that's configured to connect
to it, but will only actually accept connections to the socket once
virDomainCreateXML() has returned instead of right away, thus
making it possible for QEMU to block on the connection attempt.

A more resilient solution would be to have two separate threads,
one setting up the socket and immediately trying to accept
connections in a loop, and the other one starting the guest in
parallel once the above has been set up. The direct backend is
actually already doing something similar to the above, due to the
fact that it has to fork() in order to spawn QEMU.

I would suggest cloning this bug to libguestfs, and using that
clone to track progress on a fix to the above; said bug would also
be the perfect one to note eg. that tests have temporarily been
disabled on aarch64 to unblock module builds or, if we decide to
start using silent AAVMF in the libguestfs test suite to work
around the issue, that we have done so.

Thoughts?

Comment 52 Daniel Berrangé 2019-01-11 14:44:21 UTC
(In reply to Andrea Bolognani from comment #51)
> Disclaimer: socket programming is not my strong suit, quite the
> opposite in fact; that said, I've discussed this with colleagues
> so I'm somewhat confident I'm not just gonna write nonsense and
> publicly make a fool of myself :)
> 
> It seems to me that, while both libvirt and QEMU should be
> fixed, not to block on pl011 write and not to resume guest CPUs
> up until the very last moment possible respectively, libguestfs
> is also not being robust enough when starting libvirt guests.
> 
> Relevant snippet:
> 
>   if (bind (console_sock, (struct sockaddr *) &addr, sizeof addr) == -1) {
>     perrorf (g, "bind");
>     goto cleanup;
>   }
> 
>   if (listen (console_sock, 1) == -1) {
>     perrorf (g, "listen");
>     goto cleanup;
>   }
> 
>   ...
> 
>   dom = virDomainCreateXML (conn, (char *) xml,
> VIR_DOMAIN_START_AUTODESTROY);
>   if (!dom) {
>     libvirt_error (g, _(
> 			"could not create appliance through libvirt.\n"
> 			"\n"
> 			"Try running qemu directly without libvirt using this environment
> variable:\n"
> 			"export LIBGUESTFS_BACKEND=direct\n"
> 			"\n"
> 			"Original error from libvirt"));
>     goto cleanup;
>   }
> 
>   g->state = LAUNCHING;
> 
>   /* Wait for console socket to be opened (by qemu). */
>   r = accept4 (console_sock, NULL, NULL, SOCK_NONBLOCK|SOCK_CLOEXEC);
> 
> So libguestfs will prepare a socket for the serial console, start
> listening on it and then start a guest that's configured to connect
> to it, but will only actually accept connections to the socket once
> virDomainCreateXML() has returned instead of right away, thus
> making it possible for QEMU to block on the connection attempt.

QEMU shouldn't block on the connect() call.  The kernel will do the connection handshake in the background and queue the incoming connection for a later accept() call.  QEMU can even start sending data before we accept() I believe, and will only block once the socket buffers are all full.

Comment 53 Andrea Bolognani 2019-01-11 14:53:31 UTC
(In reply to Daniel Berrange from comment #52)
> (In reply to Andrea Bolognani from comment #51)
> > So libguestfs will prepare a socket for the serial console, start
> > listening on it and then start a guest that's configured to connect
> > to it, but will only actually accept connections to the socket once
> > virDomainCreateXML() has returned instead of right away, thus
> > making it possible for QEMU to block on the connection attempt.
> 
> QEMU shouldn't block on the connect() call.  The kernel will do the
> connection handshake in the background and queue the incoming connection for
> a later accept() call.  QEMU can even start sending data before we accept()
> I believe, and will only block once the socket buffers are all full.

I assume the above is why we're only hitting the issue in a very
narrow scenario as opposed to the whole thing not working at all,
ever?

If you're using verbose AAVMF then the buffers are going to be
filled up fairly quickly, whereas normally the guest won't produce
nearly as much output and libguestfs will be able to accept the
connection after all.

This doesn't invalidate the fact that setting up the accept() loop
earlier will reduce or even remove the window where the deadlock
can occur, right?

Comment 54 Daniel Berrangé 2019-01-11 14:56:02 UTC
(In reply to Andrea Bolognani from comment #53)
> This doesn't invalidate the fact that setting up the accept() loop
> earlier will reduce or even remove the window where the deadlock
> can occur, right?

Doing the accept earlier won't help. It would need to actually be reading data from the socket to avoid the deadlock.

Comment 55 Andrea Bolognani 2019-01-11 15:05:48 UTC
(In reply to Daniel Berrange from comment #54)
> (In reply to Andrea Bolognani from comment #53)
> > This doesn't invalidate the fact that setting up the accept() loop
> > earlier will reduce or even remove the window where the deadlock
> > can occur, right?
> 
> Doing the accept earlier won't help. It would need to actually be reading
> data from the socket to avoid the deadlock.

Yeah, sorry, of course I didn't mean just calling accept() but also
reading the data and logging it or whatnot.

The point is, only starting the data processing loop after the guest
is up is not as robust as calling virDomainCreateXML() in a separate
thread once the loop is in place.

Comment 56 Richard W.M. Jones 2019-01-11 17:12:06 UTC
In addition, libguestfs can't start threads or process the data in threads.
It's a library and might be running in a non-threaded program.

What we might do is to start the guest in the paused state, accept the
connection (assuming qemu connects in this scenario), and then unpause
the guest.  However this will cause boot times to increase, quite likely
considerably - I measured at least 300ms extra delay based on libvirt
debug messages - and that is not really acceptable either given the amount
of time I spent two years ago knocking tens of milliseconds off the boot
time.

Comment 57 Daniel Berrangé 2019-01-11 17:16:46 UTC
(In reply to Richard W.M. Jones from comment #56)
> In addition, libguestfs can't start threads or process the data in threads.
> It's a library and might be running in a non-threaded program.
> 
> What we might do is to start the guest in the paused state, accept the
> connection (assuming qemu connects in this scenario), and then unpause
> the guest.  However this will cause boot times to increase, quite likely
> considerably - I measured at least 300ms extra delay based on libvirt
> debug messages - and that is not really acceptable either given the amount
> of time I spent two years ago knocking tens of milliseconds off the boot
> time.

I'm pretty surprised to see such a noticeable time delay from starting paused. It is only adding one extra libvirt RPC roundtrip which should be minimal for a local UNIX socket connection. The number of QMP commands libvirt does wrt QEMU should be the same still - we just move the "cont" command to the second RPC call.

Comment 58 Richard W.M. Jones 2019-01-11 17:17:56 UTC
Let me actually measure this before I make any claims ...

Comment 63 Peter Krempa 2019-02-04 15:00:02 UTC
As of libvirt upstream commit:

commit 41d37d31b3ea7105bb3a4cc064032a7fe6142f04
Author: Marc Hartmayer <mhartmay@linux.ibm.com>
Date:   Mon Feb 4 13:36:24 2019 +0100

    qemu: Refresh state before starting the VCPUs
    
    For normal starts (no incoming migration) the refresh of the QEMU
    state must be done before the VCPUs getting started since otherwise
    there might be a race condition between a possible shutdown of the
    guest OS and the QEMU monitor queries.
    
    This fixes "qemu: migration: Refresh device information after
    transferring state" (93db7eea1b864).
    
    Signed-off-by: Marc Hartmayer <mhartmay@linux.ibm.com>
    Reviewed-by: Peter Krempa <pkrempa@redhat.com>

we no longer try to refresh the state prior to vCPUs being started.

Comment 64 Andrea Bolognani 2019-02-13 11:23:43 UTC
(In reply to Peter Krempa from comment #63)
> As of libvirt upstream commit:
> 
> commit 41d37d31b3ea7105bb3a4cc064032a7fe6142f04
> Author: Marc Hartmayer <mhartmay@linux.ibm.com>
> Date:   Mon Feb 4 13:36:24 2019 +0100
> 
>     qemu: Refresh state before starting the VCPUs
>     
>     For normal starts (no incoming migration) the refresh of the QEMU
>     state must be done before the VCPUs getting started since otherwise
>     there might be a race condition between a possible shutdown of the
>     guest OS and the QEMU monitor queries.
>     
>     This fixes "qemu: migration: Refresh device information after
>     transferring state" (93db7eea1b864).
>     
>     Signed-off-by: Marc Hartmayer <mhartmay@linux.ibm.com>
>     Reviewed-by: Peter Krempa <pkrempa@redhat.com>
> 
> we no longer try to refresh the state prior to vCPUs being started.

That's great, thanks for the heads-up!

Moving to POST since we're going to get the fix through the next
rebase.

(In reply to Richard W.M. Jones from comment #49)
> At the moment we have disabled the test on aarch64 to allow module builds
> to go through

Rich, is there a bug associated with that change? We should make
sure the libguestfs test is re-enabled once libvirt has picked up
the fix.

Comment 67 Richard W.M. Jones 2019-03-04 12:05:36 UTC
> (In reply to Richard W.M. Jones from comment #49)
> > At the moment we have disabled the test on aarch64 to allow module builds
> > to go through
> 
> Rich, is there a bug associated with that change? We should make
> sure the libguestfs test is re-enabled once libvirt has picked up
> the fix.

I'm afraid there isn't any bug associated with that.  Pino simply
disabled the tests in dist-git commit 8d36cc88fd09bd3e2b4bebc1dc0ead00b6cf0171.


Note You need to log in before you can comment on or make changes to this bug.