Bug 1874441

Summary:	Guest enters 'grub rescue>' mode if active creating/adding disk image during the early guest booting stage
Product:	Red Hat Enterprise Linux Advanced Virtualization	Reporter:	Gu Nini <ngu>
Component:	qemu-kvm	Assignee:	Greg Kurz <gkurz>
qemu-kvm sub component:	Block Jobs	QA Contact:	Gu Nini <ngu>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	medium	CC:	aliang, bugproxy, ddepaula, dgibson, fnovak, gkurz, hannsj_uhl, jinzhao, kwolf, mrezanin, qzhang, thuth, virt-maint
Version:	8.3	Keywords:	Triaged
Target Milestone:	rc
Target Release:	8.4
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	qemu-kvm-5.2.0-0.module+el8.4.0+8855+a9e237a9	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-05-25 06:43:34 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1796871

Description Gu Nini 2020-09-01 11:19:36 UTC

Description of problem:
This is the remained issue for the same test scenario of bz1824042.
During the early guest booting stage, if active create / add disk image with qmp commands 'blockdev-create' / 'blockdev-add', the guest os might fail to boot up and enter grub finally:

Trying to load:  from: /pci@800000020000000/scsi@4/disk@100000000000000 ...   Successfully loaded
SCSI-DISK: Failed to get disk capacity!
error: ../../grub-core/kern/disk.c:237:disk `ieee1275/disk,msdos2' not found.
Entering rescue mode...
grub rescue> 


Version-Release number of selected component (if applicable):
Host kernel: 4.18.0-234.el8.ppc64le
Guest kernel: 4.18.0-234.el8.ppc64le
Qemu: qemu-kvm-5.1.0-4.module+el8.3.0+7846+ae9b566f.ppc64le
SLOF: SLOF-20200717-1.gite18ddad8.module+el8.3.0+7638+07cf13d2.noarch

How reproducible:
50%

Steps to Reproduce:
1. Boot up a pre-installed guest with following qemu command line:

/usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox on  \
    -machine pseries  \
    -nodefaults \
    -device VGA,bus=pci.0,addr=0x2 \
    -m 1024  \
    -smp 8,maxcpus=8,cores=4,threads=1,sockets=2  \
    -cpu 'host' \
    -chardev socket,id=qmp_id_qmpmonitor1,server,nowait,path=/var/tmp/avocado_1  \
    -mon chardev=qmp_id_qmpmonitor1,mode=control \
    -chardev socket,id=chardev_serial0,server,nowait,path=/var/tmp/avocado_2 \
    -device spapr-vty,id=serial0,reg=0x30000000,chardev=chardev_serial0 \
    -device qemu-xhci,id=usb1,bus=pci.0,addr=0x3 \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4 \
    -blockdev node-name=file_image1,driver=file,aio=threads,filename=/home/ngu/rhel830-ppc64le-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_image1,driver=qcow2,cache.direct=on,cache.no-flush=off,file=file_image1 \
    -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \
    -device virtio-net-pci,mac=9a:fa:14:5e:3d:13,id=idLviSMj,netdev=idlrY22o,bus=pci.0,addr=0x5  \
    -netdev tap,id=idlrY22o,vhost=on  \
    -vnc :10  \
    -rtc base=utc,clock=host  \
    -boot menu=off,order=cdn,once=c,strict=off \
    -enable-kvm \
    -monitor stdio

2. During the guest early booting up stage, create/add disk image with following script:

#/bin/bash

for i in {1..6}
do

echo "####create image sn$i####"

echo -e "{'execute':'qmp_capabilities'} {'execute': 'blockdev-create', 'arguments': {'options': {'driver': 'file', 'filename': '/home/ngu/sn$i.qcow2', 'size': 21474836480}, 'job-id': 'file_sn$i'}}" | nc -U /var/tmp/avocado_1

echo -e "{'execute':'qmp_capabilities'} {'execute': 'job-dismiss', 'arguments': {'id': 'file_sn$i'}}" | nc -U /var/tmp/avocado_1

echo -e "{'execute':'qmp_capabilities'} {'execute': 'blockdev-add', 'arguments': {'node-name': 'file_sn$i', 'driver': 'file', 'filename': '/home/ngu/sn$i.qcow2', 'aio': 'threads'}}" | nc -U /var/tmp/avocado_1

echo -e "{'execute':'qmp_capabilities'} {'execute': 'blockdev-create', 'arguments': {'options': {'driver': 'qcow2', 'file': 'file_sn$i', 'size': 21474836480}, 'job-id': 'drive_sn$i'}}" | nc -U /var/tmp/avocado_1

sleep 1

echo -e "{'execute':'qmp_capabilities'} {'execute': 'job-dismiss', 'arguments': {'id': 'drive_sn$i'}}" | nc -U /var/tmp/avocado_1

echo -e "{'execute':'qmp_capabilities'} {'execute': 'blockdev-add', 'arguments': {'node-name': 'drive_sn$i', 'driver': 'qcow2', 'file': 'file_sn$i'}}" | nc -U /var/tmp/avocado_1

done

3. Check the guest booting up process in guest serial:
# nc -U var/tmp/avocado_2


Actual results:
Guest failed to boot up and entered grub as showed in the description part.

Expected results:
Guest could boot up.

Additional info:

Comment 6 Greg Kurz 2020-09-18 09:15:23 UTC

Since the message "SCSI-DISK: Failed to get disk capacity!" belongs to SLOF
and I don't really know how SLOF handles SCSI disks, I'm now investigating
using a custom SLOF with debug messages enabled.

This is a bit tricky. When the flow of debug messages is too high, which
causes a lot of round-trip between SLOF and QEMU, I don't hit the issue.
So I've tried to only enable selected messages or add my own, with no
success so far.

Maybe someone from IBM, eg. Alexey Kardashevskiy, could help here ?

Comment 7 David Gibson 2020-09-24 23:38:27 UTC

I've requested a mirror of this to IBM.  As discussed on our call, it might also be worth enquiring with Thomasn Huth who works on s390 now, but has a background with SLOF.

Comment 9 Greg Kurz 2020-10-09 13:11:03 UTC

I could make some progress:

- each time the failure is observed, a previous INQUIRY command seems to
  get stuck for some time.

- with a debug message in QEMU, it turns out that the INQUIRY command
  doesn't reach scsi_disk_emulate_inquiry() in this case.

- the issue doesn't seem to be hit when using a spapr-vscsi device
  instead of virtio-scsi.

- the issue doesn't seem to be hit when *not* running virtio-scsi in an
  iothread.

So I'm beginning to suspect that there might be a race somewhere in
the virtio-scsi dataplane implementation.

Since this isn't POWER specific code, I'd like to know if this issue
can be reproduced on x86.

Comment 10 Gu Nini 2020-10-10 01:21:54 UTC

(In reply to Greg Kurz from comment #9)
> I could make some progress:
> 
> - each time the failure is observed, a previous INQUIRY command seems to
>   get stuck for some time.
> 
> - with a debug message in QEMU, it turns out that the INQUIRY command
>   doesn't reach scsi_disk_emulate_inquiry() in this case.
> 
> - the issue doesn't seem to be hit when using a spapr-vscsi device
>   instead of virtio-scsi.
> 
> - the issue doesn't seem to be hit when *not* running virtio-scsi in an
>   iothread.
> 
> So I'm beginning to suspect that there might be a race somewhere in
> the virtio-scsi dataplane implementation.

Please note in my test, I have not used data plane, i.e. '-object iothread,id=iothread1 \'.

> 
> Since this isn't POWER specific code, I'd like to know if this issue
> can be reproduced on x86.

As proved in comment 1, it could not be reproduced on x86 formerly. Aihua, could you help to reconfirm? Thanks in advance.

Comment 11 Gu Nini 2020-10-10 08:21:10 UTC

(In reply to Gu Nini from comment #10)
> (In reply to Greg Kurz from comment #9)

> > 
> > So I'm beginning to suspect that there might be a race somewhere in
> > the virtio-scsi dataplane implementation.
> 
> Please note in my test, I have not used data plane, i.e. '-object
> iothread,id=iothread1 \'.

It's confirmed if using data plane for the system disk, the issue won't occur.

    -object iothread,id=iothread1 \
......
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4,iothread=iothread1 \
    -blockdev node-name=file_image1,driver=file,aio=threads,filename=/home/ngu/rhel830-ppc64le-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \
    -blockdev node-name=drive_image1,driver=qcow2,cache.direct=on,cache.no-flush=off,file=file_image1 \
    -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \


> 
> > 
> > Since this isn't POWER specific code, I'd like to know if this issue
> > can be reproduced on x86.
> 
> As proved in comment 1, it could not be reproduced on x86 formerly. Aihua,
> could you help to reconfirm? Thanks in advance.

I have done test on x86 with the latest qemu-kvm-5.1.0-13.module+el8.3.0+8382+afc3bbea.x86_64, the issue is not reproduced no matter data plane enabled or disabled.

Comment 12 Greg Kurz 2020-10-11 19:30:05 UTC

(In reply to Gu Nini from comment #10)
> 
> Please note in my test, I have not used data plane, i.e. '-object
> iothread,id=iothread1 \'.
> 

Sure but you've certainly used the dataplane code since it is used
unless you turn off ioeventfd. The only difference being that it
is called from the main QEMU thread instead of a separate iothread.

(gdb) bt
#0  0x000000012938ada0 in scsi_disk_emulate_inquiry (outbuf=<optimized out>, req=0x14daa7000) at ../../hw/scsi/scsi-disk.c:1922
#1  0x000000012938ada0 in scsi_disk_emulate_command (req=0x14daa7000, buf=0x14daa7038 "\022") at ../../hw/scsi/scsi-disk.c:1922
#2  0x000000012929f2fc in scsi_req_enqueue (req=0x14daa7000) at ../../hw/scsi/scsi-bus.c:820
#3  0x000000012963df74 in virtio_scsi_handle_cmd_req_submit (s=<optimized out>, req=<optimized out>) at ../../hw/scsi/virtio-scsi.c:634
#4  0x000000012963df74 in virtio_scsi_handle_cmd_vq (s=0x14e548fa0, vq=0x14e550880) at ../../hw/scsi/virtio-scsi.c:634
#5  0x00000001295d2cb8 in virtio_scsi_data_plane_handle_cmd (vdev=<optimized out>, vq=0x14e550880) at ../../hw/scsi/virtio-scsi-dataplane.c:60 <=====
#6  0x00000001295d61e0 in virtio_queue_notify_aio_vq (vq=0x14e550880) at ../../hw/virtio/virtio.c:2325
#7  0x000000012993afc0 in aio_dispatch_handler (ctx=ctx@entry=0x14dc088a0, node=0x7fff24180340) at ../../util/aio-posix.c:328
#8  0x000000012993bc8c in aio_dispatch_handlers (ctx=0x14dc088a0) at ../../util/aio-posix.c:371
#9  0x000000012993bc8c in aio_dispatch (ctx=0x14dc088a0) at ../../util/aio-posix.c:381
#10 0x0000000129914c60 in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at ../../util/async.c:306
#11 0x00007fff89d68a7c in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
#12 0x000000012990d808 in glib_pollfds_poll () at ../../util/main-loop.c:221
#13 0x000000012990d808 in os_host_main_loop_wait (timeout=<optimized out>) at ../../util/main-loop.c:244
#14 0x000000012990d808 in main_loop_wait (nonblocking=<optimized out>) at ../../util/main-loop.c:520
#15 0x00000001295eb7bc in qemu_main_loop () at ../../softmmu/vl.c:1676
#16 0x00000001291acc40 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at ../../softmmu/main.c:50

> > 
> > Since this isn't POWER specific code, I'd like to know if this issue
> > can be reproduced on x86.
> 
> As proved in comment 1, it could not be reproduced on x86 formerly. Aihua,
> could you help to reconfirm? Thanks in advance.

Oops, my bad. I overlooked this comment.

Comment 13 Greg Kurz 2020-10-11 19:40:06 UTC

(In reply to Gu Nini from comment #11)
> (In reply to Gu Nini from comment #10)
> > (In reply to Greg Kurz from comment #9)
> 
> > > 
> > > So I'm beginning to suspect that there might be a race somewhere in
> > > the virtio-scsi dataplane implementation.
> > 
> > Please note in my test, I have not used data plane, i.e. '-object
> > iothread,id=iothread1 \'.
> 
> It's confirmed if using data plane for the system disk, the issue won't
> occur.
> 
>     -object iothread,id=iothread1 \
> ......
>     -device
> virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4,iothread=iothread1 \
>     -blockdev
> node-name=file_image1,driver=file,aio=threads,filename=/home/ngu/rhel830-
> ppc64le-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \
>     -blockdev
> node-name=drive_image1,driver=qcow2,cache.direct=on,cache.no-flush=off,
> file=file_image1 \
>     -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \
> 

Ah, interesting. Using dataplane for real, ie. with a separate iothread, is
the next thing I wanted to try... thanks !

> 
> > 
> > > 
> > > Since this isn't POWER specific code, I'd like to know if this issue
> > > can be reproduced on x86.
> > 
> > As proved in comment 1, it could not be reproduced on x86 formerly. Aihua,
> > could you help to reconfirm? Thanks in advance.
> 
> I have done test on x86 with the latest
> qemu-kvm-5.1.0-13.module+el8.3.0+8382+afc3bbea.x86_64, the issue is not
> reproduced no matter data plane enabled or disabled.

Thanks for confirming that.

Comment 14 Greg Kurz 2020-10-15 13:21:19 UTC

Ok so I've made some progress in the investigation.

I had added some custom traces in virtioscsi_send() in SLOF and I had
already observed that a SCSI command doesn't get a response from QEMU
when the issue occurs.

I've decided to give a try with x86 anyway but seabios runs so faster
than SLOF that the blockdev_add QMP commands of the reproducer script
happen much too late.

So I have adapted the test by starting QEMU with -S and by adding QMP
continue and stop commands around all QMP commands of the reproducer
script in order to slow things down:

#/bin/bash

for i in {1..6}
do

echo "####create image sn$i####"

# cont
echo -e "{'execute':'qmp_capabilities'} {'execute': 'cont'}" | nc -U /var/tmp/avocado_1

echo -e "{'execute':'qmp_capabilities'} {'execute': 'blockdev-create', 'arguments': {'options': {'driver': 'file', 'filename': '/home/ngu/sn$i.qcow2', 'size': 21474836480}, 'job-id': 'file_sn$i'}}" | nc -U /var/tmp/avocado_1

# stop
echo -e "{'execute':'qmp_capabilities'} {'execute': 'stop'}" | nc -U /var/tmp/avocado_1

# cont
echo -e "{'execute':'qmp_capabilities'} {'execute': 'cont'}" | nc -U /var/tmp/avocado_1

echo -e "{'execute':'qmp_capabilities'} {'execute': 'job-dismiss', 'arguments': {'id': 'file_sn$i'}}" | nc -U /var/tmp/avocado_1

# stop
echo -e "{'execute':'qmp_capabilities'} {'execute': 'stop'}" | nc -U /var/tmp/avocado_1

# cont
echo -e "{'execute':'qmp_capabilities'} {'execute': 'cont'}" | nc -U /var/tmp/avocado_1

echo -e "{'execute':'qmp_capabilities'} {'execute': 'blockdev-add', 'arguments': {'node-name': 'file_sn$i', 'driver': 'file', 'filename': '/home/ngu/sn$i.qcow2', 'aio': 'threads'}}" | nc -U /var/tmp/avocado_1

# stop
echo -e "{'execute':'qmp_capabilities'} {'execute': 'stop'}" | nc -U /var/tmp/avocado_1

# cont
echo -e "{'execute':'qmp_capabilities'} {'execute': 'cont'}" | nc -U /var/tmp/avocado_1

echo -e "{'execute':'qmp_capabilities'} {'execute': 'blockdev-create', 'arguments': {'options': {'driver': 'qcow2', 'file': 'file_sn$i', 'size': 21474836480}, 'job-id': 'drive_sn$i'}}" | nc -U /var/tmp/avocado_1

# stop
echo -e "{'execute':'qmp_capabilities'} {'execute': 'stop'}" | nc -U /var/tmp/avocado_1

sleep 1

#cont
echo -e "{'execute':'qmp_capabilities'} {'execute': 'cont'}" | nc -U /var/tmp/avocado_1

echo -e "{'execute':'qmp_capabilities'} {'execute': 'job-dismiss', 'arguments': {'id': 'drive_sn$i'}}" | nc -U /var/tmp/avocado_1

# stop
echo -e "{'execute':'qmp_capabilities'} {'execute': 'stop'}" | nc -U /var/tmp/avocado_1

#cont
echo -e "{'execute':'qmp_capabilities'} {'execute': 'cont'}" | nc -U /var/tmp/avocado_1

echo -e "{'execute':'qmp_capabilities'} {'execute': 'blockdev-add', 'arguments': {'node-name': 'drive_sn$i', 'driver': 'qcow2', 'file': 'file_sn$i'}}" | nc -U /var/tmp/avocado_1

# stop
echo -e "{'execute':'qmp_capabilities'} {'execute': 'stop'}" | nc -U /var/tmp/avocado_1

done

#cont
echo -e "{'execute':'qmp_capabilities'} {'execute': 'cont'}" | nc -U /var/tmp/avocado_1


Using a seabios with custom traces in virtio_scsi_process_op(), I observe on x86 the very
same behavior as with SLOF on POWER : after a few tries, seabios sends an SCSI command and
gets no response from QEMU, ie. seabios loops forever.

So it really seems that the issue isn't specific to POWER. It just happens to bite easily
on POWER because SLOF is terribly slow.

Comment 15 Gu Nini 2020-10-19 07:32:56 UTC

HI Kevin, could you help to check comment 14? Greg has already reproduced it on x86, would you help to take the bug hence?

Comment 16 Kevin Wolf 2020-10-19 08:43:08 UTC

(In reply to Greg Kurz from comment #9)
> - each time the failure is observed, a previous INQUIRY command seems to
>   get stuck for some time.

Is it stuck and then continues later (maybe after the guest has hit a timeout) or does the request just get lost as comment 14 suggests?
 
> - with a debug message in QEMU, it turns out that the INQUIRY command
>   doesn't reach scsi_disk_emulate_inquiry() in this case.

I think this is the most important information here. It means that we haven't even come near any code that touches the block layer before the request gets lost. So we're likely talking about some side effect that the QMP commands have. For example, stopping the VM causes the virtio-scsi dataplane to be stopped, maybe we lose requests there? Though this wouldn't explain the original case without stop/cont.

Maybe we just need to work backwards with tracing. Do we reach virtio_scsi_handle_cmd_vq()? The -EINVAL case there looks like it could drop requests. (Though I'm not sure why we should hit it.)

Comment 17 Greg Kurz 2020-10-19 14:15:00 UTC

(In reply to Kevin Wolf from comment #16)
> (In reply to Greg Kurz from comment #9)
> > - each time the failure is observed, a previous INQUIRY command seems to
> >   get stuck for some time.
> 
> Is it stuck and then continues later (maybe after the guest has hit a
> timeout) or does the request just get lost as comment 14 suggests?
>  

On POWER, SLOF waits for some time, hits a timeout, tries again and then
bails out.

On x86, seabios loops until it gets an answer.

> > - with a debug message in QEMU, it turns out that the INQUIRY command
> >   doesn't reach scsi_disk_emulate_inquiry() in this case.
> 
> I think this is the most important information here. It means that we
> haven't even come near any code that touches the block layer before the
> request gets lost. So we're likely talking about some side effect that the
> QMP commands have. For example, stopping the VM causes the virtio-scsi
> dataplane to be stopped, maybe we lose requests there? Though this wouldn't
> explain the original case without stop/cont.
> 

Not sure exactly why but I see that SLOF naturally triggers a lot of dataplane 
start/stop, so this could be a lead.

> Maybe we just need to work backwards with tracing. Do we reach
> virtio_scsi_handle_cmd_vq()? The -EINVAL case there looks like it could drop
> requests. (Though I'm not sure why we should hit it.)

I'll give a try. Thanks.

Comment 18 Greg Kurz 2020-10-21 17:29:33 UTC

(In reply to Kevin Wolf from comment #16)
> Maybe we just need to work backwards with tracing. Do we reach
> virtio_scsi_handle_cmd_vq()? The -EINVAL case there looks like it could drop
> requests. (Though I'm not sure why we should hit it.)

virtio_scsi_handle_cmd_vq() doesn't get called, and with extra traces I finally
found that aio_dispatch_handler() doesn't call virtio_queue_host_notifier_aio_read()
because ctx->external_disable_cnt is not zero.

As I was saying in comment #17, SLOF generates a fair amount of dataplace stop/start.
This is basically because SLOF disables/re-enables PCI devices multiple times via
IO/MEM/MASTER bits of PCI_COMMAND register after the initial probe/feature negotiation,
as it tends to work with a single device at a time at various stages like probing and
running block/network bootloaders without doing a full reset in-between.

If a blockdev-create job completes while the dataplane is being stopped, especially
during the drain section, we might miss a call to bdrv_do_drained_end() and, thus,
miss a call to aio_enable_external() and the AIO context stay disabled forever. Endgame.

If this makes sense to you, I was thinking to the following fix:

--- a/block.c
+++ b/block.c
@@ -4667,6 +4667,10 @@ static void bdrv_delete(BlockDriverState *bs)
 
     bdrv_close(bs);
 
+    if (bs->quiesce_counter) {
+        aio_enable_external(bs->aio_context);
+    }
+
     g_free(bs);
 }

Comment 19 Kevin Wolf 2020-10-22 12:56:34 UTC

(In reply to Greg Kurz from comment #18)
> If a blockdev-create job completes while the dataplane is being stopped,
> especially during the drain section, we might miss a call to bdrv_do_drained_end() and,
> thus, miss a call to aio_enable_external() and the AIO context stay disabled
> forever. Endgame.

So we're still inside a bdrv_drain_all_begin()/end() section, so bdrv_close() is called with bs->quiesce_counter > 0?

> If this makes sense to you, I was thinking to the following fix:
> 
> --- a/block.c
> +++ b/block.c
> @@ -4667,6 +4667,10 @@ static void bdrv_delete(BlockDriverState *bs)
>  
>      bdrv_close(bs);
>  
> +    if (bs->quiesce_counter) {
> +        aio_enable_external(bs->aio_context);
> +    }

I think I would call bdrv_do_drained_end() (possibly through a new helper function) rather than just duplicating one of its effects. Also, I'd probably move this inside bdrv_close().

We'll probably also want to improve test_graph_change_drain_all() in tests/test-bdrv-drain.c to also check ctx->external_disable_cnt and catch the bug this way.

Comment 20 Greg Kurz 2020-10-22 17:26:39 UTC

(In reply to Kevin Wolf from comment #19)
> (In reply to Greg Kurz from comment #18)
> > If a blockdev-create job completes while the dataplane is being stopped,
> > especially during the drain section, we might miss a call to bdrv_do_drained_end() and,
> > thus, miss a call to aio_enable_external() and the AIO context stay disabled
> > forever. Endgame.
> 
> So we're still inside a bdrv_drain_all_begin()/end() section, so
> bdrv_close() is called with bs->quiesce_counter > 0?
> 

Yes.

> > If this makes sense to you, I was thinking to the following fix:
> > 
> > --- a/block.c
> > +++ b/block.c
> > @@ -4667,6 +4667,10 @@ static void bdrv_delete(BlockDriverState *bs)
> >  
> >      bdrv_close(bs);
> >  
> > +    if (bs->quiesce_counter) {
> > +        aio_enable_external(bs->aio_context);
> > +    }
> 
> I think I would call bdrv_do_drained_end() (possibly through a new helper
> function) rather than just duplicating one of its effects. Also, I'd
> probably move this inside bdrv_close().
> 

Ok.

> We'll probably also want to improve test_graph_change_drain_all() in
> tests/test-bdrv-drain.c to also check ctx->external_disable_cnt and catch
> the bug this way.

Indeed, g_assert_cmpint(qemu_get_aio_context()->external_disable_cnt, ==, 0)
after bdrv_drain_all_end() catches the issue.

I'll try to come up with a patch and I'll post to qemu-devel.

Thanks Kevin.

Comment 21 Kevin Wolf 2020-10-23 07:24:19 UTC

Actually, one more thing: It's not a boolean, but a counter and it can probably be > 1 in corner cases. So the if should probably be a while (bs->quiesce_counter--).

Comment 22 Greg Kurz 2020-10-23 08:32:51 UTC

(In reply to Kevin Wolf from comment #21)
> Actually, one more thing: It's not a boolean, but a counter and it can
> probably be > 1 in corner cases. So the if should probably be a while
> (bs->quiesce_counter--).

Yeah this was bothering me but I couldn't think of a situation where
it could be actually > 1... I was even envisioning to add an assert()
for that, but I'll take your word. Also, we shouldn't -- since this
is handled by bdrv_do_drained_end().

Last question, does this really need a helper, ie. is there a chance
it would be useful anywhere but from bdrv_close() ? What about
making bdrv_do_drained_end() extern and open-coding the logic within
bdrv_close() instead ?

Comment 23 Greg Kurz 2020-10-23 09:55:51 UTC

(In reply to Greg Kurz from comment #22)
> 
> Last question, does this really need a helper, ie. is there a chance
> it would be useful anywhere but from bdrv_close() ? What about
> making bdrv_do_drained_end() extern and open-coding the logic within
> bdrv_close() instead ?

Nah, it looks like this would expose internals of block/io.c to
block.c... I'll go with a helper and make it clear it should
only be called from bdrv_close(). Please ignore and sorry for the
noise :)

Comment 24 Kevin Wolf 2020-10-23 10:53:52 UTC

(In reply to Greg Kurz from comment #22)
> Yeah this was bothering me but I couldn't think of a situation where
> it could be actually > 1... I was even envisioning to add an assert()
> for that, but I'll take your word.

I'm not sure if it happens in practice, but in theory there is no reason why you couldn't nest bdrv_drain_all_begin/end sections. I wouldn't be surprised if there is a corner case where we actually end up doing this.

Test cases probably don't cover this yet, though.

> Also, we shouldn't -- since this is handled by bdrv_do_drained_end().

You're right, of course.

> Last question, does this really need a helper, ie. is there a chance
> it would be useful anywhere but from bdrv_close() ? What about
> making bdrv_do_drained_end() extern and open-coding the logic within
> bdrv_close() instead ?

I wasn't sure if it was appropriate to just make the internal helper public or whether that would expose internals that we don't want to make part of the public interface. That's why I said "possibly". Seems you already found your own answer to the question, so I don't have to think up an answer myself. :-)

Comment 25 Greg Kurz 2020-10-23 15:13:57 UTC

Posted fix for upstream:

http://patchwork.ozlabs.org/project/qemu-devel/patch/160346526998.272601.9045392804399803158.stgit@bahia.lan/

Comment 26 Greg Kurz 2020-10-30 21:41:46 UTC

Fix now merged upstream as:

commit 1a6d3bd229d429879a85a9105fb84cae049d083c
Author: Greg Kurz <groug>
Date:   Fri Oct 23 17:01:10 2020 +0200

    block: End quiescent sections when a BDS is deleted
    
    If a BDS gets deleted during blk_drain_all(), it might miss a
    call to bdrv_do_drained_end(). This means missing a call to
    aio_enable_external() and the AIO context remains disabled for
    ever. This can cause a device to become irresponsive and to
    disrupt the guest execution, ie. hang, loop forever or worse.
    
    This scenario is quite easy to encounter with virtio-scsi
    on POWER when punching multiple blockdev-create QMP commands
    while the guest is booting and it is still running the SLOF
    firmware. This happens because SLOF disables/re-enables PCI
    devices multiple times via IO/MEM/MASTER bits of PCI_COMMAND
    register after the initial probe/feature negotiation, as it
    tends to work with a single device at a time at various stages
    like probing and running block/network bootloaders without
    doing a full reset in-between. This naturally generates many
    dataplane stops and starts, and thus many drain sections that
    can race with blockdev_create_run(). In the end, SLOF bails
    out.
    
    It is somehow reproducible on x86 but it requires to generate
    articial dataplane start/stop activity with stop/cont QMP
    commands. In this case, seabios ends up looping for ever,
    waiting for the virtio-scsi device to send a response to
    a command it never received.
    
    Add a helper that pairs all previously called bdrv_do_drained_begin()
    with a bdrv_do_drained_end() and call it from bdrv_close().
    While at it, update the "/bdrv-drain/graph-change/drain_all"
    test in test-bdrv-drain so that it can catch the issue.
    
    BugId: https://bugzilla.redhat.com/show_bug.cgi?id=1874441
    Signed-off-by: Greg Kurz <groug>
    Message-Id: <160346526998.272601.9045392804399803158.stgit>
    Signed-off-by: Kevin Wolf <kwolf>

Comment 29 Danilo de Paula 2020-12-15 19:12:36 UTC

QA_ACK, please?

Comment 30 Gu Nini 2020-12-21 06:37:31 UTC

The bug is fixed well, test 100 times on following software versions, not meet the bug:

Host kernel: 4.18.0-262.el8.dt3.ppc64le
Guest kernel: 4.18.0-259.el8.ppc64le
Qemu: qemu-kvm-5.2.0-2.module+el8.4.0+9186+ec44380f.ppc64le
SLOF: SLOF-20200717-1.gite18ddad8.module+el8.4.0+8855+a9e237a9.noarch

Comment 33 Gu Nini 2020-12-25 00:57:27 UTC

(In reply to Gu Nini from comment #30)
> The bug is fixed well, test 100 times on following software versions, not
> meet the bug:
> 
> Host kernel: 4.18.0-262.el8.dt3.ppc64le
> Guest kernel: 4.18.0-259.el8.ppc64le
> Qemu: qemu-kvm-5.2.0-2.module+el8.4.0+9186+ec44380f.ppc64le
> SLOF: SLOF-20200717-1.gite18ddad8.module+el8.4.0+8855+a9e237a9.noarch

Set the bug status as verified.

Comment 35 errata-xmlrpc 2021-05-25 06:43:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2098