Bug 1874441
Summary: | Guest enters 'grub rescue>' mode if active creating/adding disk image during the early guest booting stage | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux Advanced Virtualization | Reporter: | Gu Nini <ngu> |
Component: | qemu-kvm | Assignee: | Greg Kurz <gkurz> |
qemu-kvm sub component: | Block Jobs | QA Contact: | Gu Nini <ngu> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | medium | CC: | aliang, bugproxy, ddepaula, dgibson, fnovak, gkurz, hannsj_uhl, jinzhao, kwolf, mrezanin, qzhang, thuth, virt-maint |
Version: | 8.3 | Keywords: | Triaged |
Target Milestone: | rc | ||
Target Release: | 8.4 | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | qemu-kvm-5.2.0-0.module+el8.4.0+8855+a9e237a9 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-05-25 06:43:34 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1796871 |
Description
Gu Nini
2020-09-01 11:19:36 UTC
Since the message "SCSI-DISK: Failed to get disk capacity!" belongs to SLOF and I don't really know how SLOF handles SCSI disks, I'm now investigating using a custom SLOF with debug messages enabled. This is a bit tricky. When the flow of debug messages is too high, which causes a lot of round-trip between SLOF and QEMU, I don't hit the issue. So I've tried to only enable selected messages or add my own, with no success so far. Maybe someone from IBM, eg. Alexey Kardashevskiy, could help here ? I've requested a mirror of this to IBM. As discussed on our call, it might also be worth enquiring with Thomasn Huth who works on s390 now, but has a background with SLOF. I could make some progress: - each time the failure is observed, a previous INQUIRY command seems to get stuck for some time. - with a debug message in QEMU, it turns out that the INQUIRY command doesn't reach scsi_disk_emulate_inquiry() in this case. - the issue doesn't seem to be hit when using a spapr-vscsi device instead of virtio-scsi. - the issue doesn't seem to be hit when *not* running virtio-scsi in an iothread. So I'm beginning to suspect that there might be a race somewhere in the virtio-scsi dataplane implementation. Since this isn't POWER specific code, I'd like to know if this issue can be reproduced on x86. (In reply to Greg Kurz from comment #9) > I could make some progress: > > - each time the failure is observed, a previous INQUIRY command seems to > get stuck for some time. > > - with a debug message in QEMU, it turns out that the INQUIRY command > doesn't reach scsi_disk_emulate_inquiry() in this case. > > - the issue doesn't seem to be hit when using a spapr-vscsi device > instead of virtio-scsi. > > - the issue doesn't seem to be hit when *not* running virtio-scsi in an > iothread. > > So I'm beginning to suspect that there might be a race somewhere in > the virtio-scsi dataplane implementation. Please note in my test, I have not used data plane, i.e. '-object iothread,id=iothread1 \'. > > Since this isn't POWER specific code, I'd like to know if this issue > can be reproduced on x86. As proved in comment 1, it could not be reproduced on x86 formerly. Aihua, could you help to reconfirm? Thanks in advance. (In reply to Gu Nini from comment #10) > (In reply to Greg Kurz from comment #9) > > > > So I'm beginning to suspect that there might be a race somewhere in > > the virtio-scsi dataplane implementation. > > Please note in my test, I have not used data plane, i.e. '-object > iothread,id=iothread1 \'. It's confirmed if using data plane for the system disk, the issue won't occur. -object iothread,id=iothread1 \ ...... -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4,iothread=iothread1 \ -blockdev node-name=file_image1,driver=file,aio=threads,filename=/home/ngu/rhel830-ppc64le-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \ -blockdev node-name=drive_image1,driver=qcow2,cache.direct=on,cache.no-flush=off,file=file_image1 \ -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \ > > > > > Since this isn't POWER specific code, I'd like to know if this issue > > can be reproduced on x86. > > As proved in comment 1, it could not be reproduced on x86 formerly. Aihua, > could you help to reconfirm? Thanks in advance. I have done test on x86 with the latest qemu-kvm-5.1.0-13.module+el8.3.0+8382+afc3bbea.x86_64, the issue is not reproduced no matter data plane enabled or disabled. (In reply to Gu Nini from comment #10) > > Please note in my test, I have not used data plane, i.e. '-object > iothread,id=iothread1 \'. > Sure but you've certainly used the dataplane code since it is used unless you turn off ioeventfd. The only difference being that it is called from the main QEMU thread instead of a separate iothread. (gdb) bt #0 0x000000012938ada0 in scsi_disk_emulate_inquiry (outbuf=<optimized out>, req=0x14daa7000) at ../../hw/scsi/scsi-disk.c:1922 #1 0x000000012938ada0 in scsi_disk_emulate_command (req=0x14daa7000, buf=0x14daa7038 "\022") at ../../hw/scsi/scsi-disk.c:1922 #2 0x000000012929f2fc in scsi_req_enqueue (req=0x14daa7000) at ../../hw/scsi/scsi-bus.c:820 #3 0x000000012963df74 in virtio_scsi_handle_cmd_req_submit (s=<optimized out>, req=<optimized out>) at ../../hw/scsi/virtio-scsi.c:634 #4 0x000000012963df74 in virtio_scsi_handle_cmd_vq (s=0x14e548fa0, vq=0x14e550880) at ../../hw/scsi/virtio-scsi.c:634 #5 0x00000001295d2cb8 in virtio_scsi_data_plane_handle_cmd (vdev=<optimized out>, vq=0x14e550880) at ../../hw/scsi/virtio-scsi-dataplane.c:60 <===== #6 0x00000001295d61e0 in virtio_queue_notify_aio_vq (vq=0x14e550880) at ../../hw/virtio/virtio.c:2325 #7 0x000000012993afc0 in aio_dispatch_handler (ctx=ctx@entry=0x14dc088a0, node=0x7fff24180340) at ../../util/aio-posix.c:328 #8 0x000000012993bc8c in aio_dispatch_handlers (ctx=0x14dc088a0) at ../../util/aio-posix.c:371 #9 0x000000012993bc8c in aio_dispatch (ctx=0x14dc088a0) at ../../util/aio-posix.c:381 #10 0x0000000129914c60 in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at ../../util/async.c:306 #11 0x00007fff89d68a7c in g_main_context_dispatch () at /lib64/libglib-2.0.so.0 #12 0x000000012990d808 in glib_pollfds_poll () at ../../util/main-loop.c:221 #13 0x000000012990d808 in os_host_main_loop_wait (timeout=<optimized out>) at ../../util/main-loop.c:244 #14 0x000000012990d808 in main_loop_wait (nonblocking=<optimized out>) at ../../util/main-loop.c:520 #15 0x00000001295eb7bc in qemu_main_loop () at ../../softmmu/vl.c:1676 #16 0x00000001291acc40 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at ../../softmmu/main.c:50 > > > > Since this isn't POWER specific code, I'd like to know if this issue > > can be reproduced on x86. > > As proved in comment 1, it could not be reproduced on x86 formerly. Aihua, > could you help to reconfirm? Thanks in advance. Oops, my bad. I overlooked this comment. (In reply to Gu Nini from comment #11) > (In reply to Gu Nini from comment #10) > > (In reply to Greg Kurz from comment #9) > > > > > > > So I'm beginning to suspect that there might be a race somewhere in > > > the virtio-scsi dataplane implementation. > > > > Please note in my test, I have not used data plane, i.e. '-object > > iothread,id=iothread1 \'. > > It's confirmed if using data plane for the system disk, the issue won't > occur. > > -object iothread,id=iothread1 \ > ...... > -device > virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x4,iothread=iothread1 \ > -blockdev > node-name=file_image1,driver=file,aio=threads,filename=/home/ngu/rhel830- > ppc64le-virtio-scsi.qcow2,cache.direct=on,cache.no-flush=off \ > -blockdev > node-name=drive_image1,driver=qcow2,cache.direct=on,cache.no-flush=off, > file=file_image1 \ > -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \ > Ah, interesting. Using dataplane for real, ie. with a separate iothread, is the next thing I wanted to try... thanks ! > > > > > > > > > Since this isn't POWER specific code, I'd like to know if this issue > > > can be reproduced on x86. > > > > As proved in comment 1, it could not be reproduced on x86 formerly. Aihua, > > could you help to reconfirm? Thanks in advance. > > I have done test on x86 with the latest > qemu-kvm-5.1.0-13.module+el8.3.0+8382+afc3bbea.x86_64, the issue is not > reproduced no matter data plane enabled or disabled. Thanks for confirming that. Ok so I've made some progress in the investigation. I had added some custom traces in virtioscsi_send() in SLOF and I had already observed that a SCSI command doesn't get a response from QEMU when the issue occurs. I've decided to give a try with x86 anyway but seabios runs so faster than SLOF that the blockdev_add QMP commands of the reproducer script happen much too late. So I have adapted the test by starting QEMU with -S and by adding QMP continue and stop commands around all QMP commands of the reproducer script in order to slow things down: #/bin/bash for i in {1..6} do echo "####create image sn$i####" # cont echo -e "{'execute':'qmp_capabilities'} {'execute': 'cont'}" | nc -U /var/tmp/avocado_1 echo -e "{'execute':'qmp_capabilities'} {'execute': 'blockdev-create', 'arguments': {'options': {'driver': 'file', 'filename': '/home/ngu/sn$i.qcow2', 'size': 21474836480}, 'job-id': 'file_sn$i'}}" | nc -U /var/tmp/avocado_1 # stop echo -e "{'execute':'qmp_capabilities'} {'execute': 'stop'}" | nc -U /var/tmp/avocado_1 # cont echo -e "{'execute':'qmp_capabilities'} {'execute': 'cont'}" | nc -U /var/tmp/avocado_1 echo -e "{'execute':'qmp_capabilities'} {'execute': 'job-dismiss', 'arguments': {'id': 'file_sn$i'}}" | nc -U /var/tmp/avocado_1 # stop echo -e "{'execute':'qmp_capabilities'} {'execute': 'stop'}" | nc -U /var/tmp/avocado_1 # cont echo -e "{'execute':'qmp_capabilities'} {'execute': 'cont'}" | nc -U /var/tmp/avocado_1 echo -e "{'execute':'qmp_capabilities'} {'execute': 'blockdev-add', 'arguments': {'node-name': 'file_sn$i', 'driver': 'file', 'filename': '/home/ngu/sn$i.qcow2', 'aio': 'threads'}}" | nc -U /var/tmp/avocado_1 # stop echo -e "{'execute':'qmp_capabilities'} {'execute': 'stop'}" | nc -U /var/tmp/avocado_1 # cont echo -e "{'execute':'qmp_capabilities'} {'execute': 'cont'}" | nc -U /var/tmp/avocado_1 echo -e "{'execute':'qmp_capabilities'} {'execute': 'blockdev-create', 'arguments': {'options': {'driver': 'qcow2', 'file': 'file_sn$i', 'size': 21474836480}, 'job-id': 'drive_sn$i'}}" | nc -U /var/tmp/avocado_1 # stop echo -e "{'execute':'qmp_capabilities'} {'execute': 'stop'}" | nc -U /var/tmp/avocado_1 sleep 1 #cont echo -e "{'execute':'qmp_capabilities'} {'execute': 'cont'}" | nc -U /var/tmp/avocado_1 echo -e "{'execute':'qmp_capabilities'} {'execute': 'job-dismiss', 'arguments': {'id': 'drive_sn$i'}}" | nc -U /var/tmp/avocado_1 # stop echo -e "{'execute':'qmp_capabilities'} {'execute': 'stop'}" | nc -U /var/tmp/avocado_1 #cont echo -e "{'execute':'qmp_capabilities'} {'execute': 'cont'}" | nc -U /var/tmp/avocado_1 echo -e "{'execute':'qmp_capabilities'} {'execute': 'blockdev-add', 'arguments': {'node-name': 'drive_sn$i', 'driver': 'qcow2', 'file': 'file_sn$i'}}" | nc -U /var/tmp/avocado_1 # stop echo -e "{'execute':'qmp_capabilities'} {'execute': 'stop'}" | nc -U /var/tmp/avocado_1 done #cont echo -e "{'execute':'qmp_capabilities'} {'execute': 'cont'}" | nc -U /var/tmp/avocado_1 Using a seabios with custom traces in virtio_scsi_process_op(), I observe on x86 the very same behavior as with SLOF on POWER : after a few tries, seabios sends an SCSI command and gets no response from QEMU, ie. seabios loops forever. So it really seems that the issue isn't specific to POWER. It just happens to bite easily on POWER because SLOF is terribly slow. HI Kevin, could you help to check comment 14? Greg has already reproduced it on x86, would you help to take the bug hence? (In reply to Greg Kurz from comment #9) > - each time the failure is observed, a previous INQUIRY command seems to > get stuck for some time. Is it stuck and then continues later (maybe after the guest has hit a timeout) or does the request just get lost as comment 14 suggests? > - with a debug message in QEMU, it turns out that the INQUIRY command > doesn't reach scsi_disk_emulate_inquiry() in this case. I think this is the most important information here. It means that we haven't even come near any code that touches the block layer before the request gets lost. So we're likely talking about some side effect that the QMP commands have. For example, stopping the VM causes the virtio-scsi dataplane to be stopped, maybe we lose requests there? Though this wouldn't explain the original case without stop/cont. Maybe we just need to work backwards with tracing. Do we reach virtio_scsi_handle_cmd_vq()? The -EINVAL case there looks like it could drop requests. (Though I'm not sure why we should hit it.) (In reply to Kevin Wolf from comment #16) > (In reply to Greg Kurz from comment #9) > > - each time the failure is observed, a previous INQUIRY command seems to > > get stuck for some time. > > Is it stuck and then continues later (maybe after the guest has hit a > timeout) or does the request just get lost as comment 14 suggests? > On POWER, SLOF waits for some time, hits a timeout, tries again and then bails out. On x86, seabios loops until it gets an answer. > > - with a debug message in QEMU, it turns out that the INQUIRY command > > doesn't reach scsi_disk_emulate_inquiry() in this case. > > I think this is the most important information here. It means that we > haven't even come near any code that touches the block layer before the > request gets lost. So we're likely talking about some side effect that the > QMP commands have. For example, stopping the VM causes the virtio-scsi > dataplane to be stopped, maybe we lose requests there? Though this wouldn't > explain the original case without stop/cont. > Not sure exactly why but I see that SLOF naturally triggers a lot of dataplane start/stop, so this could be a lead. > Maybe we just need to work backwards with tracing. Do we reach > virtio_scsi_handle_cmd_vq()? The -EINVAL case there looks like it could drop > requests. (Though I'm not sure why we should hit it.) I'll give a try. Thanks. (In reply to Kevin Wolf from comment #16) > Maybe we just need to work backwards with tracing. Do we reach > virtio_scsi_handle_cmd_vq()? The -EINVAL case there looks like it could drop > requests. (Though I'm not sure why we should hit it.) virtio_scsi_handle_cmd_vq() doesn't get called, and with extra traces I finally found that aio_dispatch_handler() doesn't call virtio_queue_host_notifier_aio_read() because ctx->external_disable_cnt is not zero. As I was saying in comment #17, SLOF generates a fair amount of dataplace stop/start. This is basically because SLOF disables/re-enables PCI devices multiple times via IO/MEM/MASTER bits of PCI_COMMAND register after the initial probe/feature negotiation, as it tends to work with a single device at a time at various stages like probing and running block/network bootloaders without doing a full reset in-between. If a blockdev-create job completes while the dataplane is being stopped, especially during the drain section, we might miss a call to bdrv_do_drained_end() and, thus, miss a call to aio_enable_external() and the AIO context stay disabled forever. Endgame. If this makes sense to you, I was thinking to the following fix: --- a/block.c +++ b/block.c @@ -4667,6 +4667,10 @@ static void bdrv_delete(BlockDriverState *bs) bdrv_close(bs); + if (bs->quiesce_counter) { + aio_enable_external(bs->aio_context); + } + g_free(bs); } (In reply to Greg Kurz from comment #18) > If a blockdev-create job completes while the dataplane is being stopped, > especially during the drain section, we might miss a call to bdrv_do_drained_end() and, > thus, miss a call to aio_enable_external() and the AIO context stay disabled > forever. Endgame. So we're still inside a bdrv_drain_all_begin()/end() section, so bdrv_close() is called with bs->quiesce_counter > 0? > If this makes sense to you, I was thinking to the following fix: > > --- a/block.c > +++ b/block.c > @@ -4667,6 +4667,10 @@ static void bdrv_delete(BlockDriverState *bs) > > bdrv_close(bs); > > + if (bs->quiesce_counter) { > + aio_enable_external(bs->aio_context); > + } I think I would call bdrv_do_drained_end() (possibly through a new helper function) rather than just duplicating one of its effects. Also, I'd probably move this inside bdrv_close(). We'll probably also want to improve test_graph_change_drain_all() in tests/test-bdrv-drain.c to also check ctx->external_disable_cnt and catch the bug this way. (In reply to Kevin Wolf from comment #19) > (In reply to Greg Kurz from comment #18) > > If a blockdev-create job completes while the dataplane is being stopped, > > especially during the drain section, we might miss a call to bdrv_do_drained_end() and, > > thus, miss a call to aio_enable_external() and the AIO context stay disabled > > forever. Endgame. > > So we're still inside a bdrv_drain_all_begin()/end() section, so > bdrv_close() is called with bs->quiesce_counter > 0? > Yes. > > If this makes sense to you, I was thinking to the following fix: > > > > --- a/block.c > > +++ b/block.c > > @@ -4667,6 +4667,10 @@ static void bdrv_delete(BlockDriverState *bs) > > > > bdrv_close(bs); > > > > + if (bs->quiesce_counter) { > > + aio_enable_external(bs->aio_context); > > + } > > I think I would call bdrv_do_drained_end() (possibly through a new helper > function) rather than just duplicating one of its effects. Also, I'd > probably move this inside bdrv_close(). > Ok. > We'll probably also want to improve test_graph_change_drain_all() in > tests/test-bdrv-drain.c to also check ctx->external_disable_cnt and catch > the bug this way. Indeed, g_assert_cmpint(qemu_get_aio_context()->external_disable_cnt, ==, 0) after bdrv_drain_all_end() catches the issue. I'll try to come up with a patch and I'll post to qemu-devel. Thanks Kevin. Actually, one more thing: It's not a boolean, but a counter and it can probably be > 1 in corner cases. So the if should probably be a while (bs->quiesce_counter--). (In reply to Kevin Wolf from comment #21) > Actually, one more thing: It's not a boolean, but a counter and it can > probably be > 1 in corner cases. So the if should probably be a while > (bs->quiesce_counter--). Yeah this was bothering me but I couldn't think of a situation where it could be actually > 1... I was even envisioning to add an assert() for that, but I'll take your word. Also, we shouldn't -- since this is handled by bdrv_do_drained_end(). Last question, does this really need a helper, ie. is there a chance it would be useful anywhere but from bdrv_close() ? What about making bdrv_do_drained_end() extern and open-coding the logic within bdrv_close() instead ? (In reply to Greg Kurz from comment #22) > > Last question, does this really need a helper, ie. is there a chance > it would be useful anywhere but from bdrv_close() ? What about > making bdrv_do_drained_end() extern and open-coding the logic within > bdrv_close() instead ? Nah, it looks like this would expose internals of block/io.c to block.c... I'll go with a helper and make it clear it should only be called from bdrv_close(). Please ignore and sorry for the noise :) (In reply to Greg Kurz from comment #22) > Yeah this was bothering me but I couldn't think of a situation where > it could be actually > 1... I was even envisioning to add an assert() > for that, but I'll take your word. I'm not sure if it happens in practice, but in theory there is no reason why you couldn't nest bdrv_drain_all_begin/end sections. I wouldn't be surprised if there is a corner case where we actually end up doing this. Test cases probably don't cover this yet, though. > Also, we shouldn't -- since this is handled by bdrv_do_drained_end(). You're right, of course. > Last question, does this really need a helper, ie. is there a chance > it would be useful anywhere but from bdrv_close() ? What about > making bdrv_do_drained_end() extern and open-coding the logic within > bdrv_close() instead ? I wasn't sure if it was appropriate to just make the internal helper public or whether that would expose internals that we don't want to make part of the public interface. That's why I said "possibly". Seems you already found your own answer to the question, so I don't have to think up an answer myself. :-) Posted fix for upstream: http://patchwork.ozlabs.org/project/qemu-devel/patch/160346526998.272601.9045392804399803158.stgit@bahia.lan/ Fix now merged upstream as: commit 1a6d3bd229d429879a85a9105fb84cae049d083c Author: Greg Kurz <groug> Date: Fri Oct 23 17:01:10 2020 +0200 block: End quiescent sections when a BDS is deleted If a BDS gets deleted during blk_drain_all(), it might miss a call to bdrv_do_drained_end(). This means missing a call to aio_enable_external() and the AIO context remains disabled for ever. This can cause a device to become irresponsive and to disrupt the guest execution, ie. hang, loop forever or worse. This scenario is quite easy to encounter with virtio-scsi on POWER when punching multiple blockdev-create QMP commands while the guest is booting and it is still running the SLOF firmware. This happens because SLOF disables/re-enables PCI devices multiple times via IO/MEM/MASTER bits of PCI_COMMAND register after the initial probe/feature negotiation, as it tends to work with a single device at a time at various stages like probing and running block/network bootloaders without doing a full reset in-between. This naturally generates many dataplane stops and starts, and thus many drain sections that can race with blockdev_create_run(). In the end, SLOF bails out. It is somehow reproducible on x86 but it requires to generate articial dataplane start/stop activity with stop/cont QMP commands. In this case, seabios ends up looping for ever, waiting for the virtio-scsi device to send a response to a command it never received. Add a helper that pairs all previously called bdrv_do_drained_begin() with a bdrv_do_drained_end() and call it from bdrv_close(). While at it, update the "/bdrv-drain/graph-change/drain_all" test in test-bdrv-drain so that it can catch the issue. BugId: https://bugzilla.redhat.com/show_bug.cgi?id=1874441 Signed-off-by: Greg Kurz <groug> Message-Id: <160346526998.272601.9045392804399803158.stgit> Signed-off-by: Kevin Wolf <kwolf> QA_ACK, please? The bug is fixed well, test 100 times on following software versions, not meet the bug: Host kernel: 4.18.0-262.el8.dt3.ppc64le Guest kernel: 4.18.0-259.el8.ppc64le Qemu: qemu-kvm-5.2.0-2.module+el8.4.0+9186+ec44380f.ppc64le SLOF: SLOF-20200717-1.gite18ddad8.module+el8.4.0+8855+a9e237a9.noarch (In reply to Gu Nini from comment #30) > The bug is fixed well, test 100 times on following software versions, not > meet the bug: > > Host kernel: 4.18.0-262.el8.dt3.ppc64le > Guest kernel: 4.18.0-259.el8.ppc64le > Qemu: qemu-kvm-5.2.0-2.module+el8.4.0+9186+ec44380f.ppc64le > SLOF: SLOF-20200717-1.gite18ddad8.module+el8.4.0+8855+a9e237a9.noarch Set the bug status as verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2098 |