Bug 1461827
| Summary: | QEMU hangs in aio wait when trying to access NBD volume over TLS | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Daniel Berrangé <berrange> |
| Component: | qemu-kvm-rhev | Assignee: | Paolo Bonzini <pbonzini> |
| Status: | CLOSED ERRATA | QA Contact: | Suqin Huang <shuang> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 7.4 | CC: | aliang, berrange, chayang, coli, famz, juzhang, knoel, lmiksik, michen, mrezanin, pbonzini, qzhang, stefanha, virt-maint |
| Target Milestone: | rc | Keywords: | Regression |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | qemu-kvm-rhev-2.9.0-12.el7 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-08-02 04:43:29 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1300770 | ||
Testing with lastest upstream GIT master, I managed to capture a trace of the NBD coroutine:
#0 0x0000555555b60710 in qemu_coroutine_switch (from_=from_@entry=0x55555897df50, to_=to_@entry=0x7ffff7edae58, action=action@entry=COROUTINE_YIELD) at util/coroutine-ucontext.c:176
#1 0x0000555555b5f821 in qemu_coroutine_yield () at util/qemu-coroutine.c:172
#2 0x0000555555ada920 in nbd_co_receive_reply (request=0x7fffc23decc0, request=0x7fffc23decc0, qiov=0x7fffffffd090, reply=<synthetic pointer>, s=0x555556795ed0) at block/nbd-client.c:169
#3 0x0000555555ada920 in nbd_client_co_preadv (bs=0x555556790c40, offset=<optimized out>, bytes=<optimized out>, qiov=0x7fffffffd090, flags=0)
at block/nbd-client.c:226
#4 0x0000555555ad3f53 in bdrv_driver_preadv (bs=bs@entry=0x555556790c40, offset=offset@entry=0, bytes=bytes@entry=512, qiov=qiov@entry=0x7fffffffd090, flags=0) at block/io.c:834
#5 0x0000555555ad7495 in bdrv_aligned_preadv (child=child@entry=0x55555678cb10, req=req@entry=0x7fffc23dee90, offset=offset@entry=0, bytes=bytes@entry=512, align=align@entry=1, qiov=qiov@entry=0x7fffffffd090, flags=0) at block/io.c:1083
#6 0x0000555555ad7743 in bdrv_co_preadv (child=0x55555678cb10, offset=offset@entry=0, bytes=bytes@entry=512, qiov=qiov@entry=0x7fffffffd090, flags=flags@entry=0) at block/io.c:1177
#7 0x0000555555ac810a in blk_co_preadv (blk=0x555556789be0, offset=0, bytes=512, qiov=0x7fffffffd090, flags=0) at block/block-backend.c:990
#8 0x0000555555ac81ec in blk_read_entry (opaque=0x7fffffffd0b0) at block/block-backend.c:1037
#9 0x0000555555b6077a in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at util/coroutine-ucontext.c:79
#10 0x00007fffdabdebf0 in __start_context () at /lib64/libc.so.6
#11 0x00007fffffffc8f0 in ()
#12 0x0000000000000000 in ()
(NB, the stack trace in bug description is qemu-kvm-rhev RPM source, this stack trace is today's git master upstream, so beware of filenames/line numbers)
IIUC, the NBD coroutine has yielded, waiting for the response from the NBD server - presumably it is expecting that response to trigger the main loop event callback to yield back into the coroutine. The main thread hasn't got as far as running the main loop yet though, so AFAICT, the yield will never come.
(In reply to Daniel Berrange from comment #0) > Current upstream GIT master suffers the same flaw, and git bisect shows this > is a regression introduced in the 2.9.0 release with this commit: > > > commit ff82911cd3f69f028f2537825c9720ff78bc3f19 > Author: Paolo Bonzini <pbonzini> > Date: Mon Feb 13 14:52:24 2017 +0100 > > nbd: convert to use qio_channel_yield > > [...] > > Reviewed-by: Stefan Hajnoczi <stefanha> > Reviewed-by: Fam Zheng <famz> Reassigning to Paolo (Eric would be my second choice, but he's on PTO). (In reply to Ademar Reis from comment #4) > (In reply to Daniel Berrange from comment #0) > > Current upstream GIT master suffers the same flaw, and git bisect shows this > > is a regression introduced in the 2.9.0 release with this commit: > > > > > > commit ff82911cd3f69f028f2537825c9720ff78bc3f19 > > Author: Paolo Bonzini <pbonzini> > > Date: Mon Feb 13 14:52:24 2017 +0100 > > > > nbd: convert to use qio_channel_yield > > > > [...] > > > > Reviewed-by: Stefan Hajnoczi <stefanha> > > Reviewed-by: Fam Zheng <famz> > > Reassigning to Paolo (Eric would be my second choice, but he's on PTO). For real this time. :-/ Patch is ready. Fix included in qemu-kvm-rhev-2.9.0-12.el7 Result:
boot up and login the guest successfully
Server:
qemu-nbd -f raw --object tls-creds-x509,id=tls0,endpoint=server,dir=/root/spice_x509-sJF --tls-creds tls0 rhel74-64-virtio-scsi.qcow2 -p 9000 -t
Client:
-object tls-creds-x509,id=tls0,endpoint=client,dir=/root/spice_x509-sJF \
-drive id=drive_image1,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=nbd://hp-dl388g8-16.rhts.eng.pek2.redhat.com:9000,file.tls-creds=tls0 \
Package:
qemu-kvm-rhev-2.9.0-12.el7.x86_64
Hi Paolo, Any other test do i need to run? Thanks Suqin No, it's okay. Thanks! Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2392 |
Description of problem: Export a disk image using qemu-nbd, enabling use of TLS. For example $ qemu-nbd --object tls-creds-x509,id=tls0,endpoint=server,dir=/home/berrange/security/qemutls --tls-creds tls0 /home/berrange/Fedora-Workstation-Live-x86_64-25-1.3.iso -p 9000 -t Now run QEMU, pointing it to this disk image $ qemu-system-x86_64 -object tls-creds-x509,id=tls0,endpoint=client,dir=/home/berrange/security/qemutls -drive driver=nbd,host=127.0.0.1,port=9000,id=drv0,if=none,tls-creds=tls0 -device virtio-blk-pci,id=image1,drive=drv0,bootindex=0 QEMU will hang at this point, never getting as far as running the main loop. A GDB stack trace is: #0 0x00007f1e1b335aff in __GI_ppoll (fds=0x55c0411e3600, nfds=1, timeout=<optimized out>, timeout@entry=0x0, sigmask=sigmask@entry=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:56 #1 0x000055c03c97d3fb in qemu_poll_ns (__ss=0x0, __timeout=0x0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77 #2 0x000055c03c97d3fb in qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=-1) at util/qemu-timer.c:322 #3 0x000055c03c97f0b5 in aio_poll (ctx=ctx@entry=0x55c03f955700, blocking=<optimized out>) at util/aio-posix.c:622 #4 0x000055c03c8ffa64 in blk_prw (blk=blk@entry=0x55c03f980000, offset=offset@entry=0, buf=buf@entry=0x7ffc4f03f2c0 "@", bytes=bytes@entry=512, co_entry=co_entry@entry=0x55c03c900f40 <blk_read_entry>, flags=flags@entry=0) at block/block-backend.c:1052 #5 0x000055c03c90115a in blk_pread_unthrottled (count=512, buf=0x7ffc4f03f2c0, offset=0, blk=0x55c03f980000) at block/block-backend.c:1201 ---Type <return> to continue, or q <return> to quit--- #6 0x000055c03c90115a in blk_pread_unthrottled (blk=blk@entry=0x55c03f980000, offset=offset@entry=0, buf=buf@entry=0x7ffc4f03f2c0 "@", count=count@entry=512) at block/block-backend.c:1069 #7 0x000055c03c7c8e9f in guess_disk_lchs (blk=blk@entry=0x55c03f980000, pcylinders=pcylinders@entry=0x7ffc4f03f50c, pheads=pheads@entry=0x7ffc4f03f510, psectors=psectors@entry=0x7ffc4f03f514) at hw/block/hd-geometry.c:70 #8 0x000055c03c7c9007 in hd_geometry_guess (blk=0x55c03f980000, pcyls=pcyls@entry=0x55c03f977974, pheads=pheads@entry=0x55c03f977978, psecs=psecs@entry=0x55c03f97797c, ptrans=ptrans@entry=0x55c03f977990) at hw/block/hd-geometry.c:135 #9 0x000055c03c7c8b82 in blkconf_geometry (conf=conf@entry=0x55c03f977958, ptrans=ptrans@entry=0x55c03f977990, cyls_max=cyls_max@entry=65535, heads_max=heads_max@entry=16, secs_max=secs_max@entry=255, errp=errp@entry=0x7ffc4f03f5f0) at hw/block/block.c:123 #10 0x000055c03c8123cf in ide_dev_initfn (dev=0x55c03f9778e0, kind=IDE_HD) at hw/ide/qdev.c:194 #11 0x000055c03c7d8944 in device_realize (dev=0x55c03f9778e0, errp=0x7ffc4f03f680) at hw/core/qdev.c:228 #12 0x000055c03c7da1c1 in device_set_realized (obj=<optimized out>, value=<optimized out>, errp=0x7ffc4f03f770) at hw/core/qdev.c:939 #13 0x000055c03c8c069e in property_set_bool (obj=0x55c03f9778e0, v=<optimized out>, name=<optimized out>, opaque=0x55c040960a70, errp=0x7ffc4f03f770) at qom/object.c:1860 #14 0x000055c03c8c435f in object_property_set_qobject (obj=0x55c03f9778e0, value=<optimized out>, name=0x55c03c9ea28b "realized", errp=0x7ffc4f03f770) at qom/qom-qobject.c:27 #15 0x000055c03c8c21d0 in object_property_set_bool (obj=0x55c03f9778e0, value=<optimized out>, name=0x55c03c9ea28b "realized", errp=0x7ffc4f03f770) at qom/object.c:1163 #16 0x000055c03c7d8f92 in qdev_init_nofail (dev=dev@entry=0x55c03f9778e0) at hw/core/qdev.c:373 #17 0x000055c03c812794 in ide_create_drive (bus=bus@entry=0x55c0411989f0, unit=unit@entry=0, drive=0x55c03f91b400) at hw/ide/qdev.c:132 #18 0x000055c03c812f9e in pci_ide_create_devs (dev=dev@entry=0x55c041198000, hd_table=hd_table@entry=0x7ffc4f03f870) at hw/ide/pci.c:430 #19 0x000055c03c8136e5 in pci_piix3_ide_init (bus=<optimized out>, hd_table=0x7ffc4f03f870, devfn=<optimized out>) at hw/ide/piix.c:231 #20 0x000055c03c711ce5 in pc_init1 (machine=0x55c03f95e3c0, pci_type=0x55c03c9b094a "i440FX", host_type=0x55c03c9b0951 "i440FX-pcihost") at /usr/src/debug/qemu-2.9.0/hw/i386/pc_piix.c:249 #21 0x000055c03c66cb60 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4579 Current upstream GIT master suffers the same flaw, and git bisect shows this is a regression introduced in the 2.9.0 release with this commit: commit ff82911cd3f69f028f2537825c9720ff78bc3f19 Author: Paolo Bonzini <pbonzini> Date: Mon Feb 13 14:52:24 2017 +0100 nbd: convert to use qio_channel_yield In the client, read the reply headers from a coroutine, switching the read side between the "read header" coroutine and the I/O coroutine that reads the body of the reply. In the server, if the server can read more requests it will create a new "read request" coroutine as soon as a request has been read. Otherwise, the new coroutine is created in nbd_request_put. Reviewed-by: Stefan Hajnoczi <stefanha> Signed-off-by: Paolo Bonzini <pbonzini> Reviewed-by: Fam Zheng <famz> Reviewed-by: Daniel P. Berrange <berrange> Message-id: 20170213135235.12274-8-pbonzini Signed-off-by: Stefan Hajnoczi <stefanha> Version-Release number of selected component (if applicable): qemu-kvm-rhev-2.9.0-10.el7.x86_64 How reproducible: Always Steps to Reproduce: 1. $ qemu-nbd --object tls-creds-x509,id=tls0,endpoint=server,dir=/home/berrange/security/qemutls --tls-creds tls0 /home/berrange/Fedora-Workstation-Live-x86_64-25-1.3.iso -p 9000 -t 2. $ qemu-system-x86_64 -object tls-creds-x509,id=tls0,endpoint=client,dir=/home/berrange/security/qemutls -drive driver=nbd,host=127.0.0.1,port=9000,id=drv0,if=none,tls-creds=tls0 -device virtio-blk-pci,id=image1,drive=drv0,bootindex=0 Actual results: QEMU hangs when initializing machine Expected results: QEMU runs normally