Bug 1503480
| Summary: | src qemu with gluster and data-plane core dumped when do block mirror to NBD driver. | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Longxiang Lyu <lolyu> | |
| Component: | qemu-kvm-rhev | Assignee: | Paolo Bonzini <pbonzini> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | CongLi <coli> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | high | |||
| Version: | 7.5 | CC: | aliang, areis, chayang, coli, eblake, juzhang, knoel, lolyu, mtessun, ngu, pbonzini, qizhu, qzhang, stefanha, virt-maint | |
| Target Milestone: | rc | Keywords: | Regression, TestOnly, ZStream | |
| Target Release: | --- | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1635986 (view as bug list) | Environment: | ||
| Last Closed: | 2019-12-13 13:38:47 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1558351, 1635986 | |||
could reproduce in qemu-kvm-rhev-2.9.0-16.el7_4.10.x86_64 add keyword: regression because the failure to reproduce in 7.4.z: kernel-3.10.0-693.5.1.el7.x86_64 qemu-kvm-rhev-2.9.0-16.el7_4.9.x86_64 Hi, Could you try reproducing with the latest version? There have been many dataplane fixes since this was originally reported, and there is a good chance that the bug has since been fixed. Hi,
reproduced in 1/8 with the following components:
qemu-kvm-rhev-2.10.0-16.el7
kernel-3.10.0-830.el7.x86_64
debug info of dump file:
# gdb -q -c core.25173
[New LWP 25175]
[New LWP 25177]
[New LWP 25179]
[New LWP 25184]
[New LWP 25183]
[New LWP 25178]
[New LWP 25186]
[New LWP 25182]
[New LWP 25181]
[New LWP 25185]
[New LWP 25174]
[New LWP 25193]
[New LWP 25187]
[New LWP 25188]
[New LWP 25190]
[New LWP 25173]
[New LWP 25192]
[New LWP 25189]
Reading symbols from /usr/libexec/qemu-kvm...Reading symbols from /usr/lib/debug/usr/libexec/qemu-kvm.debug...done.
done.
warning: .dynamic section for "/usr/lib64/libudev.so.1.6.2" is not at the expected address (wrong library or version mismatch?)
warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/libexec/qemu-kvm -name guest=test-virt0 -machine pc-i440fx-rhel7.4.0,accel'.
Program terminated with signal 6, Aborted.
#0 0x00007fcbc97431f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0 0x00007fcbc97431f7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007fcbc97448e8 in __GI_abort () at abort.c:90
#2 0x00007fcbc973c266 in __assert_fail_base (fmt=0x7fcbc988ee68 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x5616d2094722 "ctx->poll_disable_cnt == 0", file=file@entry=0x5616d20946d3 "util/aio-posix.c", line=line@entry=527, function=function@entry=0x5616d2094860 <__PRETTY_FUNCTION__.25907> "run_poll_handlers") at assert.c:92
#3 0x00007fcbc973c312 in __GI___assert_fail (assertion=assertion@entry=0x5616d2094722 "ctx->poll_disable_cnt == 0", file=file@entry=0x5616d20946d3 "util/aio-posix.c", line=line@entry=527, function=function@entry=0x5616d2094860 <__PRETTY_FUNCTION__.25907> "run_poll_handlers") at assert.c:101
#4 0x00005616d1fa6727 in aio_poll (max_ns=4000, ctx=0x5616d479fd40) at util/aio-posix.c:527
#5 0x00005616d1fa6727 in aio_poll (blocking=true, ctx=0x5616d479fd40) at util/aio-posix.c:562
#6 0x00005616d1fa6727 in aio_poll (ctx=0x5616d479fd40, blocking=blocking@entry=true)
at util/aio-posix.c:602
#7 0x00005616d1d997c6 in iothread_run (opaque=0x5616d47733a0) at iothread.c:59
#8 0x00007fcbc9ad8e25 in start_thread (arg=0x7fcbb9237700) at pthread_create.c:308
#9 0x00007fcbc980634d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
This actually seems to be a problem with the AioContext polling mode and the locking around it. Stefan and Paolo should know this code.
This is the assertion that fails at the start of run_poll_handlers():
assert(ctx->poll_disable_cnt == 0);
The caller of this function is try_poll_mode(), which only calls run_poll_handlers() if the assertion will hold true. So something must change the value in the short piece of code between the check and the call:
if (blocking && ctx->poll_max_ns && ctx->poll_disable_cnt == 0) {
/* See qemu_soonest_timeout() uint64_t hack */
int64_t max_ns = MIN((uint64_t)aio_compute_timeout(ctx),
(uint64_t)ctx->poll_ns);
if (max_ns) {
poll_set_started(ctx, true);
if (run_poll_handlers(ctx, max_ns)) {
return true;
}
}
}
This could be either something called by poll_set_started(), or possibly another thread if the locking isn't correct.
I'll just reassign the bug because I'd have to read and understand the whole polling mode related code first, while I assume Paolo already knows which interactions are possible at this point.
The assertion is wrong, it's valid for another thread to call aio_set_fd_handler at any time and change poll_disable_cnt. I'll remove it and check on every iteration of run_poll_handlers. Reproduced on qemu-kvm-rhev-2.12.0-15.el7.x86_64:
Core was generated by `/usr/libexec/qemu-kvm -S -name avocado-vt-vm1 -sandbox off -machine pc -nodefau'.
Program terminated with signal 11, Segmentation fault.
#0 aio_co_wake (co=0x0) at util/async.c:467
467 ctx = atomic_read(&co->ctx);
Missing separate debuginfos, use: debuginfo-install mesa-libgbm-18.0.5-3.el7.x86_64
(gdb) bt
#0 0x0000560644a25bb3 in aio_co_wake (co=0x0) at util/async.c:467
#1 0x0000560644a27d88 in aio_dispatch_handlers (ctx=ctx@entry=0x560647e57900)
at util/aio-posix.c:406
#2 0x0000560644a287fa in aio_poll (ctx=0x560647e57900, blocking=blocking@entry=true)
at util/aio-posix.c:693
#3 0x00005606447f90fe in iothread_run (opaque=0x560647e75ce0) at iothread.c:64
#4 0x00007fbe80e96dd5 in start_thread (arg=0x7fbe78b45700) at pthread_create.c:307
#5 0x00007fbe80bbcead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
Tested with latest qemu package qemu-kvm-rhev-2.12.0-18.el7.x86_64, the assertion failure has been fixed, no core dump file generate. But if shutdown the src guest after mirror, the src qemu still hang as comment 25 mentioned. Hi Ademar, The assertion issue has been fixed as Paolo mentioned, but the qmeu hang issue still existed. And bz1525827 seems a dup of this one. Do we need set this bug to 'VERIFIED' to track the assertion issue, and open a new bug to track the hang issue? some trace log of src qemu: futex(0x5628b4291d20, FUTEX_WAKE_PRIVATE, 1) = 0 sendmsg(51, {msg_name(0)=NULL, msg_iov(1)=[{"{\"timestamp\": {\"seconds\": 153801"..., 147}], msg_controllen=0, msg_flags=0}, 0) = 147 sendmsg(51, {msg_name(0)=NULL, msg_iov(1)=[{"{\"timestamp\": {\"seconds\": 153801"..., 147}], msg_controllen=0, msg_flags=0}, 0) = 147 sendmsg(51, {msg_name(0)=NULL, msg_iov(1)=[{"{\"timestamp\": {\"seconds\": 153801"..., 205}], msg_controllen=0, msg_flags=0}, 0) = 205 sendmsg(51, {msg_name(0)=NULL, msg_iov(1)=[{"{\"timestamp\": {\"seconds\": 153801"..., 149}], msg_controllen=0, msg_flags=0}, 0) = 149 sendmsg(51, {msg_name(0)=NULL, msg_iov(1)=[{"{\"timestamp\": {\"seconds\": 153801"..., 144}], msg_controllen=0, msg_flags=0}, 0) = 144 write(11, "\1\0\0\0\0\0\0\0", 8) = 8 rt_sigprocmask(SIG_BLOCK, NULL, [BUS USR1 ALRM IO], 8) = 0 mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f563dac1000 mprotect(0x7f563dac1000, 4096, PROT_NONE) = 0 rt_sigprocmask(SIG_SETMASK, [BUS USR1 ALRM IO], [BUS USR1 ALRM IO], 8) = 0 write(11, "\1\0\0\0\0\0\0\0", 8) = 8 futex(0x5628b4291d20, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x5628b4291d20, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0x5628b4291d20, FUTEX_WAKE_PRIVATE, 1) = 0 rt_sigprocmask(SIG_BLOCK, NULL, [BUS USR1 ALRM IO], 8) = 0 mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f563d9c0000 mprotect(0x7f563d9c0000, 4096, PROT_NONE) = 0 rt_sigprocmask(SIG_SETMASK, [BUS USR1 ALRM IO], [BUS USR1 ALRM IO], 8) = 0 write(11, "\1\0\0\0\0\0\0\0", 8) = 8 futex(0x5628b4291d20, FUTEX_WAKE_PRIVATE, 1) = 1 ppoll([{fd=8, events=POLLIN|POLLERR|POLLHUP}], 1, NULL, NULL, 8) = 1 ([{fd=8, revents=POLLIN}]) read(8, "\1\0\0\0\0\0\0\0", 512) = 8 sendmsg(57, {msg_name(0)=NULL, msg_iov(1)=[{"%`\225\23\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 28}], msg_controllen=0, msg_flags=0}, 0) = 28 shutdown(57, SHUT_RDWR) = 0 futex(0x5628b4291d20, FUTEX_WAKE_PRIVATE, 1) = 1 ppoll([{fd=8, events=POLLIN|POLLERR|POLLHUP}], 1, NULL, NULL, 8 Thanks. *** Bug 1525827 has been marked as a duplicate of this bug. *** (In reply to CongLi from comment #28) > Tested with latest qemu package qemu-kvm-rhev-2.12.0-18.el7.x86_64, the > assertion failure has been fixed, no core dump file generate. > > But if shutdown the src guest after mirror, the src qemu still hang as > comment 25 mentioned. > > Hi Ademar, > > The assertion issue has been fixed as Paolo mentioned, but the qmeu hang > issue still existed. > > And bz1525827 seems a dup of this one. > Done. > Do we need set this bug to 'VERIFIED' to track the assertion issue, and open > a new bug to track the hang issue? Yes, let's do this to avoid confusion. Thanks. Thanks Ademar. Based on comment comment 25, comment 26, comment 28 and comment 30, set this bug to verified. And will report a new bug for the hang issue with data plane. Thanks. (In reply to Ademar Reis from comment #30) > (In reply to CongLi from comment #28) > > > Do we need set this bug to 'VERIFIED' to track the assertion issue, and open > > a new bug to track the hang issue? > > Yes, let's do this to avoid confusion. Thanks. Bug 1634219 will track the src qemu hang issue. Thanks. |
Description of problem: src qemu with gluster backend core dumped when do block mirror to NBD driver. Version-Release number of selected component (if applicable): kernel-3.10.0-702.el7.x86_64 qemu-kvm-rhev-2.9.0-16.el7_4.4.x86_64 glusterfs-3.8.4-45.el7rhgs.x86_64 guest: rhel7.4 How reproducible: 3/12 Steps to Reproduce: 1. boot up dest guest an empty disk and export it as NBD server. # qemu-img create -f qcow2 mirror.qcow2 20G #!/bin/bash /usr/libexec/qemu-kvm \ -name guest=test-virt1 \ -machine pc-i440fx-rhel7.4.0,accel=kvm,usb=off,vmport=off,dump-guest-core=off \ -cpu SandyBridge \ -m 4G \ -smp 4,sockets=4,cores=1,threads=1 \ -boot strict=on \ -object iothread,id=iothread0 \ -device virtio-scsi-pci,bus=pci.0,addr=0x5,iothread=iothread0,id=scsi0 \ -drive file=mirror.qcow2,format=qcow2,snapshot=off,cache=none,if=none,aio=native,id=img0 \ -device scsi-hd,bus=scsi0.0,drive=img0,scsi-id=0,lun=0,id=scsi-disk0,bootindex=0 \ -netdev tap,id=hostnet0,vhost=on \ -device virtio-net-pci,netdev=hostnet0,id=net0,mac=51:54:12:b3:20:61,bus=pci.0,addr=0x3 \ -device qxl-vga \ -vnc :2 \ -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 \ -monitor stdio \ -qmp tcp:0:5555,server,nowait \ -usbdevice tablet \ -incoming tcp:0:6666 \ # telnet 127.0.0.1 5555 { "execute": "qmp_capabilities" } { "execute": "nbd-server-start", "arguments": { "addr": { "type": "inet","data": { "host":"10.66.11.1", "port": "9000" } } } } {"execute":"nbd-server-add","arguments":{"device":"img0", "writable": true}} 2. boot up src guest with gluster backend. #!/bin/bash /usr/libexec/qemu-kvm \ -name guest=test-virt0 \ -machine pc-i440fx-rhel7.4.0,accel=kvm,usb=off,vmport=off,dump-guest-core=off \ -cpu SandyBridge \ -m 4G \ -smp 4,sockets=4,cores=1,threads=1 \ -boot strict=on \ -object iothread,id=iothread0 \ -device virtio-scsi-pci,bus=pci.0,addr=0x5,iothread=iothread0,id=scsi0 \ -drive file=gluster://10.73.199.197/gv0/rhel74-64-virtio.qcow2,format=qcow2,snapshot=off,cache=none,if=none,aio=native,id=img0 \ -device scsi-hd,bus=scsi0.0,drive=img0,scsi-id=0,lun=0,id=scsi-disk0,bootindex=0 \ -netdev tap,id=hostnet0,vhost=on \ -device virtio-net-pci,netdev=hostnet0,id=net0,mac=51:54:12:b3:20:61,bus=pci.0,addr=0x3 \ -device qxl-vga \ -vnc :1 \ -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 \ -monitor stdio \ -qmp tcp:0:4444,server,nowait \ -usbdevice tablet \ 3. do block mirror to NBD server from src. # telnet 127.0.0.1 4444 { "execute": "qmp_capabilities" } { "execute": "drive-mirror", "arguments": { "device": "img0", "target": "nbd://10.66.11.1:9000/img0", "sync": "full", "format": "raw", "mode": "existing" } } Actual results: src qemu: src qemu core dumped. qmp outputs: { "execute": "drive-mirror", "arguments": { "device": "img0", "target": "nbd://10.66.11.1:9000/img0", "sync": "full", "format": "raw", "mode": "existing" } } {"return": {}} Connection closed by foreign host. dest qemu: (qemu) qemu-kvm: Disconnect client, due to: Unable to read from socket: Connection reset by peer qemu-kvm: Disconnect client, due to: End of file Expected results: block mirror finishes with no segmentation fault. Additional info: gdb info of the dump file: # gdb -q -c core.3070 [New LWP 3072] [New LWP 3082] [New LWP 3115] [New LWP 3089] [New LWP 3081] [New LWP 3195] [New LWP 3083] [New LWP 3131] [New LWP 3121] [New LWP 3087] [New LWP 3196] [New LWP 3091] [New LWP 3071] [New LWP 3086] [New LWP 3088] [New LWP 3090] [New LWP 3116] [New LWP 3070] Reading symbols from /usr/libexec/qemu-kvm...Reading symbols from /usr/lib/debug/usr/libexec/qemu-kvm.debug...done. done. [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/libexec/qemu-kvm -name guest=test-virt0 -machine pc-i440fx-rhel7.4.0,accel'. Program terminated with signal 6, Aborted. #0 0x00007faa7b4411f7 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 celt051-0.5.1.3-8.el7.x86_64 cyrus-sasl-gssapi-2.1.26-21.el7.x86_64 cyrus-sasl-lib-2.1.26-21.el7.x86_64 cyrus-sasl-md5-2.1.26-21.el7.x86_64 cyrus-sasl-plain-2.1.26-21.el7.x86_64 cyrus-sasl-scram-2.1.26-21.el7.x86_64 elfutils-libelf-0.168-8.el7.x86_64 elfutils-libs-0.168-8.el7.x86_64 glib2-2.50.3-3.el7.x86_64 glibc-2.17-196.el7.x86_64 glusterfs-3.8.4-45.el7rhgs.x86_64 glusterfs-api-3.8.4-45.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-45.el7rhgs.x86_64 glusterfs-libs-3.8.4-45.el7rhgs.x86_64 gmp-6.0.0-15.el7.x86_64 gnutls-3.3.26-9.el7.x86_64 gperftools-libs-2.4-8.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-8.el7.x86_64 libacl-2.2.51-12.el7.x86_64 libaio-0.3.109-13.el7.x86_64 libattr-2.4.46-12.el7.x86_64 libblkid-2.23.2-43.el7.x86_64 libcacard-2.5.2-2.el7.x86_64 libcap-2.22-9.el7.x86_64 libcom_err-1.42.9-10.el7.x86_64 libcurl-7.29.0-42.el7.x86_64 libdb-5.3.21-20.el7.x86_64 libffi-3.0.13-18.el7.x86_64 libgcc-4.8.5-16.el7.x86_64 libgcrypt-1.5.3-14.el7.x86_64 libgpg-error-1.12-3.el7.x86_64 libibverbs-13-7.el7.x86_64 libidn-1.28-4.el7.x86_64 libiscsi-1.9.0-7.el7.x86_64 libjpeg-turbo-1.2.90-5.el7.x86_64 libmount-2.23.2-43.el7.x86_64 libnl3-3.2.28-4.el7.x86_64 libpng-1.5.13-7.el7_2.x86_64 librados2-12.2.0-2.el7cp.x86_64 librbd1-12.2.0-2.el7cp.x86_64 librdmacm-13-7.el7.x86_64 libseccomp-2.3.1-3.el7.x86_64 libselinux-2.5-11.el7.x86_64 libssh2-1.4.3-10.el7_2.1.x86_64 libstdc++-4.8.5-16.el7.x86_64 libtasn1-4.10-1.el7.x86_64 libunwind-1.2-2.el7.x86_64 libusbx-1.0.20-1.el7.x86_64 libuuid-2.23.2-43.el7.x86_64 lttng-ust-2.4.1-4.el7.x86_64 lzo-2.06-8.el7.x86_64 nettle-2.7.1-8.el7.x86_64 nspr-4.13.1-1.0.el7_3.x86_64 nss-3.28.4-8.el7.x86_64 nss-softokn-freebl-3.28.3-6.el7.x86_64 nss-util-3.28.4-3.el7.x86_64 numactl-libs-2.0.9-6.el7_2.x86_64 openldap-2.4.44-5.el7.x86_64 openssl-libs-1.0.2k-8.el7.x86_64 p11-kit-0.23.5-3.el7.x86_64 pcre-8.32-17.el7.x86_64 pixman-0.34.0-1.el7.x86_64 snappy-1.1.0-3.el7.x86_64 spice-server-0.12.8-2.el7.1.x86_64 sssd-client-1.15.2-50.el7.x86_64 systemd-libs-219-42.el7.x86_64 usbredir-0.7.1-2.el7.x86_64 userspace-rcu-0.7.16-1.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64 (gdb) bt #0 0x00007faa7b4411f7 in raise () from /lib64/libc.so.6 #1 0x00007faa7b4428e8 in abort () from /lib64/libc.so.6 #2 0x00007faa7b43a266 in __assert_fail_base () from /lib64/libc.so.6 #3 0x00007faa7b43a312 in __assert_fail () from /lib64/libc.so.6 #4 0x000055b4d6b8b2d7 in run_poll_handlers (max_ns=4000, ctx=0x55b4d81a1ac0) at util/aio-posix.c:520 #5 try_poll_mode (blocking=true, ctx=0x55b4d81a1ac0) at util/aio-posix.c:555 #6 aio_poll (ctx=0x55b4d81a1ac0, blocking=blocking@entry=true) at util/aio-posix.c:595 #7 0x000055b4d6982dfe in iothread_run (opaque=0x55b4d813b680) at iothread.c:59 #8 0x00007faa7b7d6e25 in start_thread () from /lib64/libpthread.so.0 #9 0x00007faa7b50434d in clone () from /lib64/libc.so.6 (gdb)