Bug 1329145
Summary: | qemu-kvm-rhev sometimes gets SIGABRT when do continuous blockcommit operations | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Han Han <hhan> | ||||
Component: | qemu-kvm-rhev | Assignee: | John Snow <jsnow> | ||||
Status: | CLOSED ERRATA | QA Contact: | Qianqian Zhu <qizhu> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 7.3 | CC: | chayang, dyuan, eblake, hhan, huding, juzhang, knoel, mrezanin, qizhu, virt-maint, xuzhang, yanyang | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | qemu-2.7.0 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2017-08-01 23:32:13 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
It is more easily reproduced on: libvirt-1.3.3-3.fc25.x86_64 qemu-kvm-2.6.0-0.1.rc2.fc25.x86_64 (In reply to Han Han from comment #0) > 2. Running following scripts to reproduce the bug: Translating libvirt actions into QMP commands, we have:... > #!/bin/bash -x > DOM=ABRT > OS_IMAGE=/var/lib/libvirt/images/V.qcow2 > WORK_DIR=/nfs/$DOM > SCRIPT_DIR=`pwd` > ABRT_STAMP=`date +%s` > prep () > { > mkdir -p $WORK_DIR/{a..d} > if [ ! -f $WORK_DIR/a/a ]; then > cp $OS_IMAGE $WORK_DIR/a/a > fi > cd $WORK_DIR/b > qemu-img create -f qcow2 -o backing_file=../a/a,backing_fmt=qcow2 b > cd $WORK_DIR/c > qemu-img create -f qcow2 -o backing_file=../b/b,backing_fmt=qcow2 c > cd $WORK_DIR/d > qemu-img create -f qcow2 -o backing_file=../c/c,backing_fmt=qcow2 d > cd $SCRIPT_DIR ...create a backing chain (a/a <- b/b <- c/c <- d/d), where b/b, c/c, and d/d are all initially empty overlays of a/a, then > virsh define $DOM.xml && virsh start $DOM && sleep 30 ...start the domain, presuming that d/d will be modified by the guest getting 30 seconds of runtime, then... > for i in {1..3};do > virsh -k0 blockcommit --active --verbose $DOM vda --shallow --pivot > --keep-relative iteration 1 commits d/d into c/c then makes c/c active, iteration 2 commits c/c into b/b then makes b/b active, iteration 3 commits b/b into a/a then makes a/a active. The first iteration looks roughly like {"execute":"block-commit", "arguments":{ "device":$name_of_vda, "top":"d/d", "base":"c/c" } } to start the commit, then libvirt waits for completion by probing query-block-jobs but also waiting for events, until the job has reached ready state (that is, where 'offset' and 'len' have the same non-zero value), then calls block-job-complete to perform the pivot. > done > } > > cleanup () > { > virsh destroy $DOM > virsh undefine $DOM --snapshots-metadata > rm $WORK_DIR/{b..d} -rf > } # ---------- end of function cleanup ---------- > while true; do > prep > cleanup > done > > 3. When the script hang at Block commit: [100 %] , you can find > qemu-kvm-rhev SIGABRT in abrt-cli > # abrt-cli ls > id 299764e1370d59ef9cac85382813d3e5e68963fa > reason: qemu-kvm killed by SIGABRT It sounds like there may be some sort of race in qemu, perhaps where it is informing (whether via event or via a response to query-block-jobs) that the job appears to be complete, but where when libvirt then attempts block-job-complete it causes a race because the job was not actually complete, and that this is triggering an assertion in qemu. But I'm not sure on how to better debug the issue, so much as being available to offer assistance in trying to reproduce the problem using just qemu without libvirt in the mix. (In reply to Eric Blake from comment #3) > (In reply to Han Han from comment #0) > > 2. Running following scripts to reproduce the bug: > > Translating libvirt actions into QMP commands, we have:... > > > #!/bin/bash -x > > DOM=ABRT > > OS_IMAGE=/var/lib/libvirt/images/V.qcow2 > > WORK_DIR=/nfs/$DOM > > SCRIPT_DIR=`pwd` > > ABRT_STAMP=`date +%s` > > prep () > > { > > mkdir -p $WORK_DIR/{a..d} > > if [ ! -f $WORK_DIR/a/a ]; then > > cp $OS_IMAGE $WORK_DIR/a/a > > fi > > cd $WORK_DIR/b > > qemu-img create -f qcow2 -o backing_file=../a/a,backing_fmt=qcow2 b > > cd $WORK_DIR/c > > qemu-img create -f qcow2 -o backing_file=../b/b,backing_fmt=qcow2 c > > cd $WORK_DIR/d > > qemu-img create -f qcow2 -o backing_file=../c/c,backing_fmt=qcow2 d > > cd $SCRIPT_DIR > > ...create a backing chain (a/a <- b/b <- c/c <- d/d), where b/b, c/c, and > d/d are all initially empty overlays of a/a, then > > > virsh define $DOM.xml && virsh start $DOM && sleep 30 > > ...start the domain, presuming that d/d will be modified by the guest > getting 30 seconds of runtime, then... > > > for i in {1..3};do > > virsh -k0 blockcommit --active --verbose $DOM vda --shallow --pivot > > --keep-relative > > iteration 1 commits d/d into c/c then makes c/c active, > iteration 2 commits c/c into b/b then makes b/b active, > iteration 3 commits b/b into a/a then makes a/a active. > > The first iteration looks roughly like > {"execute":"block-commit", > "arguments":{ "device":$name_of_vda, > "top":"d/d", "base":"c/c" } } > to start the commit, then libvirt waits for completion by probing > query-block-jobs but also waiting for events, until the job has reached > ready state (that is, where 'offset' and 'len' have the same non-zero > value), then calls block-job-complete to perform the pivot. > Er, libvirt uses len == offset to determine 'READY' instead of waiting for the QMP event? ... That makes me sorta nervous. > > done > > } > > > > cleanup () > > { > > virsh destroy $DOM > > virsh undefine $DOM --snapshots-metadata > > rm $WORK_DIR/{b..d} -rf > > } # ---------- end of function cleanup ---------- > > while true; do > > prep > > cleanup > > done > > > > 3. When the script hang at Block commit: [100 %] , you can find > > qemu-kvm-rhev SIGABRT in abrt-cli > > # abrt-cli ls > > id 299764e1370d59ef9cac85382813d3e5e68963fa > > reason: qemu-kvm killed by SIGABRT > > It sounds like there may be some sort of race in qemu, perhaps where it is > informing (whether via event or via a response to query-block-jobs) that the > job appears to be complete, but where when libvirt then attempts > block-job-complete it causes a race because the job was not actually > complete, and that this is triggering an assertion in qemu. But I'm not Yeah, looks like pending requests exist at the time we invoke bdrv_replace_in_backing_chain, and we fail assert(!bdrv_requests_pending(old)): #3 0x00007fd48092d612 in __GI___assert_fail (assertion=assertion@entry=0x7fd49aa58817 "!bdrv_requests_pending(old)", file=file@entry=0x7fd49aa1f11a "block.c", line=line@entry=2092, function=function@entry=0x7fd49aa599b0 <__PRETTY_FUNCTION__.35035> "bdrv_replace_in_backing_chain") at assert.c:101 > sure on how to better debug the issue, so much as being available to offer > assistance in trying to reproduce the problem using just qemu without > libvirt in the mix. Han: You know, the code surrounding the draining of requests has changed from 2.6.0 to 2.7.0 a good deal... Han, could I be very, very obnoxious and ask you to re-test using a 2.7.0 package from fedora? Either of these two should do it: qemu-2.7.0-0.2.rc3.fc25 qemu-2.7.0-0.2.rc3.fc26 I will work on reproducing myself tomorrow, otherwise. Thank you, --js Hi John, I ran the script for thousands of loops on libvirt-2.2.0-1.fc26.x86_64 qemu-2.7.0-0.2.rc3.fc26.x86_64 , the bug was not reproduced. (In reply to Han Han from comment #5) > Hi John, > I ran the script for thousands of loops on libvirt-2.2.0-1.fc26.x86_64 > qemu-2.7.0-0.2.rc3.fc26.x86_64 , the bug was not reproduced. Great, thanks! I spent some time trying to reproduce this for 2.6.0, but couldn't. As such, I'm not readily able to pick out which commit in the 2.6.0 -> 2.7.0 timeframe may have fixed the issue for you. This isn't marked a blocker or regression; at this point I need to postpone this to 7.4 unless there is a very compelling reason otherwise. It's not clear to me which commit specifically may have changed the locking/draining behavior, so I was hoping to do a git bisect. However, I can't reproduce; and I assume it's not a great use of QE's time to try to bisect for me. So for now, I'm going to shuffle this to 7.4.0 and I will return to it shortly. Since we rebased, this is likely fixed for us as we couldn't reproduce in 2.7. We'll collect the apparent fix in the new version. (I think POST is correct in this case.) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2392 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2392 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2392 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2392 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2392 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:2392 |
Created attachment 1149398 [details] gdb backtrace info Description of problem: As subject Version-Release number of selected component (if applicable): libvirt-1.3.3-2.el7.x86_64 qemu-kvm-rhev-2.5.0-4.el7.x86_64 How reproducible: 10% Steps to Reproduce: 1. Prepare a guest xml(ABRT.xml) like following: ... <disk type='file' device='disk'> <driver name='qemu' type='qcow2' cache='none'/> <source file='/nfs/ABRT/d/d'/> <target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </disk> ... 2. Running following scripts to reproduce the bug: #!/bin/bash -x DOM=ABRT OS_IMAGE=/var/lib/libvirt/images/V.qcow2 WORK_DIR=/nfs/$DOM SCRIPT_DIR=`pwd` ABRT_STAMP=`date +%s` prep () { mkdir -p $WORK_DIR/{a..d} if [ ! -f $WORK_DIR/a/a ]; then cp $OS_IMAGE $WORK_DIR/a/a fi cd $WORK_DIR/b qemu-img create -f qcow2 -o backing_file=../a/a,backing_fmt=qcow2 b cd $WORK_DIR/c qemu-img create -f qcow2 -o backing_file=../b/b,backing_fmt=qcow2 c cd $WORK_DIR/d qemu-img create -f qcow2 -o backing_file=../c/c,backing_fmt=qcow2 d cd $SCRIPT_DIR virsh define $DOM.xml && virsh start $DOM && sleep 30 for i in {1..3};do virsh -k0 blockcommit --active --verbose $DOM vda --shallow --pivot --keep-relative done } cleanup () { virsh destroy $DOM virsh undefine $DOM --snapshots-metadata rm $WORK_DIR/{b..d} -rf } # ---------- end of function cleanup ---------- while true; do prep cleanup done 3. When the script hang at Block commit: [100 %] , you can find qemu-kvm-rhev SIGABRT in abrt-cli # abrt-cli ls id 299764e1370d59ef9cac85382813d3e5e68963fa reason: qemu-kvm killed by SIGABRT time: Thu 21 Apr 2016 11:23:38 AM CST cmdline: /usr/libexec/qemu-kvm -name ABRT,debug-threads=on -S -machine pc-i440fx-rhel7.2.0,accel=kvm,usb=off,vmport=off -cpu Penryn -m 1024 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid f30b74d0-142e-41f4-9cc9-aa2fcc785865 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-503-ABRT/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x6.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x6 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x6.0x1 -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x6.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive file=/nfs/ABRT/d/d,format=qcow2,if=none,i! d=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=28,id=hostnet0,vhost=on,vhostfd=30 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:14:62:61,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-503-ABRT/org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel1,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=com.redhat.spice.0 -device usb-tablet,id=input0 -spice port=5903,tls-port=5904,addr=0.0.0.0,disable-ticketing,x509-dir=/etc/pki/libvirt-spice,image-compression=off,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0! ,vgamem_mb=16,bus=pci.0,addr=0x2 -device intel-hda,id=sound0,bus=pci.0 ,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -chardev spicevmc,id=charredir0,name=usbredir -device usb-redir,chardev=charredir0,id=redir0 -chardev spicevmc,id=charredir1,name=usbredir -device usb-redir,chardev=charredir1,id=redir1 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -msg timestamp=on package: qemu-kvm-rhev-2.5.0-4.el7 uid: 107 (qemu) Directory: /var/spool/abrt/ccpp-2016-04-20-23:23:38-26919 Run 'abrt-cli report /var/spool/abrt/ccpp-2016-04-20-23:23:38-26919' for creating a case in Red Hat Customer Portal Actual results: as step2 Expected results: NO SIGABRT Additional info: After the bug reproduced, libvirt show the VM is running but actually it has exited with SIGABRT. # virsh list Id Name State ---------------------------------------------------- 428 intel-no1 running # virsh destroy intel-no1 error: Failed to destroy domain intel-no1 error: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainBlockJobAbort) # ps uax|grep intel-no1 root 5356 0.0 0.0 113120 1504 pts/8 S+ 17:01 0:00 /bin/bash -x ./intel-no1.sh root 12458 0.0 0.0 338972 6420 pts/8 Sl+ 17:24 0:00 virsh -k0 blockcommit --active --verbose intel-no1 vda --shallow --pivot --keep-relative root 19472 0.0 0.0 149284 5296 pts/24 S+ 17:29 0:00 vim intel-no1.sh root 31593 0.0 0.0 112648 972 pts/13 S+ 17:41 0:00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn intel-no1