Bug 1329145

Summary:

qemu-kvm-rhev sometimes gets SIGABRT when do continuous blockcommit operations

Product:

Red Hat Enterprise Linux 7

Reporter:

Han Han <hhan>

Component:

qemu-kvm-rhev

Assignee:

John Snow <jsnow>

Status:

CLOSED ERRATA

QA Contact:

Qianqian Zhu <qizhu>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

7.3

CC:

chayang, dyuan, eblake, hhan, huding, juzhang, knoel, mrezanin, qizhu, virt-maint, xuzhang, yanyang

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

qemu-2.7.0

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-08-01 23:32:13 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
gdb backtrace info	none

Description Han Han 2016-04-21 09:49:55 UTC

Created attachment 1149398 [details]
gdb backtrace info

Description of problem:
As subject

Version-Release number of selected component (if applicable):
libvirt-1.3.3-2.el7.x86_64
qemu-kvm-rhev-2.5.0-4.el7.x86_64

How reproducible:
10%

Steps to Reproduce:
1. Prepare a guest xml(ABRT.xml) like following:
...
<disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='none'/>
      <source file='/nfs/ABRT/d/d'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </disk>
...
2. Running following scripts to reproduce the bug:
#!/bin/bash -x
DOM=ABRT
OS_IMAGE=/var/lib/libvirt/images/V.qcow2
WORK_DIR=/nfs/$DOM
SCRIPT_DIR=`pwd`
ABRT_STAMP=`date +%s`
prep ()
{
    mkdir -p $WORK_DIR/{a..d}
    if [ ! -f $WORK_DIR/a/a ]; then
        cp $OS_IMAGE $WORK_DIR/a/a
    fi
    cd $WORK_DIR/b
    qemu-img create -f qcow2 -o backing_file=../a/a,backing_fmt=qcow2 b
    cd $WORK_DIR/c
    qemu-img create -f qcow2 -o backing_file=../b/b,backing_fmt=qcow2 c
    cd $WORK_DIR/d
    qemu-img create -f qcow2 -o backing_file=../c/c,backing_fmt=qcow2 d
    cd $SCRIPT_DIR
    virsh define $DOM.xml && virsh start $DOM && sleep 30
    for i in {1..3};do
        virsh -k0 blockcommit --active --verbose $DOM vda --shallow --pivot --keep-relative
    done
}

cleanup ()
{  
    virsh destroy $DOM
    virsh undefine $DOM --snapshots-metadata
    rm $WORK_DIR/{b..d} -rf
}   # ----------  end of function cleanup  ----------
while true; do
    prep
    cleanup
done

3. When the script hang at Block commit: [100 %] , you can find qemu-kvm-rhev SIGABRT in abrt-cli
# abrt-cli ls
id 299764e1370d59ef9cac85382813d3e5e68963fa
reason:         qemu-kvm killed by SIGABRT
time:           Thu 21 Apr 2016 11:23:38 AM CST
cmdline:        /usr/libexec/qemu-kvm -name ABRT,debug-threads=on -S -machine pc-i440fx-rhel7.2.0,accel=kvm,usb=off,vmport=off -cpu Penryn -m 1024 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid f30b74d0-142e-41f4-9cc9-aa2fcc785865 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-503-ABRT/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x6.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x6 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x6.0x1 -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x6.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 -drive file=/nfs/ABRT/d/d,format=qcow2,if=none,i!
 d=drive-virtio-disk0,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=28,id=hostnet0,vhost=on,vhostfd=30 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:14:62:61,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/domain-503-ABRT/org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -chardev spicevmc,id=charchannel1,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=com.redhat.spice.0 -device usb-tablet,id=input0 -spice port=5903,tls-port=5904,addr=0.0.0.0,disable-ticketing,x509-dir=/etc/pki/libvirt-spice,image-compression=off,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0!
 ,vgamem_mb=16,bus=pci.0,addr=0x2 -device intel-hda,id=sound0,bus=pci.0
,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -chardev spicevmc,id=charredir0,name=usbredir -device usb-redir,chardev=charredir0,id=redir0 -chardev spicevmc,id=charredir1,name=usbredir -device usb-redir,chardev=charredir1,id=redir1 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -msg timestamp=on
package:        qemu-kvm-rhev-2.5.0-4.el7
uid:            107 (qemu)
Directory:      /var/spool/abrt/ccpp-2016-04-20-23:23:38-26919
Run 'abrt-cli report /var/spool/abrt/ccpp-2016-04-20-23:23:38-26919' for creating a case in Red Hat Customer Portal

Actual results:
as step2

Expected results:
NO SIGABRT

Additional info:
After the bug reproduced, libvirt show the VM is running but actually it has exited with SIGABRT.
# virsh list 
 Id    Name                           State
----------------------------------------------------
 428   intel-no1                      running

# virsh destroy intel-no1 
error: Failed to destroy domain intel-no1
error: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainBlockJobAbort)

# ps uax|grep intel-no1
root      5356  0.0  0.0 113120  1504 pts/8    S+   17:01   0:00 /bin/bash -x ./intel-no1.sh
root     12458  0.0  0.0 338972  6420 pts/8    Sl+  17:24   0:00 virsh -k0 blockcommit --active --verbose intel-no1 vda --shallow --pivot --keep-relative
root     19472  0.0  0.0 149284  5296 pts/24   S+   17:29   0:00 vim intel-no1.sh
root     31593  0.0  0.0 112648   972 pts/13   S+   17:41   0:00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn intel-no1

Comment 2 Han Han 2016-04-21 10:18:37 UTC

It is more easily reproduced on:
libvirt-1.3.3-3.fc25.x86_64
qemu-kvm-2.6.0-0.1.rc2.fc25.x86_64

Comment 3 Eric Blake 2016-08-31 22:19:04 UTC

(In reply to Han Han from comment #0)
> 2. Running following scripts to reproduce the bug:

Translating libvirt actions into QMP commands, we have:...

> #!/bin/bash -x
> DOM=ABRT
> OS_IMAGE=/var/lib/libvirt/images/V.qcow2
> WORK_DIR=/nfs/$DOM
> SCRIPT_DIR=`pwd`
> ABRT_STAMP=`date +%s`
> prep ()
> {
>     mkdir -p $WORK_DIR/{a..d}
>     if [ ! -f $WORK_DIR/a/a ]; then
>         cp $OS_IMAGE $WORK_DIR/a/a
>     fi
>     cd $WORK_DIR/b
>     qemu-img create -f qcow2 -o backing_file=../a/a,backing_fmt=qcow2 b
>     cd $WORK_DIR/c
>     qemu-img create -f qcow2 -o backing_file=../b/b,backing_fmt=qcow2 c
>     cd $WORK_DIR/d
>     qemu-img create -f qcow2 -o backing_file=../c/c,backing_fmt=qcow2 d
>     cd $SCRIPT_DIR

...create a backing chain (a/a <- b/b <- c/c <- d/d), where b/b, c/c, and d/d are all initially empty overlays of a/a, then

>     virsh define $DOM.xml && virsh start $DOM && sleep 30

...start the domain, presuming that d/d will be modified by the guest getting 30 seconds of runtime, then...

>     for i in {1..3};do
>         virsh -k0 blockcommit --active --verbose $DOM vda --shallow --pivot
> --keep-relative

iteration 1 commits d/d into c/c then makes c/c active,
iteration 2 commits c/c into b/b then makes b/b active,
iteration 3 commits b/b into a/a then makes a/a active.

The first iteration looks roughly like
{"execute":"block-commit",
 "arguments":{ "device":$name_of_vda,
 "top":"d/d", "base":"c/c" } }
to start the commit, then libvirt waits for completion by probing query-block-jobs but also waiting for events, until the job has reached ready state (that is, where 'offset' and 'len' have the same non-zero value), then calls block-job-complete to perform the pivot.

>     done
> }
> 
> cleanup ()
> {  
>     virsh destroy $DOM
>     virsh undefine $DOM --snapshots-metadata
>     rm $WORK_DIR/{b..d} -rf
> }   # ----------  end of function cleanup  ----------
> while true; do
>     prep
>     cleanup
> done
> 
> 3. When the script hang at Block commit: [100 %] , you can find
> qemu-kvm-rhev SIGABRT in abrt-cli
> # abrt-cli ls
> id 299764e1370d59ef9cac85382813d3e5e68963fa
> reason:         qemu-kvm killed by SIGABRT

It sounds like there may be some sort of race in qemu, perhaps where it is informing (whether via event or via a response to query-block-jobs) that the job appears to be complete, but where when libvirt then attempts block-job-complete it causes a race because the job was not actually complete, and that this is triggering an assertion in qemu.  But I'm not sure on how to better debug the issue, so much as being available to offer assistance in trying to reproduce the problem using just qemu without libvirt in the mix.

Comment 4 John Snow 2016-09-07 22:43:06 UTC

(In reply to Eric Blake from comment #3)
> (In reply to Han Han from comment #0)
> > 2. Running following scripts to reproduce the bug:
> 
> Translating libvirt actions into QMP commands, we have:...
> 
> > #!/bin/bash -x
> > DOM=ABRT
> > OS_IMAGE=/var/lib/libvirt/images/V.qcow2
> > WORK_DIR=/nfs/$DOM
> > SCRIPT_DIR=`pwd`
> > ABRT_STAMP=`date +%s`
> > prep ()
> > {
> >     mkdir -p $WORK_DIR/{a..d}
> >     if [ ! -f $WORK_DIR/a/a ]; then
> >         cp $OS_IMAGE $WORK_DIR/a/a
> >     fi
> >     cd $WORK_DIR/b
> >     qemu-img create -f qcow2 -o backing_file=../a/a,backing_fmt=qcow2 b
> >     cd $WORK_DIR/c
> >     qemu-img create -f qcow2 -o backing_file=../b/b,backing_fmt=qcow2 c
> >     cd $WORK_DIR/d
> >     qemu-img create -f qcow2 -o backing_file=../c/c,backing_fmt=qcow2 d
> >     cd $SCRIPT_DIR
> 
> ...create a backing chain (a/a <- b/b <- c/c <- d/d), where b/b, c/c, and
> d/d are all initially empty overlays of a/a, then
> 
> >     virsh define $DOM.xml && virsh start $DOM && sleep 30
> 
> ...start the domain, presuming that d/d will be modified by the guest
> getting 30 seconds of runtime, then...
> 
> >     for i in {1..3};do
> >         virsh -k0 blockcommit --active --verbose $DOM vda --shallow --pivot
> > --keep-relative
> 
> iteration 1 commits d/d into c/c then makes c/c active,
> iteration 2 commits c/c into b/b then makes b/b active,
> iteration 3 commits b/b into a/a then makes a/a active.
> 
> The first iteration looks roughly like
> {"execute":"block-commit",
>  "arguments":{ "device":$name_of_vda,
>  "top":"d/d", "base":"c/c" } }
> to start the commit, then libvirt waits for completion by probing
> query-block-jobs but also waiting for events, until the job has reached
> ready state (that is, where 'offset' and 'len' have the same non-zero
> value), then calls block-job-complete to perform the pivot.
> 

Er, libvirt uses len == offset to determine 'READY' instead of waiting for the QMP event? ...

That makes me sorta nervous.

> >     done
> > }
> > 
> > cleanup ()
> > {  
> >     virsh destroy $DOM
> >     virsh undefine $DOM --snapshots-metadata
> >     rm $WORK_DIR/{b..d} -rf
> > }   # ----------  end of function cleanup  ----------
> > while true; do
> >     prep
> >     cleanup
> > done
> > 
> > 3. When the script hang at Block commit: [100 %] , you can find
> > qemu-kvm-rhev SIGABRT in abrt-cli
> > # abrt-cli ls
> > id 299764e1370d59ef9cac85382813d3e5e68963fa
> > reason:         qemu-kvm killed by SIGABRT
> 
> It sounds like there may be some sort of race in qemu, perhaps where it is
> informing (whether via event or via a response to query-block-jobs) that the
> job appears to be complete, but where when libvirt then attempts
> block-job-complete it causes a race because the job was not actually
> complete, and that this is triggering an assertion in qemu.  But I'm not

Yeah, looks like pending requests exist at the time we invoke bdrv_replace_in_backing_chain, and we fail assert(!bdrv_requests_pending(old)):

#3  0x00007fd48092d612 in __GI___assert_fail (assertion=assertion@entry=0x7fd49aa58817 "!bdrv_requests_pending(old)", file=file@entry=0x7fd49aa1f11a "block.c", line=line@entry=2092, 
    function=function@entry=0x7fd49aa599b0 <__PRETTY_FUNCTION__.35035> "bdrv_replace_in_backing_chain") at assert.c:101

> sure on how to better debug the issue, so much as being available to offer
> assistance in trying to reproduce the problem using just qemu without
> libvirt in the mix.



Han: 

You know, the code surrounding the draining of requests has changed from 2.6.0 to 2.7.0 a good deal... Han, could I be very, very obnoxious and ask you to re-test using a 2.7.0 package from fedora?

Either of these two should do it: qemu-2.7.0-0.2.rc3.fc25 qemu-2.7.0-0.2.rc3.fc26

I will work on reproducing myself tomorrow, otherwise.

Thank you,
--js

Comment 5 Han Han 2016-09-09 01:49:59 UTC

Hi John, 
I ran the script for thousands of loops on libvirt-2.2.0-1.fc26.x86_64 qemu-2.7.0-0.2.rc3.fc26.x86_64 , the bug was not reproduced.

Comment 6 John Snow 2016-09-09 20:32:59 UTC

(In reply to Han Han from comment #5)
> Hi John, 
> I ran the script for thousands of loops on libvirt-2.2.0-1.fc26.x86_64
> qemu-2.7.0-0.2.rc3.fc26.x86_64 , the bug was not reproduced.

Great, thanks!

I spent some time trying to reproduce this for 2.6.0, but couldn't. As such, I'm not readily able to pick out which commit in the 2.6.0 -> 2.7.0 timeframe may have fixed the issue for you.

This isn't marked a blocker or regression; at this point I need to postpone this to 7.4 unless there is a very compelling reason otherwise. It's not clear to me which commit specifically may have changed the locking/draining behavior, so I was hoping to do a git bisect.

However, I can't reproduce; and I assume it's not a great use of QE's time to try to bisect for me.

So for now, I'm going to shuffle this to 7.4.0 and I will return to it shortly.

Comment 7 John Snow 2017-02-16 23:48:47 UTC

Since we rebased, this is likely fixed for us as we couldn't reproduce in 2.7. We'll collect the apparent fix in the new version.

(I think POST is correct in this case.)

Comment 9 Qianqian Zhu 2017-02-24 08:16:10 UTC

Set Verified per comment5 and comment7. If you still hit the issue please feel free to reopen.

Comment 11 errata-xmlrpc 2017-08-01 23:32:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 12 errata-xmlrpc 2017-08-02 01:09:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 13 errata-xmlrpc 2017-08-02 02:01:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 14 errata-xmlrpc 2017-08-02 02:42:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 15 errata-xmlrpc 2017-08-02 03:07:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392

Comment 16 errata-xmlrpc 2017-08-02 03:27:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392