Bug 1249718 - Segfault occurred at Dst VM while completed migration upon ENOSPC
Segfault occurred at Dst VM while completed migration upon ENOSPC
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm (Show other bugs)
7.1
Unspecified Unspecified
medium Severity medium
: rc
: ---
Assigned To: Stefan Hajnoczi
Virtualization Bugs
:
Depends On: 1160169 1249740
Blocks:
  Show dependency treegraph
 
Reported: 2015-08-03 11:54 EDT by Stefan Hajnoczi
Modified: 2015-11-19 00:12 EST (History)
14 users (show)

See Also:
Fixed In Version: qemu-kvm-1.5.3-99.el7
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1160169
Environment:
Last Closed: 2015-11-19 00:12:11 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Comment 1 Miroslav Rezanina 2015-08-06 00:32:28 EDT
Fix included in qemu-kvm-1.5.3-99.el7
Comment 3 Qian Guo 2015-08-13 03:25:01 EDT
Can reproduce this bug with qemu-kvm-rhev-2.1.2-21.el7.x86_64
steps:
Boot guest as:
/usr/libexec/qemu-kvm \
    -name rhel7.0 \
    -S \
    -machine pc \
    -cpu Penryn \
    -m 4096 \
    -realtime mlock=off \
    -smp 4,sockets=1,cores=4,threads=1 \
    -uuid fbf54917-5833-48f2-b3fb-5ce2ad294d93 \
    -no-user-config \
    -nodefaults \
    -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/rhel7cp1.0.monitor,server,nowait \
    -mon chardev=charmonitor,id=monitor,mode=control \
    -rtc base=utc,driftfix=slew \
    -global kvm-pit.lost_tick_policy=discard \
    -no-hpet \
    -no-shutdown \
    -boot menu=on \
    -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 \
    -drive file=/home/rhel72qiguo.qcow2,snapshot=off,cache=none,if=none,id=drive-virtio-disk0,format=qcow2 \
    -device virtio-blk-pci,bus=pci.0,addr=0x7,id=test1,drive=drive-virtio-disk0 \
    -netdev tap,vhost=on,script=/etc/qemu-ifup,id=hostnet0 \
    -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:0b:02:81,bus=pci.0,addr=0x3 \
    -chardev pty,id=charserial0 \
    -device isa-serial,chardev=charserial0,id=serial0 \
    -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channel/target/rhel7cp1.0.org.qemu.guest_agent.0,server,nowait \
    -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 \
    -chardev spicevmc,id=charchannel1,name=vdagent \
    -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=com.redhat.spice.0 \
    -spice port=5901,disable-ticketing,seamless-migration=on \
    -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vgamem_mb=16,bus=pci.0,addr=0x2 \
    -device intel-hda,id=sound0,bus=pci.0,addr=0x4 \
    -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 \
    -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 \
    -msg timestamp=on \
    -monitor stdio \
    -qmp unix:/tmp/q1,server,nowait \
    -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0 \
    -drive file=/mnt/test.qcow2,snapshot=off,cache=none,if=none,id=drive-virtio-disk1,format=qcow2 \
    -device virtio-scsi-pci,bus=pci.0,addr=0xe,id=scsi1 \
    -device scsi-hd,drive=drive-virtio-disk1,bus=scsi1.0 \

2.migrate guest
(qemu) migrate -d tcp:0:4444

3. trigger ENOSPC for the scsi disk
(qemu) block I/O error in device 'drive-virtio-disk1': No space left on device (28)

Result:
After migration, qemu crashed:
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff0a9ac50 in __memcpy_ssse3 () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff0a9ac50 in __memcpy_ssse3 () from /lib64/libc.so.6
#1  0x00005555558418a6 in memcpy (__len=51, __src=<optimized out>, __dest=<optimized out>)
    at /usr/include/bits/string3.h:51
#2  iov_to_buf (iov=iov@entry=0x555556722f70, iov_cnt=<optimized out>, offset=<optimized out>, offset@entry=0, 
    buf=buf@entry=0x555556726f94, bytes=bytes@entry=51) at util/iov.c:49
#3  0x0000555555636287 in virtio_scsi_parse_req (req=req@entry=0x55555671af10, req_size=51, resp_size=108)
    at /usr/src/debug/qemu-2.1.2/hw/scsi/virtio-scsi.c:152
#4  0x0000555555636450 in virtio_scsi_load_request (f=0x55555657ff50, sreq=0x555556569020)
    at /usr/src/debug/qemu-2.1.2/hw/scsi/virtio-scsi.c:243
#5  0x000055555578e0fa in get_scsi_requests (f=0x55555657ff50, pv=0x55555647c210, size=<optimized out>)
    at hw/scsi/scsi-bus.c:1905
#6  0x00005555556cf518 in vmstate_load_state (f=f@entry=0x55555657ff50, 
    vmsd=0x555555c107c0 <vmstate_scsi_device>, opaque=0x55555647c210, version_id=1) at vmstate.c:105
#7  0x00005555556cf4c4 in vmstate_load_state (f=0x55555657ff50, vmsd=0x555555c0f6c0 <vmstate_scsi_disk_state>, 
    opaque=0x55555647c210, version_id=1) at vmstate.c:102
#8  0x00005555556183aa in qemu_loadvm_state (f=f@entry=0x55555657ff50) at /usr/src/debug/qemu-2.1.2/savevm.c:1008
#9  0x00005555556cda86 in process_incoming_migration_co (opaque=0x55555657ff50) at migration.c:97
#10 0x000055555580007a in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>)
    at coroutine-ucontext.c:118
#11 0x00007ffff09a00f0 in ?? () from /lib64/libc.so.6
#12 0x00007fffffffcaf0 in ?? ()
#13 0x0000000000000000 in ?? ()


But can not reproduce this bug with qemu-kvm-1.5.3-86.el7.x86_64 qemu-kvm-1.5.3-82.el7.x86_64.


And test with latest qemu-kvm build:
qemu-kvm-1.5.3-100.el7.x86_64

Steps as above

Result:

After migration, the status of qemu is io-error and qemu does not crash.
(qemu) info status
VM status: paused (io-error)

Hi, Juzhang

According to above, this bug is reproduced in unfixed qemu-kvm-rhev but can not with unfixed qemu-kvm build by same steps and envs(images/hosts)

And both fixed qemu-kvm/qemu-kvm-rhev have no such issue now, do you think we can set this bug as verified?

Thanks
Comment 4 juzhang 2015-08-13 03:34:38 EDT
> 
> Hi, Juzhang
> 
> According to above, this bug is reproduced in unfixed qemu-kvm-rhev but can
> not with unfixed qemu-kvm build by same steps and envs(images/hosts)
> 
> And both fixed qemu-kvm/qemu-kvm-rhev have no such issue now, do you think
> we can set this bug as verified?
> 
> Thanks

Good question and strange. I need to ask your 2 questions first. Some issue can not be reproduced in every time.
1. How many did you try with qemu-kvm-rhev build? It's 100% reproduced?
2. How many did you try with qemu-kvm build? can not be reproduced 100%?

If both Y with above questions. The reason I could think is "this issue is original found with qemu-kvm-rhev and might be can not reproduced with qemu-kvm". Anyway, we might need to needinfo Stefan about "any efficient way for QE reproducing & verifying this issue?".

Hope useful.

Best Regards,
Junyi
Comment 5 Qian Guo 2015-08-13 03:46:59 EDT
(In reply to juzhang from comment #4)
> > 
> > Hi, Juzhang
> > 
> > According to above, this bug is reproduced in unfixed qemu-kvm-rhev but can
> > not with unfixed qemu-kvm build by same steps and envs(images/hosts)
> > 
> > And both fixed qemu-kvm/qemu-kvm-rhev have no such issue now, do you think
> > we can set this bug as verified?
> > 
> > Thanks
> 
> Good question and strange. I need to ask your 2 questions first. Some issue
> can not be reproduced in every time.
> 1. How many did you try with qemu-kvm-rhev build? It's 100% reproduced?

Yes, it is 100% reproduced with qemu-kvm-rhev build.

> 2. How many did you try with qemu-kvm build? can not be reproduced 100%?
> 
I tried more than 5 times with qemu-kvm-82 builds, but it just works as the fixed build works.

I tried qemu-kvm-85 for 1 time and it is not reproduced.

> If both Y with above questions. The reason I could think is "this issue is
> original found with qemu-kvm-rhev and might be can not reproduced with
> qemu-kvm". Anyway, we might need to needinfo Stefan about "any efficient way
> for QE reproducing & verifying this issue?".
> 

OK, agree.
Thanks :)

> Hope useful.
> 
> Best Regards,
> Junyi
Comment 6 Qian Guo 2015-08-13 03:48:52 EDT
Hi, Stefan

Could you help check comment 3 and do you have any suggestion for QE to do reproduce/verify for this issue?

Thanks,
qian
Comment 7 Stefan Hajnoczi 2015-09-08 13:31:11 EDT
(In reply to Qian Guo from comment #6)
> Could you help check comment 3 and do you have any suggestion for QE to do
> reproduce/verify for this issue?

At the source code level, the bug is present in qemu-kvm versions before qemu-kvm-1.5.3-99.el7.  The fixed version is needed.

I reproduced your results where unfixed qemu-kvm doesn't crash.  This happens because there is another bug.  After live migration, the failed request does not re-execute on the destination host.

This BZ can be marked verified.  I will raise another BZ to describe the additional bug that you have found (it seems to have existed for a while).
Comment 8 juzhang 2015-09-08 22:19:09 EDT
(In reply to Stefan Hajnoczi from comment #7)
> (In reply to Qian Guo from comment #6)
> > Could you help check comment 3 and do you have any suggestion for QE to do
> > reproduce/verify for this issue?
> 
> At the source code level, the bug is present in qemu-kvm versions before
> qemu-kvm-1.5.3-99.el7.  The fixed version is needed.
> 
> I reproduced your results where unfixed qemu-kvm doesn't crash.  This
> happens because there is another bug.  After live migration, the failed
> request does not re-execute on the destination host.
> 
> This BZ can be marked verified.  I will raise another BZ to describe the
> additional bug that you have found (it seems to have existed for a while).

Thanks for the reply. Could you point the BZ ID to QE?

Best Regards,
Junyi
Comment 9 Stefan Hajnoczi 2015-09-09 04:22:16 EDT
(In reply to juzhang from comment #8)
> (In reply to Stefan Hajnoczi from comment #7)
> > (In reply to Qian Guo from comment #6)
> > > Could you help check comment 3 and do you have any suggestion for QE to do
> > > reproduce/verify for this issue?
> > 
> > At the source code level, the bug is present in qemu-kvm versions before
> > qemu-kvm-1.5.3-99.el7.  The fixed version is needed.
> > 
> > I reproduced your results where unfixed qemu-kvm doesn't crash.  This
> > happens because there is another bug.  After live migration, the failed
> > request does not re-execute on the destination host.
> > 
> > This BZ can be marked verified.  I will raise another BZ to describe the
> > additional bug that you have found (it seems to have existed for a while).
> 
> Thanks for the reply. Could you point the BZ ID to QE?

I investigated further this morning and was able to reproduce the segfault with qemu-kvm-1.5.3-98.el7 after all.

Yesterday I wasn't able to reproduce the crash because debugging code I added made QEMU misbehave.  That was my mistake.

I have not raised a new BZ since there is no new issue to look into.

Here are the steps to reproduce the crash:

1. Ensure the source QEMU sees ENOSPC so the request fails and the guest pauses.  I use the blkdebug feature to simulate the error:

$ cat blkdebug.conf
[inject-error]
event = "read_aio"
errno = "28"
$ qemu-img create test.raw 1G
$ qemu-system-x86_64 -enable-kvm -m 1024 -cpu host \
                     -device virtio-scsi-pci \
                     -drive if=none,id=drive0,rerror=stop,file=blkdebug:blkdebug.conf:test.raw \
                     -device scsi-hd,drive=drive0

2. Launch the destination QEMU without blkdebug so I/O requests can complete:

$ gdb --args qemu-system-x86_64 -enable-kvm -m 1024 -cpu host \
                                -device virtio-scsi-pci \
                                -drive if=none,id=drive0,rerror=stop,file=test.raw,format=raw \
                                -device scsi-hd,drive=drive0 \
                                -incoming tcp::1234
(gdb) r

3. Migrate the VM between the QEMUs:

(source qemu) migrate tcp:127.0.0.1:1234

Destination QEMU should crash now:

Program received signal SIGSEGV, Segmentation fault.
0x000055555577a5ef in virtio_scsi_command_complete (r=0x5555565f1110, status=0, resid=0) at /home/stefanha/qemu-kvm/hw/scsi/virtio-scsi.c:316
316	    req->resp.cmd->response = VIRTIO_SCSI_S_OK;
(gdb) bt
#0  0x000055555577a5ef in virtio_scsi_command_complete (r=0x5555565f1110, status=0, resid=0) at /home/stefanha/qemu-kvm/hw/scsi/virtio-scsi.c:316
#1  0x0000555555696234 in scsi_req_complete (req=0x5555565f1110, status=<optimized out>) at hw/scsi/scsi-bus.c:1655
#2  0x0000555555699175 in scsi_dma_complete_noio (opaque=0x5555565f1110, ret=0) at hw/scsi/scsi-disk.c:276
#3  0x0000555555629c21 in dma_complete (dbs=dbs@entry=0x555556785240, ret=ret@entry=0) at dma-helpers.c:124
#4  0x0000555555629e42 in dma_bdrv_cb (opaque=0x555556785240, ret=0) at dma-helpers.c:152
#5  0x00005555555e223e in bdrv_co_em_bh (opaque=0x555556795c90) at block.c:4670
#6  0x00005555555d5967 in aio_bh_poll (ctx=ctx@entry=0x55555653f850) at async.c:81
#7  0x00005555555d5599 in aio_poll (ctx=0x55555653f850, blocking=blocking@entry=false) at aio-posix.c:185
#8  0x00005555555d5870 in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at async.c:200
#9  0x00007ffff65b6a8a in g_main_context_dispatch (context=0x555556540a00) at gmain.c:3122
#10 0x00007ffff65b6a8a in g_main_context_dispatch (context=context@entry=0x555556540a00) at gmain.c:3737
#11 0x00005555556c334a in main_loop_wait () at main-loop.c:187
#12 0x00005555556c334a in main_loop_wait (timeout=<optimized out>) at main-loop.c:232
#13 0x00005555556c334a in main_loop_wait (nonblocking=<optimized out>) at main-loop.c:464
#14 0x00005555555d0c24 in main () at vl.c:1989
#15 0x00005555555d0c24 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4381

Note that the crash depends on the memory layout of the QEMU process.  If, by chance, the guest RAM is at the same location in both QEMU processes, then you will not see a crash.

This crash does not occur with qemu-kvm-1.5.3-99.el7 since it has the backported fix.

Please try verifying these steps again.
Comment 10 Qian Guo 2015-09-10 02:08:40 EDT
Reproduced this bug with qemu-kvm-1.5.3-98.el7.x86_64

Steps as comment 9:
# cat blkdebug.cfg 
[inject-error]
event = "read_aio"
errno = "28"

src:
# /usr/libexec/qemu-kvm -enable-kvm -m 1024 -cpu host -device virtio-scsi-pci                      -drive if=none,id=drive0,rerror=stop,file=blkdebug:blkdebug.cfg:test.raw  -device scsi-hd,drive=drive0 -monitor stdio
QEMU 1.5.3 monitor - type 'help' for more information
(qemu) VNC server running on `::1:5900'
block I/O error in device 'drive0': No space left on device (28)



dst:
# gdb --args /usr/libexec/qemu-kvm -enable-kvm -m 1024 -cpu host -device virtio-scsi-pci                      -drive if=none,id=drive0,rerror=stop,file=test.raw  -device scsi-hd,drive=drive0 -monitor stdio -incoming tcp:0:4444

(gdb) r


Do migration, then dst crashed:
Program received signal SIGSEGV, Segmentation fault.
0x000055555575986f in virtio_scsi_command_complete (r=0x555556ce4000, status=0, resid=0)
    at /usr/src/debug/qemu-1.5.3/hw/scsi/virtio-scsi.c:316
316	    req->resp.cmd->response = VIRTIO_SCSI_S_OK;

(gdb) bt
#0  0x000055555575986f in virtio_scsi_command_complete (r=0x555556ce4000, status=0, resid=0)
    at /usr/src/debug/qemu-1.5.3/hw/scsi/virtio-scsi.c:316
#1  0x000055555567cc52 in scsi_req_complete (req=0x555556ce4000, status=<optimized out>)
    at hw/scsi/scsi-bus.c:1655
#2  0x000055555567f085 in scsi_dma_complete_noio (opaque=0x555556ce4000, ret=0) at hw/scsi/scsi-disk.c:276
#3  0x000055555561fed2 in dma_complete (dbs=0x555557511820, ret=0) at dma-helpers.c:124
#4  0x00005555556200e2 in dma_bdrv_cb (opaque=0x555557511820, ret=0) at dma-helpers.c:152
#5  0x00005555555db3be in bdrv_co_em_bh (opaque=0x555556cf0030) at block.c:4670
#6  0x00005555555cdf77 in aio_bh_poll (ctx=ctx@entry=0x555556d0e000) at async.c:81
#7  0x00005555555cdbc8 in aio_poll (ctx=0x555556d0e000, blocking=blocking@entry=false) at aio-posix.c:185
#8  0x00005555555cde80 in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, 
    user_data=<optimized out>) at async.c:200
#9  0x00007ffff635579a in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
#10 0x00005555556aa0da in glib_pollfds_poll () at main-loop.c:187
#11 os_host_main_loop_wait (timeout=<optimized out>) at main-loop.c:232
#12 main_loop_wait (nonblocking=<optimized out>) at main-loop.c:464
#13 0x00005555555c99a0 in main_loop () at vl.c:1989
#14 main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4381


So this bug is reproduced.

Verify this bug with qemu-kvm-1.5.3-102.el7.x86_64

Steps as above.

Result: after migration, dst works well:
(qemu) info status
VM status: running



So this bug is fixed.
Comment 11 juzhang 2015-09-10 22:28:34 EDT
Thanks all, set this issue as verified.
Comment 13 errata-xmlrpc 2015-11-19 00:12:11 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2213.html

Note You need to log in before you can comment on or make changes to this bug.