Bug 1048575

Summary: Segmentation fault occurs after migrate guest(use scsi disk and add stress) to des machine
Product: Red Hat Enterprise Linux 7 Reporter: langfang <flang>
Component: qemu-kvmAssignee: Kevin Wolf <kwolf>
Status: CLOSED CURRENTRELEASE QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 7.0CC: acathrow, coli, flang, hhuang, juzhang, knoel, kwolf, lijin, qiguo, qzhang, sluo, virt-maint, xfu, zhzhang
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: qemu-kvm-1.5.3-54.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-06-13 12:46:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1076185    
Attachments:
Description Flags
Proof-of-concept forward port of the RHEL 6 patch
none
Test script for upstream none

Description langfang 2014-01-05 11:20:57 UTC
Description of problem:
Segmentation fault occurs after migrate guest(use scsi disk and add stress) to des machine
 
Version-Release number of selected component (if applicable):
HostA: and HostB:
# uname -r
3.10.0-64.el7.x86_64
#rpm -q qemu-kvm-rhev
qemu-kvm-rhev-1.5.3-30.el7.x86_64

Guest:
3.10.0-64.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1.Boot guest use scsi disk on src machine
 /usr/libexec/qemu-kvm -cpu Penryn -m 2G -smp 2,sockets=1,cores=2,threads=1,maxvcpu=8 -enable-kvm -device piix3-usb-uhci,id=usb -name rhel7 -nodefaults -nodefconfig -device virtio-balloon-pci,id=balloon0,addr=0x6 -spice port=5800,disable-ticketing -vga qxl -global qxl-vga.vram_size=67108864 -global qxl-vga.revision=3 -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0 -monitor stdio  -drive file=/mnt/rhel7-newtree.qcow2-new_v3,if=none,media=disk,format=qcow2,rerror=stop,werror=stop,aio=native,id=scsi-disk0 -device virtio-scsi-pci,id=bus2,addr=0x8 -device scsi-hd,bus=bus2.0,drive=scsi-disk0,id=disk0 -device intel-hda,id=sound0,bus=pci.0,addr=0x9 -device hda-duplex -qmp tcp:0:4446,server,nowait -serial unix:/tmp/tty0,server,nowait

(qemu) client_migrate_info spice 10.66.5.251 5900
2.Boot in listern mode

...-incoming tcp:0:5999

3.Add stress in guest
#dd if=/dev/zero of=lang.txt

4.migrate guest to des machine

{"execute": "migrate","arguments":{"uri": "tcp:10.66.5.251:5999"}}

Actual results:After migrate finished
Src:
(qemu) info migrate
capabilities: xbzrle: off x-rdma-pin-all: off auto-converge: off zero-blocks: off 
Migration status: completed
total time: 35278 milliseconds
downtime: 130 milliseconds
setup: 43 milliseconds
transferred ram: 1156642 kbytes
throughput: 268.73 mbps
remaining ram: 0 kbytes
total ram: 2228576 kbytes
duplicate: 725521 pages
skipped: 0 pages
normal: 287005 pages
normal bytes: 1148020 kbytes
(qemu) info status
VM status: paused (postmigrate)


Des:
...
[New Thread 0x7fffeaeb4700 (LWP 22403)]
red_dispatcher_set_cursor_peer: 
inputs_connect: inputs channel client create
qcow2: Preventing invalid write on metadata (overlaps with refcount block); image marked as corrupt.
block I/O error in device 'scsi-disk0': Input/output error (5)
[New Thread 0x7fff525fe700 (LWP 22409)]
[New Thread 0x7fff51dfd700 (LWP 22410)]
[New Thread 0x7fff515fc700 (LWP 22411)]
[New Thread 0x7fff50dfb700 (LWP 22412)]
[New Thread 0x7fff43fff700 (LWP 22413)]
[New Thread 0x7fff437fe700 (LWP 22414)]
[New Thread 0x7fff42ffd700 (LWP 22415)]

Program received signal SIGSEGV, Segmentation fault.
0x000055555562c097 in copy_sectors (n_end=<optimized out>, n_start=1008, 
    cluster_offset=<optimized out>, start_sect=<optimized out>, 
    bs=0x555556549130) at block/qcow2-cluster.c:377
377	    ret = bs->drv->bdrv_co_readv(bs, start_sect + n_start, n, &qiov);
(gdb) by
Undefined command: "by".  Try "help".
(gdb) bt
#0  0x000055555562c097 in copy_sectors (n_end=<optimized out>, n_start=1008, 
    cluster_offset=<optimized out>, start_sect=<optimized out>, 
    bs=0x555556549130) at block/qcow2-cluster.c:377
#1  perform_cow (bs=bs@entry=0x555556549130, r=r@entry=0x55555675a610, 
    m=0x55555675a5d0, m=0x55555675a5d0) at block/qcow2-cluster.c:664
#2  0x000055555562c60d in qcow2_alloc_cluster_link_l2 (
    bs=bs@entry=0x555556549130, m=0x55555675a5d0) at block/qcow2-cluster.c:701
#3  0x0000555555632128 in qcow2_co_writev (bs=0x555556549130, 
    sector_num=9092992, remaining_sectors=1008, qiov=0x55555659ae88)
    at block/qcow2.c:1075
#4  0x000055555561ad32 in bdrv_co_do_writev (bs=0x555556549130, 
    sector_num=9092992, nb_sectors=1008, qiov=0x55555659ae88, 
    flags=flags@entry=(unknown: 0)) at block.c:2797
#5  0x000055555561b628 in bdrv_co_do_rw (opaque=0x55555659c0e0) at block.c:4061
#6  0x000055555565250a in coroutine_trampoline (i0=<optimized out>, 
    i1=<optimized out>) at coroutine-ucontext.c:118
#7  0x00007ffff2cc44f0 in ?? () from /lib64/libc.so.6
#8  0x00007fffffffd430 in ?? ()
#9  0x0000000000000000 in ?? ()
(gdb) 


Expected results:
Migrate succsessfully

Additional info:

Comment 2 Juan Quintela 2014-02-11 23:14:47 UTC
*** Bug 1053432 has been marked as a duplicate of this bug. ***

Comment 3 Kevin Wolf 2014-02-12 08:52:41 UTC
Can you run qemu-img check on the image? This could give us a hint if there is a
real on-disk corruption or if the qcow2 driver was operating on stale data in
memory.

Comment 4 Juan Quintela 2014-02-12 09:39:02 UTC
Please, could you:
- try to reproducewith cache=none
- post your NFS server & client configuration?

Thanks

Comment 5 langfang 2014-02-17 10:19:36 UTC
Reply Comment4 and Comment3

1)NFS configuration:
Src machine:
# cat /etc/exports
/home *(rw,no_root_squash)
#service nfs start

#mount -o hard,rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,proto=tcp,timeo=600,retrans=2,sec=sys 10.66.6.14:/home/test /mnt

Des machine:
#mount -o hard,rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,proto=tcp,timeo=600,retrans=2,sec=sys 10.66.6.14:/home/test /mnt

2)After hit the problem,check the img
..
ERROR OFLAG_COPIED data cluster: l2_entry=8000000182010000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=8000000182020000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=80000001823d0000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=80000001823e0000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=80000001823f0000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=8000000182400000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=8000000182410000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=8000000182420000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=8000000182430000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=8000000182030000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=8000000182040000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=8000000182050000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=8000000182060000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=8000000182070000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=8000000182080000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=8000000182090000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=80000001820a0000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=80000001820b0000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=80000001820c0000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=80000001820d0000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=80000001820e0000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=80000001820f0000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=8000000182100000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=8000000182110000 refcount=0
ERROR OFLAG_COPIED data cluster: l2_entry=8000000182120000 refcount=0

1661 errors were found on the image.
Data may be corrupted, or further writes to the image may corrupt it.
99119/327680 = 30.25% allocated, 4.95% fragmented, 0.00% compressed clusters
Image end offset: 6497435648


Also test on new version:
Host A(Src)
# uname -r
3.10.0-88.el7.x86_64
# rpm -q qemu-kvm-rhev
qemu-kvm-rhev-1.5.3-47.el7.x86_64
# rpm -q seabios
seabios-1.7.2.2-11.el7.x86_64

Host B(Des)
# uname -r
3.10.0-86.el7.x86_64
# rpm -q qemu-kvm-rhev
qemu-kvm-rhev-1.5.3-45.el7.x86_64

Steps as same as cooment0 but boot guest with "cache=none"

...-drive file=/mnt/RHEL-Server-7.0-64-virtio.qcow2,if=none,media=disk,format=qcow2,rerror=stop,werror=stop,aio=native,id=scsi-disk0,cache=none -device virtio-scsi-pci,id=bus2,addr=0x8 -device scsi-hd,bus=bus2.0,drive=scsi-disk0,id=disk0...

Result:Tried 10 times,  hit this problem 2 times,not 100% reproduce.

Comment 6 Mike Cao 2014-02-20 08:51:41 UTC
*** Bug 1067319 has been marked as a duplicate of this bug. ***

Comment 7 Kevin Wolf 2014-03-07 22:22:44 UTC
For me, the following sequence is a 100% reproducer:

1. Create a fresh qcow2 image
2. Start the qemu process with -incoming first
3. Start an guest OS installation (Win 7 in my case, but shouldn't matter)
4. After letting it copy some data to the hard disk, migrate

This results in a the destination seeing a corrupted image. The corruption
prevention patches detect this and close the image. Because of a second bug,
this leads to a NULL pointer derefernce, i.e. segfault (this is the same as in
the original report):

qcow2: Preventing invalid write on metadata (overlaps with refcount block); image marked as corrupt.
block I/O error in device 'virtio0': Eingabe-/Ausgabefehler (5)
Program received signal SIGSEGV, Segmentation fault.

Comment 8 Kevin Wolf 2014-03-07 22:35:05 UTC
Created attachment 872040 [details]
Proof-of-concept forward port of the RHEL 6 patch

An experimental forward port of the RHEL 6 patch that reopens all images after
an incoming migration has completed fixes the problem for me for some unknown
reason. I'm attaching the corresponding patch file.

Comment 9 Kevin Wolf 2014-03-07 22:45:14 UTC
Created attachment 872045 [details]
Test script for upstream

Also, for upstream qemu there is an easier way to trigger the condition. The
attached shell script starts two VMs and creates the problematic situation using
the 'qemu-io' HMP command (this command is the reason why the script does _not_
work on RHEL 7, it has not been backported).

You reproduced the bug, if the output of the script contains a line like:
qcow2: Preventing invalid write on metadata (overlaps with active L2 table); image marked as corrupt.

Comment 10 Kevin Wolf 2014-03-10 10:10:33 UTC
(In reply to Kevin Wolf from comment #8)
> Created attachment 872040 [details]
> Proof-of-concept forward port of the RHEL 6 patch
>
> An experimental forward port of the RHEL 6 patch that reopens all images
> after an incoming migration has completed fixes the problem for me for some
> unknown reason. I'm attaching the corresponding patch file.

As requested by Amit, I did a Brew build with this patch applied:
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=7174667

Comment 14 Kevin Wolf 2014-03-10 13:49:10 UTC
Just wanted to mention that my forward port is definitely not more than a
proof of concept. When using it to work around this problem while reproducing
something else, I noticed that I/O throttling settings get lost when the migration
completes.

Comment 15 langfang 2014-03-11 03:06:01 UTC
(In reply to Kevin Wolf from comment #10)
> (In reply to Kevin Wolf from comment #8)
> > Created attachment 872040 [details]
> > Proof-of-concept forward port of the RHEL 6 patch
> >
> > An experimental forward port of the RHEL 6 patch that reopens all images
> > after an incoming migration has completed fixes the problem for me for some
> > unknown reason. I'm attaching the corresponding patch file.
> 
> As requested by Amit, I did a Brew build with this patch applied:
> http://brewweb.devel.redhat.com/brew/taskinfo?taskID=7174667


Test this build,not hit the problem naymore

Version:
Host:
# uname -r
3.10.0-98.el7.x86_64
# rpm -q qemu-kvm
qemu-kvm-1.5.3-52.el7.migration_reopen.x86_64


Steps as same as comment0

Results:

Tried more than 15 times,not hit the core dump problem. migration finished successfully, after migration, guest work well.

Comment 16 Kevin Wolf 2014-03-11 10:47:53 UTC
I prepared a new Brew build which uses a different, upstreamable patch. It
should solve the problem as well, without falling back to the downstream-only
implementation of RHEL 6:
http://brewweb.devel.redhat.com/brew/taskinfo?taskID=7186243

Can you try if it fixes the problem as well?

Comment 19 Miroslav Rezanina 2014-03-19 10:35:03 UTC
Fix included in qemu-kvm-1.5.3-54.el7

Comment 21 Sibiao Luo 2014-03-21 02:36:06 UTC
Reproduce this issue with the same step as comment #0.
host info:
# uname -r && rpm -q qemu-kvm
3.10.0-113.el7.x86_64
qemu-kvm-1.5.3-53.el7.x86_64

Steps:
comment #0.

Results:
migration fail with QEMU Segmentation fault (core dumped) in dest.
(qemu) qcow2: Preventing invalid write on metadata (overlaps with refcount block); image marked as corrupt.
block I/O error in device 'scsi-disk0': Input/output error (5)
Segmentation fault (core dumped)

-------------------------------------------------------

Verify this issue with the same step as comment #0.
host info:
# uname -r && rpm -q qemu-kvm
3.10.0-113.el7.x86_64
qemu-kvm-1.5.3-55.el7.x86_64

Steps:
comment #0.

Results:
migration finish successfully without any core dumped, the VM was in running status normally.
(qemu) info status 
VM status: running

Base on above, this issue has been fixed correctly, move to VERIFIED status.

Best Regards,
sluo

Comment 25 Kevin Wolf 2014-04-01 15:39:53 UTC
*** Bug 1081326 has been marked as a duplicate of this bug. ***

Comment 26 Kevin Wolf 2014-04-02 11:00:13 UTC
*** Bug 971214 has been marked as a duplicate of this bug. ***

Comment 27 Ludek Smid 2014-06-13 12:46:25 UTC
This request was resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you have further questions about the request.

Comment 29 Kevin Wolf 2014-07-15 13:31:46 UTC
*** Bug 1074992 has been marked as a duplicate of this bug. ***