Bug 1048575
| Summary: | Segmentation fault occurs after migrate guest(use scsi disk and add stress) to des machine | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | langfang <flang> | ||||||
| Component: | qemu-kvm | Assignee: | Kevin Wolf <kwolf> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 7.0 | CC: | acathrow, coli, flang, hhuang, juzhang, knoel, kwolf, lijin, qiguo, qzhang, sluo, virt-maint, xfu, zhzhang | ||||||
| Target Milestone: | rc | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | qemu-kvm-1.5.3-54.el7 | Doc Type: | Bug Fix | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2014-06-13 12:46:25 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1076185 | ||||||||
| Attachments: |
|
||||||||
*** Bug 1053432 has been marked as a duplicate of this bug. *** Can you run qemu-img check on the image? This could give us a hint if there is a real on-disk corruption or if the qcow2 driver was operating on stale data in memory. Please, could you: - try to reproducewith cache=none - post your NFS server & client configuration? Thanks Reply Comment4 and Comment3 1)NFS configuration: Src machine: # cat /etc/exports /home *(rw,no_root_squash) #service nfs start #mount -o hard,rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,proto=tcp,timeo=600,retrans=2,sec=sys 10.66.6.14:/home/test /mnt Des machine: #mount -o hard,rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,proto=tcp,timeo=600,retrans=2,sec=sys 10.66.6.14:/home/test /mnt 2)After hit the problem,check the img .. ERROR OFLAG_COPIED data cluster: l2_entry=8000000182010000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182020000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000001823d0000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000001823e0000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000001823f0000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182400000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182410000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182420000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182430000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182030000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182040000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182050000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182060000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182070000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182080000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182090000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000001820a0000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000001820b0000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000001820c0000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000001820d0000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000001820e0000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000001820f0000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182100000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182110000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182120000 refcount=0 1661 errors were found on the image. Data may be corrupted, or further writes to the image may corrupt it. 99119/327680 = 30.25% allocated, 4.95% fragmented, 0.00% compressed clusters Image end offset: 6497435648 Also test on new version: Host A(Src) # uname -r 3.10.0-88.el7.x86_64 # rpm -q qemu-kvm-rhev qemu-kvm-rhev-1.5.3-47.el7.x86_64 # rpm -q seabios seabios-1.7.2.2-11.el7.x86_64 Host B(Des) # uname -r 3.10.0-86.el7.x86_64 # rpm -q qemu-kvm-rhev qemu-kvm-rhev-1.5.3-45.el7.x86_64 Steps as same as cooment0 but boot guest with "cache=none" ...-drive file=/mnt/RHEL-Server-7.0-64-virtio.qcow2,if=none,media=disk,format=qcow2,rerror=stop,werror=stop,aio=native,id=scsi-disk0,cache=none -device virtio-scsi-pci,id=bus2,addr=0x8 -device scsi-hd,bus=bus2.0,drive=scsi-disk0,id=disk0... Result:Tried 10 times, hit this problem 2 times,not 100% reproduce. *** Bug 1067319 has been marked as a duplicate of this bug. *** For me, the following sequence is a 100% reproducer: 1. Create a fresh qcow2 image 2. Start the qemu process with -incoming first 3. Start an guest OS installation (Win 7 in my case, but shouldn't matter) 4. After letting it copy some data to the hard disk, migrate This results in a the destination seeing a corrupted image. The corruption prevention patches detect this and close the image. Because of a second bug, this leads to a NULL pointer derefernce, i.e. segfault (this is the same as in the original report): qcow2: Preventing invalid write on metadata (overlaps with refcount block); image marked as corrupt. block I/O error in device 'virtio0': Eingabe-/Ausgabefehler (5) Program received signal SIGSEGV, Segmentation fault. Created attachment 872040 [details]
Proof-of-concept forward port of the RHEL 6 patch
An experimental forward port of the RHEL 6 patch that reopens all images after
an incoming migration has completed fixes the problem for me for some unknown
reason. I'm attaching the corresponding patch file.
Created attachment 872045 [details]
Test script for upstream
Also, for upstream qemu there is an easier way to trigger the condition. The
attached shell script starts two VMs and creates the problematic situation using
the 'qemu-io' HMP command (this command is the reason why the script does _not_
work on RHEL 7, it has not been backported).
You reproduced the bug, if the output of the script contains a line like:
qcow2: Preventing invalid write on metadata (overlaps with active L2 table); image marked as corrupt.
(In reply to Kevin Wolf from comment #8) > Created attachment 872040 [details] > Proof-of-concept forward port of the RHEL 6 patch > > An experimental forward port of the RHEL 6 patch that reopens all images > after an incoming migration has completed fixes the problem for me for some > unknown reason. I'm attaching the corresponding patch file. As requested by Amit, I did a Brew build with this patch applied: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=7174667 Just wanted to mention that my forward port is definitely not more than a proof of concept. When using it to work around this problem while reproducing something else, I noticed that I/O throttling settings get lost when the migration completes. (In reply to Kevin Wolf from comment #10) > (In reply to Kevin Wolf from comment #8) > > Created attachment 872040 [details] > > Proof-of-concept forward port of the RHEL 6 patch > > > > An experimental forward port of the RHEL 6 patch that reopens all images > > after an incoming migration has completed fixes the problem for me for some > > unknown reason. I'm attaching the corresponding patch file. > > As requested by Amit, I did a Brew build with this patch applied: > http://brewweb.devel.redhat.com/brew/taskinfo?taskID=7174667 Test this build,not hit the problem naymore Version: Host: # uname -r 3.10.0-98.el7.x86_64 # rpm -q qemu-kvm qemu-kvm-1.5.3-52.el7.migration_reopen.x86_64 Steps as same as comment0 Results: Tried more than 15 times,not hit the core dump problem. migration finished successfully, after migration, guest work well. I prepared a new Brew build which uses a different, upstreamable patch. It should solve the problem as well, without falling back to the downstream-only implementation of RHEL 6: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=7186243 Can you try if it fixes the problem as well? Fix included in qemu-kvm-1.5.3-54.el7 Reproduce this issue with the same step as comment #0. host info: # uname -r && rpm -q qemu-kvm 3.10.0-113.el7.x86_64 qemu-kvm-1.5.3-53.el7.x86_64 Steps: comment #0. Results: migration fail with QEMU Segmentation fault (core dumped) in dest. (qemu) qcow2: Preventing invalid write on metadata (overlaps with refcount block); image marked as corrupt. block I/O error in device 'scsi-disk0': Input/output error (5) Segmentation fault (core dumped) ------------------------------------------------------- Verify this issue with the same step as comment #0. host info: # uname -r && rpm -q qemu-kvm 3.10.0-113.el7.x86_64 qemu-kvm-1.5.3-55.el7.x86_64 Steps: comment #0. Results: migration finish successfully without any core dumped, the VM was in running status normally. (qemu) info status VM status: running Base on above, this issue has been fixed correctly, move to VERIFIED status. Best Regards, sluo *** Bug 1081326 has been marked as a duplicate of this bug. *** *** Bug 971214 has been marked as a duplicate of this bug. *** This request was resolved in Red Hat Enterprise Linux 7.0. Contact your manager or support representative in case you have further questions about the request. *** Bug 1074992 has been marked as a duplicate of this bug. *** |
Description of problem: Segmentation fault occurs after migrate guest(use scsi disk and add stress) to des machine Version-Release number of selected component (if applicable): HostA: and HostB: # uname -r 3.10.0-64.el7.x86_64 #rpm -q qemu-kvm-rhev qemu-kvm-rhev-1.5.3-30.el7.x86_64 Guest: 3.10.0-64.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1.Boot guest use scsi disk on src machine /usr/libexec/qemu-kvm -cpu Penryn -m 2G -smp 2,sockets=1,cores=2,threads=1,maxvcpu=8 -enable-kvm -device piix3-usb-uhci,id=usb -name rhel7 -nodefaults -nodefconfig -device virtio-balloon-pci,id=balloon0,addr=0x6 -spice port=5800,disable-ticketing -vga qxl -global qxl-vga.vram_size=67108864 -global qxl-vga.revision=3 -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0 -monitor stdio -drive file=/mnt/rhel7-newtree.qcow2-new_v3,if=none,media=disk,format=qcow2,rerror=stop,werror=stop,aio=native,id=scsi-disk0 -device virtio-scsi-pci,id=bus2,addr=0x8 -device scsi-hd,bus=bus2.0,drive=scsi-disk0,id=disk0 -device intel-hda,id=sound0,bus=pci.0,addr=0x9 -device hda-duplex -qmp tcp:0:4446,server,nowait -serial unix:/tmp/tty0,server,nowait (qemu) client_migrate_info spice 10.66.5.251 5900 2.Boot in listern mode ...-incoming tcp:0:5999 3.Add stress in guest #dd if=/dev/zero of=lang.txt 4.migrate guest to des machine {"execute": "migrate","arguments":{"uri": "tcp:10.66.5.251:5999"}} Actual results:After migrate finished Src: (qemu) info migrate capabilities: xbzrle: off x-rdma-pin-all: off auto-converge: off zero-blocks: off Migration status: completed total time: 35278 milliseconds downtime: 130 milliseconds setup: 43 milliseconds transferred ram: 1156642 kbytes throughput: 268.73 mbps remaining ram: 0 kbytes total ram: 2228576 kbytes duplicate: 725521 pages skipped: 0 pages normal: 287005 pages normal bytes: 1148020 kbytes (qemu) info status VM status: paused (postmigrate) Des: ... [New Thread 0x7fffeaeb4700 (LWP 22403)] red_dispatcher_set_cursor_peer: inputs_connect: inputs channel client create qcow2: Preventing invalid write on metadata (overlaps with refcount block); image marked as corrupt. block I/O error in device 'scsi-disk0': Input/output error (5) [New Thread 0x7fff525fe700 (LWP 22409)] [New Thread 0x7fff51dfd700 (LWP 22410)] [New Thread 0x7fff515fc700 (LWP 22411)] [New Thread 0x7fff50dfb700 (LWP 22412)] [New Thread 0x7fff43fff700 (LWP 22413)] [New Thread 0x7fff437fe700 (LWP 22414)] [New Thread 0x7fff42ffd700 (LWP 22415)] Program received signal SIGSEGV, Segmentation fault. 0x000055555562c097 in copy_sectors (n_end=<optimized out>, n_start=1008, cluster_offset=<optimized out>, start_sect=<optimized out>, bs=0x555556549130) at block/qcow2-cluster.c:377 377 ret = bs->drv->bdrv_co_readv(bs, start_sect + n_start, n, &qiov); (gdb) by Undefined command: "by". Try "help". (gdb) bt #0 0x000055555562c097 in copy_sectors (n_end=<optimized out>, n_start=1008, cluster_offset=<optimized out>, start_sect=<optimized out>, bs=0x555556549130) at block/qcow2-cluster.c:377 #1 perform_cow (bs=bs@entry=0x555556549130, r=r@entry=0x55555675a610, m=0x55555675a5d0, m=0x55555675a5d0) at block/qcow2-cluster.c:664 #2 0x000055555562c60d in qcow2_alloc_cluster_link_l2 ( bs=bs@entry=0x555556549130, m=0x55555675a5d0) at block/qcow2-cluster.c:701 #3 0x0000555555632128 in qcow2_co_writev (bs=0x555556549130, sector_num=9092992, remaining_sectors=1008, qiov=0x55555659ae88) at block/qcow2.c:1075 #4 0x000055555561ad32 in bdrv_co_do_writev (bs=0x555556549130, sector_num=9092992, nb_sectors=1008, qiov=0x55555659ae88, flags=flags@entry=(unknown: 0)) at block.c:2797 #5 0x000055555561b628 in bdrv_co_do_rw (opaque=0x55555659c0e0) at block.c:4061 #6 0x000055555565250a in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at coroutine-ucontext.c:118 #7 0x00007ffff2cc44f0 in ?? () from /lib64/libc.so.6 #8 0x00007fffffffd430 in ?? () #9 0x0000000000000000 in ?? () (gdb) Expected results: Migrate succsessfully Additional info: