Bug 1048575
Summary: | Segmentation fault occurs after migrate guest(use scsi disk and add stress) to des machine | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | langfang <flang> | ||||||
Component: | qemu-kvm | Assignee: | Kevin Wolf <kwolf> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 7.0 | CC: | acathrow, coli, flang, hhuang, juzhang, knoel, kwolf, lijin, qiguo, qzhang, sluo, virt-maint, xfu, zhzhang | ||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | qemu-kvm-1.5.3-54.el7 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2014-06-13 12:46:25 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1076185 | ||||||||
Attachments: |
|
Description
langfang
2014-01-05 11:20:57 UTC
*** Bug 1053432 has been marked as a duplicate of this bug. *** Can you run qemu-img check on the image? This could give us a hint if there is a real on-disk corruption or if the qcow2 driver was operating on stale data in memory. Please, could you: - try to reproducewith cache=none - post your NFS server & client configuration? Thanks Reply Comment4 and Comment3 1)NFS configuration: Src machine: # cat /etc/exports /home *(rw,no_root_squash) #service nfs start #mount -o hard,rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,proto=tcp,timeo=600,retrans=2,sec=sys 10.66.6.14:/home/test /mnt Des machine: #mount -o hard,rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,proto=tcp,timeo=600,retrans=2,sec=sys 10.66.6.14:/home/test /mnt 2)After hit the problem,check the img .. ERROR OFLAG_COPIED data cluster: l2_entry=8000000182010000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182020000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000001823d0000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000001823e0000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000001823f0000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182400000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182410000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182420000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182430000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182030000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182040000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182050000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182060000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182070000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182080000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182090000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000001820a0000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000001820b0000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000001820c0000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000001820d0000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000001820e0000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=80000001820f0000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182100000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182110000 refcount=0 ERROR OFLAG_COPIED data cluster: l2_entry=8000000182120000 refcount=0 1661 errors were found on the image. Data may be corrupted, or further writes to the image may corrupt it. 99119/327680 = 30.25% allocated, 4.95% fragmented, 0.00% compressed clusters Image end offset: 6497435648 Also test on new version: Host A(Src) # uname -r 3.10.0-88.el7.x86_64 # rpm -q qemu-kvm-rhev qemu-kvm-rhev-1.5.3-47.el7.x86_64 # rpm -q seabios seabios-1.7.2.2-11.el7.x86_64 Host B(Des) # uname -r 3.10.0-86.el7.x86_64 # rpm -q qemu-kvm-rhev qemu-kvm-rhev-1.5.3-45.el7.x86_64 Steps as same as cooment0 but boot guest with "cache=none" ...-drive file=/mnt/RHEL-Server-7.0-64-virtio.qcow2,if=none,media=disk,format=qcow2,rerror=stop,werror=stop,aio=native,id=scsi-disk0,cache=none -device virtio-scsi-pci,id=bus2,addr=0x8 -device scsi-hd,bus=bus2.0,drive=scsi-disk0,id=disk0... Result:Tried 10 times, hit this problem 2 times,not 100% reproduce. *** Bug 1067319 has been marked as a duplicate of this bug. *** For me, the following sequence is a 100% reproducer: 1. Create a fresh qcow2 image 2. Start the qemu process with -incoming first 3. Start an guest OS installation (Win 7 in my case, but shouldn't matter) 4. After letting it copy some data to the hard disk, migrate This results in a the destination seeing a corrupted image. The corruption prevention patches detect this and close the image. Because of a second bug, this leads to a NULL pointer derefernce, i.e. segfault (this is the same as in the original report): qcow2: Preventing invalid write on metadata (overlaps with refcount block); image marked as corrupt. block I/O error in device 'virtio0': Eingabe-/Ausgabefehler (5) Program received signal SIGSEGV, Segmentation fault. Created attachment 872040 [details]
Proof-of-concept forward port of the RHEL 6 patch
An experimental forward port of the RHEL 6 patch that reopens all images after
an incoming migration has completed fixes the problem for me for some unknown
reason. I'm attaching the corresponding patch file.
Created attachment 872045 [details]
Test script for upstream
Also, for upstream qemu there is an easier way to trigger the condition. The
attached shell script starts two VMs and creates the problematic situation using
the 'qemu-io' HMP command (this command is the reason why the script does _not_
work on RHEL 7, it has not been backported).
You reproduced the bug, if the output of the script contains a line like:
qcow2: Preventing invalid write on metadata (overlaps with active L2 table); image marked as corrupt.
(In reply to Kevin Wolf from comment #8) > Created attachment 872040 [details] > Proof-of-concept forward port of the RHEL 6 patch > > An experimental forward port of the RHEL 6 patch that reopens all images > after an incoming migration has completed fixes the problem for me for some > unknown reason. I'm attaching the corresponding patch file. As requested by Amit, I did a Brew build with this patch applied: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=7174667 Just wanted to mention that my forward port is definitely not more than a proof of concept. When using it to work around this problem while reproducing something else, I noticed that I/O throttling settings get lost when the migration completes. (In reply to Kevin Wolf from comment #10) > (In reply to Kevin Wolf from comment #8) > > Created attachment 872040 [details] > > Proof-of-concept forward port of the RHEL 6 patch > > > > An experimental forward port of the RHEL 6 patch that reopens all images > > after an incoming migration has completed fixes the problem for me for some > > unknown reason. I'm attaching the corresponding patch file. > > As requested by Amit, I did a Brew build with this patch applied: > http://brewweb.devel.redhat.com/brew/taskinfo?taskID=7174667 Test this build,not hit the problem naymore Version: Host: # uname -r 3.10.0-98.el7.x86_64 # rpm -q qemu-kvm qemu-kvm-1.5.3-52.el7.migration_reopen.x86_64 Steps as same as comment0 Results: Tried more than 15 times,not hit the core dump problem. migration finished successfully, after migration, guest work well. I prepared a new Brew build which uses a different, upstreamable patch. It should solve the problem as well, without falling back to the downstream-only implementation of RHEL 6: http://brewweb.devel.redhat.com/brew/taskinfo?taskID=7186243 Can you try if it fixes the problem as well? Fix included in qemu-kvm-1.5.3-54.el7 Reproduce this issue with the same step as comment #0. host info: # uname -r && rpm -q qemu-kvm 3.10.0-113.el7.x86_64 qemu-kvm-1.5.3-53.el7.x86_64 Steps: comment #0. Results: migration fail with QEMU Segmentation fault (core dumped) in dest. (qemu) qcow2: Preventing invalid write on metadata (overlaps with refcount block); image marked as corrupt. block I/O error in device 'scsi-disk0': Input/output error (5) Segmentation fault (core dumped) ------------------------------------------------------- Verify this issue with the same step as comment #0. host info: # uname -r && rpm -q qemu-kvm 3.10.0-113.el7.x86_64 qemu-kvm-1.5.3-55.el7.x86_64 Steps: comment #0. Results: migration finish successfully without any core dumped, the VM was in running status normally. (qemu) info status VM status: running Base on above, this issue has been fixed correctly, move to VERIFIED status. Best Regards, sluo *** Bug 1081326 has been marked as a duplicate of this bug. *** *** Bug 971214 has been marked as a duplicate of this bug. *** This request was resolved in Red Hat Enterprise Linux 7.0. Contact your manager or support representative in case you have further questions about the request. *** Bug 1074992 has been marked as a duplicate of this bug. *** |