Bug 618940

Summary:

Source host's qemu-kvm process hits core dump when do migration under copying image from host to guest

Product:

Red Hat Enterprise Linux 6

Reporter:

juzhang <juzhang>

Component:

qemu-kvm

Assignee:

Kevin Wolf <kwolf>

Status:

CLOSED WORKSFORME

QA Contact:

Virtualization Bugs <virt-bugs>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

6.0

CC:

amit.shah, ddumas, kcao, michen, mjenner, mkenneth, quintela, snagar, tburke, virt-maint

Target Milestone:

Keywords:

TestBlocker

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2010-08-11 03:52:40 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
The detailed info about image after migration failed	none
The detailed info about image after migration completed with qemu-kvm-0.12.1.2-2.106.el6.x86_64	none

Description juzhang 2010-07-28 06:01:34 UTC

Description of problem:
First boot guest in hostA. second scp a big .qcow image(RHEL-Server-6.0.qcow2) to guest.in the process of scp image,do migration from hostA to HostB.about 30 seconds later,source host's qemu-kvm process aborted with error"(qemu) qemu-kvm: block/qcow2.c:613: qcow_aio_write_cb: Assertion `(acb->cluster_offset & 511) == 0' failed." the destination host's qemu-kvm aborted with error "qemu: warning: error while loading state section id 4 load of migration failed"

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.Boot guest in HostA
# /usr/libexec/qemu-kvm -m 2G -smp 2 -drive file=/mnt/RHEL-Server-6.0-64-virtio.qcow2,if=none,id=test,boot=on,cache=none,format=qcow2 -device virtio-blk-pci,drive=test -cpu qemu64,+sse2,+x2apic,-kvmclock -monitor stdio -drive file=/root/zhangjunyi/boot.iso,if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,drive=drive-ide0-1-0 -boot order=cdn,menu=on -netdev tap,id=hostnet0,vhost=on -device virtio-net-pci,netdev=hostnet0,id=net0,mac=22:11:22:45:66:94 -vnc :9 -qmp tcp:0:4444,server,nowait -incoming tcp:0:5555

2.Boot listening mode in remote HostB.
#<commandLine> -incoming tcp:0:5555

3.After guest booted,scp RHEL-Server-6.0.qcow2 from HostA to guest.
#scp RHEL-Server-6.0.qcow2 10.66.91.95:/root

4.In the process of scp image,do migration from hostA to HostB
#{"execute": "migrate","arguments":{"uri": "tcp:10.66.91.123:5555"}}

Actual results:
After step4,about 30 seconds later,source host's qemu-kvm process aborted with error"(qemu) qemu-kvm: block/qcow2.c:613: qcow_aio_write_cb: Assertion `(acb->cluster_offset & 511) == 0' failed." 
the destination host's qemu-kvm aborted with error "qemu: warning: error while loading state section id 4 load of migration failed"

Expected results:
Migrate successful,at least,qemu-kvm process hit no core dump.

Additional info:
1. HostA's qemu-kvm dump

(gdb) bt
#0  0x000000322ec329b5 in raise () from /lib64/libc.so.6
#1  0x000000322ec34195 in abort () from /lib64/libc.so.6
#2  0x000000322ec2b945 in __assert_fail () from /lib64/libc.so.6
#3  0x000000000048d3a2 in qcow_aio_write_cb (opaque=0x7f4663b45300, ret=0) at block/qcow2.c:613
#4  0x000000000048d3e4 in qcow_aio_writev (bs=<value optimized out>, sector_num=<value optimized out>, qiov=<value optimized out>, 
    nb_sectors=<value optimized out>, cb=<value optimized out>, opaque=<value optimized out>) at block/qcow2.c:665
#5  0x0000000000479143 in bdrv_aio_writev (bs=0x2286410, sector_num=10880040, qiov=0x7f466005bc20, nb_sectors=4032, cb=<value optimized out>, 
    opaque=<value optimized out>) at block.c:1871
#6  0x0000000000479f4c in bdrv_aio_multiwrite (bs=0x2286410, reqs=0x7f466d78a5f0, num_reqs=<value optimized out>) at block.c:2080
#7  0x000000000041db8e in do_multiwrite (bs=<value optimized out>, blkreq=0x7f466d78a5f0, num_writes=4)
    at /usr/src/debug/qemu-kvm-0.12.1.2/hw/virtio-blk.c:236
#8  0x000000000041e238 in virtio_blk_handle_output (vdev=0x23213e0, vq=<value optimized out>) at /usr/src/debug/qemu-kvm-0.12.1.2/hw/virtio-blk.c:363
#9  0x000000000042a8b9 in kvm_handle_io (env=0x22d8010) at /usr/src/debug/qemu-kvm-0.12.1.2/kvm-all.c:538
#10 kvm_run (env=0x22d8010) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:975
#11 0x000000000042a959 in kvm_cpu_exec (env=<value optimized out>) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:1658
#12 0x000000000042b57f in kvm_main_loop_cpu (_env=0x22d8010) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:1900
#13 ap_main_loop (_env=0x22d8010) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:1950
#14 0x000000322f4077e1 in start_thread () from /lib64/libpthread.so.0
#15 0x000000322ece151d in clone () from /lib64/libc.so.6

2. If no "scp RHEL-Server-6.0.qcow2 from HostA to guest." operation,migration is ok.

3. After migration failed,check guest img.found lots of error,the details infos,please have a look attachment.I just pasted error summary

#qemu-img check RHEL-Server-6.0-64-virtio.qcow2 >imgerrorinfo.txt
77649 errors were found on the image.
Data may be corrupted, or further writes to the image may corrupt it.

30976 leaked clusters were found on the image.
This means waste of disk space, but no harm to data.

5133 internal errors have occurred during the check.

An error has occurred during the check: Success
The check is not complete and may have missed error.

Comment 2 juzhang 2010-07-28 06:06:42 UTC

Created attachment 434926 [details]
The detailed info about image after migration failed

Comment 3 RHEL Program Management 2010-07-28 06:17:43 UTC

This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **

Comment 4 Dor Laor 2010-07-28 11:35:03 UTC

It might be the same issue as bz 618601

Comment 5 Amit Shah 2010-07-28 12:03:40 UTC

(In reply to comment #4)
> It might be the same issue as bz 618601    

Unlikely; there's no src host qemu crash in that case.

Comment 6 Kevin Wolf 2010-07-28 12:49:18 UTC

Might still be the same cause. It feels so completely wrong to have the image opened in two qemu instances at the same time.

If we can't delay opening the images on the destination until after the migration has completed, maybe we can open it read-only until then at least?

Comment 7 juzhang 2010-07-29 02:37:39 UTC

Just kindly reminder,maybe useful.
1.if no "scp RHEL-Server-6.0.qcow2 from HostA to guest." operation,migration is ok.
2. HostA and HostB both shared iscsi storage.

Comment 8 Dor Laor 2010-08-03 09:53:04 UTC

Please retest with qemu-kvm-0.12.1.2-2.106.el6 that contains the close-to-open consistency fix.

Comment 9 juzhang 2010-08-04 03:44:05 UTC

Retested with qemu-kvm-0.12.1.2-2.106.el6.
Using the steps as same as comment0.migration can be completed.Source host's qemu-kvm process didn't hit core dump.however,still exist two problem.

1.copying image from host to guest can't complete,with error"from UNKNOWN: 2: Packet corrupt,lost connection"

#scp RHEL-Server-6.0-64-virtio.qcow2 10.66.91.95:/root
root.91.95's password:
RHEL-Server-6.0-64-virtio.qcow2 75% 1481MB 7.6MB/s 01:02 ETAReceived disconnect from UNKNOWN: 2: Packet corrupt
lost connection

2. check image infos.
2.1 before migration
#qemu-img check RHEL-Server-6.0-64-virtio.qcow2
No errors were found on the image.
2.2 After migration completed,check guest img.found lots of errors(19437),compared to comment0 errors number(77649 errors were found on the image).much less,the details infos,please have a look attachment.I just pasted error summary.
#qemu-img check RHEL-Server-6.0-64-virtio.qcow2 >qemu2.106imgerrorinfo.txt

19437 errors were found on the image.
Data may be corrupted, or further writes to the image may corrupt it.

24455 leaked clusters were found on the image.
This means waste of disk space, but no harm to data.

139 internal errors have occurred during the check.

An error has occurred during the check: Success
The check is not complete and may have missed error.

Additional info:
if no "scp RHEL-Server-6.0.qcow2 from HostA to guest." operation,migration can be completed.after migration,check image.no error was found.
#qemu-img check RHEL-Server-6.0-64-virtio.qcow2
No errors were found on the image.

Comment 10 juzhang 2010-08-04 03:48:31 UTC

Created attachment 436427 [details]
The detailed info about image after migration completed with qemu-kvm-0.12.1.2-2.106.el6.x86_64

Comment 11 Kevin Wolf 2010-08-04 15:32:45 UTC

(In reply to comment #7)
> Just kindly reminder,maybe useful.
> 1.if no "scp RHEL-Server-6.0.qcow2 from HostA to guest." operation,migration is
> ok.
> 2. HostA and HostB both shared iscsi storage.    

Does it only happen with iscsi? Tried to reproduce with local files and NFS, but haven't been successful.

Second question is if you really need scp or if it's just about writing lots of data (for example, is dd from /dev/urandom or /dev/zero to a file enough?)

Comment 12 juzhang 2010-08-05 03:42:54 UTC

> Does it only happen with iscsi? Tried to reproduce with local files and NFS,
> but haven't been successful.
> 
Source host's qemu-kvm process hits core dump only happen with iscsi on qemu-kvm-0.12.1.2-2.99.el6.tested on qemu-kvm-0.12.1.2-2.106.el6,just as comment9 description.

For NFS:
Tested on tested on qemu-kvm-0.12.1.2-2.106.el6.migration can be completed.but,still exist problem like iscsi problem.

1.copying image from host to guest can't complete,with error"from UNKNOWN: 2:
Packet corrupt,lost connection"
#scp RHEL-Server-6.0-64-virtio.qcow2 10.66.91.95:/root/
root.91.95's password: 
RHEL-Server-6.0-64-virtio.qcow2                                                                                             43%  859MB   5.1MB/s   03:37 ETAReceived disconnect from UNKNOWN: 2: Packet corrupt
lost connection

2.2. check image infos.
2.1 before migration
#qemu-img check RHEL-Server-6.0-64-virtio.qcow2
No errors were found on the image.
2.2 After migration completed,check guest img.found some cluster leaked.but no errors.

1711 leaked clusters were found on the image.
This means waste of disk space, but no harm to data.

For local files:
migration can be completed,however,copying image from host to guest can't completed.
#scp RHEL-Server-6.0-64-virtio.qcow2 10.66.91.95:/root/
root.91.95's password: 
RHEL-Server-6.0-64-virtio.qcow2                                                                                             28%  567MB  19.7KB/s - stalled -Write failed: Broken pipe
lost connection

> Second question is if you really need scp or if it's just about writing lots of
> data (for example, is dd from /dev/urandom or /dev/zero to a file enough?)    
dd from /dev/urandom or /dev/zero just can generate ios in host internal  or in guest internal.can't make io from host to guest or from guest to host.I mainly want to make ios between host and guest,so using scp command. 

I also tested using dd if=/dev/vda of=/dev/zero bs=1M count=20480 in guest in the progress of migration.migration can be completed and no error was found in os image.

Comment 13 Dor Laor 2010-08-05 21:14:35 UTC

So no more core dump?

Is it the same as reported in 618601 - check the end of the bugzila there.

Comment 14 juzhang 2010-08-06 01:50:24 UTC

(In reply to comment #13)
> So no more core dump?
> 
> Is it the same as reported in 618601 - check the end of the bugzila there.    

Yes,no more core dump
I don't make sure my scenario is as same as bcao description.however,I can make sure guest lost connection after migration completed.just as I mentioned in comment12.

Comment 15 Kevin Wolf 2010-08-11 03:47:08 UTC

I think we're actually facing two independent problems here: One related to block code that leads to cluster leakage (however no disk corruption any more), and another related to network code that lets the connection abort during migration.

I haven't really had a look at the network one, but I tried and couldn't reproduce the cluster leakage yet.

Comment 16 Dor Laor 2010-08-11 03:52:40 UTC

QE, please re-test with .62 kernel I think the problem is a duplicated of bug 619002. I'll close this one since the core does not exist no more

Comment 17 juzhang 2010-08-12 02:21:23 UTC

(In reply to comment #16)
> QE, please re-test with .62 kernel I think the problem is a duplicated of bug
> 619002. I'll close this one since the core does not exist no more    

Retested on 2.6.32-62.el6.x86_64 with qemu-kvm-0.12.1.2-2.109.el6.x86_64.
Still hit network issue."eceived disconnect from 10.66.91.95: 2: Packet corrupt
lost connection".

I will pasted my test results into bz619002.