Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 681472

Summary:

qcow2 image corrupted after ping-pong live migration while scp file from host to guest

Product:

Red Hat Enterprise Linux 6

Reporter:

Mike Cao <bcao>

Component:

qemu-kvm

Assignee:

Juan Quintela <quintela>

Status:

CLOSED NOTABUG

QA Contact:

Virtualization Bugs <virt-bugs>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

6.1

CC:

bcao, chayang, chellwig, gcosta, Jes.Sorensen, khong, kwolf, michen, mkenneth, mshao, tburke, virt-maint

Target Milestone:

Keywords:

Triaged

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-04-19 08:40:17 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
see picture name	none
pci info on the host	none

Description Mike Cao 2011-03-02 09:58:51 UTC

Description of problem:


Version-Release number of selected component (if applicable):
# uname -r
2.6.32-118.el6.x86_64
# rpm -q qemu-kvm
qemu-kvm-0.12.1.2-2.148.el6.x86_64

How reproducible:
only 1 time

Steps to Reproduce:
1.start a VM(rhel6 guest) in src host:
eg:/usr/libexec/qemu-kvm -enable-kvm -m 4G -smp 4 -name rhel6U1 -uuid adcbfb49-3411-1701-3c36-6bdbc00bedb9 -rtc base=utc,clock=host,driftfix=slew -boot c -drive file=/dev/s2/share,if=none,id=mike_d1,format=qcow2,cache=none,werror=stop,rerror=stop -device virtio-blk-pci,drive=mike_d1,id=mike_d1 -netdev tap,id=hostnet0,vhost=on -device virtio-net-pci,netdev=hostnet0,id=net0,mac=a2:54:50:a4:c2:c1 -chardev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -device usb-tablet,id=input0 -vnc :2 -device virtio-balloon-pci,id=ballooning -monitor stdio
2.start a listenning port
3.scp file from host to guest
#ssh-copy-id <guest ip>
#for ((i=1;i<=10000;i++));
 do 
    scp -l 102400 /tt1 <guest-ip>:/; 
    ssh <guest-ip> "rm -rf /tt1"; 
    echo $i+"times completed"; 
done
4.do live migration between 2 hosts back & forth.
  
Actual results:

After ping-pong migration some times ,guest responds very slowly
on the host ,the scp process stalled to transfer file to guest,while guest's nework is fine
#tt1                                            60%  904MB   0.0KB/s - stalled 

then reboot the guest
#(qemu)system_reset

the image corrupted
#qemu-img check /dev/s2/share
...
ERROR cluster 134730 refcount=1 reference=2
ERROR cluster 134731 refcount=1 reference=2

260 errors were found on the image.
Data may be corrupted, or further writes to the image may corrupt it.

Expected results:


Additional info:
1.only occurs 1 time ,can not reproduce with other qcow2 images
2.The corrupted image was installed in this week ,only run some stress tools on it ever.

Comment 2 Kevin Wolf 2011-03-02 21:12:30 UTC

Juan, do we close all images or at least bdrv_flush them before migrating? I think we only call qemu_aio_flush, which is not enough.

Comment 3 Chao Yang 2011-03-10 06:36:22 UTC

Hit same issue with windows 2008r2 guest.

Comment 4 Juan Quintela 2011-03-21 10:51:23 UTC

We do a bdrv_flush(), so we should be good here.

Comment 5 Dor Laor 2011-04-07 13:13:19 UTC

Can QE help analyze this case?
We probably have a block IO issue while migrating. So please don't do networking, just block IO and live migration in a loop without guest reboot and report what happens.

Comment 6 Jes Sorensen 2011-04-11 11:08:59 UTC

What Ethernet hardware is in that host?

If it is rtl8169 based, could you please try and disable hw
checksumming support? I have at least one system here, where
hw csum on the rtl8169 is bad and corrupts NFS if I do not
disable it.

/sbin/ethtool -K eth0 rx off

If it is not rtl8169, please ignore this comment.

Jes

Comment 7 Jes Sorensen 2011-04-11 15:11:54 UTC

Mike,

Can you please confirm or not, if this problem was seen on systems with
rtl8169 hardware, or if it has been seen on other systems as well?

Thanks,
Jes

Comment 8 Mike Cao 2011-04-12 03:32:17 UTC

(In reply to comment #7)
> Mike,
> 
> Can you please confirm or not, if this problem was seen on systems with
> rtl8169 hardware, or if it has been seen on other systems as well?
> 
> Thanks,
> Jes

I find this bug on a host with rtl8169 hardware .Will try on other systems with e1000e.
since one of the host with rtl8169 hareware is broken now ,I will do local live mgiration instead to try to reproduce it .

Comment 9 Jes Sorensen 2011-04-12 06:18:45 UTC

Mike,

Any chance you can get access to a set of machines with rtl8169, and
try to reproduce it the 'old way'? Once that is done, try and disable
the checksums as described above and see if the problem goes away?

That would be the ideal test to determine if this is rtl8169 related.

Thanks,
Jes

Comment 10 Juan Quintela 2011-04-12 09:02:34 UTC

I have reproduced it once.  Whole host networking died.  Trying to reproduce it with normal console & serial console to see what is happening.

Comment 11 Chao Yang 2011-04-13 06:22:01 UTC

Hit same issue with steps in comment #0 on AMD host.
CLI:
/usr/libexec/qemu-kvm -M rhel6.1.0 -enable-kvm -m 4096 -smp 4 -name rhel5.6-32 -uuid `uuidgen` -rtc base=utc,clock=host,driftfix=slew -no-kvm-pit-reinjection -boot dc -drive file=/dev/chayang/rhel5.6-32,if=none,id=drive-virtio0-0-0,media=disk,format=qcow2,cache=none,werror=stop,rerror=stop -device virtio-blk-pci,drive=drive-virtio0-0-0,id=virt0-0-0 -netdev tap,id=hostnet1 -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:40:81:11:53 -usb -device usb-tablet,id=input1 -vnc :0 -monitor stdio -balloon none
# qemu-img check /dev/chayang/rhel5.6-32 
ERROR OFLAG_COPIED: offset=80000000c1400000 refcount=0
ERROR OFLAG_COPIED: offset=80000000c1410000 refcount=0
ERROR OFLAG_COPIED: offset=80000000c1420000 refcount=0
ERROR OFLAG_COPIED: offset=80000000c1430000 refcount=0
ERROR OFLAG_COPIED: offset=80000000c1440000 refcount=0
ERROR OFLAG_COPIED: offset=80000000c1450000 refcount=0
ERROR OFLAG_COPIED: offset=80000000c1460000 refcount=0
ERROR OFLAG_COPIED: offset=80000000c1470000 refcount=0
ERROR OFLAG_COPIED: offset=80000000c1480000 refcount=0
ERROR OFLAG_COPIED: offset=80000000c1490000 refcount=0
ERROR OFLAG_COPIED: offset=80000000c14a0000 refcount=0
...
ERROR OFLAG_COPIED: offset=80000000d6f20000 refcount=0
ERROR OFLAG_COPIED: offset=80000000d6f30000 refcount=0
ERROR OFLAG_COPIED: offset=80000000d6f40000 refcount=0
ERROR OFLAG_COPIED: offset=80000000d6f50000 refcount=0
ERROR OFLAG_COPIED: offset=80000000d6ef0000 refcount=0
ERROR cluster 49472 refcount=0 reference=1
ERROR cluster 49473 refcount=0 reference=1
ERROR cluster 49474 refcount=0 reference=1
ERROR cluster 49475 refcount=0 reference=1
ERROR cluster 49476 refcount=0 reference=1
ERROR cluster 49477 refcount=0 reference=1
ERROR cluster 49478 refcount=0 reference=1
ERROR cluster 49479 refcount=0 reference=1
ERROR cluster 49480 refcount=0 reference=1
...

11116 errors were found on the image.
Data may be corrupted, or further writes to the image may corrupt it.

Additional info:
# brctl show
bridge name	bridge id		STP enabled	interfaces
switch		8000.0024217fb7f9	no		eth0
							tap0
# ethtool -i eth0
driver: tg3
version: 3.113
firmware-version: 5754-v3.26
bus-info: 0000:3f:00.0

# lspci -vvv -s 3f:00.0
3f:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5754 Gigabit Ethernet PCI Express (rev 02)
	Subsystem: Hewlett-Packard Company Device 3029
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 25
	Region 0: Memory at f0200000 (64-bit, non-prefetchable) [size=64K]
	Expansion ROM at <ignored> [disabled]
	Capabilities: [48] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
	Capabilities: [50] Vital Product Data
		Product Name: Broadcom NetLink Gigabit Ethernet Controller
		Read-only fields:
			[PN] Part number: BCM95754
			[EC] Engineering changes: 106679-15
			[SN] Serial number: 0123456789
			[MN] Manufacture ID: 31 34 65 34
			[RV] Reserved: checksum good, 30 byte(s) reserved
		Read/write fields:
			[YA] Asset tag: XYZ01234567
			[RW] Read-write area: 107 byte(s) free
		End
	Capabilities: [58] Vendor Specific Information <?>
	Capabilities: [e8] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee0f00c  Data: 4181
	Capabilities: [d0] Express (v1) Endpoint, MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 128 bytes, MaxReadReq 4096 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 2.5GT/s, Width x1, ASPM L0s, Latency L0 <4us, L1 <64us
			ClockPM- Surprise- LLActRep- BwNot-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
	Capabilities: [100] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [13c] Virtual Channel <?>
	Capabilities: [160] Device Serial Number 00-24-21-ff-fe-7f-b7-f9
	Capabilities: [16c] Power Budgeting <?>
	Kernel driver in use: tg3
	Kernel modules: tg3

Comment 12 Jes Sorensen 2011-04-13 12:10:59 UTC

Hi,

I notice that so far we have only seen this bug when using iSCSI
for storage.

We need to try and narrow down further what causes this bug - ie.
is it networking, is it iSCSI, or is it something else.

1) As discussed per irc earlier, could you try and reproduce the problem
using NFS backed storage?

2) Could you also try to reproduce it using LVM storage and migration
on just one host?

3) Could you try to reproduce it without guest networking, ie. using
file copy inside the guest instead of scp, but on iSCSI. If if fails
on iSCSI, try NFS as well, and last on LVM as in 2).

Thanks,
Jes

Comment 13 Chao Yang 2011-04-14 12:40:10 UTC

(In reply to comment #5)
> Can QE help analyze this case?
> We probably have a block IO issue while migrating. So please don't do
> networking, just block IO and live migration in a loop without guest reboot and
> report what happens.

Hi dor,
 Have tested iscsi as storage to do ping-pong migration for about 10 times,  launched 3 iozone -a processes in guest instead of scp files from host to guest. After ping-pong migration, I checked block image, the following is what I got in src host and dst host. BTW, when boot again the block image, guest fails to launch X windows (please take a look at screenshot). 

Before migration:
checked the block image both in src and dst host, no errors were found on the image.

After 10 times migration:
src host:
# qemu-img check /dev/chayang/rhel5.6-21-bac 
ERROR OFLAG_COPIED: offset=800000020e520000 refcount=0
ERROR OFLAG_COPIED: offset=800000020e530000 refcount=0
ERROR OFLAG_COPIED: offset=800000020e570000 refcount=0
ERROR OFLAG_COPIED: offset=800000020e580000 refcount=0
...
ERROR OFLAG_COPIED: offset=8000000210910000 refcount=0
ERROR OFLAG_COPIED: offset=8000000210920000 refcount=0
ERROR OFLAG_COPIED: offset=8000000210930000 refcount=0
ERROR OFLAG_COPIED: offset=8000000210940000 refcount=0
ERROR cluster 118364 refcount=1 reference=2
ERROR cluster 118365 refcount=1 reference=2
ERROR cluster 118366 refcount=1 reference=2
ERROR cluster 118367 refcount=1 reference=2
ERROR cluster 118368 refcount=1 reference=2
ERROR cluster 118369 refcount=1 reference=2
ERROR cluster 118370 refcount=1 reference=2
...
ERROR cluster 135315 refcount=0 reference=1
ERROR cluster 135316 refcount=0 reference=1

1262 errors were found on the image.
Data may be corrupted, or further writes to the image may corrupt it.


dst host:
# qemu-img check /dev/chayang/rhel5.6-21-bac 
ERROR OFLAG_COPIED: offset=800000020db70000 refcount=0
ERROR OFLAG_COPIED: offset=800000020db80000 refcount=0
ERROR OFLAG_COPIED: offset=800000020db90000 refcount=0
ERROR OFLAG_COPIED: offset=800000020dba0000 refcount=0
...
ERROR OFLAG_COPIED: offset=8000000210940000 refcount=0
ERROR OFLAG_COPIED: offset=800000020e400000 refcount=0
ERROR cluster 114205 refcount=1 reference=2
ERROR cluster 114206 refcount=1 reference=2
ERROR cluster 114207 refcount=1 reference=2
ERROR cluster 114208 refcount=1 reference=2
...
ERROR cluster 135315 refcount=0 reference=1
ERROR cluster 135316 refcount=0 reference=1

2313 errors were found on the image.
Data may be corrupted, or further writes to the image may corrupt it.

Comment 14 Chao Yang 2011-04-14 12:41:34 UTC

Created attachment 492084 [details]
see picture name

Comment 15 Chao Yang 2011-04-14 13:00:23 UTC

Created attachment 492100 [details]
pci info on the host

Comment 16 Dor Laor 2011-04-14 20:36:08 UTC

Thanks. The next stage is to see whether this happens w/o qcow2 by using a raw image instead.

Comment 17 Mike Cao 2011-04-15 06:41:33 UTC

(In reply to comment #7)
> Mike,
> 
> Can you please confirm or not, if this problem was seen on systems with
> rtl8169 hardware, or if it has been seen on other systems as well?
> 
> Thanks,
> Jes
(In reply to comment #12)

> 2) Could you also try to reproduce it using LVM storage and migration
> on just one host?
> 


Tried on this machine with following steps:
1.start kvm while image is lvm (qcow2 format)
2.scp file from host to guest in a loop
3.do ping-pong live migration
4.shutdown VM #qemu-img check

Actual Results:
no error find in #qemu-img check.Can not reproduce this issue.

additional info:
chayang tried on AMD host(host info referring comment #15) ,can not reproduced either.

Comment 18 Jes Sorensen 2011-04-15 09:29:20 UTC

Hi,

So we are still down to iSCSI and QCOW2 - what type of iSCSI server
are you using?

Thanks,
Jes

Comment 19 Mike Cao 2011-04-15 09:35:52 UTC

(In reply to comment #18)
> Hi,
> 
> So we are still down to iSCSI and QCOW2 - what type of iSCSI server
> are you using?
> 
> Thanks,
> Jes

We made localdisk partition as  iscsi target ,then use scsi-target-utils configure it as 

setenforce 0
iptables -F
tgtadm --lld iscsi --mode target --op new --tid 1 --target iqn.mike.com:s1
tgtadm --lld iscsi --mode logicalunit --op new --tid 1 --lun 1 --backing-store /dev/sdb1
tgtadm --lld iscsi --mode target --op bind --tid 1 --initiator-address ALL

Comment 20 Jes Sorensen 2011-04-15 09:51:01 UTC

Interesting!

Just to clarify, so if you use iSCSI on the host, and run QCOW2, then do the
ping pong on the same host, you can reproduce the problem? If you use LVM
instead, you cannot reproduce it?

Thanks,
Jes

Comment 21 Chao Yang 2011-04-15 10:05:36 UTC

(In reply to comment #12)
> Hi,
> 
> I notice that so far we have only seen this bug when using iSCSI
> for storage.
> 
> We need to try and narrow down further what causes this bug - ie.
> is it networking, is it iSCSI, or is it something else.
> 
> 1) As discussed per irc earlier, could you try and reproduce the problem
> using NFS backed storage?
Hi Jes,
  Tried the problem using NFS backed storage, scp files from host to guest, after ping-pong 10 times on two hosts, check the qcow2 image.
# qemu-img check /mnt/RHEL-Server-5.6-32.qcow2 
No errors were found on the image.


> 2) Could you also try to reproduce it using LVM storage and migration
> on just one host?
Please refer to comment #17


> 3) Could you try to reproduce it without guest networking, ie. using
> file copy inside the guest instead of scp, but on iSCSI. If if fails
Please refer to comment #13

> on iSCSI, try NFS as well, and last on LVM as in 2). 
Running 3 iozone processes in guest, after ping-pong 10 times, check the image:
# qemu-img check /mnt/RHEL-Server-5.6-32.qcow2 
No errors were found on the image.


> Thanks,
> Jes

CLI I am using for test.
/usr/libexec/qemu-kvm -M rhel6.1.0 -enable-kvm -m 4096 -smp 4 -name rhel5.6-32 -uuid `uuidgen` -rtc base=utc,clock=host,driftfix=slew -no-kvm-pit-reinjection -boot c -drive file=/mnt/RHEL-Server-5.6-32.qcow2,if=none,id=drive-virtio0-0-0,media=disk,format=qcow2,cache=none -device virtio-blk-pci,drive=drive-virtio0-0-0,id=virt0-0-0 -netdev tap,id=hostnet1 -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:40:81:11:53 -usb -device usb-tablet,id=input1 -vnc :0 -monitor stdio -balloon none -incoming tcp:0:7000

Comment 22 Chao Yang 2011-04-15 11:00:37 UTC

(In reply to comment #20)
> Interesting!
> 
> Just to clarify, so if you use iSCSI on the host, and run QCOW2, then do the
> ping pong on the same host, you can reproduce the problem? If you use LVM
> instead, you cannot reproduce it?
> 
> Thanks,
> Jes


Hi Jes,
 I tried iscsi+qcow2 in one host, scp files from host to guest, ping-pong 5 times, check the image:
# qemu-img check /dev/chayang/test-ping-pong
No errors were found on the image.

Comment 23 Jes Sorensen 2011-04-15 12:27:34 UTC

Chayang,

How many times did you have to ping pong before, before you saw
the error? Was 5 ping pongs enough?

Thanks,
Jes

Comment 24 Jes Sorensen 2011-04-15 12:36:15 UTC

Hi,

Kevin and I were chatting about this on irc, and we are not quite sure.
When you export the disk locally via iSCSI, how is it accessed. Ie. is your
configuration:

1) /dev/sdX on host A is exported as an iSCSI target to host B, accessed
   directly on host A, but over iSCSI from host B.
2) /dev/sdX on host A is exported as an iSCSI target to both host A and host B,
   and imported as an iSCSI device on both host A and B?

Thanks,
Jes

Comment 25 Mike Cao 2011-04-16 11:10:50 UTC

(In reply to comment #24)
> Hi,
> Kevin and I were chatting about this on irc, and we are not quite sure.
> When you export the disk locally via iSCSI, how is it accessed. Ie. is your
> configuration:
> 1) /dev/sdX on host A is exported as an iSCSI target to host B, accessed
>    directly on host A, but over iSCSI from host B.
> 2) /dev/sdX on host A is exported as an iSCSI target to both host A and host >B,
>    and imported as an iSCSI device on both host A and B?
> Thanks,
> Jes

Hi，Jes,

Seems the 1st one .because can not use iscsi-initiator to connect hostA（scsi-target-utils) itself.If we do so ,it will report dup vg error. Following is my steps.

1.configure iscsi target
on hostA:
#tgtadm --lld iscsi --mode target --op new --tid 1 --target iqn.mike.com:s1
#tgtadm --lld iscsi --mode logicalunit --op new --tid 1 --lun 1 --backing-store
/dev/sdb1
#tgtadm --lld iscsi --mode target --op bind --tid 1 --initiator-address ALL

2.configure iscsi initiator
on hostB:
#iscsiadm -m node -T iqn.mike.com:s1 -p [host A] -l

3.on hostA (or hostB)
#pvcreate /dev/sdb1
#vgcreate vgtest /dev/sdb1
#lvcreate -L 20G -n RHEL5u6 vgtest
#qemu-img create -f qcow2 /dev/vgtest/RHEL5u6

4.on hostB (if step3 use hostB ,this steps use hostA instead)
#vgscan
#lvscan
#lvchange -ay /dev/vgtest/RHEL5u6 

5.on host A: <commandLine> -drive file=/dev/vgtest/RHEL5u6
  on host B: <commandLine> -drive file=/dev/vgtest/RHEL5u6 -incoming tcp:0:5888

Comment 26 Jes Sorensen 2011-04-18 09:05:30 UTC

Hi Mike,

Thank you for the explanation. Given that you are using 1) we should focus
on two things: See if we can reproduce it on raw as Dor requested in #16
and also see if we can reproduce it against an external iSCSI server.

Thanks,
Jes

Comment 27 chellwig@redhat.com 2011-04-18 09:14:26 UTC

Note that using buffered I/O with shared storage will cause exactly this kind of corruption.  From the list of tools above it looks like you use the userspace "tgt" iSCSI target.  What does the configuration for the affected LUN look like?  If it does not include the "direct" argument it will use buffered I/O and thus cause corruption like this.

Comment 28 Mike Cao 2011-04-18 09:22:44 UTC

(In reply to comment #27)
> Note that using buffered I/O with shared storage will cause exactly this kind
> of corruption.  From the list of tools above it looks like you use the
> userspace "tgt" iSCSI target.  What does the configuration for the affected LUN
> look like?  If it does not include the "direct" argument it will use buffered
> I/O and thus cause corruption like this.

The configuration referring to comment #25

Comment 29 Jes Sorensen 2011-04-18 10:25:30 UTC

Based on Christoph's findings, it looks the default is for tgtd to use
buffered IO on the iSCSI target, which could explain this corruption.

Please try running the tests specifying --bsoflags=direct for the
exported devices and see if it makes a difference to the corruption.

Thanks,
Jes

Comment 30 Chao Yang 2011-04-18 12:39:18 UTC

 using following cli to exports a target, cannot reproduce  Bug 681472 after ping-pong 10 times with qcow2 image.
tgtadm --lld iscsi --mode target --op new --tid 1 --targetname iqn.chayang.com:test
tgtadm --lld iscsi --mode logicalunit --op new --tid 1 --lun 1 --backing-store /dev/sda5  --bsoflags=direct
tgtadm --lld iscsi --op bind --mode target --tid 1 --initiator-address ALL

Comment 32 Mike Cao 2011-04-19 08:15:10 UTC

Tried 2 more times with --bsoflags=direct (reffering to comment #30)

steps:
1.in the guest #for((;;)) do iozone -a ;done
2.on the host scp file to guest in a loop
3.ping-pong live migration 10 times

Actual Results:

Tried 2 times ,could not reproduce.

Comment 33 Jes Sorensen 2011-04-19 08:40:17 UTC

This is excellent news!

It looks like this was an iSCSI configuration issue in the end, rather
than a QEMU bug.

Mike, please make sure to document this within QE so everybody knows to
use --bsoflags=direct when using iSCSI locally like this.

Closing

Jes

Comment 34 Dor Laor 2011-04-19 20:58:52 UTC

It is actually a good practice for the future for refrain from doing this and use a completely different server for the shared storage.
Even if the above will be used, there are always other things that might get in our way.

This bug took allot of resources from us all.

Jes/Juan/Kevin/christoph and QE thanks for your efforts closing the bug