Bug 1243225

Summary:	Transfer data through chardev while do migration cause data broken
Product:	Red Hat Enterprise Linux 7	Reporter:	Zhengtong <zhengtli>
Component:	qemu-kvm-rhev	Assignee:	Laurent Vivier <lvivier>
Status:	CLOSED WONTFIX	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	7.2	CC:	amit.shah, hannsj_uhl, huding, juzhang, knoel, michen, ngu, pbonzini, qzhang, thuth, virt-maint, xuhan, zhengtli
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-07-04 16:21:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1308609, 1359843

Description Zhengtong 2015-07-15 02:44:03 UTC

Description of problem:

Do transfer data b/w host and guest while doing migration will cause data broken

Version-Release number of selected component (if applicable):
Host kernel 
3.10.0-292.el7.ppc64le

qemu:
qemu-kvm-rhev-2.3.0-9.el7


How reproducible:

3/3

Steps to Reproduce:
1.Boot up guest with chardev attached
/usr/libexec/qemu-kvm -name liuzt-RHEL-7.1-20150219.1_LE -machine pseries,accel=kvm,usb=off -m 32768 -realtime mlock=off -smp 64,sockets=1,cores=16,threads=4 \
-monitor stdio \
-monitor unix:tt,server,nowait \
-rtc base=localtime,clock=host \
-no-shutdown \
-boot strict=on \
-device usb-ehci,id=usb,bus=pci.0,addr=0x2 \
-device pci-ohci,id=usb1,bus=pci.0,addr=0x1 \
-device spapr-vscsi,id=scsi0,reg=0x1000 \
-drive file=/root/test_home/liuzt/vdisk/rhel_le.img,if=none,id=drive-scsi0-0-0-0,format=qcow2,cache=none \
-device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 \
-serial pty \
-device usb-kbd,id=input0 \
-device usb-mouse,id=input1 \
-device usb-tablet,id=input2 \
-vnc 0:16 -device VGA,id=video0,vgamem_mb=16,bus=pci.0,addr=0x4 \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 \
-msg timestamp=on \
-netdev tap,id=hostnet0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
-device spapr-vlan,netdev=hostnet0,id=net0,mac=52:54:00:c4:e7:83,reg=0x2000 \
-device virtio-serial-pci,id=virtio-serial0 \
-chardev socket,path=/root/test_home/liuzt/Manuall_test/virtio-serail/serial-socket1,id=channel0,server,nowait \
-device virtserialport,bus=virtio-serial0.0,nr=1,chardev=channel0,name=org.linux-kvm.port.1,id=port1 \
-chardev socket,path=/root/test_home/liuzt/Manuall_test/virtio-serail/serial-socket2,id=channel1,server,nowait \
-device virtserialport,bus=virtio-serial0.0,nr=2,chardev=channel1,name=org.linux-kvm.port.2,id=port2 \
-qmp tcp:0:4444,server,nowait \


2.on the host, open the chardev but don't read from it.
python
Python 2.7.3 (default, Aug 9 2012, 17:23:57)
[GCC 4.7.1 20120720 (Red Hat 4.7.1-5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket >>> >>> sock = socket.socket(socket.AF_UNIX) >>> sock.connect("/tmp/foo") >>>

3.create 2G size a big file in guest
#dd if=/dev/zero of=2g bs=1M count=2000
send data from guest
#cat 2g >/dev/vport0p1

4.boot up dst :B guest

/usr/libexec/qemu-kvm -name liuzt-RHEL-7.1-20150219.1_LE -machine pseries,accel=kvm,usb=off -m 32768 -realtime mlock=off -smp 64,sockets=1,cores=16,threads=4 \
-monitor stdio \
-rtc base=localtime,clock=host \
-no-shutdown \
-boot strict=on \
-device usb-ehci,id=usb,bus=pci.0,addr=0x2 \
-device pci-ohci,id=usb1,bus=pci.0,addr=0x1 \
-device spapr-vscsi,id=scsi0,reg=0x1000 \
-drive file=/root/test_home/liuzt/vdisk/rhel_le.img,if=none,id=drive-scsi0-0-0-0,format=qcow2,cache=none \
-drive file=/root/test_home/liuzt/iso/RHEL-LE-7.1-20150219.1-Server-ppc64le-dvd1.iso,if=none,id=drive-scsi0-0-1-0,readonly=on,format=raw,cache=none \
-device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 \
-device scsi-cd,bus=scsi0.0,channel=0,scsi-id=1,lun=0,drive=drive-scsi0-0-1-0,id=scsi0-0-1-0,bootindex=2 \
-serial pty \
-device usb-kbd,id=input0 \
-device usb-mouse,id=input1 \
-device usb-tablet,id=input2 \
-vnc 0:17 -device VGA,id=video0,vgamem_mb=16,bus=pci.0,addr=0x4 \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 \
-msg timestamp=on \
-netdev tap,id=hostnet0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
-device spapr-vlan,netdev=hostnet0,id=net0,mac=52:54:00:c4:e7:83,reg=0x2000 \
-device virtio-serial-pci,id=virtio-serial0 \
-chardev socket,path=/root/test_home/liuzt/Manuall_test/virtio-serail/serial-socket1,id=channel0,server,nowait \
-device virtserialport,bus=virtio-serial0.0,nr=1,chardev=channel0,name=org.linux-kvm.port.1,id=port1 \
-chardev socket,path=/root/test_home/liuzt/Manuall_test/virtio-serail/serial-socket2,id=channel1,server,nowait \
-device virtserialport,bus=virtio-serial0.0,nr=2,chardev=channel1,name=org.linux-kvm.port.2,id=port2 \
-qmp tcp:0:4454,server,nowait \
-incoming tcp:0:5980


5.migrate from A to B
(qemu) migrate -d tcp:10.16.67.19:5980

6.start listener on B
#nc -U $host_socket > 2g-host

7.do md5sum for 2g and 2g-host.

Actual results:

md5sum is not same

Expected results:

md5sum should be same

Additional info:

Comment 6 Amit Shah 2015-08-10 12:17:20 UTC

When you open the host side on the src, the host side will cache data that is sent by the guest.  That means you'll have to read out all the data on the src, then read out all the data on the dest, concatenate the outputs, and then compare the md5sum.

Please do that and let us know if that worked.

Comment 7 Zhengtong 2015-08-13 03:16:43 UTC

(In reply to Amit Shah from comment #6)
> When you open the host side on the src, the host side will cache data that
> is sent by the guest.  That means you'll have to read out all the data on
> the src, then read out all the data on the dest, concatenate the outputs,
> and then compare the md5sum.
> 
> Please do that and let us know if that worked.

There is no environment to do this test now, I will do this as soon as I got one.
And still there is a problem, I don't know how to got the cache data on source host, do you have any idea about this?

I just connect the socket with python module:
>>> import socket >>> >>> sock = socket.socket(socket.AF_UNIX) >>> sock.connect("/tmp/foo") >>>

thanks

Comment 8 Amit Shah 2015-08-13 03:55:47 UTC

(In reply to Zhengtong from comment #7)
> And still there is a problem, I don't know how to got the cache data on
> source host, do you have any idea about this?

After migration is complete, actually read from the src host and put contents into a file, e.g. dump1.

You should already have contents from dest host in a file, e.g. dump2.

Then, do something like:

  cat dump1 dump2 > dump3

and then check the md5 of dump3 and guest file.

Comment 9 Zhengtong 2015-08-17 02:10:41 UTC

(In reply to Amit Shah from comment #8)
> (In reply to Zhengtong from comment #7)
> > And still there is a problem, I don't know how to got the cache data on
> > source host, do you have any idea about this?
> 
> After migration is complete, actually read from the src host and put
> contents into a file, e.g. dump1.
> 
> You should already have contents from dest host in a file, e.g. dump2.
> 
> Then, do something like:
> 
>   cat dump1 dump2 > dump3
> 
> and then check the md5 of dump3 and guest file.

Hi, Amit. Actual I want to know how to get dump1 file. 
Anyway , I have tried with method as below:
1. On src host. recv the channel data. after migration finished.
>>> f=open("dump1",'wb')
>>> f.write(sock.recv())
>>> f.close()
2. copy dump1 file to host2. connect the data with dump2 as you said. 
# cat dump1 dump2 > dump3
and the result is that the md5sum is not  same. even the size is not same.

Details:

Received data:
[root@ibm-p8-kvm-02-qe dump]# ls -ls
total 4080168
     16 -rw-r--r--. 1 root root      16384 Aug 14 08:06 dump1
2040068 -rw-r--r--. 1 root root 2089025536 Aug 14 07:15 dump2
2040084 -rw-r--r--. 1 root root 2089041920 Aug 14 08:27 dump3

Sent data:
[root@dhcp71-167 ~]# ls -s 2g 
2048000 2g

The dump1 is only 16K size , but the gap between 2g(sent data in guest) and dump2(received on dst host) is 7.9M(2048000-2040084).

Comment 11 Amit Shah 2015-11-28 04:51:30 UTC

(In reply to Zhengtong from comment #9)

> Hi, Amit. Actual I want to know how to get dump1 file. 
> Anyway , I have tried with method as below:
> 1. On src host. recv the channel data. after migration finished.

Yes, that's what I meant.

> >>> f=open("dump1",'wb')
> >>> f.write(sock.recv())

What's the bufsize you give to recv()?  Keep receiving till you get EOF.

Comment 12 Zhengtong 2015-11-30 07:25:04 UTC

I didn't modified the bufsize during my test. 
Check again in my host for default value.

>>> import socket
>>> sock=socket.socket(socket.AF_UNIX)
>>> sock.getsockopt(socket.SOL_SOCKET,socket.SO_RCVBUF)
229376


So. the  bufsize is 229376.

Comment 13 Amit Shah 2015-11-30 07:42:18 UTC

(In reply to Zhengtong from comment #5)
> I tried on X86 host. But there is a different issue. Here is the step and
> result test on X86.

> And the migration failed with the error msg in destination as below:
> 
> (qemu) red_dispatcher_loadvm_commands: 
> KVM internal error. Suberror: 1
> emulation failure

> Then the guest freeze.
> 
> 
> I tried without transferring data b/w host and guest. the migration could
> finish successfully.

That's strange; can you try this again?  This could be due to a mismatch in the two host cpu types.  Transferring data shouldn't trigger such an emulation failure.

Comment 14 Amit Shah 2015-11-30 07:48:24 UTC

(In reply to Zhengtong from comment #12)
> I didn't modified the bufsize during my test. 
> Check again in my host for default value.
> 
> >>> import socket
> >>> sock=socket.socket(socket.AF_UNIX)
> >>> sock.getsockopt(socket.SOL_SOCKET,socket.SO_RCVBUF)
> 229376
> 
> 
> So. the  bufsize is 229376.

I mean from here:

(In reply to Amit Shah from comment #11)
> (In reply to Zhengtong from comment #9)
> 
> > Hi, Amit. Actual I want to know how to get dump1 file. 
> > Anyway , I have tried with method as below:
> > 1. On src host. recv the channel data. after migration finished.
> 
> Yes, that's what I meant.
> 
> > >>> f=open("dump1",'wb')
> > >>> f.write(sock.recv())

sock.recv() needs to be passed a bufsize parameter.  What param do you use here?

In any case:

> Keep receiving till you get EOF.

just keep reading from sock.recv() till there's no more data left in the recv queue.  Only that will ensure all data has been received and written to the file on disk.

Comment 15 Zhengtong 2015-11-30 10:56:14 UTC

Ok, I see. In my test before. I didn't set sock.recv() with bufsize. the command you saw here is exactly the command I run in test("f.write(sock.recv())").


I test again with bufsize 65536000 , and the md5sum is still not right.

step:
1. Boot guest on source host and connect with python tool:
>>> import socket
>>> sock = socket.socket(socket.AF_UNIX)
>>> sock.connect("/root/test_home/liuzt/manually_test/serial-socket")

2. In guest. run 
#cat 2G > /dev/vport1p1

3. do migration

4. In destination host, get the file by 
#nc -U $socket > dump2


5. In source host, get the cached data by sock.recv(65536000)
>>> f=open("/tmp/dump1",'wb')
>>> f.write(sock.recv(65536000))
wait 5 seconds(just to ensure the data is received over)
>>> f.close()

# ls -l /tmp/dump1
-rw-r--r--. 1 root root 131072 Nov 30 05:44 /tmp/dump1


6. copy /tmp/dump1 file to destination host. and do 
#cat dump1 dump2 > dump3

compare the md5sum with dump3 in dest host and 2G file in guest, the md5sum is still different.


[root@ibm-p8-kvm-01-qe ~]# md5sum dump3 
4375ef49dafdb13f977259ce3807c440  dump3

[root@dhcp71-100 ~]# md5sum 2G 
f3444be3893c1b9b6fa9dd1acde671a9  2G

Comment 16 Zhengtong 2015-12-01 03:32:46 UTC

I can't reproduce the issue in comment #5 any more, Not sure which part I did was wrong then, It seems like the problem is that I emulated Sandybridge cpu on my AMD host.  I switched the cpu type to  "Opteron_G3" and do test, there is no problems any more. 

In my test on X86. the result is good. I mean the md5sum of files in guest is the same as that in Host. So the problems only occurs on Powerpc.

Comment 17 Amit Shah 2015-12-01 05:01:05 UTC

That's strange - maybe it's something in the virtio layer acting up on power?  But still, that's a stretch.  Assigning to David -- he might have more ideas.

Comment 18 Laurent Vivier 2016-03-17 16:13:22 UTC

I think I've been able to reproduce the problem using only one host for simplicity:

- start guest A with:

        ...
        -device virtio-serial-pci,id=serial0 \
        -chardev socket,id=channel0,path=/tmp/serial_socket_1,server,nowait \
        -device virtserialport,bus=serial0.0,nr=1,chardev=channel0,name=org.linux-kvm.port.1,id=port1 \

- start guest B with:

        ...
        -device virtio-serial-pci,id=serial0 \
        -chardev socket,id=channel0,path=/tmp/serial_socket_2,server,nowait \
        -device virtserialport,bus=serial0.0,nr=1,chardev=channel0,name=org.linux-kvm.port.1,id=port1 \
        -incoming tcp:0:4444

-> we use two different unix sockets.

- on guest A:

    1- dd if=/dev/zero of=2g bs=1M count=2000
    2- cat 2g > /dev/vport1p1
    3- in QEMU monitor:  migrate tcp:localhost:4444

once migration is terminated, on host:

   - sudo nc -U /tmp/serial_socket_2 > file2

Wait until the end of the "cat" on guest B (the shell is displayed again).

The 2 files, 2g (guest) and file2 (host) have the same size, 2097152000, but the file received on the host is filled with 0 at the beginning of the file, exactly 0x40000 bytes of 0:

00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00040000  e0 ff 81 fb 78 2b b9 7c  26 80 90 7d e8 ff a1 fb  |....x+.|&..}....|
00040010  f0 ff c1 fb 79 33 de 7c  78 23 9d 7c f8 ff e1 fb  |....y3.|x#.|....|
00040020  c0 ff 01 fb 78 1b 7f 7c  78 3b fc 7c d0 ff 41 fb  |....x..|x;.|..A.|

I've tested this with qemu-kvm-rhev-2.3.0-31.el7_2.8 and with upstream QEMU (331ac65) on POWER and the result is the same.

I didn't use the python script to open the chardev to have this result.

Comment 19 Laurent Vivier 2016-03-17 17:27:46 UTC

The behavior is exactly the same with QEMU x86_64 upstream (331ac65), beginning of the file is overwritten by 0x40000 bytes of 0.

Comment 20 Laurent Vivier 2016-03-17 17:57:23 UTC

Forget my two previous comments (#18 and #19): in fact my file is corrupted during the migration (it should be filled entirely with 0), because I'm using "snapshot=on" so some disk datas are not transferred from a guest to another...

With snapshot=off I'm not able to reproduce the bug on POWER or x86_64 (but I'm not using the python script to open the first serial port, I'll do more tests with that).

Re-reading BZ description, it seems you are also on the same host, but for POWER you use two different socket names (serial-socket2 and serial-socket1), but on x86 you are using only one socket name (serial_socket).

Could you recheck this BZ on POWER and X86?

Thanks

Comment 21 Laurent Vivier 2016-03-17 20:11:24 UTC

I've tested again using this time the python script, and I've been able to reproduce it but I think the problem is with the python script.

Amit was right in comment 14: we must keep reading on sock.recv() but playing with the buffsize parameter is not enough to flush all the buffers:
- we must loop on the sock.recv() and f.write()
- we must quit the first guest once the migration is done
  (to exit the loop and flush all intermediate buffers).

I'm using now this python script (flush.py):

-------------------8<------------------
import socket
sock = socket.socket(socket.AF_UNIX)
sock.connect("/tmp/serial_socket_1")
raw_input("Press Enter to continue...")
f=open("file1",'wb')
while 1:
    data = sock.recv(4096)
    if not data: break
    f.write(data)
f.close()
-------------------8<------------------

the test is:

0- clean all files (to be sure we don't re-use an existing unix socket):

    sudo rm -f file* /tmp/serial_socket_*

1- - start guest A with:

        ...
        -device virtio-serial-pci,id=serial0 \
        -chardev socket,id=channel0,path=/tmp/serial_socket_1,server,nowait \
        -device virtserialport,bus=serial0.0,nr=1,chardev=channel0,name=org.linux-kvm.port.1,id=port1 \

2 - start guest B with:

        ...
        -device virtio-serial-pci,id=serial0 \
        -chardev socket,id=channel0,path=/tmp/serial_socket_2,server,nowait \
        -device virtserialport,bus=serial0.0,nr=1,chardev=channel0,name=org.linux-kvm.port.1,id=port1 \
        -incoming tcp:0:4444

3- on the host, start the python script "flush.py":

    sudo python flush.py

You should have the following message: "Press Enter to continue..."
but don't press enter!
This opens the socket but doesn't read any data until enter is pressed.

4- on guest A: log in as root

    dd if=/dev/zero of=2g bs=1M count=2000
    cat 2g >/dev/vport1p1

5- on guest A: switch to monitor

    migrate tcp:localhost:4444

6- wait the end of the migration

7- on host:

    sudo nc -U /tmp/serial_socket_2 > file2

8- on guest B: wait the end of "cat" (shell prompt is displayed)

9- in the terminal of the python script flush.py, press enter
   (to flush the data)

10- in guest A monitor, type "quit"
    (this will allow to exit from the loop of the python script)

11- cat file1 file2 > file3

12- compare "md5sum 2g" and "md5sum file3"

f3444be3893c1b9b6fa9dd1acde671a9  file3
f3444be3893c1b9b6fa9dd1acde671a9  2g


So, I think this is not a bug, we had just some unflushed data (as guessed by Amit).

Note: without the loop in the python script, my file1 size is 131072 (as in comment 15) whereas with the loop the size is 4325376.

Comment 22 Zhengtong 2016-03-18 07:18:15 UTC

28b258d52534be3a698cb6cdd41dd16e(In reply to Laurent Vivier from comment #20)
> Forget my two previous comments (#18 and #19): in fact my file is corrupted
> during the migration (it should be filled entirely with 0), because I'm
> using "snapshot=on" so some disk datas are not transferred from a guest to
> another...
> 
> With snapshot=off I'm not able to reproduce the bug on POWER or x86_64 (but
> I'm not using the python script to open the first serial port, I'll do more
> tests with that).
> 
> Re-reading BZ description, it seems you are also on the same host, but for
> POWER you use two different socket names (serial-socket2 and
> serial-socket1), but on x86 you are using only one socket name
> (serial_socket).
> 
> Could you recheck this BZ on POWER and X86?
> 
> Thanks

Hi Laurent, For POWERPC, I only use serial-socket1 port in this case test, although I attached two socket during guest boot. Maybe this make you confused, I use this suit of commands to test not only this case but also other test cases which need multiple serial sockets. I just paste all command lines here. for this case, I tested between two hosts, only used serial-socket1.

and regards comment #21.  I tested in my side following your steps. But I do migration between two physical hosts.  I didn't get the right result. I am keeping trying more times.

in my side , the process:

1. Boot guest in hostA and hostB
A:
qemu-kvm  ...
-chardev socket,path=/root/test_home/liuzt/Manuall_test/virtio-serail/serial-socket1,id=channel0,server,nowait \
-device virtserialport,bus=virtio-serial0.0,nr=1,chardev=channel0,name=org.linux-kvm.port.1,id=port1 \
...

B:
qemu-kvm ..
-chardev socket,path=/root/test_home/liuzt/Manuall_test/virtio-serail/serial-socket1,id=channel0,server,nowait \
-device virtserialport,bus=virtio-serial0.0,nr=1,chardev=channel0,name=org.linux-kvm.port.1,id=port1 \
...
-incoming tcp:0:5980

2. Run the python scripts provided by you on host A:
[root@ibm-p8-firestone-01 script]# python test.py 
Press Enter to continue...
(I didn't press Enter)

3. In guest
# dd if=/dev/zero of=./2G bs=1M count=2000
# cat 2G > /dev/vport1p1

4. Do migrate on Host A (hmp)
(qemu) migrate -d tcp:10.19.112.10:5980

5. While migration process finished. press "Enter" to trigger the receive action in python scripts. And wait for a long time when the file1's size doesn't increase, then terminate post-migration vm in Host A with "quit" command in hmp.


6. in Host B,  get the data through serial socket.
# nc -U /root/test_home/liuzt/Manuall_test/virtio-serail/serial-socket1 > file2
till  the process(cat 2G>/dev/vport1p1) in guest got finished.

7." Ctrl + c " to terminate nc process in Host B.

8. scp file1 from Host A to Host B, and do data connection.
cat file1 file2 > file3

9. compare md5sum 
[root@dhcp71-153 ~]# md5sum 2G 
f3444be3893c1b9b6fa9dd1acde671a9  2G

[root@ibm-p8-rhevm-18 ~]# md5sum file3 
268542feafe56109e1167d747b15fa66  file3

I am not sure where should the problem should be. Any ideas about this ?

the file size:
[root@ibm-p8-rhevm-18 ~]# ls -sl file*
   4224 -rw-r--r--. 1 root root    4325376 Mar 18 02:47 file1
2040064 -rw-r--r--. 1 root root 2089025536 Mar 18 02:46 file2
2044288 -rw-r--r--. 1 root root 2093350912 Mar 18 02:47 file3

[root@dhcp71-153 ~]# ls -sl 2G
2048000 -rw-r--r--. 1 root root 2097152000 Mar 18 14:25 2G

Comment 23 Laurent Vivier 2016-03-18 10:46:16 UTC

(In reply to Zhengtong from comment #22)
> the file size:
> [root@ibm-p8-rhevm-18 ~]# ls -sl file*
>    4224 -rw-r--r--. 1 root root    4325376 Mar 18 02:47 file1
> 2040064 -rw-r--r--. 1 root root 2089025536 Mar 18 02:46 file2
> 2044288 -rw-r--r--. 1 root root 2093350912 Mar 18 02:47 file3

I have:

-rw-r--r--. 1 root    root     4325376 18 mars  06:41 file1
-rw-r--r--. 1 lvivier users 2092826624 18 mars  06:41 file2
-rw-r--r--. 1 lvivier users 2097152000 18 mars  06:41 file3

So the problem seems to be with file2, the file generated by the "nc" process.

Comment 24 Laurent Vivier 2016-03-24 14:23:35 UTC

I've been able to reproduce [1] this on POWER _and_ on x86_64 with upstream QEMU (2538039 + [2]) , so change "architecture" flag to "all".

[1] as described in comment 21 and comment 22
[2] https://patchwork.ozlabs.org/patch/599628/

Comment 25 Laurent Vivier 2016-03-24 16:36:05 UTC

I've added some traces in hw/char/virtio-console.c:flush_buf().

What happens is:
- on guest A side, data are consumed only when the flush.py is connected (not reading),
- on guest B side, just after the end of migration and while nothing is connected ("nc" is not started), data are consumed by qemu_chr_fe_write() and lost. Exactly 4194304 bytes (4 MiB). Then, when we start "nc" remaining data are read until the end and written in the file.

Comment 26 Laurent Vivier 2016-03-24 17:03:33 UTC

The data are lost in tcp_chr_write().

If the socket is not connected, the data are trashed:

static int tcp_chr_write(CharDriverState *chr, const uint8_t *buf, int len)
...
    if (s->connected) {
...
    } else {
        /* XXX: indicate an error ? */
        return len;
    }
}

On migration, the guest B should check if data can be sent to the front-end before trying to write them  (qemu_chr_fe_write()).

Amit, any idea how to do that?

Comment 27 Amit Shah 2016-04-01 07:03:42 UTC

(In reply to Laurent Vivier from comment #26)
> The data are lost in tcp_chr_write().
> 
> If the socket is not connected, the data are trashed:
> 
> static int tcp_chr_write(CharDriverState *chr, const uint8_t *buf, int len)
> ...
>     if (s->connected) {
> ...
>     } else {
>         /* XXX: indicate an error ? */
>         return len;
>     }
> }
> 
> On migration, the guest B should check if data can be sent to the front-end
> before trying to write them  (qemu_chr_fe_write()).
> 
> Amit, any idea how to do that?

To test this, how about dropping 'nowait' from the chardev attributes?  That way, qemu will only accept incoming connection after we connect the socket.  Then, if the guest does a write, we will have all the data the guest sent.

What could also be happening is this:

If the dest host socket is not connected, upon migration end, the 'host_connected' property of the virtio-serial port becomes false (hw/char/virtio-serial-bus.c:668).  And so, any data that is sent by the guest is discarded (hw/char/virtio-serial-bus.c:478).

I think the only way to mitigate this would be to ensure we don't use 'nowait' on the dest.

Another interesting experiment would be to try this using libvirt.  libvirt will do all the setup before accepting migration, so we won't actually see this there.

Comment 28 Laurent Vivier 2016-04-01 15:24:25 UTC

(In reply to Amit Shah from comment #27)
> (In reply to Laurent Vivier from comment #26)
> > The data are lost in tcp_chr_write().
> > 
> > If the socket is not connected, the data are trashed:
> > 
> > static int tcp_chr_write(CharDriverState *chr, const uint8_t *buf, int len)
> > ...
> >     if (s->connected) {
> > ...
> >     } else {
> >         /* XXX: indicate an error ? */
> >         return len;
> >     }
> > }
> > 
> > On migration, the guest B should check if data can be sent to the front-end
> > before trying to write them  (qemu_chr_fe_write()).
> > 
> > Amit, any idea how to do that?
> 
> To test this, how about dropping 'nowait' from the chardev attributes?  That
> way, qemu will only accept incoming connection after we connect the socket. 
> Then, if the guest does a write, we will have all the data the guest sent.

If we remove the 'nowait', we can't migrate as the guest B is waiting the connection and is not ready to receive migration data.

And in fact, data are lost even if there is a "reader" on the socket of the guest B (with or without "nowait") while the migration is being done. The only case we don't loose data during migration is if socket is not opened by a reader on guest A.

Comment 29 Laurent Vivier 2016-04-01 15:51:39 UTC

(In reply to Laurent Vivier from comment #28)
> (In reply to Amit Shah from comment #27)
> > (In reply to Laurent Vivier from comment #26)
> > > The data are lost in tcp_chr_write().
> > > 
> > > If the socket is not connected, the data are trashed:
> > > 
> > > static int tcp_chr_write(CharDriverState *chr, const uint8_t *buf, int len)
> > > ...
> > >     if (s->connected) {
> > > ...
> > >     } else {
> > >         /* XXX: indicate an error ? */
> > >         return len;
> > >     }
> > > }
> > > 
> > > On migration, the guest B should check if data can be sent to the front-end
> > > before trying to write them  (qemu_chr_fe_write()).
> > > 
> > > Amit, any idea how to do that?
> > 
> > To test this, how about dropping 'nowait' from the chardev attributes?  That
> > way, qemu will only accept incoming connection after we connect the socket. 
> > Then, if the guest does a write, we will have all the data the guest sent.
> 
> If we remove the 'nowait', we can't migrate as the guest B is waiting the
> connection and is not ready to receive migration data.
> 
> And in fact, data are lost even if there is a "reader" on the socket of the
> guest B (with or without "nowait") while the migration is being done. The
> only case we don't loose data during migration is if socket is not opened by
> a reader on guest A.

A hint can be given by what happens with qemu upstream when a reader is already connected on the guest B socket:

while we loose some data on migration with qemu-kvm-rhev, we have a crash of the guest B in the case of upstream qemu (de1d099):

    virtio: unexpected memory split

Paolo, as this error is raised by code you have added [1], do you understand what happens?

[1] commit 3b3b0628217e2726069990ff9942a5d6d9816bd7
Author: Paolo Bonzini <pbonzini>
Date:   Sun Jan 31 11:29:01 2016 +0100

    virtio: slim down allocation of VirtQueueElements
    
    Build the addresses and s/g lists on the stack, and then copy them
    to a VirtQueueElement that is just as big as required to contain this
    particular s/g list.  The cost of the copy is minimal compared to that
    of a large malloc.
    
    When virtqueue_map is used on the destination side of migration or on
    loadvm, the iovecs have already been split at memory region boundary,
    so we can just reuse the out_num/in_num we find in the file.

Comment 30 Paolo Bonzini 2016-04-15 10:07:42 UTC

Wow, so I have actually improved the situation! :-O

The "memory split" error means that a descriptor spans multiple non-adjacent RAM regions.  It means that the memory map is somehow different between the source  and destination.

Here is how VirtQueueElements are built and migrated.  The source uses virtqueue_map_desc to convert a descriptor (which is a s/g entry, i.e. address+length) into a series of "struct iov".  The iov_base field is a host address, thus it differs between source and destination.  But we cannot just rebuild the array of "struct iov" because the device might be storing indices in the source's array.  In fact this is true of virtio-serial, which migrates an (iov_idx, iov_offset) pair together with each VirtQueueElement!

So the destination uses virtqueue_map_iovec to fill in that field.  virtqueue_map_iovec takes the base address from the descriptors and the length from the "struct iov"s.  It then uses cpu_physical_memory_map to retrieve the iov_base.  If cpu_physical_memory_map returns a different length than the iov_len, it means that it behaved differently between virtqueue_map_desc and virtqueue_map_iovec.  That's a "memory split" and, well, it's unexpected. :)

Before I introduced this new algorithm, virtio would rebuild the entire iovec.  I introduced the check almost as an assertion failure, to simplify the loading process, and it would be in principle possible to rewrite qemu_get_virtqueue_element to deal with a memory split.  But then we'd go back to the situation where iov_idx and iov_offset now point in the woods: the failure of this check can definitely explain the older corruption and I wonder if there are weird security consequences.

The new algorithm is not _necessary_.  We can remove the error and fix virtio-serial: use the saved VirtQueueElement to map (iov_idx,iov_offset) to a byte offset, and reconstruct a new index/offset pair as it is loaded.  But this would not be too easy and we might have other bugs looming in other virtio devices.  So I would keep the check and appreciate the unexpected checking it provides.

This is also because, while really weird, the failure should be easy to debug.  Just call "info mtree" on the source; on the destination, invoke "call mtree_info(fprintf, stdout)" from the debugger when it errors out.  You can then correlate

It might happen because qemu_get_virtqueue_element is called before the memory map has been transferred completely (it is split between multiple devices).  One way to fix it would be to move the virtqueue_map call from qemu_get_virtqueue_element to virtio_vmstate_change.

Comment 31 Laurent Vivier 2016-04-21 16:15:50 UTC

(In reply to Paolo Bonzini from comment #30)
> Wow, so I have actually improved the situation! :-O

Thank you for your help and this detailed explanation.

Sadly, I'm not able to reproduce anymore the crash (on x86_64 or ppc64le), even with the same commit number...

Comment 32 Laurent Vivier 2016-04-21 16:54:44 UTC

The problem looks like a flow control problem on the virtio-serial port, rather a migration one. It appears with migration because migration induces latency in the flow.

I've written a little C program to see which data are lost:

/* cc -o gen gen.c */

#include <unistd.h>

static unsigned buf[1048576]; /* 1 MiB */

int main(void)
{
        unsigned i, j;

        for (i = 0; i < 2000; i++) { /* 2000 MiB */
                for (j = 0; j < 1048576 / sizeof(unsigned); j++) {
                        buf[j] = i * 1048576 / sizeof(unsigned) + j;
                }
                write(STDOUT_FILENO, buf, 1048576);
        }
        return 0;
}

But as this program runs much faster than the "cat 2g" (x6), we have corrupted data after the first block (block has the size of the I/O buffer), i.e. after the first 32768 bytes.

on the host:

$ sudo nc -U /tmp/serial_socket_1 > file1

on the guest:

$ cc -o gen gen.c
$ time ./gen >/dev/vport1p1 
real	0m1.353s
user	0m1.343s
sys	0m0.009s

Compare result on the host:

$ ./gen > 2g
$ ls -l 2g file1
... 2097152000 ... 2g
...   65536000 ... file1
$ cmp 2g file1
2g file1 differ: byte 32770, line 289

FYI:

time cat 2g > /dev/vport1p1
real	0m8.007s
user	0m0.015s
sys	0m1.281s

Amit, do you know if this behavior is normal?
Should we expect to have flow control?

Comment 33 Amit Shah 2016-04-22 07:55:32 UTC

That's very interesting.  My first reaction to this was that it's a guest bug.  Trying it out, it's suspicious that my file size too is 65536000.

In the guest:

# cat /sys/kernel/debug/virtio-ports/vport0p5 
name: 
guest_connected: 0
host_connected: 1
outvq_full: 0
bytes_sent: 65536000
bytes_received: 6564
bytes_discarded: 0
is_console: no
console_vtermno: 1

which shows that indeed the guest has only written out those many bytes to the port, and qemu has written them all out to the socket.

So I think we can, for the purpose of this test, say qemu is not at fault.

Now on to the guest: there's no reason for the guest to drop any buffers, so I was quite sure that write() returns an error when the vq is full, and we aren't retrying those writes.  So I added a do..while loop to retry on error, but the result was the same.  I then checked the return value from write(), and found the culprit: we only write 32768 bytes per write(), and this program drops the remaining bytes.  32768 * 2000 = 65536000, so this explains it.

It's legal for the kernel to do short writes, and the userspace app has to ensure it writes everything else, so we're within our rights to do that in the guest kernel.  If we didn't do that, we'd have to allocate huge buffers in the kernel to do our writes, which isn't always a good idea.  I think 32K per call is fine, but if someone thinks we should do better, I'm all ears.  However, that's a different problem.

Comment 34 Laurent Vivier 2016-04-22 10:42:54 UTC

Thank you Amit.

To be sure we manage correctly the "write()" stuf, I dump the result of "./gen" in a file, and then send it to /dev/vport1p1 with cat.

I test the case where the "nc" is already attached to the guest 2 before the migration:

The result is now:

...    4423680 ... file1
... 2096922624 ... file2
... 2101346304 ... file3

$ cmp 2g file1
cmp: EOF on file1

So file1 has been transfered correctly.

We can check the end of the file:

$ hexdump -e '"%07.7_ax  " 4/4 "%08x " "\n"' file1|tail -2
0437fe0  0010dff8 0010dff9 0010dffa 0010dffb
0437ff0  0010dffc 0010dffd 0010dffe 0010dfff

and 10dffc * 4 = 437ff0.

But if we check the start of file2, the result is very strange:

$ hexdump -e '"%07.7_ax  " 4/4 "%08x " "\n"' file2|head -2
0000000  0000e000 0000e001 0000e002 0000e003
0000010  0000e004 0000e005 0000e006 0000e007

we should have instead of e000:

(437ff0 + 10) / 4 = 10e000

But the end of the file is OK:

$ hexdump -e '"%07.7_ax  " 4/4 "%08x " "\n"' 2g|tail -2
7cffffe0  1f3ffff8 1f3ffff9 1f3ffffa 1f3ffffb
7cfffff0  1f3ffffc 1f3ffffd 1f3ffffe 1f3fffff

$ hexdump -e '"%07.7_ax  " 4/4 "%08x " "\n"' file2|tail -2
7cfc7fe0  1f3ffff8 1f3ffff9 1f3ffffa 1f3ffffb
7cfc7ff0  1f3ffffc 1f3ffffd 1f3ffffe 1f3fffff

and:

$ ls -l 2g file3
... 2097152000 ... 2g
... 2101346304 ... file3

and 2101346304 - 2097152000 = 0x400000, 0x400000 / 4 =  0x100000 and so it looks like the second guest re-emit the 0x40000 last bytes of the guest 1.

Comment 35 Laurent Vivier 2016-06-23 14:32:01 UTC

Hi Zhengtong,

I've tried to reproduce this issue with latest qemu [1] on ppc64le, but the problem seems fixed.

Could you try to reproduce it on your own, please?

[1] qemu-kvm-rhev-2.6.0-8.el7.ppc64le
    kernel-3.10.0-445.el7.ppc64le

Comment 36 Zhengtong 2016-06-24 11:10:15 UTC

Hi , Laurent ,  I will give the result next week.

Comment 37 Zhengtong 2016-06-27 06:09:18 UTC

hi Laurent,
I tried in my test environment here. The result seems no any difference. The problem is still there. I tested following the steps in comment #22. Still, the md5sum is not same.

Guest: 
[root@dhcp113-129 ~]# md5sum 2G 
f3444be3893c1b9b6fa9dd1acde671a9  2G

Host2:
[root@dhcp113-127 home]# md5sum file3
268542feafe56109e1167d747b15fa66  file3


[root@dhcp113-127 home]# ls -lS
total 4088580
-rw-r--r--. 1 root root 2093350912 Jun 27 02:05 file3
-rw-r--r--. 1 root root 2089025536 Jun 27 02:04 file2
-rw-r--r--. 1 root root    4325376 Jun 27 02:05 file1



Both hosts kernel: 3.10.0-445.el7.ppc64le
qemu-kvm-rhev-2.6.0-8.el7

Have you tried with the steps in comment #22 .  Is the steps in comment #22 is correct , laurent ?

Comment 38 Laurent Vivier 2016-06-27 16:35:32 UTC

(In reply to Zhengtong from comment #37)
> Have you tried with the steps in comment #22 .  Is the steps in comment #22
> is correct , laurent ?

Yes / Yes

But it seems the probability to have the problem is not 100%.

I'm able to reproduce it, thanks.

Comment 39 Laurent Vivier 2016-06-28 14:36:34 UTC

Data are lost when the transfer is resumed on the destination guest.

Data seem to be lost in hw/char/virtio-serial-bus.c:handle_output() because port->host_connected is false and virtqueue elements are dropped.

Comment 40 Laurent Vivier 2016-06-28 15:20:57 UTC

Data are dropped since this commit:

ommit 3ecb45f893d09a97b8f24399b5e40808a708261f
Author: Amit Shah <amit.shah>
Date:   Tue Apr 27 18:04:04 2010 +0530

    virtio-serial: Send out guest data to ports only if port is opened
    
    Data should be written only when ports are open.
    
    Signed-off-by: Amit Shah <amit.shah>
    Signed-off-by: Anthony Liguori <aliguori.com>

diff --git a/hw/virtio-serial-bus.c b/hw/virtio-serial-bus.c
index 3a09f0d..6befd4d 100644
--- a/hw/virtio-serial-bus.c
+++ b/hw/virtio-serial-bus.c
@@ -335,6 +335,11 @@ static void handle_output(VirtIODevice *vdev, VirtQueue *vq)
             goto next_buf;
         }
 
+       if (!port->host_connected) {
+            ret = 0;
+            goto next_buf;
+        }
+
         /*
          * A port may not have any handler registered for consuming the
          * data that the guest sends or it may not have a chardev associated

And cleaned up by:

a69c760 virtio-serial: Discard data that guest sends us when ports aren't connected

Amit, do you think we should not drop data or this behavior is normal (we can loose data if the host is not ready)?

Comment 41 Laurent Vivier 2016-06-29 17:21:15 UTC

I think what happens is:

1- kernel of source guest fill the virtqueue with data to transfer
   (because host_connected == true)

2- the guest is migrated to destination guest

3- qemu read data from the virtqueue, and as the port is not connected to anything (host_connected == false), data are discarded.

Comment 42 Laurent Vivier 2016-06-29 17:48:02 UTC

I don't think there is any solution to this bug:

1- the source guest virtio-serial port is hanging because the port is connected
   but data are not read,

2- the destination guest virtio-serial port cannot flush its data because there
   is nothing connected to the port,

I think this configuration is a not a viable one because the guests configuration is not symmetric: if there is a consumer connected to the source guest port, we should have a consumer connected to the destination guest port.

[It's like migrating from a source guest with a disk backend to a destination guest without a disk backend. Is it possible?]

I don't know if we can apply this to the virtio-serial port, but normally a serial port is not reliable: data can be lost... and this is what happens in this case.

I propose to close this BZ as WONTFIX.

Any comments?

Comment 43 Amit Shah 2016-07-04 15:47:38 UTC

(In reply to Laurent Vivier from comment #42)
> I don't think there is any solution to this bug:
> 
> 1- the source guest virtio-serial port is hanging because the port is
> connected
>    but data are not read,
> 
> 2- the destination guest virtio-serial port cannot flush its data because
> there
>    is nothing connected to the port,
> 
> I think this configuration is a not a viable one because the guests
> configuration is not symmetric: if there is a consumer connected to the
> source guest port, we should have a consumer connected to the destination
> guest port.

Right.

If there were a consumer on the host on the dest, any data it reads would not be consistent unless the consumer on the src had read everything out from its buffers (i.e. any buffering the src Linux does -- not the virtqueue buffers!).

We could let the data remain in the vq, and we do do that in case the consumer isn't reading -- but as you note, this is an inconsistent configuration on the dest host during migration, and we really can't provide guarantees this way.

> [It's like migrating from a source guest with a disk backend to a
> destination guest without a disk backend. Is it possible?]
> 
> I don't know if we can apply this to the virtio-serial port, but normally a
> serial port is not reliable: data can be lost... and this is what happens in
> this case.

It is supposed to be reliable, however since this is an inconsistent config, I agree we could close it as WONTFIX.

Comment 44 Laurent Vivier 2016-07-04 16:21:15 UTC

Thank you Amit.

So I close this BZ.