Bug 713392

Summary:	Increase migration max_downtime/or speed cause guest stalls.
Product:	Red Hat Enterprise Linux 5	Reporter:	Mike Cao <bcao>
Component:	kvm	Assignee:	Juan Quintela <quintela>
Status:	CLOSED ERRATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	high	Docs Contact:
Priority:	high
Version:	5.7	CC:	bcao, bgollahe, cww, dyasny, ehabkost, gcosta, iheim, juzhang, knoel, llim, lyarwood, michen, mkalinin, mkenneth, mshao, quintela, syeghiay, tburke, virt-maint, vromanov, xfu
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	kvm-83-239.el5	Doc Type:	Bug Fix
Doc Text:	Due to a regression, when the values for maximum downtime or maximum speed were increased during a migration, the guests experienced heavy stalls and the migration did not finish in a reasonable time. With this update, a patch has been provided and the migration process finishes successfully in the described scenario.	Story Points:	---
Clone Of:	690521	Environment:
Last Closed:	2011-07-21 08:50:05 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	690521
Bug Blocks:	580949, 696155, 707606

Comment 3 Mike Cao 2011-06-22 07:59:13 UTC

I tried several times both image is located on nfs server or using lvm,still can *not* reproduce ,following is my steps:

1.start guest with -m 1G -cpu 4
eg:/usr/libexec/qemu-kvm -m 1G -smp 4,sockets=4,cores=1,threads=1 -name RHEL5u7 -uuid 13bd47ff-7458-a214-9c43-d311ed5ca5a3 -monitor stdio -no-kvm-pit-reinjection -boot c -drive file=/mnt/RHEL5.7-virtio.qcow2,if=virtio,format=qcow2,cache=none,boot=on -net nic,macaddr=54:52:00:52:ed:61,vlan=0,model=virtio -net tap,script=/etc/qemu-ifup,downscript=no,vlan=0 -serial pty -parallel none -usb -vnc :1 -k en-us -vga cirrus -balloon none -M rhel5.6.0 -usbdevice tablet
2.in the guest 
#ping 8.8.8.8 -i 0.1
#stress -c 1 -m 1
3.(qemu)migrate_set_speed 1G
4.(qemu) migrate -d tcp:<hostB>:5888
5.During migration process ,I keeps doing operations in the guest ,eg :move mouse ,type some letters ,double-click files.

Actual Results:
I tried more than 10 times ,during migration ,no stalls append in the guest and no package lost from ping command.

bcao--->Juan
May I ask in which state you found the stalls ,during migration process ,migration nearly to the end , or migration completed and stalls happened on dst host ?

Best Regards,
Mike

Comment 4 Juan Quintela 2011-06-22 12:46:47 UTC

Test is half wrong, you need to do both:
ping <guest> -> outside
ping <outside> -> guest

outside -> guest you see the stalls easily. Inside -> outside only happens sometimes (guest is paused after all).

Later, Juan.

Comment 5 Mike Cao 2011-06-23 02:42:11 UTC

Tried the steps provided by Juan

Actual Results:
from guest side :ping 8.8.8.8 -i 0.2  ----> no packages lost
from host side :ping <guest ip> -i 0.2 ---> 24 packages lost

bcao--->Juan

Does the packages lost above means I reproduced this issue ?

Best Regards,
Mike

Comment 6 Mike Cao 2011-06-23 11:04:36 UTC

Talked with Juan via IRC. comment #5 means reproduced this issue due to guest was not able to answer a ping for so long (24*0.2=4.8 sec).

Based on above ,provide qa_ack+.

Comment 11 Mike Cao 2011-07-07 06:28:25 UTC

Tried on kvm-83-239.el5 ,I found Bug 690521 was regressed and it blocks me to verify this bug.

Steps:
1.I tried several times both image is located on nfs server or using lvm,still
can *not* reproduce ,following is my steps:

1.start guest with -m 1G -cpu 4
eg:/usr/libexec/qemu-kvm -m 1G -smp 4,sockets=4,cores=1,threads=1 -name RHEL5u7
-uuid 13bd47ff-7458-a214-9c43-d311ed5ca5a3 -monitor stdio
-no-kvm-pit-reinjection -boot c -drive
file=/mnt/RHEL5.7-virtio.qcow2,if=virtio,format=qcow2,cache=none,boot=on -net
nic,macaddr=54:52:00:52:ed:61,vlan=0,model=virtio -net
tap,script=/etc/qemu-ifup,downscript=no,vlan=0 -serial pty -parallel none -usb
-vnc :1 -k en-us -vga cirrus -balloon none -M rhel5.6.0 -usbdevice tablet
2.in the guest 
#ping 8.8.8.8 -i 0.1
#stress -c 1 -m 1
3.(qemu)migrate_set_speed 1G
4.(qemu) migrate -d tcp:<hostB>:5888

Actual Results:
wait for more than 30 mins ,migration never finished ,I can not Verity this Bug

Comment 12 Mike Cao 2011-07-07 06:38:13 UTC

Juan ,Could you gave some suggestions how to verify this issue without Bug 690521 fixed ?

Best Regards,
Mike

Comment 13 FuXiangChun 2011-07-07 07:32:40 UTC

when do local migration, migration default transfer speed is about 35M/sec
after changed migrate_set_speed to 1G, migration transfer speed is about 160M.

default speed info:
 
(qemu) info migrate
Migration status: active
transferred ram: 90881 kbytes
remaining ram: 3993092 kbytes
total ram: 4214796 kbytes
QEMU 0.9.1 monitor - type 'help' for more information

(qemu) info migrate
Migration status: active
transferred ram: 127135 kbytes
remaining ram: 3956908 kbytes
total ram: 4214796 kbytes
QEMU 0.9.1 monitor - type 'help' for more information

(qemu) info migrate
Migration status: active
transferred ram: 179874 kbytes
remaining ram: 3904272 kbytes
total ram: 4214796 kbytes

QEMU 0.9.1 monitor - type 'help' for more information
(qemu) info migrate
Migration status: active
transferred ram: 212834 kbytes
remaining ram: 3871376 kbytes
total ram: 4214796 kbytes

QEMU 0.9.1 monitor - type 'help' for more information
(qemu) info migrate
Migration status: active
transferred ram: 245791 kbytes
remaining ram: 3838484 kbytes
total ram: 4214796 kbytes

QEMU 0.9.1 monitor - type 'help' for more information
(qemu) info migrate
Migration status: active
transferred ram: 291936 kbytes
remaining ram: 3792428 kbytes
total ram: 4214796 kbytes

QEMU 0.9.1 monitor - type 'help' for more information
(qemu) info migrate
Migration status: active
transferred ram: 324897 kbytes
remaining ram: 3759532 kbytes
total ram: 4214796 kbytes

QEMU 0.9.1 monitor - type 'help' for more information
(qemu) info migrate
Migration status: active
transferred ram: 361151 kbytes
remaining ram: 3723348 kbytes
total ram: 4214796 kbytes

QEMU 0.9.1 monitor - type 'help' for more information
(qemu) info migrate
Migration status: active
transferred ram: 397149 kbytes
remaining ram: 3634812 kbytes
total ram: 4214796 kbytes


after setting migrate speed 1G, migration info:

(qemu) info migrate
Migration status: active
transferred ram: 782433 kbytes
remaining ram: 3237796 kbytes
total ram: 4214796 kbytes

QEMU 0.9.1 monitor - type 'help' for more information
(qemu) info migrate
Migration status: active
transferred ram: 944833 kbytes
remaining ram: 3074260 kbytes
total ram: 4214796 kbytes

QEMU 0.9.1 monitor - type 'help' for more information
(qemu) info migrate
Migration status: active
transferred ram: 1165022 kbytes
remaining ram: 2854524 kbytes
total ram: 4214796 kbytes

QEMU 0.9.1 monitor - type 'help' for more information
(qemu) info migrate
Migration status: active
transferred ram: 1301184 kbytes
remaining ram: 2718660 kbytes
total ram: 4214796 kbytes

QEMU 0.9.1 monitor - type 'help' for more information
(qemu) info migrate
Migration status: active
transferred ram: 1456158 kbytes
remaining ram: 2564552 kbytes
total ram: 4214796 kbytes

QEMU 0.9.1 monitor - type 'help' for more information
(qemu) info migrate
Migration status: active
transferred ram: 1596031 kbytes
remaining ram: 2424972 kbytes
total ram: 4214796 kbytes

QEMU 0.9.1 monitor - type 'help' for more information
(qemu) info migrate
Migration status: active
transferred ram: 1749662 kbytes
remaining ram: 2271892 kbytes
total ram: 4214796 kbytes

QEMU 0.9.1 monitor - type 'help' for more information
(qemu) info migrate
Migration status: active
transferred ram: 1898016 kbytes
remaining ram: 2124208 kbytes
total ram: 4214796 kbytes

Comment 14 Dor Laor 2011-07-07 09:39:16 UTC

(In reply to comment #12)
> Juan ,Could you gave some suggestions how to verify this issue without Bug
> 690521 fixed ?
> 
> Best Regards,
> Mike

Does 'https://bugzilla.redhat.com/show_bug.cgi?id=690521#c70' help?

Comment 15 Mike Cao 2011-07-07 09:55:10 UTC

(In reply to comment #14)
> (In reply to comment #12)
> > Juan ,Could you gave some suggestions how to verify this issue without Bug
> > 690521 fixed ?
> > 
> > Best Regards,
> > Mike
> 
> Does 'https://bugzilla.redhat.com/show_bug.cgi?id=690521#c70' help?

I am afraid not , following is the reason :

this bug was mainly about *increasing migration_max_speed* costs migration downtime also increased itself(from my result ,it increase to 4s) ,that't the reason  cause customers application failed .

If I can not verify this Bug via increase migration_down_time.migration may finish ,but it is not the original Bug.

Actually ,QE really found that ,doing (qemu)migration_set_speed 1G  the speed increased on kvm-239 was much smaller than that on kvm-238

Mike

Comment 16 Dor Laor 2011-07-07 10:46:39 UTC

(In reply to comment #15)
> (In reply to comment #14)
> > (In reply to comment #12)
> > > Juan ,Could you gave some suggestions how to verify this issue without Bug
> > > 690521 fixed ?
> > > 
> > > Best Regards,
> > > Mike
> > 
> > Does 'https://bugzilla.redhat.com/show_bug.cgi?id=690521#c70' help?
> 
> I am afraid not , following is the reason :
> 
> this bug was mainly about *increasing migration_max_speed* costs migration
> downtime also increased itself(from my result ,it increase to 4s) ,that't the
> reason  cause customers application failed .
> 
> If I can not verify this Bug via increase migration_down_time.migration may
> finish ,but it is not the original Bug.
> 
> Actually ,QE really found that ,doing (qemu)migration_set_speed 1G  the speed
> increased on kvm-239 was much smaller than that on kvm-238

That's because we revert some buggy change. Before that revert, it looked like the bandwidth and the migration convergence are fine but when you expected a down time of 0.1s you actually got several seconds of downtime.

With the patch revert migration with config of 0.1s take lots of time because we're accurate. If you'll want to achive similar convergence/bandwidth, you'll need to increase the configuration

> 
> Mike

Comment 19 Tomas Capek 2011-07-19 09:00:16 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Due to a regression, when the values for maximum downtime or maximum speed were increased during a migration, the guests experienced heavy stalls and the migration did not finish in a reasonable time. With this update, a patch has been provided and the migration process finishes successfully in the described scenario.

Comment 20 errata-xmlrpc 2011-07-21 08:50:05 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1068.html

Comment 21 errata-xmlrpc 2011-07-21 11:49:42 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-1068.html