1053699 – Backport Cancelled race condition fixes

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1053699 - Backport Cancelled race condition fixes

Summary: Backport Cancelled race condition fixes

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Dr. David Alan Gilbert
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	965991 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-01-15 16:34 UTC by Dr. David Alan Gilbert
Modified:	2014-06-18 03:45 UTC (History)
CC List:	9 users (show)
Fixed In Version:	qemu-kvm-1.5.3-40.el7
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-06-13 13:13:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Dr. David Alan Gilbert 2014-01-15 16:34:06 UTC

Description of problem:

There are a couple of races around cancelling a migration, already fixed upstream.
Paolo suggests backporting.


Version-Release number of selected component (if applicable):


How reproducible:
racey

Steps to Reproduce:
1. cancel migration just near completion
2.
3.

Actual results:


Expected results:
can end up in cancelled even though it was completed


Additional info:

Upstream fixes:
51cf4c1a99a172679c2949a2d58a2a4ee307b557 - introduce MIG_STATE_CANCELLING state
6f2b811a61810a7fd9f9a5085de223f66b823342 - avoid a bogus COMPLETED->CANCELLED transition

Note they introduce a new migration state, but it's never exposed to the management stack (it sees cancelling still as active)

Comment 1 FuXiangChun 2014-01-16 07:06:53 UTC

Reproduce this bug with 

1.In src
/usr/libexec/qemu-kvm -M pc -cpu SandyBridge -enable-kvm -m 4096 -smp 4,sockets=2,cores=2,threads=1,maxcpus=160 -usb -device usb-tablet,id=input0 -name gpu -uuid 990ea161-6b67-47b2-b803-19fb01d30d30 -rtc base=localtime,clock=host,driftfix=slew -drive file=/mnt/rhel6.5-ga-new.qcow2,if=none,id=drive-virtio-disk,format=qcow2,cache=none,aio=native,werror=stop,rerror=stop -device ide-drive,drive=drive-virtio-disk,id=virtio-disk,bootindex=1 -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0 -k en-us -boot menu=on -serial unix:/tmp/ttyS0,server,nowait -vnc :5 -monitor stdio -netdev tap,id=hostnet0,vhost=on,script=/etc/qemu-ifup -device rtl8139,netdev=hostnet0,id=virtio-net-pci0,mac=00:01:02:B6:41:29,bus=pci.0,addr=0x5

2. In des

/usr/libexec/qemu-kvm -M pc -cpu SandyBridge -enable-kvm -m 4096 -smp 4,sockets=2,cores=2,threads=1,maxcpus=160 -usb -device usb-tablet,id=input0 -name gpu -uuid 990ea161-6b67-47b2-b803-19fb01d30d30 -rtc base=localtime,clock=host,driftfix=slew -drive file=/mnt/rhel6.5-ga-new.qcow2,if=none,id=drive-virtio-disk,format=qcow2,cache=none,aio=native,werror=stop,rerror=stop -device ide-drive,drive=drive-virtio-disk,id=virtio-disk,bootindex=1 -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0 -k en-us -boot menu=on -serial unix:/tmp/ttyS0,server,nowait -vnc :5 -monitor stdio -netdev tap,id=hostnet0,vhost=on,script=/etc/qemu-ifup -device rtl8139,netdev=hostnet0,id=virtio-net-pci0,mac=00:01:02:B6:41:29,bus=pci.0,addr=0x5 -incoming tcp:0:6666

3. do migration from src to des host

4. cancel migration after migration was competed

result:
qemu-kvm and guest well work in des host.  so fail to cancel it if migration was completed.

Comment 2 Miroslav Rezanina 2014-01-22 07:09:59 UTC

Fix included in qemu-kvm-1.5.3-40.el7

Comment 4 Qunfang Zhang 2014-01-28 08:29:48 UTC

Test on the unfixed version qemu-kvm-1.5.3-39.el7:

After migration finished, cancel it via "migrate_cancel" command. The migration status is "cancelled".

(qemu) info migrate
capabilities: xbzrle: off x-rdma-pin-all: off auto-converge: off zero-blocks: off 
Migration status: completed
total time: 12332 milliseconds
downtime: 37 milliseconds
setup: 6 milliseconds
transferred ram: 392124 kbytes
throughput: 77.40 mbps
remaining ram: 0 kbytes
total ram: 2114264 kbytes
duplicate: 434035 pages
skipped: 0 pages
normal: 96888 pages
normal bytes: 387552 kbytes
(qemu)  
(qemu) migrate_cancel 
(qemu)
(qemu) info migrate
capabilities: xbzrle: off x-rdma-pin-all: off auto-converge: off zero-blocks: off 
Migration status: cancelled
total time: 0 milliseconds

===============

On the fixed version qemu-kvm-1.5.3-41.el7:

(qemu) info migrate
capabilities: xbzrle: off x-rdma-pin-all: off auto-converge: off zero-blocks: off 
Migration status: completed
total time: 17429 milliseconds
downtime: 38 milliseconds
setup: 6 milliseconds
transferred ram: 656231 kbytes
throughput: 830.65 mbps
remaining ram: 0 kbytes
total ram: 2114264 kbytes
duplicate: 472658 pages
skipped: 0 pages
normal: 162701 pages
normal bytes: 650804 kbytes
(qemu)      
(qemu) migrate_cancel 
(qemu) 
(qemu) info migrate
capabilities: xbzrle: off x-rdma-pin-all: off auto-converge: off zero-blocks: off 
Migration status: completed
total time: 17429 milliseconds
downtime: 38 milliseconds
setup: 6 milliseconds
transferred ram: 656231 kbytes
throughput: 830.65 mbps
remaining ram: 0 kbytes
total ram: 2114264 kbytes
duplicate: 472658 pages
skipped: 0 pages
normal: 162701 pages
normal bytes: 650804 kbytes
(qemu) 


Hi, Dr.David

Could you give a help on the above result, is this the expected?  Do I still need to test something else? 

Thanks a lot!
Qunfang

Comment 5 Dr. David Alan Gilbert 2014-01-28 09:15:31 UTC

Hi Qunfang,
  Yes I think those results are correct - you shouldn't be able to cancel a migration that's completed.

I think that's the only case you can easily test; there is another case described here:

http://lists.gnu.org/archive/html/qemu-devel/2013-11/msg00325.html

I think that is trickier to test, you would need to break the network connection during migration, cancel the migration and then try to start another migration within a few seconds.   If I understand the original report correctly, it would try and start a new migration while the other one hasn't quite finished being cancelled yet.

Comment 6 Qunfang Zhang 2014-01-28 09:32:23 UTC

Hi, David

Thanks a lot for the feedback. I will try with your comment. As we will take the holiday and it will start tmr for me, so I will continue the test after holiday if I can not finish it today.

Thanks,
Qunfang

Comment 7 Juan Quintela 2014-02-11 17:10:29 UTC

*** Bug 965991 has been marked as a duplicate of this bug. ***

Comment 8 Qunfang Zhang 2014-02-13 11:08:44 UTC

Reproduced the bug on qemu-kvm-10:1.5.3-39.el7. 

1. Boot up the guest on source host and boot with listenning mode on destination host. 

2. Start migration
(qemu) migrate -d tcp:$dst_host_ip:5800

3. After step 2, stop destination host network immediately.
#ifdown switch

4. On source host, repeat:
(qemu) info migrate

It shows the "transferred ram" and "remaining ram" do not change.

5. Cancel the migrate and then start destination host network *immediately*.
On source host:
(qemu) migrate_cancel
(qemu) info migrate
capabilities: xbzrle: off x-rdma-pin-all: off auto-converge: off zero-blocks: off 
Migration status: cancelled


On destination host:
#ifup switch

6. Once the destination host network comes back, repeat migration again at once.

Result:

After step 6, on the source host:

(qemu) 
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff593fc700 (LWP 2270)]
qemu_file_get_error (f=0x0) at /usr/src/debug/qemu-1.5.3/savevm.c:571
571	    return f->last_error;

(gdb) bt
#0  qemu_file_get_error (f=0x0) at /usr/src/debug/qemu-1.5.3/savevm.c:571
#1  0x0000555555705b67 in migration_thread (opaque=0x555555c869c0 <current_migration.27387>)
    at migration.c:609
#2  0x00007ffff604ddf3 in start_thread () from /lib64/libpthread.so.0
#3  0x00007ffff2d5939d in clone () from /lib64/libc.so.6


Sometimes it shows a different log:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff5abff700 (LWP 2330)]
qemu_bh_schedule (bh=0x0) at async.c:103
103	    if (bh->scheduled)

(gdb) bt
#0  qemu_bh_schedule (bh=0x0) at async.c:103
#1  0x0000555555705d93 in migration_thread (opaque=0x555555c869c0 <current_migration.27387>)
    at migration.c:653
#2  0x00007ffff604ddf3 in start_thread () from /lib64/libpthread.so.0
#3  0x00007ffff2d5939d in clone () from /lib64/libc.so.6
(gdb) 

=========================

Verified the bug on qemu-kvm-10:1.5.3-47.el7.

Steps: The same as above.

Result:

After step 5: When cancel the migration on source host side, "info migrate" will shows the migration status is still active. (Destination host network is down at this moment.)

(qemu) migrate_cancel 
(qemu) info migrate
capabilities: xbzrle: off x-rdma-pin-all: off auto-converge: off zero-blocks: off 
Migration status: active
total time: 16984 milliseconds
expected downtime: 30 milliseconds
setup: 2 milliseconds
  ram: 140978 kbytes
throughput: 268.57 mbps
remaining ram: 317516 kbytes
total ram: 2114264 kbytes
duplicate: 414471 pages
skipped: 0 pages
normal: 34715 pages
normal bytes: 138860 kbytes

And then start the destination host network, quit the listening mode qemu command line on destination host and start up again *immediately*. 

Then cancel the migrate process again on source host side and repeat migration. All the actions need to be done continuously and immediately.

(qemu) info migrate
capabilities: xbzrle: off x-rdma-pin-all: off auto-converge: off zero-blocks: off 
Migration status: cancelled
total time: 0 milliseconds
(qemu) 
(qemu) migrate -d tcp:10.66.4.229:5800
capabilities: xbzrle: off x-rdma-pin-all: off auto-converge: off zero-blocks: off 
Migration status: active
total time: 1596 milliseconds
expected downtime: 30 milliseconds
setup: 2 milliseconds
transferred ram: 52457 kbytes
throughput: 268.56 mbps
remaining ram: 1934804 kbytes
total ram: 2114264 kbytes
duplicate: 446317 pages
skipped: 0 pages
normal: 47791 pages
normal bytes: 191164 kbytes

(qemu) info migrate
capabilities: xbzrle: off x-rdma-pin-all: off auto-converge: off zero-blocks: off 
Migration status: completed
total time: 16768 milliseconds
downtime: 37 milliseconds
setup: 2 milliseconds
transferred ram: 549678 kbytes
throughput: 268.57 mbps
remaining ram: 0 kbytes
total ram: 2114264 kbytes
duplicate: 839682 pages
skipped: 0 pages
normal: 170991 pages
normal bytes: 683964 kbytes


So, on the latest version, the segment fault issue does not exist any more.  This bug is fixed.

Comment 9 Ludek Smid 2014-06-13 13:13:37 UTC

This request was resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you have further questions about the request.

Note You need to log in before you can comment on or make changes to this bug.