Bug 1441684

Summary:	Re-enable op blocker assertions
Product:	Red Hat Enterprise Linux 7	Reporter:	Kevin Wolf <kwolf>
Component:	qemu-kvm-rhev	Assignee:	Kevin Wolf <kwolf>
Status:	CLOSED ERRATA	QA Contact:	xianwang <xianwang>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	7.3	CC:	chayang, coli, dgilbert, hannsj_uhl, hreitz, juzhang, kwolf, michen, mrezanin, ngu, qzhang, virt-maint
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	qemu-kvm-rhev-2.10.0-1.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1452148 (view as bug list)		Environment:
Last Closed:	2018-04-11 00:16:25 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1452148
Bug Blocks:

Description Kevin Wolf 2017-04-12 13:13:49 UTC

In commit e3e0003a, upstream qemu disabled the op blocker assertions for the
2.9 release because some bugs could not be fixed in time. After rebasing to
2.9, we'll want to revert the commit and include proper fixes for the bugs.
Without the bugs fixed, op blockers can't keep the promises they are making.

Known problems with op blockers so far that need to be fixed before the commit
can be safely reverted:

* Old style block migration (migrate -b) triggers an assertion because it
  reuses the guest device's BlockBackend. During migration, this BlockBackend
  is not ready to be used yet (its real permissions are only enabled in
  blk_resume_after_migration() immediately before the guest starts to run).
  Block migration needs to use its own BlockBackend here.

* Postcopy migration. Commit d35ff5e6 added blk_resume_after_migration() in two
  places, but postcopy migration uses loadvm_postcopy_handle_run_bh(), which is
  the third one. In order to avoid assertion failures, the call needs to be
  added there as well. Without this fix, the guest device's op blockers are
  ineffective after postcopy migration.

Comment 1 juzhang 2017-06-08 11:05:38 UTC

Hi Cong,

Free to update the QE contact.

Comment 2 Kevin Wolf 2017-10-09 08:08:59 UTC

This is fixed in upstream qemu 2.10.

Postcopy migration was fixed with commit 0042fd36.
Old-style block migration was fixed with the series leading to commit 49695eeb.
Assertions were re-enabled in commit 362b3786.

Comment 4 xianwang 2017-12-13 07:34:05 UTC

Hi,Kevin,
could you help to give steps about how to veriy this bug? Thanks

Comment 5 Kevin Wolf 2017-12-13 09:07:36 UTC

Please just verify that old style block migration (migrate -b) and postcopy migration are working and not causing any assertion failure.

Comment 6 Dr. David Alan Gilbert 2017-12-13 10:39:14 UTC

Kevin: Note that being rhel7 we disable outgoing old style block migration anyway

Comment 7 xianwang 2017-12-14 07:03:42 UTC

As Dave said, the 'migrate -b'(old style block migration) is not supported now as following:
(qemu) migrate -b -d tcp:10.16.47.10:5801 
migrate: unsupported option -d

After confirming with Kevin, for this bug, it just needs to test general postcopy migration, the bug is verified pass on qemu-kvm-rhev-2.10.0-12.el7, as following:
Host:
3.10.0-820.el7.ppc64le
qemu-kvm-rhev-2.10.0-12.el7.ppc64le
SLOF-20170724-5.git89f519f.el8.ppc64le

Guest:
3.10.0-800.el7.ppc64le

1.Boot a guest on src host with qemu cli:
/usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1'  \
    -sandbox off  \
    -nodefaults  \
    -machine pseries-rhel7.5.0 \
    -vga std \
    -uuid 8aeab7e2-f341-4f8c-80e8-59e2968d85c2 \
    -device virtio-serial-pci,id=virtio_serial_pci0,bus=pci.0,addr=03 \
    -device virtio-scsi-pci,id=scsi1,bus=pci.0,addr=0x4 \
    -device spapr-vscsi,id=scsi2 \
    -chardev socket,id=console0,path=/tmp/console0,server,nowait \
    -device spapr-vty,chardev=console0 \
    -device nec-usb-xhci,id=usb1,bus=pci.0,addr=05 \
    -drive file=/home/rhel75-ppc64le-virtio-scsi.qcow2,format=qcow2,if=none,cache=none,id=drive_blk1,werror=stop,rerror=stop \
    -device virtio-blk-pci,drive=drive_blk1,id=blk-disk1,bootindex=0,bus=pci.0,addr=06 \
    -drive file=/home/r1.qcow2,format=qcow2,if=none,cache=none,id=drive_data1,werror=stop,rerror=stop \
    -device virtio-blk-pci,drive=drive_data1,id=blk-data,bus=pci.0,addr=07 \
    -device virtio-net-pci,mac=9a:7b:7c:7d:7e:72,id=id9HRc5V,vectors=4,netdev=idjlQN53,bus=pci.0,addr=10 \
    -netdev tap,id=idjlQN53,vhost=off,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
    -m 4G \
    -smp 4 \
    -device usb-kbd \
    -device usb-mouse \
    -qmp tcp:0:8881,server,nowait \
    -vnc :1  \
    -msg timestamp=on \
    -rtc base=localtime,clock=vm,driftfix=slew  \
    -monitor stdio \
    -boot order=cdn,once=c,menu=on,strict=off \
    -enable-kvm

in guest, format the data disk, mount it to /mnt and do "dd":
#mkfs.ext4 /dev/vdb
#mount /dev/vdb /mnt
#while true;do dd if=/dev/zero of=/mnt/file2 bs=10M count=10;done

2.Launch listening mode on dst host, qemu cli is same with src host appending "-ingcoming tcp:0:5801"

3.on src host, do postcopy migration
(qemu) migrate_set_capability postcopy-ram on
(qemu) migrate -d tcp:10.16.67.19:5801

4.result
migration complete, vm works well and "dd" is ongoing all the time in guest after postcopy migration, I also can stop "dd" and re-write to that disk again.

src->dst:postcopy migration succeed and vm works well including writing to disk;
dst->src:postcopy migration succeed and vm works well including writing to disk;

so, I think this bug is fixed.
Kevin, do you thinks this verification is ok? thanks

Comment 8 Kevin Wolf 2017-12-14 08:58:19 UTC

Don't you need a "migrate_start_postcopy" command on the source to actually
switch into postcopy mode?

Dave, the important thing is that loadvm_postcopy_handle_run_bh() runs, so that we actually test commit 0042fd36. Is the "migrate_start_postcopy" necessary for that? I assume so, but you can probably say something definite.

Comment 9 Dr. David Alan Gilbert 2017-12-14 09:43:36 UTC

Kevin is correct, that test hasn't actually done postcopy.
You need to:
  a) Start a heavy memory using job in the guest, e.g. the 'stress' command
  b) migrate_set_capability postcopy-ram on
  c) Start the migrate;   migrate -d tcp:host:port
  d) do an 'info migrate' you should see the status as 'active'
  e) migrate_start_postcopy
  f) another 'info migrate' you should see the status as 'postcopy-active'
  g) You should find the destination is responsive
  h) Wait until 'info migrate' returns complete

Include the output of (h) in the results, you should see a count of postcopy-requests.

Comment 10 xianwang 2017-12-14 11:02:46 UTC

(In reply to Dr. David Alan Gilbert from comment #9)
> Kevin is correct, that test hasn't actually done postcopy.
> You need to:
>   a) Start a heavy memory using job in the guest, e.g. the 'stress' command
>   b) migrate_set_capability postcopy-ram on
>   c) Start the migrate;   migrate -d tcp:host:port
>   d) do an 'info migrate' you should see the status as 'active'
>   e) migrate_start_postcopy
>   f) another 'info migrate' you should see the status as 'postcopy-active'
>   g) You should find the destination is responsive
>   h) Wait until 'info migrate' returns complete
> 
> Include the output of (h) in the results, you should see a count of
> postcopy-requests.

Hi, Kevin and Dave,
I am very sorry, I do that as Dave said, but I just forgot to write them to bug comment, I have done it again with "stress", result is also pass, just as following:

stress in guest:
# stress --cpu 8 --io 4 --vm 2 --vm-bytes 128M
#while true;do dd if=/dev/zero of=/dev/vdb bs=10M count=10;done

(qemu) migrate_set_capability postcopy-ram on
(qemu) migrate -d tcp:10.16.67.19:5801
(qemu) info migrate
Migration status: active
(qemu) migrate_start_postcopy
(qemu) info migrate
Migration status: postcopy-active
(qemu) info migrate
Migration status: postcopy-active
....
(qemu) info migrate
Migration status: completed

migration succeeds and vm works well on destination.

Comment 11 xianwang 2017-12-14 11:06:32 UTC

Kevin and Dave
so, if you two don't have other problems and agree this verification, I would modify this bug to "verified", thanks

Comment 12 Dr. David Alan Gilbert 2017-12-14 11:53:51 UTC

As a postcopy test I think that's fine.

Comment 14 errata-xmlrpc 2018-04-11 00:16:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:1104