|Summary:||Re-enable op blocker assertions|
|Product:||Red Hat Enterprise Linux 7||Reporter:||Kevin Wolf <kwolf>|
|Component:||qemu-kvm-rhev||Assignee:||Kevin Wolf <kwolf>|
|Status:||CLOSED ERRATA||QA Contact:||xianwang <xianwang>|
|Version:||7.3||CC:||chayang, coli, dgilbert, hannsj_uhl, juzhang, kwolf, michen, mreitz, mrezanin, ngu, qzhang, virt-maint|
|Fixed In Version:||qemu-kvm-rhev-2.10.0-1.el7||Doc Type:||If docs needed, set a value|
|Doc Text:||Story Points:||---|
|:||1452148 (view as bug list)||Environment:|
|Last Closed:||2018-04-11 00:16:25 UTC||Type:||Bug|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Cloudforms Team:||---||Target Upstream Version:|
|Bug Depends On:||1452148|
Description Kevin Wolf 2017-04-12 13:13:49 UTC
In commit e3e0003a, upstream qemu disabled the op blocker assertions for the 2.9 release because some bugs could not be fixed in time. After rebasing to 2.9, we'll want to revert the commit and include proper fixes for the bugs. Without the bugs fixed, op blockers can't keep the promises they are making. Known problems with op blockers so far that need to be fixed before the commit can be safely reverted: * Old style block migration (migrate -b) triggers an assertion because it reuses the guest device's BlockBackend. During migration, this BlockBackend is not ready to be used yet (its real permissions are only enabled in blk_resume_after_migration() immediately before the guest starts to run). Block migration needs to use its own BlockBackend here. * Postcopy migration. Commit d35ff5e6 added blk_resume_after_migration() in two places, but postcopy migration uses loadvm_postcopy_handle_run_bh(), which is the third one. In order to avoid assertion failures, the call needs to be added there as well. Without this fix, the guest device's op blockers are ineffective after postcopy migration.
Comment 1 juzhang 2017-06-08 11:05:38 UTC
Hi Cong, Free to update the QE contact.
Comment 2 Kevin Wolf 2017-10-09 08:08:59 UTC
This is fixed in upstream qemu 2.10. Postcopy migration was fixed with commit 0042fd36. Old-style block migration was fixed with the series leading to commit 49695eeb. Assertions were re-enabled in commit 362b3786.
Comment 4 xianwang 2017-12-13 07:34:05 UTC
Hi,Kevin, could you help to give steps about how to veriy this bug? Thanks
Comment 5 Kevin Wolf 2017-12-13 09:07:36 UTC
Please just verify that old style block migration (migrate -b) and postcopy migration are working and not causing any assertion failure.
Comment 6 Dr. David Alan Gilbert 2017-12-13 10:39:14 UTC
Kevin: Note that being rhel7 we disable outgoing old style block migration anyway
Comment 7 xianwang 2017-12-14 07:03:42 UTC
As Dave said, the 'migrate -b'(old style block migration) is not supported now as following: (qemu) migrate -b -d tcp:10.16.47.10:5801 migrate: unsupported option -d After confirming with Kevin, for this bug, it just needs to test general postcopy migration, the bug is verified pass on qemu-kvm-rhev-2.10.0-12.el7, as following: Host: 3.10.0-820.el7.ppc64le qemu-kvm-rhev-2.10.0-12.el7.ppc64le SLOF-20170724-5.git89f519f.el8.ppc64le Guest: 3.10.0-800.el7.ppc64le 1.Boot a guest on src host with qemu cli: /usr/libexec/qemu-kvm \ -name 'avocado-vt-vm1' \ -sandbox off \ -nodefaults \ -machine pseries-rhel7.5.0 \ -vga std \ -uuid 8aeab7e2-f341-4f8c-80e8-59e2968d85c2 \ -device virtio-serial-pci,id=virtio_serial_pci0,bus=pci.0,addr=03 \ -device virtio-scsi-pci,id=scsi1,bus=pci.0,addr=0x4 \ -device spapr-vscsi,id=scsi2 \ -chardev socket,id=console0,path=/tmp/console0,server,nowait \ -device spapr-vty,chardev=console0 \ -device nec-usb-xhci,id=usb1,bus=pci.0,addr=05 \ -drive file=/home/rhel75-ppc64le-virtio-scsi.qcow2,format=qcow2,if=none,cache=none,id=drive_blk1,werror=stop,rerror=stop \ -device virtio-blk-pci,drive=drive_blk1,id=blk-disk1,bootindex=0,bus=pci.0,addr=06 \ -drive file=/home/r1.qcow2,format=qcow2,if=none,cache=none,id=drive_data1,werror=stop,rerror=stop \ -device virtio-blk-pci,drive=drive_data1,id=blk-data,bus=pci.0,addr=07 \ -device virtio-net-pci,mac=9a:7b:7c:7d:7e:72,id=id9HRc5V,vectors=4,netdev=idjlQN53,bus=pci.0,addr=10 \ -netdev tap,id=idjlQN53,vhost=off,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \ -m 4G \ -smp 4 \ -device usb-kbd \ -device usb-mouse \ -qmp tcp:0:8881,server,nowait \ -vnc :1 \ -msg timestamp=on \ -rtc base=localtime,clock=vm,driftfix=slew \ -monitor stdio \ -boot order=cdn,once=c,menu=on,strict=off \ -enable-kvm in guest, format the data disk, mount it to /mnt and do "dd": #mkfs.ext4 /dev/vdb #mount /dev/vdb /mnt #while true;do dd if=/dev/zero of=/mnt/file2 bs=10M count=10;done 2.Launch listening mode on dst host, qemu cli is same with src host appending "-ingcoming tcp:0:5801" 3.on src host, do postcopy migration (qemu) migrate_set_capability postcopy-ram on (qemu) migrate -d tcp:10.16.67.19:5801 4.result migration complete, vm works well and "dd" is ongoing all the time in guest after postcopy migration, I also can stop "dd" and re-write to that disk again. src->dst:postcopy migration succeed and vm works well including writing to disk; dst->src:postcopy migration succeed and vm works well including writing to disk; so, I think this bug is fixed. Kevin, do you thinks this verification is ok? thanks
Comment 8 Kevin Wolf 2017-12-14 08:58:19 UTC
Don't you need a "migrate_start_postcopy" command on the source to actually switch into postcopy mode? Dave, the important thing is that loadvm_postcopy_handle_run_bh() runs, so that we actually test commit 0042fd36. Is the "migrate_start_postcopy" necessary for that? I assume so, but you can probably say something definite.
Comment 9 Dr. David Alan Gilbert 2017-12-14 09:43:36 UTC
Kevin is correct, that test hasn't actually done postcopy. You need to: a) Start a heavy memory using job in the guest, e.g. the 'stress' command b) migrate_set_capability postcopy-ram on c) Start the migrate; migrate -d tcp:host:port d) do an 'info migrate' you should see the status as 'active' e) migrate_start_postcopy f) another 'info migrate' you should see the status as 'postcopy-active' g) You should find the destination is responsive h) Wait until 'info migrate' returns complete Include the output of (h) in the results, you should see a count of postcopy-requests.
Comment 10 xianwang 2017-12-14 11:02:46 UTC
(In reply to Dr. David Alan Gilbert from comment #9) > Kevin is correct, that test hasn't actually done postcopy. > You need to: > a) Start a heavy memory using job in the guest, e.g. the 'stress' command > b) migrate_set_capability postcopy-ram on > c) Start the migrate; migrate -d tcp:host:port > d) do an 'info migrate' you should see the status as 'active' > e) migrate_start_postcopy > f) another 'info migrate' you should see the status as 'postcopy-active' > g) You should find the destination is responsive > h) Wait until 'info migrate' returns complete > > Include the output of (h) in the results, you should see a count of > postcopy-requests. Hi, Kevin and Dave, I am very sorry, I do that as Dave said, but I just forgot to write them to bug comment, I have done it again with "stress", result is also pass, just as following: stress in guest: # stress --cpu 8 --io 4 --vm 2 --vm-bytes 128M #while true;do dd if=/dev/zero of=/dev/vdb bs=10M count=10;done (qemu) migrate_set_capability postcopy-ram on (qemu) migrate -d tcp:10.16.67.19:5801 (qemu) info migrate Migration status: active (qemu) migrate_start_postcopy (qemu) info migrate Migration status: postcopy-active (qemu) info migrate Migration status: postcopy-active .... (qemu) info migrate Migration status: completed migration succeeds and vm works well on destination.
Comment 11 xianwang 2017-12-14 11:06:32 UTC
Kevin and Dave so, if you two don't have other problems and agree this verification, I would modify this bug to "verified", thanks
Comment 12 Dr. David Alan Gilbert 2017-12-14 11:53:51 UTC
As a postcopy test I think that's fine.
Comment 14 errata-xmlrpc 2018-04-11 00:16:25 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:1104