1663405 – migration failed with enable compress

Bug 1663405 - migration failed with enable compress

Summary: migration failed with enable compress

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux Advanced Virtualization
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	8.0
Assignee:	Virtualization Maintenance
QA Contact:	Li Xiaohui
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1758964 1771318
TreeView+	depends on / blocked

Reported:	2019-01-04 08:27 UTC by Yiqian Wei
Modified:	2021-03-24 02:16 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-03-15 07:32:49 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:
Flags:	hhuang: mirror+

Attachments	(Terms of Use)

Description Yiqian Wei 2019-01-04 08:27:31 UTC

Description of problem:

migration failed with enable compress

Version-Release number of selected component (if applicable):

Host version:
qemu-kvm-3.1.0-3.module+el8+2614+d714d2bb.x86_64.rpm
kernel-4.18.0-57.el8.x86_64
seabios-1.11.1-3.module+el8+2603+0a5231c4.x86_64
Guest:rhel8

How reproducible:
5/5

Steps to Reproduce:
1.Boot a guest on src end  

2.Boot incoming guest on dst end 

3.Enable compress both src and dst end 
(qemu) migrate_set_capability compress on

4.In guest, do "stress" 
#stress --cpu 1 --io 1 --vm 4 --vm-bytes 128M

5.do migration
(qemu) migrate -d tcp:10.73.72.88:1234

Actual results: migration failed and qemu quit on dst end

In src: migration failed
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: on events: off postcopy-ram: off x-colo: off release-ram: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off late-block-activate: off 
Migration status: failed
total time: 0 milliseconds

In dst:qemu quit
(qemu) qemu-kvm: decompress data failed
qemu-kvm: error while loading state section id 1(ram)
qemu-kvm: load of migration failed: Operation not permitted


Expected results:
migration completed and vm works well

Additional info:
(1)with compress off,I can't reproduce this issue,migration status is always active and can't completed, vm runs well on source end

(2)boot a guest with cmd
/usr/libexec/qemu-kvm \
-M q35,accel=kvm,kernel-irqchip=split \
-device intel-iommu,intremap=on \
-cpu Haswell-noTSX,enforce \
-nodefaults -rtc base=utc \
-name debug-threads=on \
-m 8G \
-smp 4,sockets=4,cores=1,threads=1 \
-enable-kvm \
-uuid 990ea161-6b67-47b2-b803-19fb01d30d12 \
-k en-us \
-nodefaults \
-boot menu=on \
-qmp tcp:0:6667,server,nowait \
-vga qxl \
-device pcie-root-port,bus=pcie.0,id=root0,slot=1 \
-device virtio-scsi-pci,id=virtio_scsi_pci0,bus=root0 \
-blockdev driver=qcow2,cache.direct=off,cache.no-flush=on,file.filename=/mnt/rhel80-64-virtio-scsi.qcow2,node-name=my_disk,file.driver=file  \
-device scsi-hd,drive=my_disk,bus=virtio_scsi_pci0.0 \
-device pcie-root-port,bus=pcie.0,id=root1,slot=2 \
-device virtio-net-pci,netdev=tap10,mac=9a:6a:6b:6c:6d:6e,bus=root1 -netdev tap,id=tap10 \
-device pcie-root-port,bus=pcie.0,id=root2,slot=3 \
-device e1000e,netdev=tap11,mac=9a:6a:6b:6c:6d:6a,bus=root2 -netdev tap,id=tap11 \
-device pcie-root-port,bus=pcie.0,id=root3,slot=4 \
-blockdev driver=file,cache.direct=off,cache.no-flush=on,filename=/mnt/block.qcow2,node-name=virtio_block \
-device virtio-blk-pci,drive=virtio_block,bus=root3   \
-device pcie-root-port,bus=pcie.0,id=root4,slot=5 \
-device nec-usb-xhci,id=usb1,bus=root4 \
-device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
-monitor stdio \
-vnc :1 \

Comment 1 Li Xiaohui 2019-05-27 11:23:13 UTC

Hi Juan,
I reproduce this bz on rhel8.0.1 and rhel8.1.0 host sometimes, not always reproduce.
Need I clone this bz for rhel8.0.1 or rhel8.1.0 since this bz is reported in qemu-kvm, adn the versions about qemu-kvm on rhel8.0, rhel8.0.1 and rhel8.1.0 are different

Best regards,
Li Xiaohui

Comment 2 Juan Quintela 2019-07-29 11:14:08 UTC

hi

multifd + compress don't work.  I posted upstream a new way to do compression on top of multifd.
Will improve the error message.

Comment 3 Juan Quintela 2019-07-29 11:16:38 UTC

Hi, I missread the previous commit.
This has nothing to do with multifd, investigating what happens with compression.

Comment 4 Li Xiaohui 2019-08-15 09:24:43 UTC

Hi all,
sometimes reproduce this bz on rhel8.1-av(kernel-4.18.0-129.el8.x86_64 & qemu-img-4.1.0-1.module+el8.1.0+3966+4a23dca1.x86_64), guest is kernel-4.18.0-130.el8.x86_64, thanks.

Comment 5 Juan Quintela 2019-11-13 15:26:42 UTC

Hi

compression is really difficult to support, and as far as we know we don't use it (the current implementation is only useful if you are migrating over a really, really slow link), otherwise the amount of traffic that is saved is small, and the ammount of CPU that is needed to make it work don't help here.

Comment 6 Juan Quintela 2019-11-13 15:33:27 UTC

Notice that the compression that RHV uses is "XBZRLE".  (capability xbzrle on info migrate).  compress capability is a completelly different beast, based on zlib that we don't support.

Comment 7 Juan Quintela 2019-11-13 15:43:59 UTC

There are two compression methods in qemu:
- xbzrle
- zlib (this is older, so it got the "compression" name)

In xbzrle (the one that we support on RHV), we got a big cache of memory, and we save a copy of (some) of the transferred pages.  If the page is dirty again, we just send the difference with the previous page sent, so, we transmit less bits.
With the zlib compression (that we don't support), we copy the memory to other place, start a thread to do the compression (that is a slow operation in itself), and copy back to the main thread.  This is really very slow.  It was introduced because at some point there was going to be intel processors that were able to do this compression fast, but they haven't appeared.  So we don't support it, we know that it is very slow in its current incarnation and that is why we don't support it.

There are partches posted on qemu list that will be integrated on upstream qemu that use zlib (and zstd) on top of multifd and that they are faster, and we can support.  but that is for future versions.

One line summary:  We don't support zlib compression because we know that it is not reliable.

Comment 8 Li Xiaohui 2019-12-24 09:34:03 UTC

Hi Juan& Hai,
I can reproduce this bz on the latest rhel8.1.1-av test.
From Juan's above comments, if don't support zlib compression, could disable it at all? 
Then QE won't test and trace related problems. Thanks.

Comment 9 Juan Quintela 2020-01-08 14:17:31 UTC

Hi

We will do it upstream.  But not for 8.1.1.

Later, Juan.

Comment 10 Li Xiaohui 2020-01-09 03:07:19 UTC

(In reply to Juan Quintela from comment #9)
> Hi
> 
> We will do it upstream.  But not for 8.1.1.
> 
Will for rhel8.2.0?
> Later, Juan.

Comment 11 Juan Quintela 2020-01-14 14:25:52 UTC

Yes.

Comment 12 Ademar Reis 2020-02-05 22:53:31 UTC

QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks

Comment 13 Thomas Huth 2020-09-26 06:47:47 UTC

(In reply to Li Xiaohui from comment #10)
> (In reply to Juan Quintela from comment #9)
> > Hi
> > 
> > We will do it upstream.  But not for 8.1.1.
> > 
> Will for rhel8.2.0?

Has this upstream work ever been included in 8.2 or 8.3? ... if yes, could we move this bug forward now?

Comment 15 Juan Quintela 2020-11-25 08:39:23 UTC

No product uses compression on RHEL.
No solution upstream, as said, we have a compression solution on top of multifd that is easier to maintain and much faster.

So postpone it.

Comment 17 Li Xiaohui 2020-12-23 13:15:42 UTC

Will try in the beginning of January 2021 since recently busy with other things and will be PTO in next week

Comment 18 Li Xiaohui 2021-01-10 12:18:28 UTC

Hi Amnon,
I have tested this issue on RHEL-8.4.0-AV(kernel-4.18.0-262.el8.dt3.x86_64&qemu-img-5.2.0-2.module+el8.4.0+9186+ec44380f.x86_64),
still can reproduce(not always).

If we plan to close it as won'tfix, could you or Juan give QE a confirmation that QE needn't test multi-thread-compression anymore and needn't track related bzs? Thank you.

Comment 21 Li Xiaohui 2021-01-11 01:15:22 UTC

Sorry, thank you Ariel.

Comment 23 RHEL Program Management 2021-03-15 07:32:49 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 24 Li Xiaohui 2021-03-23 09:33:51 UTC

Close this bz as WONTFIX since deprecate multi-thread-compression from migration test plan.

Note You need to log in before you can comment on or make changes to this bug.