Bug 1968509
Summary: | Use MSG_ZEROCOPY on QEMU Live Migration | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Leonardo Bras <leobras> |
Component: | qemu-kvm | Assignee: | Leonardo Bras <leobras> |
qemu-kvm sub component: | Live Migration | QA Contact: | Li Xiaohui <xiaohli> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | unspecified | ||
Priority: | urgent | CC: | ailan, cconte, chayang, coli, dgilbert, djdumas, fdeutsch, fjin, jinzhao, juzhang, mdean, nkoenig, peterx, virt-maint, xiaohli, yafu |
Version: | 9.0 | Keywords: | RFE, Triaged |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | qemu-kvm-7.0.0-8.el9 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-11-15 09:53:23 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 2089431, 2092752 |
Description
Leonardo Bras
2021-06-07 13:18:40 UTC
I have created a simple C application to use / test this feature: - It sends / receives 8GB in 64k-sized packets over TCP - One single-threaded process sends, another single-threaded process receives. - Can run in both MSG_ZEROCOPY and default mode. I ran this application with my laptop as both server and client, and got the following 'time' results for the sending part: #Default Mode: bash -c "time ./test 1" Sent : 8589934592 real 0m2.469s user 0m0.009s sys 0m1.856s #With MSG_ZEROCOPY bash -c "time ./test 0" Sent : 8589934592 real 0m3.179s user 0m0.006s sys 0m0.523s By looking at the results, I can see the real time spent on MSG_ZEROCOPY raised in ~28%, even though the user time reduced in 33% and the system time reduced to less than a third (~28%) of the default copying behavior. If I understand this correctly, a lot less system resources are being used (avoiding buffer copy), but a lot of time is spent on sleeping locks, so maybe we can improve the real time by doing the send()ing in more threads (as there are more computing time available). Next steps: - Understand how QEMU uses the socket interface in Live Migration - Try to change it to use MSG_ZEROCOPY - Try to understand if there is any performance improvement - Try to improve that I have sent a v1 upstream: http://patchwork.ozlabs.org/project/qemu-devel/list/?series=260325&archive=both&state=* With a synthetic load, I could get a 15x cpu usage reduction on __sys_sendmsg() and overall 13-18% migration time reduction. The observed improvement should be bigger in heavy loaded hosts. Maybe it's a good idea to try that in a real workload to get more useful numbers. Move RHEL-AV bugs to RHEL9. If necessary to resolve in RHEL8, then clone to the current RHEL8 release. v9 uspteam patch sent: https://patchwork.kernel.org/project/qemu-devel/list/?series=635469&state=%2A&archive=both v10 Upstream patch sent: https://patchwork.kernel.org/project/qemu-devel/list/?series=635946&state=%2A&archive=both v11 Upstream patch: https://patchwork.kernel.org/project/qemu-devel/list/?series=638427&state=%2A&archive=both v12 Upstream patch: https://patchwork.kernel.org/project/qemu-devel/list/?series=639296&archive=both&state=* v13 Upstream patch: https://patchwork.kernel.org/project/qemu-devel/list/?series=641244&archive=both&state=* Merged upstream in 54b592c427ca25751870 Merged on qemu/master under commit 5b1d9bab2d Working on brew for testing (In reply to Dr. David Alan Gilbert from comment #7) > Merged upstream in 54b592c427ca25751870 Sorry, did not see your message. That's correct, the merge commit is 54b592c427ca25751870, and the commit for my patch 8/8 is 5b1d9bab2d. Hi leonardo, Thanks for the update of DTM and ITM. Just a reminder here that DTM 18 is July 4, can we get the available qemu-kvm downstream before that? Since QE also needs at least one week to test this future feature. I would suggest move ITR from 9.1.0 to 9.2.0 if we still can't make the qemu-kvm available in downstream before DTM 18/19. Thanks for the understand! QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass. Hi Leonardo, I have some questions when test zero-copy-send, can you help answer? 1. If we want to test zero-copy-send capability, we only need to enable it on source host, not need to enable on destination host, right? 2. When I enable zero-copy-send capability, qemu gives the following prompt: {"execute": "migrate-set-capabilities", "arguments": {"capabilities": [{"capability": "zero-copy-send", "state": true}]}, "id": "bTWGP5i2"} {"id": "bTWGP5i2", "error": {"class": "GenericError", "desc": "Zero copy only available for non-compressed non-TLS multifd migration"}} If I understand right, we can only enable zero-copy-send when multifd is enabled, and don't support TLS and compression with zero-copy-send enabled, right? So for zero-copy-send testing, now we can test these combination migration scenarios: 1) multifd with zero-copy-send enabled; 2) multifd + autoconverge with zero-copy-send enabled (In reply to Li Xiaohui from comment #15) > Hi Leonardo, I have some questions when test zero-copy-send, can you help > answer? > > 1. If we want to test zero-copy-send capability, we only need to enable it > on source host, not need to enable on destination host, right? It only needs to be enabled in source/sending host. > 2. When I enable zero-copy-send capability, qemu gives the following prompt: > {"execute": "migrate-set-capabilities", "arguments": {"capabilities": > [{"capability": "zero-copy-send", "state": true}]}, "id": "bTWGP5i2"} > {"id": "bTWGP5i2", "error": {"class": "GenericError", "desc": "Zero copy > only available for non-compressed non-TLS multifd migration"}} > > If I understand right, we can only enable zero-copy-send when multifd is > enabled, and don't support TLS and compression with zero-copy-send enabled, > right? Correct. For now zero-copy-send only works with multifd, without TLS or compression. > So for zero-copy-send testing, now we can test these combination migration > scenarios: 1) multifd with zero-copy-send enabled; 2) multifd + autoconverge > with zero-copy-send enabled Test (1) is the scenario I have been testing for a while now, so it should be fine. TBH, I haven't worked with autoconverge yet, so what I say here is based on a quick look I took into the code: Autoconverge looks to be only throttling the vcpu, making it run slower. As this change nothing in the sending mechanism, test (2) should also be just fine. If is there anything else I can help you with, please let me know. Thank Leonardo. Today I test some parts of zerocopy and get some results in below zerocopy test docs, can you help confirm the results? I have added needinfo to you. https://docs.google.com/document/d/1AINki8qsX3WX7YDam1W-NwH9J9ySBCKZnuXr6buXheY I replied every question in there. Please let me know if anything else is needed. Hi Camilla, can you help move the bug to on_qa since gating/tier1 test pass for the fixed qemu-kvm according to Comment 14? Hi leonardo, when I migrate with multifd + zerocopy + xbzrle, migration succeeds. I thought xbzrle was a type of compression, so it should fail when enable zerocopy with xbzrle on. Can you explain which compression that zerocopy doesn't support? Please list all the unsupported compression if possible here. Thanks in advance. Verify this bug on hosts(hosts info: kernel-5.14.0-121.el9.x86_64 & qemu-kvm-7.0.0-8.el9.x86_64) Test scenarios (refer to docs: https://docs.google.com/document/d/1AINki8qsX3WX7YDam1W-NwH9J9ySBCKZnuXr6buXheY) Report bugs: Bug 2107466 - zerocopy capability can be enabled when set migrate capabilities with multifd and compress together Bug 2106726 - Qemu on destination host crashed if migrate with postcopy and multifd enabled Problems need to be confirmed: 1.When try to migrate with tls + multifd + zerocopy (note tls certs has been set up on source and destination host) a. if set tls creds on src and dst host firstly, then enable multifd, zerocopy, will get error prompt, it's the expectation: {"execute": "migrate-set-parameters", "arguments": {"tls-creds": "tls0"}, "id": "hdTiagq5"} ... {"execute": "migrate-set-capabilities", "arguments": {"capabilities": [{"capability": "zero-copy-send", "state": true}]}, "id": "wgr8gT3T"} {"id": "wgr8gT3T", "error": {"class": "GenericError", "desc": "Zero copy only available for non-compressed non-TLS multifd migration"}} b. but if enable multifd, zerocopy firstly, then set tls creds, all will succeed, but when start migration, migration will fail like: {"execute": "query-migrate", "id": "qUqoTVuL"} {"return": {"status": "failed", "error-desc": "Requested Zero Copy feature is not available: Invalid argument"}, "id": "qUqoTVuL"} My question: for situation b, shall we avoid set tls creds successfully when zerocopy is enabled? or give accurate error prompt than above error-desc like zerocopy is enabled, but don't support tls migration under zerocopy. 2.Same question as Comment 22, though zerocopy doesn't support compression, we still can migrate successfully with multifd + zerocopy + xbzrle, multifd + zerocopy + multi-thread-compression. So please give some explanation about why they succeed, and please list all unsupported compression. Delay ITM from 20 to 21 as don't receive the dev's reply about Comment 23. (In reply to Li Xiaohui from comment #22) > Hi leonardo, when I migrate with multifd + zerocopy + xbzrle, migration > succeeds. I thought xbzrle was a type of compression, so it should fail when > enable zerocopy with xbzrle on. > > Can you explain which compression that zerocopy doesn't support? Please list > all the unsupported compression if possible here. Thanks in advance. zero-copy does not support any kind of compression. (In reply to Li Xiaohui from comment #23) > Verify this bug on hosts(hosts info: kernel-5.14.0-121.el9.x86_64 & > qemu-kvm-7.0.0-8.el9.x86_64) > > Test scenarios (refer to docs: > https://docs.google.com/document/d/1AINki8qsX3WX7YDam1W- > NwH9J9ySBCKZnuXr6buXheY) > > > Report bugs: > Bug 2107466 - zerocopy capability can be enabled when set migrate > capabilities with multifd and compress together That's a bug. I already proposed a solution for this one. I just need testing for the solution validation. > Bug 2106726 - Qemu on destination host crashed if migrate with postcopy and > multifd enabled I don't quite see how this one is related to MSG_ZEROCOPY. Could you please elaborate? > > > Problems need to be confirmed: > 1.When try to migrate with tls + multifd + zerocopy (note tls certs has been > set up on source and destination host) > a. if set tls creds on src and dst host firstly, then enable multifd, > zerocopy, will get error prompt, it's the expectation: > {"execute": "migrate-set-parameters", "arguments": {"tls-creds": "tls0"}, > "id": "hdTiagq5"} > ... > {"execute": "migrate-set-capabilities", "arguments": {"capabilities": > [{"capability": "zero-copy-send", "state": true}]}, "id": "wgr8gT3T"} > {"id": "wgr8gT3T", "error": {"class": "GenericError", "desc": "Zero copy > only available for non-compressed non-TLS multifd migration"}} > b. but if enable multifd, zerocopy firstly, then set tls creds, all will > succeed, but when start migration, migration will fail like: > {"execute": "query-migrate", "id": "qUqoTVuL"} > {"return": {"status": "failed", "error-desc": "Requested Zero Copy feature > is not available: Invalid argument"}, "id": "qUqoTVuL"} > > My question: for situation b, shall we avoid set tls creds successfully when > zerocopy is enabled? or give accurate error prompt than above error-desc > like zerocopy is enabled, but don't support tls migration under zerocopy. TBH I don't quite understand if is there any recommended process in migration, but having it fail in different spots depending on the order of commands does not seem a healthy interface. I have an idea on why this is currently working this way, and I will provide a fix soon. > > 2.Same question as Comment 22, though zerocopy doesn't support compression, > we still can migrate successfully with multifd + zerocopy + xbzrle, multifd > + zerocopy + multi-thread-compression. > So please give some explanation about why they succeed, and please list all > unsupported compression. No compression is supported with zero-copy as of today. If this is currently migrating, or having an error output during the migration, please let me know so I can fix it. IIUC the parameters/capabilities you mentioned not printing error are currently: 1- multifd + zero-copy + compression (bz2107466) 2- multifd + zero-copy + tls (comment#23) 3- multifd + zero-copy + xbzrle 4- multifd + zero-copy + multi-thread-compression Is that right? By multi-thread-compression you mean multifd + zlib and multifd + zstd ? All the above don't fail on before migration depending on the order, is that correct? Li Xiaohui, I think I have a brew that fixes all the above issues. Could you please help me by testing if it reproduces any of the above in your scenario? https://kojihub.stream.rdu2.redhat.com/koji/taskinfo?taskID=1300898 (In reply to Leonardo Bras from comment #25) > (In reply to Li Xiaohui from comment #22) > > Hi leonardo, when I migrate with multifd + zerocopy + xbzrle, migration > > succeeds. I thought xbzrle was a type of compression, so it should fail when > > enable zerocopy with xbzrle on. > > > > Can you explain which compression that zerocopy doesn't support? Please list > > all the unsupported compression if possible here. Thanks in advance. > > zero-copy does not support any kind of compression. > > (In reply to Li Xiaohui from comment #23) > > Verify this bug on hosts(hosts info: kernel-5.14.0-121.el9.x86_64 & > > qemu-kvm-7.0.0-8.el9.x86_64) > > > > Test scenarios (refer to docs: > > https://docs.google.com/document/d/1AINki8qsX3WX7YDam1W- > > NwH9J9ySBCKZnuXr6buXheY) > > > > > > Report bugs: > > Bug 2107466 - zerocopy capability can be enabled when set migrate > > capabilities with multifd and compress together > > That's a bug. I already proposed a solution for this one. I just need > testing for the solution validation. > > > Bug 2106726 - Qemu on destination host crashed if migrate with postcopy and > > multifd enabled > > I don't quite see how this one is related to MSG_ZEROCOPY. Could you please > elaborate? Not directly relate to zerocopy. I record this bug as when I test zerocopy, also cover multifd + zerocopy + postcopy scenario, so found and record this bug here. > > > > > > > Problems need to be confirmed: > > 1.When try to migrate with tls + multifd + zerocopy (note tls certs has been > > set up on source and destination host) > > a. if set tls creds on src and dst host firstly, then enable multifd, > > zerocopy, will get error prompt, it's the expectation: > > {"execute": "migrate-set-parameters", "arguments": {"tls-creds": "tls0"}, > > "id": "hdTiagq5"} > > ... > > {"execute": "migrate-set-capabilities", "arguments": {"capabilities": > > [{"capability": "zero-copy-send", "state": true}]}, "id": "wgr8gT3T"} > > {"id": "wgr8gT3T", "error": {"class": "GenericError", "desc": "Zero copy > > only available for non-compressed non-TLS multifd migration"}} > > b. but if enable multifd, zerocopy firstly, then set tls creds, all will > > succeed, but when start migration, migration will fail like: > > {"execute": "query-migrate", "id": "qUqoTVuL"} > > {"return": {"status": "failed", "error-desc": "Requested Zero Copy feature > > is not available: Invalid argument"}, "id": "qUqoTVuL"} > > > > My question: for situation b, shall we avoid set tls creds successfully when > > zerocopy is enabled? or give accurate error prompt than above error-desc > > like zerocopy is enabled, but don't support tls migration under zerocopy. > > TBH I don't quite understand if is there any recommended process in > migration, but having it fail in different spots depending on the order of > commands does not seem a healthy interface. > I have an idea on why this is currently working this way, and I will provide > a fix soon. > > > > > > 2.Same question as Comment 22, though zerocopy doesn't support compression, > > we still can migrate successfully with multifd + zerocopy + xbzrle, multifd > > + zerocopy + multi-thread-compression. > > So please give some explanation about why they succeed, and please list all > > unsupported compression. > > No compression is supported with zero-copy as of today. > > If this is currently migrating, or having an error output during the > migration, please let me know so I can fix it. > > IIUC the parameters/capabilities you mentioned not printing error are > currently: > 1- multifd + zero-copy + compression (bz2107466) > 2- multifd + zero-copy + tls (comment#23) > 3- multifd + zero-copy + xbzrle > 4- multifd + zero-copy + multi-thread-compression > Is that right? Basically correct. Note: here compress I mean multi-thread-compression, so 1 is same as 4. when we use multi-thread-compression, just need to enable compress capability. > > By multi-thread-compression you mean multifd + zlib and multifd + zstd ? multi-thread-compression has be deprecated. It doesn't include multifd. > > All the above don't fail on before migration depending on the order, is that > correct? Not completely right. xbzrle is an exception. Regardless of the order, it always enables success. I see. Thanks for the info! The brew I sent on Comment#26 should fix all those errors. Please let me know if any bug remains in that brew. Thanks Leonardo. I would mark this bug verified per Comment 23 and Comment 25. For the issues that I found, we can track them in other bugs (I would provide the test results of the scratch build in Comment 26 in other bugs too). I would add test cases in migration test plan later, so remove the "SanityOnly" from Verified firstly. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: qemu-kvm security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7967 |