Bug 2046606
Summary: | [RFE] Postcopy Preemption | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Peter Xu <peterx> |
Component: | qemu-kvm | Assignee: | Peter Xu <peterx> |
qemu-kvm sub component: | Live Migration | QA Contact: | Li Xiaohui <xiaohli> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | medium | ||
Priority: | medium | CC: | chayang, coli, jinzhao, juzhang, leobras, mdean, mrezanin, nilal, peterx, quintela, virt-maint, xiaohli, ymankad |
Version: | unspecified | Keywords: | FutureFeature, Triaged |
Target Milestone: | rc | ||
Target Release: | 9.3 | ||
Hardware: | Unspecified | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | qemu-kvm-8.0.0-1.el9 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2023-11-07 08:26:38 UTC | Type: | Feature Request |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 2180898 | ||
Bug Blocks: | 2138176 |
Description
Peter Xu
2022-01-27 04:03:45 UTC
QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass. Hi Peter, I'm trying to verify this bug. I'm thinking of some test scenarios. Please help check if they're enough and right. 1. For postcopy-preempt feature, I read https://lore.kernel.org/qemu-devel/20220119080929.39485-1-peterx@redhat.com, also plan to test the function that speed up postcopy page requests handling process. If we want to see this optimization by comparing vanilla postcopy and postcopy-preempt, can you give some suggestions about the minimum configuration of the cpu && memory when starting the VM, and how much stress load, use which stress tool? Because generally speaking, migration testing runs on common machines. If we want to test this function for future each version, we need to know the setup of environment and add detailed test steps Also can I use your uffd-latency tool to compare the speedup? How can I use it? Which data shall I notice from it and what result is expected? https://github.com/xzpeter/small-stuffs/blob/master/tools/huge_vm/uffd-latency.bpf 2. BTW, would test some combination scenarios with other migration features and functions: (1) Postcopy preempt + TLS encryption (2) Postcopy preempt + XBZRLE (3) Postcopy preempt + multifd (4) Postcopy preempt + zerocopy (5) Postcopy preempt recovery after a) network failure; b) migrate-pause (6) Postcopy preempt under a weak network/socket ordering race (7) Postcopy preempt with Numa pinned and Hugepage pinned guest--file backend (8) Ping-pong migration with postcopy preempt I found one issue when enabling postcopy preempt on the source host, but keeping disable on the destination host (note postcopy is enabled on the src and dst host): when switch to postcopy mode, the migration process would hang on the src host, and the qemu process of dst host is hung (can't enter and execute any hmp cmds). As my understand, we need to enable postcopy preempt both on src and dst host. But if only enable one side, we shoud get migration failure, the dst qemu should automatically quit with some errors. Multifd and postcopy do it. Shall we file a bug for it? (In reply to Li Xiaohui from comment #15) > Hi Peter, > I'm trying to verify this bug. I'm thinking of some test scenarios. Please > help check if they're enough and right. > > 1. For postcopy-preempt feature, I read > https://lore.kernel.org/qemu-devel/20220119080929.39485-1-peterx@redhat.com, > also plan to test the function that speed up postcopy page requests handling > process. > If we want to see this optimization by comparing vanilla postcopy and > postcopy-preempt, can you give some suggestions about the minimum > configuration of the cpu && memory when starting the VM, and how much stress > load, use which stress tool? Because generally speaking, migration testing > runs on common machines. If we want to test this function for future each > version, we need to know the setup of environment and add detailed test steps > > Also can I use your uffd-latency tool to compare the speedup? How can I use > it? Which data shall I notice from it and what result is expected? > https://github.com/xzpeter/small-stuffs/blob/master/tools/huge_vm/uffd- > latency.bpf Good question. So the bpf program was probably good enough in many cases, but for an end-to-end test we may want to run a memory workload inside guest, then we measure the memory latencies (e.g., remember timestamp1, access memory that is missing, then remember timestamp2, then see how long it took). For that I don't yet have a tool for it. I'm not sure whether you or anyone in QE team would like to work on it if any of you are interested. :) You can always ask me for review or idea in that case. Maybe it can be a new parameter for "mig_mon mm_dirty" or a totally separate tool (I also thought about oslat, but maybe not easy to control memory set there on which page to access, time measurement would be similar). I can also try to write one myself; I actually had a todo for a long time but just didn't got time, so it's a matter of time. Maybe we can start with the bpf program just to verify its working for RHEL binaries? And then replace it with an end-to-end tool some day? > > 2. BTW, would test some combination scenarios with other migration features > and functions: > (1) Postcopy preempt + TLS encryption > (2) Postcopy preempt + XBZRLE > (3) Postcopy preempt + multifd This one is still not yet supported, Leo is looking at it. Currently we can temporarily ignore it. > (4) Postcopy preempt + zerocopy Same to this one because zerocopy relies on multifd. > (5) Postcopy preempt recovery after a) network failure; b) migrate-pause > (6) Postcopy preempt under a weak network/socket ordering race > (7) Postcopy preempt with Numa pinned and Hugepage pinned guest--file backend Since postcopy is still generally not working perfectly with huge pages (mostly due to slowness), we can ignore this one too IMHO. > (8) Ping-pong migration with postcopy preempt For the rest (1,2,5,6,8) all look sane. I'm not sure how important is xbzrle for (2), but I guess it'll still just work. > > > I found one issue when enabling postcopy preempt on the source host, but > keeping disable on the destination host (note postcopy is enabled on the src > and dst host): > when switch to postcopy mode, the migration process would hang on the src > host, and the qemu process of dst host is hung (can't enter and execute any > hmp cmds). > As my understand, we need to enable postcopy preempt both on src and dst > host. But if only enable one side, we shoud get migration failure, the dst > qemu should automatically quit with some errors. Multifd and postcopy do it. > > Shall we file a bug for it? IIUC all migration capabilities need to be either enabled or disabled on both sides or we don't support it. That's how libvirt uses qemu for now. So I'd suggest we just keep setting capabilities the same on src/dst when testing. Thanks. (In reply to Peter Xu from comment #16) > For that I don't yet have a tool for it. Okay now we have the tool. I just decided to write it just now, and it's even less code than I thought to extend mig_mon for it. This should make the measurements more accurate and exactly reflects when a customer will be using it, e.g., comparing to the eBPF thing, this will also include guest page fault handling, EPT pgtable setup / MMU shadowing, etc.. Xiaohui, see this for how to use. It should be extremely easy to use, just attach the new "-L" parameter to your old mig_mon cmdline then it'll dump the report when you stop the mig_mon program (either by "kill $pid" or ctrl-c): https://github.com/xzpeter/mig_mon/tree/devel#memory-access-latency-measurement I wanted to try it out myself with a real postcopy workload, but unfortunately my beaker host broke again just after 1 single month (Ah!!!), I requested a HW reset just now and I'll try it too after the host got back to live. Let me know if you have issues using it, and feel free to try running it with vanilla postcopy, then postcopy preempt enabled, to see the latency distribution. The tool needs to be started before migration starts, and stopped after postcopy completes, to capture all mem access during the period. The link should work, but obviously I only pushed it to "devel" tree. Now to "main" too (just in case you won't get the code by a "git pull" by default): https://github.com/xzpeter/mig_mon#memory-access-latency-measurement Thank you very much, I will try today. Hello Peter, I try to use the tool of memory-access-latency-measurement to test on RHEL 9.3.0(kernel-5.14.0-316.el9.x86_64 && qemu-kvm-8.0.0-4.el9.x86_64), please help check below data Guest config: 20G guest, 40 vcpus Host config: 1Gbps and 200Gbps host NICs attached between src/dst Test scenario: 1. Migrate through 1Gbps nic, check Memory Latencies under 1) vanilla postcopy and 2) postcopy-preempt 2. Migrate through 200Gbps nic (but set max-bandwidth and max-postcopy-bandwidth to 10Gbps before migration), check Memory Latencies under 1) vanilla postcopy and 2) postcopy-preempt Test step: 1.Boot a VM on src host, and a VM with '-incoming defer' on dst host; 2.Set relevant migration parameters and capabilities according to test scenarios; 3.Before migration, run memory-access-latency-measurement tool in VM: # ./mig_mon mm_dirty -m 16G -L 4.Start the migration, switch to postcopy mode. 5.After migration completes, stop the command of 3. Test result: Scenario 1 - vanilla postcopy Memory Latencies: 1 (us): 144834275 2 (us): 9189499 4 (us): 1649528 8 (us): 142315 16 (us): 37116 32 (us): 3369 64 (us): 280 128 (us): 30504 256 (us): 14 512 (us): 14 1024 (us): 2 2048 (us): 2 4096 (us): 0 8192 (us): 4 16384 (us): 0 32768 (us): 0 65536 (us): 0 131072 (us): 0 262144 (us): 0 524288 (us): 0 1048576 (us): 0 Scenario 1 - postcopy-preempt Memory Latencies: 1 (us): 78181316 2 (us): 5900869 4 (us): 1041870 8 (us): 13833 16 (us): 17406 32 (us): 1296 64 (us): 299 128 (us): 1534 256 (us): 22 512 (us): 7 1024 (us): 4 2048 (us): 3 4096 (us): 42 8192 (us): 11 16384 (us): 0 32768 (us): 0 65536 (us): 0 131072 (us): 0 262144 (us): 0 524288 (us): 0 1048576 (us): 0 Scenario 2 - vanilla postcopy Memory Latencies: 1 (us): 44996027 2 (us): 7267219 4 (us): 1611035 8 (us): 44922 16 (us): 11033 32 (us): 2547 64 (us): 987 128 (us): 45 256 (us): 5 512 (us): 5 1024 (us): 0 2048 (us): 3 4096 (us): 0 8192 (us): 0 16384 (us): 0 32768 (us): 0 65536 (us): 0 131072 (us): 0 262144 (us): 0 524288 (us): 0 1048576 (us): 0 Scenario 2 - postcopy-preempt Memory Latencies: 1 (us): 44846120 2 (us): 4156281 4 (us): 789524 8 (us): 13394 16 (us): 9824 32 (us): 2191 64 (us): 860 128 (us): 729 256 (us): 194 512 (us): 166 1024 (us): 1 2048 (us): 0 4096 (us): 0 8192 (us): 0 16384 (us): 0 32768 (us): 0 65536 (us): 0 131072 (us): 0 262144 (us): 0 524288 (us): 0 1048576 (us): 0 The result definitely wasn't expected.. probably because you only tested sequential default workload, which means it can match exactly the precopy stream even if without postcopy requesting any page. Why limit the bandwidth for 200G nic? I would suspect even without limiting it it won't go beyond 10Gbps due to cpu limitation. Could you try again the same test but with "-p random" in mig_mon parameter? Thanks! (In reply to Peter Xu from comment #21) > The result definitely wasn't expected.. probably because you only tested > sequential default workload, which means it can match exactly the precopy > stream even if without postcopy requesting any page. > > Why limit the bandwidth for 200G nic? I would suspect even without limiting > it it won't go beyond 10Gbps due to cpu limitation. > > Could you try again the same test but with "-p random" in mig_mon parameter? > Thanks! Retest scenario 2, the change is: don't set bandwidth both for precopy and postcopy, add '-p random' for mig_mon cmd. Get the following result: Scenario 2 - vanilla postcopy Memory Latencies: 1 (us): 14557556 2 (us): 3270733 4 (us): 1007054 8 (us): 10947 16 (us): 24092 32 (us): 669 64 (us): 67 128 (us): 30 256 (us): 13 512 (us): 4 1024 (us): 2 2048 (us): 2 4096 (us): 0 8192 (us): 0 16384 (us): 0 32768 (us): 0 65536 (us): 0 131072 (us): 0 262144 (us): 0 524288 (us): 0 1048576 (us): 0 Scenario 2 - postcopy-preempt Memory Latencies: 1 (us): 28759372 2 (us): 4621274 4 (us): 451693 8 (us): 29310 16 (us): 48462 32 (us): 3063 64 (us): 1222 128 (us): 757 256 (us): 453 512 (us): 52 1024 (us): 5 2048 (us): 0 4096 (us): 0 8192 (us): 0 16384 (us): 0 32768 (us): 0 65536 (us): 0 131072 (us): 0 262144 (us): 0 524288 (us): 0 1048576 (us): 0 BTW, the results of Comment 22 and Comment 20 are tested manually. I don't know if some delay due to manual tests (such as several seconds delay under two situation: 1) start to migration after mig_mon starts; 2) cancel mig_mon after migration completion) would affect the data of Memory Latencies (In reply to Li Xiaohui from comment #23) > BTW, the results of Comment 22 and Comment 20 are tested manually. I don't > know if some delay due to manual tests (such as several seconds delay under > two situation: 1) start to migration after mig_mon starts; 2) cancel mig_mon > after migration completion) would affect the data of Memory Latencies It shouldn't affect much - most of the before/after migration accesses will fall into small buckets anyway, but we're mostly caring about larger buckets. The number still doesn't look right comparing to what I tested with the bpf script. Maybe it means on 100G nic the latency is as good, but doesn't explain why random access (which should be the norm) is even faster than seq, but maybe that's caused by the bandwidth limitation. I'll try it out too on my host when it's back. Before that, would you try with 10Gbps nic one more time, with the same mig_mon setting (random) but also run the bpftrace script altogether? Thanks! According to Comment 15 and Comment 16, test combination scenarios with other migration features and functions on RHEL 9.3.0(kernel-5.14.0-316.el9.x86_64 && qemu-kvm-8.0.0-4.el9.x86_64): (1) Postcopy preempt + TLS encryption -- PASS (2) Postcopy preempt + XBZRLE -- PASS (3) Postcopy preempt + multifd -- ERROR (4) Postcopy preempt + zerocopy -- ERROR (5) Postcopy preempt recovery after a) network failure; -- ERROR b) migrate-pause -- PASS (6) Postcopy preempt under a weak network/socket ordering race -- PASS (7) Postcopy preempt with Numa pinned and Hugepage pinned guest--file backend -- PASS (8) Ping-pong migration with postcopy preempt -- PASS As discussion, the failure of postcopy-preempt + multifd/zerocopy is expected. When do postcopy-preempt + multifd migration, found two issues: 1) can't find the right multifd threads -> 4 on the dst host during migration is active; 2) no workloads in VM, but migration can't converge within 600s About network failure after postcopy preempt recovery, it should be a new bug because I also tried similar scenario under vanilla postcopy, it succeeds. Here is the new bug: Bug 2210788 - Postcopy preempt can't recover after handle network failure Update ITM from 13 to 14 as we still have some issues on migration function optimization according to Comment 22-24 (In reply to Peter Xu from comment #24) > (In reply to Li Xiaohui from comment #23) > > BTW, the results of Comment 22 and Comment 20 are tested manually. I don't > > know if some delay due to manual tests (such as several seconds delay under > > two situation: 1) start to migration after mig_mon starts; 2) cancel mig_mon > > after migration completion) would affect the data of Memory Latencies > > It shouldn't affect much - most of the before/after migration accesses will > fall into small buckets anyway, but we're mostly caring about larger > buckets. The number still doesn't look right comparing to what I tested > with the bpf script. > > Maybe it means on 100G nic the latency is as good, but doesn't explain why > random access (which should be the norm) is even faster than seq, but maybe > that's caused by the bandwidth limitation. > > I'll try it out too on my host when it's back. Before that, would you try > with 10Gbps nic one more time, with the same mig_mon setting (random) but > also run the bpftrace script altogether? Thanks! The latest data of scenario 2 (see Comment 20) but with mig_mon setting random and run the bpftrace script: Scenario 2 - vanilla postcopy Memory Latencies: 1 (us): 25038788 2 (us): 3388668 4 (us): 463367 8 (us): 9189 16 (us): 23135 32 (us): 310 64 (us): 48 128 (us): 16 256 (us): 11 512 (us): 5 1024 (us): 1 2048 (us): 0 4096 (us): 0 8192 (us): 0 16384 (us): 0 32768 (us): 0 65536 (us): 0 131072 (us): 0 262144 (us): 0 524288 (us): 0 1048576 (us): 0 bpftrace data: Average: 8355 (us) @delay_us: [4, 8) 4 | | [8, 16) 7 | | [16, 32) 3 | | [32, 64) 3 | | [64, 128) 1 | | [128, 256) 32 | | [256, 512) 9 | | [512, 1K) 28 | | [1K, 2K) 6 | | [2K, 4K) 16 | | [4K, 8K) 581 |@@@@@@@@@ | [8K, 16K) 3026 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| Scenario 2 - postcopy-preempt Memory Latencies: 1 (us): 23626012 2 (us): 3249279 4 (us): 576874 8 (us): 39737 16 (us): 10055 32 (us): 1741 64 (us): 744 128 (us): 292 256 (us): 151 512 (us): 222 1024 (us): 2 2048 (us): 0 4096 (us): 0 8192 (us): 0 16384 (us): 0 32768 (us): 0 65536 (us): 0 131072 (us): 0 262144 (us): 0 524288 (us): 0 1048576 (us): 0 bpftrace data: Average: 203 (us) @delay_us: [4, 8) 2 | | [8, 16) 13 | | [16, 32) 45 | | [32, 64) 5216 |@@@@@@@@@@ | [64, 128) 13370 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [128, 256) 25156 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [256, 512) 12385 |@@@@@@@@@@@@@@@@@@@@@@@@@ | [512, 1K) 749 |@ | [1K, 2K) 22 | | [2K, 4K) 34 | | [4K, 8K) 121 | | Yeah, so everything seems to be alright except the new tool which seems buggy.. Would you please pull mig_mon git repo again and retry with v0.2.3? Sorry. For the initial runs, please run both the mig_mon tool and bpftrace script so we can quickly spot anything wrong (logically they should really look similar in the results, the mig_mon tool is end-to-end measurement so it can be even slightly slower, but in most cases should still be in us or tens of us level I believe for KVM to install the pgtables). The bpftrace program is still low overhead (I believe in a few us level too) so should be good enough to trace anything we'd like to see (normally >100us for preempt, for vanilla it should be mostly >1000us). retry with mig_mon v0.2.3: Scenario 2 - vanilla postcopy Memory Latencies: 1 (us): 826885 2 (us): 221901 4 (us): 323257 8 (us): 17867 16 (us): 4907 32 (us): 776 64 (us): 130 128 (us): 10 256 (us): 38 512 (us): 2 1024 (us): 17 2048 (us): 18 4096 (us): 1116 8192 (us): 143 16384 (us): 0 32768 (us): 0 65536 (us): 0 131072 (us): 0 262144 (us): 0 524288 (us): 0 1048576 (us): 0 bpftrace data: Average: 3234 (us) @delay_us: [4, 8) 7 | | [8, 16) 6 | | [16, 32) 7 | | [32, 64) 0 | | [64, 128) 18 | | [128, 256) 2 | | [256, 512) 55 |@ | [512, 1K) 80 |@@ | [1K, 2K) 107 |@@@ | [2K, 4K) 1440 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [4K, 8K) 323 |@@@@@@@@@@@ Scenario 2 - postcopy-preempt Memory Latencies: 1 (us): 651018 2 (us): 57501 4 (us): 247086 8 (us): 49118 16 (us): 48646 32 (us): 4191 64 (us): 141 128 (us): 17769 256 (us): 11605 512 (us): 4248 1024 (us): 249 2048 (us): 32 4096 (us): 36 8192 (us): 30 16384 (us): 2 32768 (us): 0 65536 (us): 0 131072 (us): 0 262144 (us): 0 524288 (us): 0 1048576 (us): 0 bpftrace dataļ¼ Average: 222 (us) @delay_us: [4, 8) 1 | | [8, 16) 13 | | [16, 32) 33 | | [32, 64) 1485 |@@@ | [64, 128) 20796 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [128, 256) 11160 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [256, 512) 6598 |@@@@@@@@@@@@@@@@ | [512, 1K) 3947 |@@@@@@@@@ | [1K, 2K) 216 | | [2K, 4K) 42 | | [4K, 8K) 149 | | [8K, 16K) 1 | | [16K, 32K) 4 | Yes it looks much better now and more or less matches my expectation. There're still a few outliers I can probably have a closer look later, but that shouldn't affect the overall / average result I believe, so I think it's fine for now. Thanks. Thanks Peter. Then I would mark this bug verified. We can discuss how to add cases in the migration test plan later. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: qemu-kvm security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6368 |