Description of problem: When using multifd and the network is slow or it is saturated, the migration thread busy waits for a channel to become ready. Version-Release number of selected component (if applicable): All qemu-kvm versions with multifd enabled. How reproducible: 100%. The network needs to be very slow. Steps to Reproduce: 1. Configure a network that is slow. 2. Configure number of multifd channels to a big number 3. Set the bandwidth to a very small number Actual results: The migration thread is busy waiting for a channel to become ready. Expected results: The migration thread is waiting in a semaphare for a channel to become ready without wasting CPU. Additional info:
There is an uptodate patchset to fix this issue: https://lists.gnu.org/archive/html/qemu-devel/2023-04/msg04562.html
Upstream commit: commit d2026ee117147893f8d80f060cede6d872ecbd7f Author: Juan Quintela <quintela> Date: Wed Apr 26 12:20:36 2023 +0200 multifd: Fix the number of channels ready
Discuss the reproduction steps through gchat with Juan: 1. set a small network like 1MB/s for migration bandwidth; 2. set very few multifd channels (1-2); Before the fix, we would see the main migration thread is busy waiting, i.e. CPU = 100% After fix, the cpu usage of the main migration thread should be small I would test following the above steps before and after the fix. Thank you Juan.
Hi Leonardo, What's our fixed plan for this bug? I see the ITR is set to RHEL 9.3.0. Can you help set a proper DTM?
QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.
Extend ITM to 20 as the reproduction steps is not clear, I need more time to test and get confirmation from the Juan / Leonardo
Hi all, I did some tests on qemu-kvm-8.0.0-1.el9.x86_64 and qemu-kvm-8.0.0-7.el9.x86_64 Test steps: 1. Enable multifd capability and set multifd channel to 1 for src and dst host; 2. Set 0 (no limitation) for migration bandwidth; 3. Run stressapptest in VM; # stressapptest -M 10000 -s 1000000 4. Start to migrate VM from src to dst host Note: the nic support 200G bandwidth. Before the fix (qemu-kvm-8.0.0-1.el9.x86_64), see the cpu usage of the main migration thread is busy on the src host: live_migration thread is 85.0%, but multifdsend thread is 17.9% after 1 second: live_migration thread changes to 18.0%, multifdsend thread changes to 4.0% After fix (qemu-kvm-8.0.0-7.el9.x86_64), the cpu usage of the main migration thread is small: live_migration thread is 8.3%, multifdsend thread is 9.7% Also test the above scenario (but set multifd channel to 10) on qemu-kvm-8.0.0-7.el9.x86_64, the cpu usage is like below: live migration thread is 9.3%, 4 multifdsend threads are 2.3%, 3 multifdsend threads are 2.0%, 3 multifdsend threads are 1.7% Per above test results, I think we can mark this bug verified. Juan, Leonardo, how do you think?
It looks correct. Thanks very much.
Thanks for the reivew. Mark bug verified per Comment 11 and Comment 12. I would add one case to monitor cpu usage of the live migration thread and the multifdsend threads later.