Bug 2196295 - Multifd flushes its channels 10 times per second [NEEDINFO]
Summary: Multifd flushes its channels 10 times per second
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: qemu-kvm
Version: 9.3
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: ---
Assignee: Juan Quintela
QA Contact: Li Xiaohui
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-05-08 16:18 UTC by Juan Quintela
Modified: 2023-08-10 09:42 UTC (History)
9 users (show)

Fixed In Version: qemu-kvm-8.0.0-9.el9
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Target Upstream Version:
Embargoed:
xiaohli: needinfo? (jdenemar)
xiaohli: needinfo? (quintela)
xiaohli: needinfo? (quintela)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gitlab redhat/centos-stream/src qemu-kvm merge_requests 186 0 None opened Multifd flushes its channels 10 times per second 2023-07-18 08:02:03 UTC
Red Hat Issue Tracker RHELPLAN-156631 0 None None None 2023-05-08 16:21:11 UTC

Description Juan Quintela 2023-05-08 16:18:58 UTC
Description of problem:

Multifd sync optimization


Version-Release number of selected component (if applicable):

9.3

How reproducible:

It is an optimiation, it is always "pessimized" in released versions.

Steps to Reproduce:
1. Do a normal migration with multifd enabled


Actual results:

We synchonize every channel 10 times by second (once for each RAM section)

Expected results:

We syncrhonize once every time that we go through all the guest memory (i.e. every several seconds/minutes depending on guest RAM size/network speed)

Additional info:

Comment 1 Juan Quintela 2023-05-08 16:19:36 UTC
This is the upstream patchset that implements it:

https://lists.gnu.org/archive/html/qemu-devel/2023-04/msg01488.html

Comment 2 Juan Quintela 2023-05-08 16:24:38 UTC
Commint ids:

commit 294e5a4034e81b3d8db03b4e0f691386f20d6ed3
Author: Juan Quintela <quintela>
Date:   Tue Jun 21 13:36:11 2022 +0200

    multifd: Only flush once each full round of memory
    
commit b05292c237030343516d073b1a1e5f49ffc017a8
Author: Juan Quintela <quintela>
Date:   Tue Jun 21 12:21:32 2022 +0200

    multifd: Protect multifd_send_sync_main() calls
    
commit 77c259a4cb1c9799754b48f570301ebf1de5ded8
Author: Juan Quintela <quintela>
Date:   Tue Jun 21 12:13:14 2022 +0200

    multifd: Create property multifd-flush-after-each-section

Comment 5 Li Xiaohui 2023-06-19 03:05:42 UTC
Hi Juan,
Can you help check if the fix has been downstream? If not, when will it be downstream? Please also help reset a proper DTM since it has passed.

Comment 7 Juan Quintela 2023-06-27 09:47:41 UTC
Not downstream yet, I am going to post that during this week, sorry.

Comment 8 Li Xiaohui 2023-07-04 10:28:20 UTC
Hi Juan, I add the RFE keyword for this bug since this bug is an optimiation per the description from you. 

Please correct if not right

Comment 10 Li Xiaohui 2023-07-06 06:51:01 UTC
Add one needinfo to Nitesh to avoid miss ITM and release+ according to Comment 9 since don't know Juan when will be back,

Comment 11 Nitesh Narayan Lal 2023-07-06 10:55:30 UTC
(In reply to Li Xiaohui from comment #10)
> Add one needinfo to Nitesh to avoid miss ITM and release+ according to
> Comment 9 since don't know Juan when will be back,

I think we should wait for an update till next week.
If we don't have any updates then we can discuss if it's still possible to get this in 9.3 or do we want to move it to 9.4.
What do you think?

Comment 12 Li Xiaohui 2023-07-06 11:24:28 UTC
(In reply to Nitesh Narayan Lal from comment #11)
> (In reply to Li Xiaohui from comment #10)
> > Add one needinfo to Nitesh to avoid miss ITM and release+ according to
> > Comment 9 since don't know Juan when will be back,
> 
> I think we should wait for an update till next week.
> If we don't have any updates then we can discuss if it's still possible to
> get this in 9.3 or do we want to move it to 9.4.
> What do you think?

No problem

Comment 14 Juan Quintela 2023-07-17 12:27:29 UTC
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=53970406

Here is the brew.  I think it is don.

Comment 15 Juan Quintela 2023-07-17 14:10:02 UTC
Xiaouhi, could you test, please?

Comment 17 Li Xiaohui 2023-07-18 08:18:15 UTC
(In reply to Juan Quintela from comment #15)
> Xiaouhi, could you test, please?

Of course. I will provide the test results after testing.


Juan, can you help give guidance on how to test the fix? 
What's our expectation after the fix? What would be seen before the fix?

Comment 18 Juan Quintela 2023-07-18 12:00:39 UTC
Hi Xiaohui

If you are using multiple channels (let say 16) and specially if the network is fast and you have lots of memory, you should see that you are not using all the available network bandwidth.

Right now, we are flushing and synchronizing all channels 10 times a second.  With the change we will be flushing only every iteration over RAM (for 1TB guest, probably every minute or so).

Zero copy should be more impacted than normal multifd.

If this is enough for finding differences, great.  Otherwise let me know and I will try to get with a better test.

Thanks, Juan.

Comment 20 Yanan Fu 2023-07-25 05:34:08 UTC
QE bot(pre verify): Set 'Verified:Tested,SanityOnly' as gating/tier1 test pass.

Comment 24 Juan Quintela 2023-07-26 14:34:43 UTC
Hi
you need to add "-global migration.multifd-flush-after-each-section=off"
to the command line.

Due to the case that we decided not to use 9.3 machine, we can't enable it by default.

property=off -> new behaviour
property=on -> old behaviour

Comment 25 Li Xiaohui 2023-07-27 07:52:52 UTC
Hi Juan,

(In reply to Juan Quintela from comment #24)
> Hi
> you need to add "-global migration.multifd-flush-after-each-section=off"
> to the command line.
> 
> Due to the case that we decided not to use 9.3 machine, we can't enable it
> by default.

So you mean the rhel9.3 machine type will set the multifd-flush-after-each-section to off by default in the future?
But now we don't have the rhel9.3 machine type on RHEL 9.3 hosts, the highest machine type is rhel9.2. 
multifd-flush-after-each-section is on by default on rhel9.2 machine type, so we need to set it to off, right?

Note: 'by default' I mean we don't need to add the qemu cmd 'multifd-flush-after-each-section=xx'. 


What's more, will libvirt support multifd-flush-after-each-section=off on RHEL 9.3? I want to know when will libvirt support this optimiation.


> 
> property=off -> new behaviour
> property=on -> old behaviour

Comment 26 Li Xiaohui 2023-07-27 10:56:53 UTC
Hi Juan, 

I have tried to do some tests on qemu-kvm-8.0.0-7.el9.x86_64 && qemu-kvm-8.0.0-9.el9.x86_64

Hosts info: 256 cpus and 1510G memory, nic speed between src and dst host is 200G
Guest info: VM with 400G memory and 64 cpus 
In VM, use stressapptest to load some memory workload: # stressapptest -M 20000 -s 1000000

QMP cmd: enable multifd capability and set multifd channel to 20. BTW, set max-bandwidth to 0 on src host

1. On the fixed version -> qemu-kvm-8.0.0-9.el9.x86_64
(1) with '-global migration.multifd-flush-after-each-section=off' qemu cmd:
I could see migration finish immediately (total time: 51701 ms)
(2) with '-global migration.multifd-flush-after-each-section=on' qemu cmd:
Migration needs some time to finish. The total time is much longer than multifd-flush-after-each-section=off -> total time: 305990 ms

And comparing the throughput during migration is active, I can see (1) is better than (2) on the utilization of bandwidth. But (1) doesn't use all the available network bandwidth (200G), the max throughput is nearly 70G


2. On the unfixed version -> qemu-kvm-8.0.0-7.el9.x86_64
(1) When I try to boot VM with -global migration.multifd-flush-after-each-section=xx, qemu would prompt:
(qemu) 2023-07-27T10:11:19.936315Z qemu-kvm: can't apply global migration.multifd-flush-after-each-section=on: Property 'migration.multifd-flush-after-each-section' not found

So the parameter migration.multifd-flush-after-each-section is new and brought by qemu-kvm-8.0.0-9.el9?

(2) When don't add -global migration.multifd-flush-after-each-section=xx, find the migration total time and the utilization of bandwidth is similar to the result of '-global migration.multifd-flush-after-each-section=off' on qemu-kvm-8.0.0-9.el9.x86_64.
migration total time: 48392 ms; the utilization of bandwidth also looks good.

Regarding the test results of 1-(1) and 2-(2), I don't see any migration performance (total time and the utilization of bandwidth) improvement on the fixed qemu version (qemu-kvm-8.0.0-9.el9.x86_64). 
In my previous thought, I guess the migration performance should be better after fixing the bug. But now they have no performance difference. I don't quite understand. Can you help explain? 

So why do we introduce the new parameter migration.multifd-flush-after-each-section?  What I can see is on the fixed qemu version (qemu-kvm-8.0.0-9.el9.x86_64), the migration performance gets improved with multifd-flush-after-each-section to off than to on.

Comment 27 Juan Quintela 2023-07-27 12:31:31 UTC
(In reply to Li Xiaohui from comment #25)
> Hi Juan,
> 
> (In reply to Juan Quintela from comment #24)
> > Hi
> > you need to add "-global migration.multifd-flush-after-each-section=off"
> > to the command line.
> > 
> > Due to the case that we decided not to use 9.3 machine, we can't enable it
> > by default.
> 
> So you mean the rhel9.3 machine type will set the
> multifd-flush-after-each-section to off by default in the future?

Yeap.  Until we don't have a new machine type, we can't enable it by default, will break migration from previous qemu.

> But now we don't have the rhel9.3 machine type on RHEL 9.3 hosts, the
> highest machine type is rhel9.2. 
> multifd-flush-after-each-section is on by default on rhel9.2 machine type,
> so we need to set it to off, right?

To use it, we need to do that.

> Note: 'by default' I mean we don't need to add the qemu cmd
> 'multifd-flush-after-each-section=xx'. 

Exactly.  This "improvement" is not used by default until we have a new machine type.  Until them, it needs to be used by setting the 
'multifd-flush-after-each-section=off'.
 
> What's more, will libvirt support multifd-flush-after-each-section=off on
> RHEL 9.3? 

I don't think we are going to use that, except if CNV/Openstack are going to use it.

> I want to know when will libvirt support this optimiation.

We don't know.  @jdenemar any idea?

I know we ask late, but it was "surprising" that we don't have the new machine type.

Comment 28 Juan Quintela 2023-07-27 12:39:59 UTC
(In reply to Li Xiaohui from comment #26)
> Hi Juan, 
> 
> I have tried to do some tests on qemu-kvm-8.0.0-7.el9.x86_64 &&
> qemu-kvm-8.0.0-9.el9.x86_64
> 
> Hosts info: 256 cpus and 1510G memory, nic speed between src and dst host is
> 200G
> Guest info: VM with 400G memory and 64 cpus 
> In VM, use stressapptest to load some memory workload: # stressapptest -M
> 20000 -s 1000000
> 
> QMP cmd: enable multifd capability and set multifd channel to 20. BTW, set
> max-bandwidth to 0 on src host
> 
> 1. On the fixed version -> qemu-kvm-8.0.0-9.el9.x86_64
> (1) with '-global migration.multifd-flush-after-each-section=off' qemu cmd:
> I could see migration finish immediately (total time: 51701 ms)
> (2) with '-global migration.multifd-flush-after-each-section=on' qemu cmd:
> Migration needs some time to finish. The total time is much longer than
> multifd-flush-after-each-section=off -> total time: 305990 ms
> 
> And comparing the throughput during migration is active, I can see (1) is
> better than (2) on the utilization of bandwidth. But (1) doesn't use all the
> available network bandwidth (200G), the max throughput is nearly 70G

Two things, the speedup is considerable. 51 seconds vs 305 seconds.

Two things, could you add the downtime of both cases?
and tell me how much do you wait until you launch the migrate command?
 
> 2. On the unfixed version -> qemu-kvm-8.0.0-7.el9.x86_64
> (1) When I try to boot VM with -global
> migration.multifd-flush-after-each-section=xx, qemu would prompt:
> (qemu) 2023-07-27T10:11:19.936315Z qemu-kvm: can't apply global
> migration.multifd-flush-after-each-section=on: Property
> 'migration.multifd-flush-after-each-section' not found
> 
> So the parameter migration.multifd-flush-after-each-section is new and
> brought by qemu-kvm-8.0.0-9.el9?

Yeap, it is new, added with the series for this bugzilla.


> (2) When don't add -global migration.multifd-flush-after-each-section=xx,
> find the migration total time and the utilization of bandwidth is similar to
> the result of '-global migration.multifd-flush-after-each-section=off' on
> qemu-kvm-8.0.0-9.el9.x86_64.
> migration total time: 48392 ms; the utilization of bandwidth also looks good.

Can you check migration from old <-> new?  with 

migration.multifd-flush-after-each-section=on

And see that it works?


> Regarding the test results of 1-(1) and 2-(2), I don't see any migration
> performance (total time and the utilization of bandwidth) improvement on the
> fixed qemu version (qemu-kvm-8.0.0-9.el9.x86_64). 
> In my previous thought, I guess the migration performance should be better
> after fixing the bug. But now they have no performance difference. I don't
> quite understand. Can you help explain? 

I will look at this.  Or I messed things up during the backport, or this don't make anysense.

> So why do we introduce the new parameter
> migration.multifd-flush-after-each-section?  What I can see is on the fixed
> qemu version (qemu-kvm-8.0.0-9.el9.x86_64), the migration performance gets
> improved with multifd-flush-after-each-section to off than to on.

optimization: variable to "off".

For your previous result, two things.

Could you told me how much it takes the migration for:

qemu-kvm-8.0.0-7.el9.x86_64

And see if it gives you ~50 seconds or around 300 seconds.

Comment 29 Li Xiaohui 2023-07-27 14:23:01 UTC
Hi Juan, the followings are the migration information about 1-(1), 1-(2) and 2-(2) of Comment 28.
I'm not sure if the below data meet your questions. You can add comments again if don't

1-(1)
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
clear-bitmap-shift: 18
Migration status: completed
total time: 51701 ms
downtime: 1643 ms
setup: 1382 ms
transferred ram: 52067927 kbytes
throughput: 8476.88 mbps
remaining ram: 0 kbytes
total ram: 419451592 kbytes
duplicate: 97827864 pages
skipped: 0 pages
normal: 12769250 pages
normal bytes: 51077000 kbytes
dirty sync count: 5
page size: 4 kbytes
multifd bytes: 51208105 kbytes
pages-per-second: 1988368
precopy ram: 859704 kbytes
downtime ram: 118 kbytes


1-(2)
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
clear-bitmap-shift: 18
Migration status: completed
total time: 305990 ms
downtime: 8383 ms
setup: 1966 ms
transferred ram: 86113793 kbytes
throughput: 2320.38 mbps
remaining ram: 0 kbytes
total ram: 419451592 kbytes
duplicate: 98056257 pages
skipped: 0 pages
normal: 21249711 pages
normal bytes: 84998844 kbytes
dirty sync count: 8
page size: 4 kbytes
multifd bytes: 85251960 kbytes
pages-per-second: 1979446
precopy ram: 861627 kbytes
downtime ram: 205 kbytes


2-(2)
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
clear-bitmap-shift: 18
Migration status: completed
total time: 48392 ms
downtime: 3920 ms
setup: 679 ms
transferred ram: 49725847 kbytes
throughput: 8537.76 mbps
remaining ram: 0 kbytes
total ram: 419451592 kbytes
duplicate: 97840241 pages
skipped: 0 pages
normal: 12179958 pages
normal bytes: 48719832 kbytes
dirty sync count: 4
page size: 4 kbytes
multifd bytes: 48865917 kbytes
pages-per-second: 1916735
precopy ram: 859724 kbytes
downtime ram: 205 kbytes

Comment 30 Li Xiaohui 2023-07-27 14:25:51 UTC
Keep the needinfo to @jdenemar about Comment 27

Comment 31 Li Xiaohui 2023-07-28 03:19:41 UTC
Hi, I also test without migration.multifd-flush-after-each-section parameter on qemu-kvm-8.0.0-9.el9.x86_64, find the migration total time is nearly 5 mins, total data likes the result of migration.multifd-flush-after-each-section=on.

So I guess migration.multifd-flush-after-each-section is on by default on qemu-kvm-8.0.0-9.el9.x86_64

Comment 32 Li Xiaohui 2023-07-28 03:20:34 UTC
The migration info of Comment 31:
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
clear-bitmap-shift: 18
Migration status: completed
total time: 281721 ms
downtime: 1158 ms
setup: 1128 ms
transferred ram: 76548070 kbytes
throughput: 2234.87 mbps
remaining ram: 0 kbytes
total ram: 419451592 kbytes
duplicate: 98003091 pages
skipped: 0 pages
normal: 18864139 pages
normal bytes: 75456556 kbytes
dirty sync count: 6
page size: 4 kbytes
multifd bytes: 75686704 kbytes
pages-per-second: 2015168
precopy ram: 861211 kbytes
downtime ram: 154 kbytes

Comment 33 Juan Quintela 2023-07-31 07:04:47 UTC
(In reply to Li Xiaohui from comment #31)
> Hi, I also test without migration.multifd-flush-after-each-section parameter
> on qemu-kvm-8.0.0-9.el9.x86_64, find the migration total time is nearly 5
> mins, total data likes the result of
> migration.multifd-flush-after-each-section=on.
> 
> So I guess migration.multifd-flush-after-each-section is on by default on
> qemu-kvm-8.0.0-9.el9.x86_64

Then it is right.

Comment 34 Juan Quintela 2023-07-31 07:06:02 UTC
Ok, seing all the comments, I think this bug is fixed and correct, no?

Comment 35 Juan Quintela 2023-07-31 08:41:47 UTC
/me rereads all the numbers again. Notice that I have rearranged it:

1 - On the unfixed version -> qemu-kvm-8.0.0-7.el9.x86_64:

total time: 48392 ms
downtime: 3920 ms
transferred ram: 49725847 kbytes
throughput: 8537.76 mbps
duplicate: 97840241 pages
normal: 12179958 pages
dirty sync count: 4
multifd bytes: 48865917 kbytes

2a - On the fixed version -> qemu-kvm-8.0.0-9.el9.x86_64

with '-global migration.multifd-flush-after-each-section=off

total time: 51701 ms
downtime: 1643 ms
transferred ram: 52067927 kbytes
throughput: 8476.88 mbps
duplicate: 97827864 pages
normal: 12769250 pages
dirty sync count: 5
multifd bytes: 51208105 kbytes

2b - with '-global migration.multifd-flush-after-each-section=off

total time: 305990 ms
downtime: 8383 ms
transferred ram: 86113793 kbytes
throughput: 2320.38 mbps
duplicate: 98056257 pages
normal: 21249711 pages
normal bytes: 84998844 kbytes
dirty sync count: 8
multifd bytes: 85251960 kbytes

1 and 2b should be around the same values, but I see that 2b is way, way worse.
Can I ask how many times have the test beeing run?  Just wondering if the test is not good enough to detect this problem.

The diffs between 2a and 2b is what I would expect for the change.  But between 1 and 2b should be almost no differences, so I am getting another look at the values.  I would apprecciate if you could run 3 for times each of 1 and 2b and see how consistent the values are between iterations.  I am taking a look at the code right now.

Comment 36 Li Xiaohui 2023-07-31 09:02:14 UTC
(In reply to Juan Quintela from comment #35)
> /me rereads all the numbers again. Notice that I have rearranged it:
> 
> 1 - On the unfixed version -> qemu-kvm-8.0.0-7.el9.x86_64:
> 
> total time: 48392 ms
> downtime: 3920 ms
> transferred ram: 49725847 kbytes
> throughput: 8537.76 mbps
> duplicate: 97840241 pages
> normal: 12179958 pages
> dirty sync count: 4
> multifd bytes: 48865917 kbytes
> 
> 2a - On the fixed version -> qemu-kvm-8.0.0-9.el9.x86_64
> 
> with '-global migration.multifd-flush-after-each-section=off
> 
> total time: 51701 ms
> downtime: 1643 ms
> transferred ram: 52067927 kbytes
> throughput: 8476.88 mbps
> duplicate: 97827864 pages
> normal: 12769250 pages
> dirty sync count: 5
> multifd bytes: 51208105 kbytes
> 
> 2b - with '-global migration.multifd-flush-after-each-section=off

I think it should be a typo by you. 
2b - with '-global migration.multifd-flush-after-each-section=on'

> 
> total time: 305990 ms
> downtime: 8383 ms
> transferred ram: 86113793 kbytes
> throughput: 2320.38 mbps
> duplicate: 98056257 pages
> normal: 21249711 pages
> normal bytes: 84998844 kbytes
> dirty sync count: 8
> multifd bytes: 85251960 kbytes
> 
> 1 and 2b should be around the same values, but I see that 2b is way, way
> worse.
> Can I ask how many times have the test beeing run?  Just wondering if the
> test is not good enough to detect this problem.
> 
> The diffs between 2a and 2b is what I would expect for the change.  But
> between 1 and 2b should be almost no differences, so I am getting another
> look at the values.  I would apprecciate if you could run 3 for times each
> of 1 and 2b and see how consistent the values are between iterations.  

I would do it now.

> I am taking a look at the code right now.

Comment 37 Li Xiaohui 2023-07-31 14:11:30 UTC
Hi Juan,

(In reply to Li Xiaohui from comment #36)
> (In reply to Juan Quintela from comment #35)
> > /me rereads all the numbers again. Notice that I have rearranged it:
> > 
> > 1 - On the unfixed version -> qemu-kvm-8.0.0-7.el9.x86_64:
> > 
> > total time: 48392 ms
> > downtime: 3920 ms
> > transferred ram: 49725847 kbytes
> > throughput: 8537.76 mbps
> > duplicate: 97840241 pages
> > normal: 12179958 pages
> > dirty sync count: 4
> > multifd bytes: 48865917 kbytes
> > 
> > 2a - On the fixed version -> qemu-kvm-8.0.0-9.el9.x86_64
> > 
> > with '-global migration.multifd-flush-after-each-section=off
> > 
> > total time: 51701 ms
> > downtime: 1643 ms
> > transferred ram: 52067927 kbytes
> > throughput: 8476.88 mbps
> > duplicate: 97827864 pages
> > normal: 12769250 pages
> > dirty sync count: 5
> > multifd bytes: 51208105 kbytes
> > 
> > 2b - with '-global migration.multifd-flush-after-each-section=off
> 
> I think it should be a typo by you. 
> 2b - with '-global migration.multifd-flush-after-each-section=on'
> 
> > 
> > total time: 305990 ms
> > downtime: 8383 ms
> > transferred ram: 86113793 kbytes
> > throughput: 2320.38 mbps
> > duplicate: 98056257 pages
> > normal: 21249711 pages
> > normal bytes: 84998844 kbytes
> > dirty sync count: 8
> > multifd bytes: 85251960 kbytes
> > 
> > 1 and 2b should be around the same values, but I see that 2b is way, way
> > worse.
> > Can I ask how many times have the test beeing run?  Just wondering if the
> > test is not good enough to detect this problem.
> > 
> > The diffs between 2a and 2b is what I would expect for the change.  But
> > between 1 and 2b should be almost no differences, so I am getting another
> > look at the values.  I would apprecciate if you could run 3 for times each
> > of 1 and 2b and see how consistent the values are between iterations.  
> 

Repeat 6 times for 1 (qemu-kvm-8.0.0-7.el9.x86_64) and 2b (qemu-kvm-8.0.0-9.el9.x86_64 with '-global migration.multifd-flush-after-each-section=on')

For 1:
Total time: 49438 Downtime: 4900
Total time: 42173 Downtime: 6692
Total time: 51555 Downtime: 632
Total time: 72926 Downtime: 8296
Total time: 76313 Downtime: 5750
Total time: 69685 Downtime: 6320


For 2b:
Total time: 50232 Downtime: 3383
Total time: 56592 Downtime: 7028
Total time: 263880 Downtime: 872
Total time: 289918 Downtime: 665
Total time: 72666 Downtime: 1242
Total time: 251911 Downtime: 813


> 
> > I am taking a look at the code right now.

Comment 39 Juan Quintela 2023-08-01 10:24:12 UTC
(In reply to Li Xiaohui from comment #37)
> Hi Juan,
> 
> (In reply to Li Xiaohui from comment #36)
> > (In reply to Juan Quintela from comment #35)
> > > /me rereads all the numbers again. Notice that I have rearranged it:
> > > 
> > > 1 - On the unfixed version -> qemu-kvm-8.0.0-7.el9.x86_64:
> > > 
> > > total time: 48392 ms
> > > downtime: 3920 ms
> > > transferred ram: 49725847 kbytes
> > > throughput: 8537.76 mbps
> > > duplicate: 97840241 pages
> > > normal: 12179958 pages
> > > dirty sync count: 4
> > > multifd bytes: 48865917 kbytes
> > > 
> > > 2a - On the fixed version -> qemu-kvm-8.0.0-9.el9.x86_64
> > > 
> > > with '-global migration.multifd-flush-after-each-section=off
> > > 
> > > total time: 51701 ms
> > > downtime: 1643 ms
> > > transferred ram: 52067927 kbytes
> > > throughput: 8476.88 mbps
> > > duplicate: 97827864 pages
> > > normal: 12769250 pages
> > > dirty sync count: 5
> > > multifd bytes: 51208105 kbytes
> > > 
> > > 2b - with '-global migration.multifd-flush-after-each-section=off
> > 
> > I think it should be a typo by you. 
> > 2b - with '-global migration.multifd-flush-after-each-section=on'
> > 
> > > 
> > > total time: 305990 ms
> > > downtime: 8383 ms
> > > transferred ram: 86113793 kbytes
> > > throughput: 2320.38 mbps
> > > duplicate: 98056257 pages
> > > normal: 21249711 pages
> > > normal bytes: 84998844 kbytes
> > > dirty sync count: 8
> > > multifd bytes: 85251960 kbytes
> > > 
> > > 1 and 2b should be around the same values, but I see that 2b is way, way
> > > worse.
> > > Can I ask how many times have the test beeing run?  Just wondering if the
> > > test is not good enough to detect this problem.
> > > 
> > > The diffs between 2a and 2b is what I would expect for the change.  But
> > > between 1 and 2b should be almost no differences, so I am getting another
> > > look at the values.  I would apprecciate if you could run 3 for times each
> > > of 1 and 2b and see how consistent the values are between iterations.  
> > 
> 
> Repeat 6 times for 1 (qemu-kvm-8.0.0-7.el9.x86_64) and 2b
> (qemu-kvm-8.0.0-9.el9.x86_64 with '-global
> migration.multifd-flush-after-each-section=on')
> 
> For 1:
> Total time: 49438 Downtime: 4900
> Total time: 42173 Downtime: 6692
> Total time: 51555 Downtime: 632
> Total time: 72926 Downtime: 8296
> Total time: 76313 Downtime: 5750
> Total time: 69685 Downtime: 6320
> 
> 
> For 2b:
> Total time: 50232 Downtime: 3383
> Total time: 56592 Downtime: 7028
> Total time: 263880 Downtime: 872
> Total time: 289918 Downtime: 665
> Total time: 72666 Downtime: 1242
> Total time: 251911 Downtime: 813

Completely unestable, so I have to think of another way of testing it.

Sniff.

> 
> > 
> > > I am taking a look at the code right now.

Comment 42 Li Xiaohui 2023-08-04 08:12:16 UTC
Hi Juan,
Can we mark this bug FailQA and reassign it to you? 
The fix isn't working well per the current test results

Comment 43 Li Xiaohui 2023-08-10 06:59:12 UTC
Juan has replied to the question of Comment 42 through Slack. So I would reassign this bug to Juan

Xiaohui Li
Hi, how is https://bugzilla.redhat.com/show_bug.cgi?id=2196295 going now?
4:09
Can we mark this bug failQA as I don't think current fix work well

Juan Quintela
7:33 PM
It is not urgent.
7:33
we can move it to 9.4
7:33
I still think that the problem is in the test, but I haven't came with anythig better yet.

Comment 44 Li Xiaohui 2023-08-10 07:02:02 UTC
Hi Juan, 
Can you help change the ITR to 9.4.0?


Note You need to log in before you can comment on or make changes to this bug.