Description of problem: Under network packet loss environment, start multifd migration, then recovery packet loss, multifd migration continues and succeeds, but no multifd threads are found Version-Release number of selected component (if applicable): hosts info: kernel-5.14.0-316.el9.x86_64 && qemu-kvm-8.0.0-4.el9.x86_64 guest info: kernel-5.14.0-312.el9.x86_64 How reproducible: 100% Steps to Reproduce: 1.Start a VM on the source host 2.Start a VM on the destination host with appending '-incoming defer' 3.Enable multifd capability on the src and dst host 4.Create network packet loss on the src host: # tc qdisc add dev switch root netem loss 40% 5.Migrate VM from the src to the dst host 6.Recovery network packet loss, then continue multifd migration # tc qdisc delete dev switch root netem loss 40% Actual results: After network recovery, no multifd threads found when multifd migration is active: Notes: multifd migration still succeeds, and VM works well after migration. 2023-05-24-04:26:28: ----- Check the multifd threads on the src and dst host. ----- 2023-05-24-04:26:28: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l 2023-05-24-04:26:28: 0 2023-05-24-04:26:28: Host(10.73.2.80) Sending qmp command : {"execute": "query-migrate", "id": "JytSD1w4"} 2023-05-24-04:26:28: Host(10.73.2.80) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 1060, "total-time": 325809, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 7030912, "pages-per-second": 6, "downtime-bytes": 0, "page-size": 4096, "remaining": 4301185024, "postcopy-bytes": 0, "mbps": 0.21188531903454091, "transferred": 7049591, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 18679, "duplicate": 2072, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 6995968, "normal": 1708}}, "id": "JytSD1w4"} ... 2023-05-24-04:26:49: ----- Check the multifd threads on the src and dst host. ----- 2023-05-24-04:26:49: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l 2023-05-24-04:26:49: 0 2023-05-24-04:26:49: Host(10.73.2.80) Sending qmp command : {"execute": "query-migrate", "id": "5vxpJ2gL"} 2023-05-24-04:26:49: Host(10.73.2.80) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 1060, "total-time": 346983, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 7030912, "pages-per-second": 6, "downtime-bytes": 0, "page-size": 4096, "remaining": 4301185024, "postcopy-bytes": 0, "mbps": 0.21188531903454091, "transferred": 7049591, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 18679, "duplicate": 2072, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 6995968, "normal": 1708}}, "id": "5vxpJ2gL"} 2023-05-24-04:26:50: ----- Check the multifd threads on the src and dst host. ----- 2023-05-24-04:26:50: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l 2023-05-24-04:26:50: 0 2023-05-24-04:26:50: Host(10.73.2.80) Sending qmp command : {"execute": "query-migrate", "id": "RQ4QbnvV"} 2023-05-24-04:26:50: Host(10.73.2.80) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 1060, "total-time": 347991, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 7030912, "pages-per-second": 6, "downtime-bytes": 0, "page-size": 4096, "remaining": 4301185024, "postcopy-bytes": 0, "mbps": 0.21188531903454091, "transferred": 7049591, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 18679, "duplicate": 2072, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 6995968, "normal": 1708}}, "id": "RQ4QbnvV"} 2023-05-24-04:26:51: ----- Check the multifd threads on the src and dst host. ----- 2023-05-24-04:26:51: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l 2023-05-24-04:26:51: 0 2023-05-24-04:26:51: Host(10.73.2.80) Sending qmp command : {"execute": "query-migrate", "id": "Qprg16x2"} 2023-05-24-04:26:51: Host(10.73.2.80) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 1060, "total-time": 349000, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 122946176, "pages-per-second": 27974, "downtime-bytes": 0, "page-size": 4096, "remaining": 3087269888, "postcopy-bytes": 0, "mbps": 919.48127731092438, "transferred": 125377224, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 2431048, "duplicate": 270097, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 123076608, "normal": 30048}}, "id": "Qprg16x2"} 2023-05-24-04:26:52: ----- Check the multifd threads on the src and dst host. ----- 2023-05-24-04:26:52: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l 2023-05-24-04:26:52: 0 2023-05-24-04:26:52: Host(10.73.2.80) Sending qmp command : {"execute": "query-migrate", "id": "AxMYJNt4"} 2023-05-24-04:26:52: Host(10.73.2.80) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 1060, "total-time": 350008, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 241992064, "pages-per-second": 28698, "downtime-bytes": 0, "page-size": 4096, "remaining": 2968485888, "postcopy-bytes": 0, "mbps": 943.2609655172414, "transferred": 244423464, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 2431400, "duplicate": 270121, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 241762304, "normal": 59024}}, "id": "AxMYJNt4"} ... 2023-05-24-04:27:08: ----- Check the multifd threads on the src and dst host. ----- 2023-05-24-04:27:08: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l 2023-05-24-04:27:08: 0 2023-05-24-04:27:08: Host(10.73.2.80) Sending qmp command : {"execute": "query-migrate", "id": "CrLrH587"} 2023-05-24-04:27:08: Host(10.73.2.80) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 1060, "total-time": 366150, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 2141654272, "pages-per-second": 40800, "downtime-bytes": 0, "page-size": 4096, "remaining": 667672576, "postcopy-bytes": 0, "mbps": 1101.2626399999999, "transferred": 2144981949, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 3327677, "duplicate": 369462, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 2135674880, "normal": 521405}}, "id": "CrLrH587"} 2023-05-24-04:27:09: ----- Check the multifd threads on the src and dst host. ----- 2023-05-24-04:27:09: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l 2023-05-24-04:27:09: 0 2023-05-24-04:27:09: Host(10.73.2.80) Sending qmp command : {"execute": "query-migrate", "id": "RwwQ4my6"} 2023-05-24-04:27:09: Host(10.73.2.80) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 1060, "total-time": 367160, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 1, "multifd-bytes": 2262454528, "pages-per-second": 27312, "downtime-bytes": 0, "page-size": 4096, "remaining": 528969728, "postcopy-bytes": 0, "mbps": 881.67276800000002, "transferred": 2265822489, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 3367961, "duplicate": 373922, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 2256109568, "normal": 550808}}, "id": "RwwQ4my6"} 2023-05-24-04:27:10: ----- Check the multifd threads on the src and dst host. ----- 2023-05-24-04:27:10: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l 2023-05-24-04:27:10: 0 2023-05-24-04:27:10: Host(10.73.2.80) Sending qmp command : {"execute": "query-migrate", "id": "RdHaR1Fs"} 2023-05-24-04:27:10: Host(10.73.2.80) Responding qmp command: {"return": {"expected-downtime": 300, "status": "active", "setup-time": 1060, "total-time": 368169, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 2, "multifd-bytes": 2376975936, "pages-per-second": 27512, "downtime-bytes": 0, "page-size": 4096, "remaining": 53698560, "postcopy-bytes": 0, "mbps": 904.28323966942162, "transferred": 2381207903, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 4231967, "duplicate": 469906, "dirty-pages-rate": 21, "skipped": 0, "normal-bytes": 2370240512, "normal": 578672}}, "id": "RdHaR1Fs"} 2023-05-24-04:27:11: ----- Check the multifd threads on the src and dst host. ----- 2023-05-24-04:27:11: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l 2023-05-24-04:27:11: 0 2023-05-24-04:27:11: Host(10.73.2.80) Sending qmp command : {"execute": "query-migrate", "id": "uR4McgzU"} 2023-05-24-04:27:12: Host(10.73.2.80) Responding qmp command: {"return": {"status": "completed", "setup-time": 1060, "downtime": 156, "total-time": 368569, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 4, "multifd-bytes": 2409007744, "pages-per-second": 41974, "downtime-bytes": 41598, "page-size": 4096, "remaining": 0, "postcopy-bytes": 0, "mbps": 52.540865252279538, "transferred": 2413308246, "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 4258904, "duplicate": 477508, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 2401693696, "normal": 586351}}, "id": "uR4McgzU"} Expected results: Besides multifd migration succeeds and VM works well after migration, we should also found 2 multifd threads during multifd active Additional info:
Nitesh - sending this untriaged one your way since this has no assignee for quite some time
This is normal behaviour. When using migration (not postcopy) If we have a big problem on destination (in this case network loss) we just drop the destination qemu, and if user fix the problem, it should launch the qemu on destination again. The only thing that we should improve is that if the channels die, the whole qemu should also die. This is not a regression and I think we can drop it or move it next version.
What is happening here is: create destination qemu create source qemu stop the network launch migration command it get lost because network is dropping packets fix the network the packet don't reappear (it is lost) so migration never starts on destination. Right thing to do at this point would have been: migrate_cancel relaunch migrate command
> 2023-05-24-04:27:10: [root@host ~]# cat /proc/73671/task/*/comm | grep > multifdsend | wc -l > 2023-05-24-04:27:10: 0 > 2023-05-24-04:27:10: Host(10.73.2.80) Sending qmp command : {"execute": > "query-migrate", "id": "RdHaR1Fs"} > 2023-05-24-04:27:10: Host(10.73.2.80) Responding qmp command: {"return": > {"expected-downtime": 300, "status": "active", "setup-time": 1060, > "total-time": 368169, "ram": {"total": 4316667904, "postcopy-requests": 0, > "dirty-sync-count": 2, "multifd-bytes": 2376975936, "pages-per-second": > 27512, "downtime-bytes": 0, "page-size": 4096, "remaining": 53698560, > "postcopy-bytes": 0, "mbps": 904.28323966942162, "transferred": 2381207903, > "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 4231967, "duplicate":w > 469906, "dirty-pages-rate": 21, "skipped": 0, "normal-bytes": 2370240512, > "normal": 578672}}, "id": "RdHaR1Fs"} > 2023-05-24-04:27:11: ----- Check the multifd threads on the src and dst > host. ----- > 2023-05-24-04:27:11: [root@host ~]# cat /proc/73671/task/*/comm | grep > multifdsend | wc -l Could you double check that 73671 is the right process id? > 2023-05-24-04:27:11: 0 > 2023-05-24-04:27:11: Host(10.73.2.80) Sending qmp command : {"execute": > "query-migrate", "id": "uR4McgzU"} > 2023-05-24-04:27:12: Host(10.73.2.80) Responding qmp command: {"return": > {"status": "completed", "setup-time": 1060, "downtime": 156, "total-time": > 368569, "ram": {"total": 4316667904, "postcopy-requests": 0, > "dirty-sync-count": 4, "multifd-bytes": 2409007744, Because we are clearly using multifd, otherwise this value should be 0. > "pages-per-second": > 41974, "downtime-bytes": 41598, "page-size": 4096, "remaining": 0, > "postcopy-bytes": 0, "mbps": 52.540865252279538, "transferred": 2413308246, > "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 4258904, "duplicate": > 477508, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 2401693696, > "normal": 586351}}, "id": "uR4McgzU"}
(In reply to Juan Quintela from comment #2) > This is normal behaviour. > > When using migration (not postcopy) If we have a big problem on destination > (in this case network loss) we just drop the destination qemu, and if user > fix the problem, it should launch the qemu on destination again. > > The only thing that we should improve is that if the channels die, the whole > qemu should also die. > > This is not a regression and I think we can drop it or move it next version. Hi Juan, in one previous relevant bug, we fix one error and support multifd migration to go on after network recovery till it succeeds, see https://bugzilla.redhat.com/show_bug.cgi?id=2137740#c8 Bug 2137740 - Multifd migration fails under a weak network/socket ordering race Add one needinfo to Peter here because he fixed that bug. Peter, please add comments if you have.
(In reply to Juan Quintela from comment #4) ... > > 2023-05-24-04:27:11: ----- Check the multifd threads on the src and dst > > host. ----- > > 2023-05-24-04:27:11: [root@host ~]# cat /proc/73671/task/*/comm | grep > > multifdsend | wc -l > > Could you double check that 73671 is the right process id? It should be right because we use automation to obtain the process id and monitor the multifdsend threads. > > > 2023-05-24-04:27:11: 0 > > 2023-05-24-04:27:11: Host(10.73.2.80) Sending qmp command : {"execute": > > "query-migrate", "id": "uR4McgzU"} > > 2023-05-24-04:27:12: Host(10.73.2.80) Responding qmp command: {"return": > > {"status": "completed", "setup-time": 1060, "downtime": 156, "total-time": > > 368569, "ram": {"total": 4316667904, "postcopy-requests": 0, > > "dirty-sync-count": 4, "multifd-bytes": 2409007744, > > Because we are clearly using multifd, otherwise this value should be 0.
(In reply to Li Xiaohui from comment #6) > (In reply to Juan Quintela from comment #4) > ... > > > 2023-05-24-04:27:11: ----- Check the multifd threads on the src and dst > > > host. ----- > > > 2023-05-24-04:27:11: [root@host ~]# cat /proc/73671/task/*/comm | grep > > > multifdsend | wc -l > > > > Could you double check that 73671 is the right process id? > > It should be right because we use automation to obtain the process id and > monitor the multifdsend threads. The can you try to reproduce? It makes zero sense because we have: - no multifd_send threads - all data is sent through multifd threads So at least one of these two things need to be false. This is the first output that you show > After network recovery, no multifd threads found when multifd migration is active: > Notes: multifd migration still succeeds, and VM works well after migration. > 2023-05-24-04:26:28: ----- Check the multifd threads on the src and dst host. ----- > 2023-05-24-04:26:28: [root@host ~]# cat /proc/73671/task/*/comm | grep multifdsend | wc -l No multifd send threads. > 2023-05-24-04:26:28: 0 > 2023-05-24-04:26:28: Host(10.73.2.80) Sending qmp command : {"execute": "query-migrate", "id": "JytSD1w4"} > 2023-05-24-04:26:28: Host(10.73.2.80) Responding qmp command: {"return": {"expected-downtime": 300, > "status": "active", Migration is still active. > "setup-time": 1060, "total-time": 325809, "ram": {"total": 4316667904, "postcopy-requests": 0, "dirty-sync-count": 1, > "multifd-bytes": 7030912, 7MB transferred through multifd > "pages-per-second": 6, "downtime-bytes": 0, "page-size": 4096, > "remaining": 4301185024, Still lot of data to transfer > "postcopy-bytes": 0, > "mbps": 0.21188531903454091 It is not transferred anything. 0.2MB/s is basically stopped. > "transferred": 7049591, basically everything is sent through multifd (as it should): 7049591 - 7030912 = 18679 ~ 18KB sent through the main channel. > "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 18679, "duplicate": 2072, "dirty-pages-rate": 0, "skipped": 0, > "normal-bytes": 6995968, > "normal": 1708}}, "id": "JytSD1w4"} We go to the last one: > 2023-05-24-04:27:11: Host(10.73.2.80) Sending qmp command : {"execute": > "query-migrate", "id": "uR4McgzU"} > 2023-05-24-04:27:12: Host(10.73.2.80) Responding qmp command: {"return": > {"status": "completed", migration finished. > "setup-time": 1060, "downtime": 156, "total-time": > 368569, "ram": {"total": 4316667904, "postcopy-requests": 0, > "dirty-sync-count": 4, "multifd-bytes": 2409007744, Lots of multifd traffic (as it should) > "pages-per-second": > 41974, "downtime-bytes": 41598, "page-size": 4096, "remaining": 0, > "postcopy-bytes": 0, "mbps": 52.540865252279538, 52MB/s network speed, reasonable > "transferred": 2413308246, > "dirty-sync-missed-zero-copy": 0, "precopy-bytes": 4258904, "duplicate": > 477508, "dirty-pages-rate": 0, "skipped": 0, "normal-bytes": 2401693696, > "normal": 586351}}, "id": "uR4McgzU"} So I am really at a lost why you don't find the multifd_send threads. That is why I ask to see if there is posibility that there is a testing error.
(In reply to Li Xiaohui from comment #5) > (In reply to Juan Quintela from comment #2) > > This is normal behaviour. > > > > When using migration (not postcopy) If we have a big problem on destination > > (in this case network loss) we just drop the destination qemu, and if user > > fix the problem, it should launch the qemu on destination again. > > > > The only thing that we should improve is that if the channels die, the whole > > qemu should also die. > > > > This is not a regression and I think we can drop it or move it next version. > > Hi Juan, in one previous relevant bug, we fix one error and support multifd > migration to go on after network recovery till it succeeds, > see https://bugzilla.redhat.com/show_bug.cgi?id=2137740#c8 > Bug 2137740 - Multifd migration fails under a weak network/socket ordering > race That is a different bug. The problem there was that the channels (due to network drop of packets) where created in the wrong order. As I explained in previous comment here we have: - you are not able to get the multifd threads - we are transferring all data through multifd threads Thanks, Juan. > > Add one needinfo to Peter here because he fixed that bug. > Peter, please add comments if you have.
Xiaohui, I see that the discussion is going on and Juan seems to have answered. I'm removing the needinfo for now. If you still need me for anything, just do that again or drop me an message. Thanks!