Created attachment 1609309 [details] libvirtd and qemu log, libvirtd backtrace, vm xml Description of problem: libvirtd deadlock when dest qemu crashed during live migration. I only reproduced this bug in rdma env for now, I will try to find a reproducer in non-rdma env. Version-Release number of selected component (if applicable): libvirt-5.6.0-2.virtcov.el8.x86_64 qemu-kvm-4.1.0-5.module+el8.1.0+4076+b5e41ebc.x86_64 How reproducible: >50% Steps to Reproduce: 1. Prepare rdma migration env 2. Start a vm with low memory limit on source host ... <memory unit='KiB'>1048576</memory> <currentMemory unit='KiB'>1048576</currentMemory> <memtune> <hard_limit unit='KiB'>1048576</hard_limit> <swap_hard_limit unit='KiB'>1048576</swap_hard_limit> </memtune> ... 3. Migrate vm with --rdma-pin-all(expect result: it will fail because memory limit is too low): # virsh migrate vm2 qemu+ssh://dell-per730-36.lab.eng.pek2.redhat.com/system --live --verbose --p2p --migrateuri rdma://192.168.100.4 --listen-address 0.0.0.0 --rdma-pin-all Actual results: libvirtd deadlock after step3 Expected results: libvirtd should not deadlock Additional info: Thread 3 (Thread 0x7f11b5831700 (LWP 12251)): #0 0x00007f11c9bc047c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f11cc409665 in virCondWait (c=c@entry=0x7f1148001fb8, m=m@entry=0x7f1148001f90) at util/virthread.c:154 #2 0x00007f11768111d3 in qemuMonitorSend (mon=mon@entry=0x7f1148001f80, msg=msg@entry=0x7f11b582fe00) at qemu/qemu_monitor.c:1075 #3 0x00007f1176825c72 in qemuMonitorJSONCommandWithFd (mon=mon@entry=0x7f1148001f80, cmd=cmd@entry=0x7f11a806efa0, scm_fd=scm_fd@entry=-1, reply=reply@entry=0x7f11b582fe98) at qemu/qemu_monitor_json.c:334 #4 0x00007f1176825df6 in qemuMonitorJSONCommand (mon=mon@entry=0x7f1148001f80, cmd=cmd@entry=0x7f11a806efa0, reply=reply@entry=0x7f11b582fe98) at qemu/qemu_monitor_json.c:359 #5 0x00007f117682c29c in qemuMonitorJSONGetMigrationStats (mon=mon@entry=0x7f1148001f80, stats=stats@entry=0x7f11b582ff00, error=error@entry=0x7f11b5830020) at qemu/qemu_monitor_json.c:3571 #6 0x00007f1176816a66 in qemuMonitorGetMigrationStats (mon=0x7f1148001f80, stats=stats@entry=0x7f11b582ff00, error=error@entry=0x7f11b5830020) at qemu/qemu_monitor.c:2604 #7 0x00007f11767f9dac in qemuMigrationAnyFetchStats (driver=driver@entry=0x7f115c0f88b0, vm=vm@entry=0x7f115c1f45f0, asyncJob=asyncJob@entry=QEMU_ASYNC_JOB_MIGRATION_OUT, jobInfo=jobInfo@entry=0x7f11a8003cd0, error=error@entry=0x7f11b5830020) at qemu/qemu_migration.c:1387 #8 0x00007f11767fa63c in qemuMigrationJobCheckStatus (asyncJob=QEMU_ASYNC_JOB_MIGRATION_OUT, vm=0x7f115c1f45f0, driver=0x7f115c0f88b0) at qemu/qemu_migration.c:1438 #9 qemuMigrationAnyCompleted (driver=driver@entry=0x7f115c0f88b0, vm=vm@entry=0x7f115c1f45f0, asyncJob=asyncJob@entry=QEMU_ASYNC_JOB_MIGRATION_OUT, dconn=dconn@entry=0x7f11a8004680, flags=flags@entry=8) at qemu/qemu_migration.c:1504 #10 0x00007f11767fae52 in qemuMigrationSrcWaitForCompletion (driver=driver@entry=0x7f115c0f88b0, vm=vm@entry=0x7f115c1f45f0, asyncJob=asyncJob@entry=QEMU_ASYNC_JOB_MIGRATION_OUT, dconn=dconn@entry=0x7f11a8004680, flags=flags@entry=8) at qemu/qemu_migration.c:1615 #11 0x00007f117680305b in qemuMigrationSrcRun (driver=driver@entry=0x7f115c0f88b0, vm=vm@entry=0x7f115c1f45f0, persist_xml=persist_xml@entry=0x0, cookiein=cookiein@entry=0x7f11a804c4f0 "<qemu-migration>\n <name>vm2</name>\n <uuid>1b3338d6-b599-406b-a14c-33b000b15b6c</uuid>\n <hostname>dell-per730-36.lab.eng.pek2.redhat.com</hostname>\n <hostuuid>4c4c4544-0057-5010-8051-b7c04f584432</"..., cookieinlen=cookieinlen@entry=666, cookieout=cookieout@entry=0x7f11b58304f0, cookieoutlen=<optimized out>, flags=<optimized out>, resource=<optimized out>, spec=<optimized out>, dconn=<optimized out>, graphicsuri=<optimized out>, nmigrate_disks=<optimized out>, migrate_disks=<optimized out>, migParams=<optimized out>) at qemu/qemu_migration.c:3608 #12 0x00007f1176803e74 in qemuMigrationSrcPerformNative (driver=driver@entry=0x7f115c0f88b0, vm=vm@entry=0x7f115c1f45f0, persist_xml=persist_xml@entry=0x0, uri=uri@entry=0x7f11a8001cf0 "rdma://192.168.100.4:49152", cookiein=0x7f11a804c4f0 "<qemu-migration>\n <name>vm2</name>\n <uuid>1b3338d6-b599-406b-a14c-33b000b15b6c</uuid>\n <hostname>dell-per730-36.lab.eng.pek2.redhat.com</hostname>\n <hostuuid>4c4c4544-0057-5010-8051-b7c04f584432</"..., cookieinlen=666, cookieout=0x7f11b58304f0, cookieoutlen=0x7f11b58304cc, flags=16387, resource=0, dconn=0x7f11a8004680, graphicsuri=0x0, nmigrate_disks=0, migrate_disks=0x0, migParams=0x7f11a8002c80) at qemu/qemu_migration.c:3804 #13 0x00007f11768054b7 in qemuMigrationSrcPerformPeer2Peer3 (flags=<optimized out>, useParams=true, bandwidth=0, migParams=0x7f11a8002c80, nbdPort=0, migrate_disks=0x0, nmigrate_disks=0, listenAddress=0x7f11a8001000 "0.0.0.0", graphicsuri=0x0, uri=<optimized out>, dname=0x0, persist_xml=0x0, xmlin=<optimized out>, vm=0x7f115c1f45f0, dconnuri=0x7f11a8001bd0 "qemu+ssh://dell-per730-36.lab.eng.pek2.redhat.com/system", dconn=<optimized out>, sconn=0x7f1194001aa0, driver=0x7f115c0f88b0) at qemu/qemu_migration.c:4221 #14 qemuMigrationSrcPerformPeer2Peer (v3proto=<synthetic pointer>, resource=0, dname=0x0, flags=16387, migParams=0x7f11a8002c80, nbdPort=0, migrate_disks=0x0, nmigrate_disks=0, listenAddress=0x7f11a8001000 "0.0.0.0", graphicsuri=0x0, uri=0x7f11a8002000 "rdma://192.168.100.4", dconnuri=0x7f11a8001bd0 "qemu+ssh://dell-per730-36.lab.eng.pek2.redhat.com/system", persist_xml=0x0, xmlin=<optimized out>, vm=0x7f115c1f45f0, sconn=0x7f1194001aa0, driver=0x7f115c0f88b0) at qemu/qemu_migration.c:4522 #15 qemuMigrationSrcPerformJob (driver=driver@entry=0x7f115c0f88b0, conn=conn@entry=0x7f1194001aa0, vm=vm@entry=0x7f115c1f45f0, xmlin=xmlin@entry=0x0, persist_xml=persist_xml@entry=0x0, dconnuri=dconnuri@entry=0x7f11a8001bd0 "qemu+ssh://dell-per730-36.lab.eng.pek2.redhat.com/system", uri=0x7f11a8002000 "rdma://192.168.100.4", graphicsuri=0x0, listenAddress=0x7f11a8001000 "0.0.0.0", nmigrate_disks=0, migrate_disks=0x0, nbdPort=0, migParams=0x7f11a8002c80, cookiein=0x0, cookieinlen=0, cookieout=0x7f11b5830880, cookieoutlen=0x7f11b5830874, flags=16387, dname=0x0, resource=0, v3proto=<optimized out>) at qemu/qemu_migration.c:4600 #16 0x00007f1176806a58 in qemuMigrationSrcPerform (driver=driver@entry=0x7f115c0f88b0, conn=0x7f1194001aa0, vm=0x7f115c1f45f0, xmlin=0x0, persist_xml=0x0, dconnuri=dconnuri@entry=0x7f11a8001bd0 "qemu+ssh://dell-per730-36.lab.eng.pek2.redhat.com/system", uri=0x7f11a8002000 "rdma://192.168.100.4", graphicsuri=0x0, listenAddress=0x7f11a8001000 "0.0.0.0", nmigrate_disks=0, migrate_disks=0x0, nbdPort=0, migParams=0x7f11a8002c80, cookiein=0x0, cookieinlen=0, cookieout=0x7f11b5830880, cookieoutlen=0x7f11b5830874, flags=16387, dname=0x0, resource=0, v3proto=true) at qemu/qemu_migration.c:4771 #17 0x00007f117684e985 in qemuDomainMigratePerform3Params (dom=0x7f11a8002310, dconnuri=0x7f11a8001bd0 "qemu+ssh://dell-per730-36.lab.eng.pek2.redhat.com/system", params=<optimized out>, nparams=2, cookiein=0x0, cookieinlen=0, cookieout=0x7f11b5830880, cookieoutlen=0x7f11b5830874, flags=16387) at qemu/qemu_driver.c:13288 #18 0x00007f11cc67465f in virDomainMigratePerform3Params (domain=domain@entry=0x7f11a8002310, dconnuri=0x7f11a8001bd0 "qemu+ssh://dell-per730-36.lab.eng.pek2.redhat.com/system", params=0x7f11a8002490, nparams=2, cookiein=0x0, cookieinlen=0, cookieout=0x7f11b5830880, cookieoutlen=0x7f11b5830874, flags=16387) at libvirt-domain.c:4968 #19 0x000055b106657e6e in remoteDispatchDomainMigratePerform3Params (ret=0x7f11a8001c20, args=0x7f11a8001e60, rerr=0x7f11b5830940, msg=0x55b108067b80, client=<optimized out>, server=<optimized out>) at remote/remote_daemon_dispatch.c:5668 #20 remoteDispatchDomainMigratePerform3ParamsHelper (server=<optimized out>, client=<optimized out>, msg=0x55b108067b80, rerr=0x7f11b5830940, args=0x7f11a8001e60, ret=0x7f11a8001c20) at remote/remote_daemon_dispatch_stubs.h:8826 #21 0x00007f11cc55b974 in virNetServerProgramDispatchCall (msg=0x55b108067b80, client=0x55b108071ab0, server=0x55b108018f80, prog=0x55b108055ac0) at rpc/virnetserverprogram.c:435 #22 virNetServerProgramDispatch (prog=0x55b108055ac0, server=server@entry=0x55b108018f80, client=client@entry=0x55b108071ab0, msg=msg@entry=0x55b108067b80) at rpc/virnetserverprogram.c:302 #23 0x00007f11cc562b47 in virNetServerProcessMsg (srv=srv@entry=0x55b108018f80, client=0x55b108071ab0, prog=<optimized out>, msg=0x55b108067b80) at rpc/virnetserver.c:137 #24 0x00007f11cc562fa9 in virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x55b108018f80) at rpc/virnetserver.c:158 #25 0x00007f11cc40a65a in virThreadPoolWorker (opaque=opaque@entry=0x55b1080163b0) at util/virthreadpool.c:163 #26 0x00007f11cc4092a6 in virThreadHelper (data=<optimized out>) at util/virthread.c:206 #27 0x00007f11c9bba2de in start_thread () from /lib64/libpthread.so.0 #28 0x00007f11c9098133 in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7f11b6032700 (LWP 12250)): #0 0x00007f11c9bc07ca in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f11cc4096eb in virCondWaitUntil (c=c@entry=0x7f115c1f4738, m=m@entry=0x7f115c1f4600, whenms=whenms@entry=1567068885042) at util/virthread.c:169 #2 0x00007f1176791a14 in qemuDomainObjBeginJobInternal (driver=driver@entry=0x7f115c0f88b0, obj=obj@entry=0x7f115c1f45f0, job=job@entry=QEMU_JOB_QUERY, agentJob=agentJob@entry=QEMU_AGENT_JOB_NONE, asyncJob=asyncJob@entry=QEMU_ASYNC_JOB_NONE, nowait=nowait@entry=false) at qemu/qemu_domain.c:7809 #3 0x00007f1176797364 in qemuDomainObjBeginJob (driver=driver@entry=0x7f115c0f88b0, obj=obj@entry=0x7f115c1f45f0, job=job@entry=QEMU_JOB_QUERY) at qemu/qemu_domain.c:7973 #4 0x00007f1176861804 in qemuDomainGetJobStatsInternal (driver=driver@entry=0x7f115c0f88b0, vm=0x7f115c1f45f0, completed=completed@entry=false, jobInfo=jobInfo@entry=0x7f11b6031670) at qemu/qemu_driver.c:13973 #5 0x00007f1176861fcb in qemuDomainGetJobInfo (dom=0x7f11b0002460, info=0x7f11b6031820) at qemu/qemu_driver.c:14025 #6 0x00007f11cc682509 in virDomainGetJobInfo (domain=domain@entry=0x7f11b0002460, info=info@entry=0x7f11b6031820) at libvirt-domain.c:8766 #7 0x000055b10666ac68 in remoteDispatchDomainGetJobInfo (ret=0x7f11b0002020, args=0x7f11b0001e40, rerr=0x7f11b6031940, msg=0x55b108052e90, client=<optimized out>, server=0x55b108018f80) at remote/remote_daemon_dispatch_stubs.h:6397 #8 remoteDispatchDomainGetJobInfoHelper (server=0x55b108018f80, client=<optimized out>, msg=0x55b108052e90, rerr=0x7f11b6031940, args=0x7f11b0001e40, ret=0x7f11b0002020) at remote/remote_daemon_dispatch_stubs.h:6371 #9 0x00007f11cc55b974 in virNetServerProgramDispatchCall (msg=0x55b108052e90, client=0x55b108071ab0, server=0x55b108018f80, prog=0x55b108055ac0) at rpc/virnetserverprogram.c:435 #10 virNetServerProgramDispatch (prog=0x55b108055ac0, server=server@entry=0x55b108018f80, client=client@entry=0x55b108071ab0, msg=msg@entry=0x55b108052e90) at rpc/virnetserverprogram.c:302 #11 0x00007f11cc562b47 in virNetServerProcessMsg (srv=srv@entry=0x55b108018f80, client=0x55b108071ab0, prog=<optimized out>, msg=0x55b108052e90) at rpc/virnetserver.c:137 #12 0x00007f11cc562fa9 in virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x55b108018f80) at rpc/virnetserver.c:158 #13 0x00007f11cc40a65a in virThreadPoolWorker (opaque=opaque@entry=0x55b108018330) at util/virthreadpool.c:163 #14 0x00007f11cc4092a6 in virThreadHelper (data=<optimized out>) at util/virthread.c:206 #15 0x00007f11c9bba2de in start_thread () from /lib64/libpthread.so.0 #16 0x00007f11c9098133 in clone () from /lib64/libc.so.6
This happens because src qemu can't respond to qmp command after --rdma-pin-all with low memory limit. This is a corner test scenarios, maybe this bug should be closed.
I looked at the full backtrace and there's no sign of any deadlock. Of course, the thread handling migration is stuck waiting for a response from QEMU and there's one more thread waiting for a job on the domain (for virDomainGetJobStats API), but that's it. That is, nothing special for a non-responsive QEMU. Thus the only thing we can investigate is the reason QEMU does not respond on the monitor. The migration fails on destination with Failed to register local dest ram block! : Cannot allocate memory 2019-08-29T08:54:14.755230Z qemu-kvm: rdma migration: error dest registering ram blocks 2019-08-29T08:54:14.755239Z qemu-kvm: error while loading state for instance 0x0 of device 'ram' 2019-08-29T08:54:14.755594Z qemu-kvm: load of migration failed: Operation not permitted 2019-08-29T08:54:14.755632Z qemu-kvm: Early error. Sending error. 2019-08-29T08:54:14.755657Z qemu-kvm: rdma migration: send polling control error The source libvirtd gets a MIGRATION event with "failed" status and sends query-migrate command: 2019-08-29 08:54:14.753+0000: 12251: info : qemuMonitorSend:1072 : QEMU_MONITOR_SEND_MSG: mon=0x7f1148001f80 msg={"execute":"query-migrate","id":"libvirt-22"} and never gets any response. It looks like QEMU cannot recover from failed migration in this case.
Yes, reproduced - it's quite delicate because often it just fails with an out of memory first. source main qemu thread is stuck in: #0 0x00007f6a68b918dd in __lll_lock_wait () from target:/lib64/libpthread.so.0 #1 0x00007f6a68b8aaf9 in pthread_mutex_lock () from target:/lib64/libpthread.so.0 #2 0x0000555c32e605bd in qemu_mutex_lock_impl (mutex=0x555c336f1f00 <rcu_sync_lock>, file=0x555c33006510 "util/rcu.c", line=144) at util/qemu-thread-posix.c:66 #3 0x0000555c32e72495 in synchronize_rcu () at util/rcu.c:144 #4 0x0000555c32d3029e in qio_channel_rdma_close (ioc=<optimized out>, errp=<optimized out>) at migration/rdma.c:3036 #5 0x0000555c32d2b101 in channel_close (opaque=<optimized out>) at migration/qemu-file-channel.c:106 #6 0x0000555c32d2a43c in qemu_fclose (f=0x555c35901780) at migration/qemu-file.c:330 #7 0x0000555c32d20003 in migrate_fd_cleanup (s=s@entry=0x555c3573d450) at migration/migration.c:1527 #8 0x0000555c32d200dd in migrate_fd_cleanup_bh (opaque=0x555c3573d450) at migration/migration.c:1559 #9 0x0000555c32e5ac46 in aio_bh_call (bh=0x555c3575b130) at util/async.c:117 #10 aio_bh_poll (ctx=ctx@entry=0x555c35755e20) at util/async.c:117 #11 0x0000555c32e5e084 in aio_dispatch (ctx=0x555c35755e20) at util/aio-posix.c:459 #12 0x0000555c32e5ab22 in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at util/async.c:260 #13 0x00007f6a6d66f72d in g_main_context_dispatch () from target:/lib64/libglib-2.0.so.0 #14 0x0000555c32e5d138 in glib_pollfds_poll () at util/main-loop.c:218 #15 os_host_main_loop_wait (timeout=<optimized out>) at util/main-loop.c:241 #16 main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:517 #17 0x0000555c32c46169 in main_loop () at vl.c:1809 #18 0x0000555c32af5fd3 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4506 (gdb) thread 2 [Switching to thread 2 (Thread 0x7f6a61e39700 (LWP 21414))] #0 0x00007f6a688b3b4d in syscall () from target:/lib64/libc.so.6 (gdb) where #0 0x00007f6a688b3b4d in syscall () from target:/lib64/libc.so.6 #1 0x0000555c32e60cff in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at util/qemu-thread-posix.c:438 #2 qemu_event_wait (ev=ev@entry=0x555c336f1ec0 <rcu_gp_event>) at util/qemu-thread-posix.c:442 #3 0x0000555c32e725b7 in wait_for_readers () at util/rcu.c:134 #4 synchronize_rcu () at util/rcu.c:170 #5 0x0000555c32e72895 in call_rcu_thread (opaque=<optimized out>) at util/rcu.c:267 #6 0x0000555c32e604d4 in qemu_thread_start (args=0x555c3565d190) at util/qemu-thread-posix.c:502 #7 0x00007f6a68b882de in start_thread () from target:/lib64/libpthread.so.0 #8 0x00007f6a688b92e3 in clone () from target:/lib64/libc.so.6 (gdb) thread 11 [Switching to thread 11 (Thread 0x7f6a3effd700 (LWP 21427))] #0 0x00007f6a68b918dd in __lll_lock_wait () from target:/lib64/libpthread.so.0 (gdb) where #0 0x00007f6a68b918dd in __lll_lock_wait () from target:/lib64/libpthread.so.0 #1 0x00007f6a68b8aaf9 in pthread_mutex_lock () from target:/lib64/libpthread.so.0 #2 0x0000555c32e605bd in qemu_mutex_lock_impl (mutex=0x555c336bcf60 <qemu_global_mutex>, file=0x555c32efa088 "/builddir/build/BUILD/qemu-4.1.0/exec.c", line=3301) at util/qemu-thread-posix.c:66 #3 0x0000555c32b4139e in qemu_mutex_lock_iothread_impl (file=<optimized out>, line=<optimized out>) at /usr/src/debug/qemu-kvm-4.1.0-5.el8.bz1746790a.x86_64/cpus.c:1859 #4 0x0000555c32af98f9 in prepare_mmio_access (mr=<optimized out>, mr=<optimized out>) at /usr/src/debug/qemu-kvm-4.1.0-5.el8.bz1746790a.x86_64/exec.c:3301 #5 0x0000555c32afa990 in flatview_write_continue (fv=0x7f6a380e9900, addr=980, attrs=..., buf=0x7f6a6de39000 "\016!\002\200", len=2, addr1=<optimized out>, l=<optimized out>, mr=0x555c37c19990) at /usr/src/debug/qemu-kvm-4.1.0-5.el8.bz1746790a.x86_64/exec.c:3332 #6 0x0000555c32afab46 in flatview_write (fv=0x7f6a380e9900, addr=980, attrs=..., buf=0x7f6a6de39000 "\016!\002\200", len=2) at /usr/src/debug/qemu-kvm-4.1.0-5.el8.bz1746790a.x86_64/exec.c:3376 #7 0x0000555c32afed6f in address_space_write (as=<optimized out>, addr=<optimized out>, attrs=..., buf=<optimized out>, len=<optimized out>) at /usr/src/debug/qemu-kvm-4.1.0-5.el8.bz1746790a.x86_64/exec.c:3466 #8 0x0000555c32b5c544 in kvm_handle_io (count=1, size=2, direction=<optimized out>, data=<optimized out>, attrs=..., port=980) at /usr/src/debug/qemu-kvm-4.1.0-5.el8.bz1746790a.x86_64/accel/kvm/kvm-all.c:2042 #9 kvm_cpu_exec (cpu=<optimized out>) at /usr/src/debug/qemu-kvm-4.1.0-5.el8.bz1746790a.x86_64/accel/kvm/kvm-all.c:2288 #10 0x0000555c32b4156e in qemu_kvm_cpu_thread_fn (arg=0x555c358db4d0) at /usr/src/debug/qemu-kvm-4.1.0-5.el8.bz1746790a.x86_64/cpus.c:1285 #11 0x0000555c32e604d4 in qemu_thread_start (args=0x555c358fe2c0) at util/qemu-thread-posix.c:502 #12 0x00007f6a68b882de in start_thread () from target:/lib64/libpthread.so.0 #13 0x00007f6a688b92e3 in clone () from target:/lib64/libc.so.6 looks like an rcu screwup of some type. [root@virtlab413 ~]# virsh migrate bz1746787rdma qemu+ssh://ibpair/system --live --verbose --p2p --migrateuri rdma://192.168.99.14 --listen-address 192.168.99.14 --rdma-pin-all
I think the problem here is that address_space_write in the CPU thread takes the rcu-read-lock but then decides it needs to take the main lock. Meanwhile, in the close down path of the rdma code we do a synchronize_rcu - but this is with the main lock held. It's not normally a problem, but in the case where the migration fails, the CPU on the source is still running and so you've got the CPU running.
Posted upstream: Subject: [PATCH 1/2] migration/rdma: Don't moan about disconnects at the end Subject: [PATCH 2/2] migration/rdma.c: Swap synchronize_rcu for call_rcu
I tried to reproduce this bz on host(kernel-4.18.0-141.el8.x86_64&qemu-img-4.1.0-5.module+el8.1.0+4076+b5e41ebc.x86_64&libvirt-5.6.0-6.module+el8.1.0+4244+9aa4e6bb.x86_64), but only get error: "Failed to register local dest ram block!: Cannot allocate memory", couldn't hit src qemu crash after tried > 30 times. libvirt commands, src&dst libvirt&guest log please see this link: http://fileshare.englab.nay.redhat.com/pub/section2/coredump/bz1746787/ Notes: I use same libvirt commands with fjin's, only change guest system disk [root@dhcp-8-195 xiaohli]# virsh migrate vm2 qemu+ssh://10.66.65.110/system --live --verbose --p2p --migrateuri rdma://192.168.0.21 --listen-address 0.0.0.0 --rdma-pin-all error: internal error: qemu unexpectedly closed the monitor: dest_init RDMA Device opened: kernel name mlx4_0 uverbs device name uverbs0, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs0, infiniband class device path /sys/class/infiniband/mlx4_0, transport: (1) Infiniband Failed to register local dest ram block! : Cannot allocate memory 2019-09-25T12:03:06.958940Z qemu-kvm: rdma migration: error dest registering ram blocks 2019-09-25T12:03:06.958949Z qemu-kvm: error while loading state for instance 0x0 of device 'ram' 2019-09-25T12:03:06.959772Z qemu-kvm: load of migration failed: Operation not permitted 2019-09-25T12:03:06.959855Z qemu-kvm: Early error. Sending error. 2019-09-25T12:03:06.959876Z qemu-kvm: rdma migration: send polling control error [root@dhcp-8-195 xiaohli]# Dave, I find src qemu whether is crashed by followings, are the checking points right?: (1)the qemu process is still running on src host (2)guest is still working well on src host.
For me that would hang about 1/5 or 1/10 - same as for Fangge; but it is a race so it's not guaranteed to fail
These patches are now upstream, so should be in 4.2 qemu for 8.2 av: de8434a35a7871f5f09ff1b22af2dad40a7a0fba migration/rdma: Don't moan about disconnects at the end d46a4847ca868aaf537df2b87ce07dcbcad6a224 migration/rdma.c: Swap synchronize_rcu for call_rcu rdma migration users are pretty rare; so I'll let this just float into 8.2 on the rebase rather than backporting
QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.
I have tried this issue on the rhelav 8.4.0(kernel-4.18.0-304.el8.x86_64&qemu-kvm-5.2.0-14.module+el8.4.0+10425+ad586fa5.x86_64) via libvirt, same steps with Comment 0: 1.Failed to boot vm when with memory 1G and memtune: <memory unit='KiB'>1048576</memory> <currentMemory unit='KiB'>1048576</currentMemory> <memtune> <hard_limit unit='KiB'>1048576</hard_limit> <swap_hard_limit unit='KiB'>1048576</swap_hard_limit> </memtune> 2.Could boot vm without memtune but with 1G memory: <memory unit='KiB'>1048576</memory> <currentMemory unit='KiB'>1048576</currentMemory> *************************************************** But can't migrate without hard memory limit: [root@dhcp-8-122 home]# virsh migrate rhel8.4 qemu+ssh://10.66.8.146/system --live --verbose --p2p --migrateuri rdma://192.168.0.21 --listen-address 0.0.0.0 --rdma-pin-allerror: Invalid required operation:cannot start RDMA migration with no memory hard limit set Hi Fangge, could you help see why I couldn't start guest in Scenario 1? Thanks.
Discussed with Fangge, she and I all agree to keep this bz as Wontfix considering the sceanario doesn't make much sense in this bz