Bug 1869015
| Summary: | Qemu core dump on src host when network recover + migration if mistake to migrate before handle network failure | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Li Xiaohui <xiaohli> |
| Component: | qemu-kvm | Assignee: | Peter Xu <peterx> |
| qemu-kvm sub component: | Live Migration | QA Contact: | Li Xiaohui <xiaohli> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | low | ||
| Priority: | low | CC: | ailan, chayang, dgilbert, jinzhao, juzhang, lcapitulino, mrezanin, peterx, quintela, qzhang, virt-maint, yfu |
| Version: | 9.0 | Keywords: | Triaged |
| Target Milestone: | rc | Flags: | pm-rhel:
mirror+
|
| Target Release: | 9.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | qemu-kvm-6.1.0-1.el9 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-05-17 12:23:22 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Li Xiaohui
2020-08-15 11:34:22 UTC
List QEMU command lines: /usr/libexec/qemu-kvm \ -name "mouse-vm",debug-threads=on \ -sandbox off \ -machine q35 \ -cpu EPYC \ -nodefaults \ -device VGA \ -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1,server,nowait \ -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor,server,nowait \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -mon chardev=qmp_id_catch_monitor,mode=control \ -device pcie-root-port,port=0x10,chassis=1,id=root0,bus=pcie.0,multifunction=on,addr=0x2 \ -device pcie-root-port,port=0x11,chassis=2,id=root1,bus=pcie.0,addr=0x2.0x1 \ -device pcie-root-port,port=0x12,chassis=3,id=root2,bus=pcie.0,addr=0x2.0x2 \ -device pcie-root-port,port=0x13,chassis=4,id=root3,bus=pcie.0,addr=0x2.0x3 \ -device pcie-root-port,port=0x14,chassis=5,id=root4,bus=pcie.0,addr=0x2.0x4 \ -device pcie-root-port,port=0x15,chassis=6,id=root5,bus=pcie.0,addr=0x2.0x5 \ -device pcie-root-port,port=0x16,chassis=7,id=root6,bus=pcie.0,addr=0x2.0x6 \ -device pcie-root-port,port=0x17,chassis=8,id=root7,bus=pcie.0,addr=0x2.0x7 \ -device nec-usb-xhci,id=usb1,bus=root0 \ -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=root1 \ -device scsi-hd,id=image1,drive=drive_image1,bus=virtio_scsi_pci0.0,channel=0,scsi-id=0,lun=0,bootindex=0 \ -device virtio-net-pci,mac=9a:8a:8b:8c:8d:8e,id=net0,vectors=4,netdev=tap0,bus=root2 \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -blockdev driver=file,cache.direct=on,cache.no-flush=off,filename=/mnt/nfs/rhel830-64-virtio-scsi.qcow2,node-name=drive_sys1 \ -blockdev driver=qcow2,node-name=drive_image1,file=drive_sys1 \ -netdev tap,id=tap0,vhost=on \ -m 4096 \ -smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \ -vnc :10 \ -rtc base=utc,clock=host \ -boot menu=off,strict=off,order=cdn,once=c \ -enable-kvm \ -qmp tcp:0:3333,server,nowait \ -serial tcp:0:4444,server,nowait \ -monitor stdio \ I can't persuade gdb to give me a backtrace off this core; can you try and get a full backtrace from it please? Also, it says it died during an abort - when it dies can you give us any messages? (In reply to Dr. David Alan Gilbert from comment #3) > I can't persuade gdb to give me a backtrace off this core; can you try and > get a full backtrace from it please? (gdb) t a a bt full Thread 10 (LWP 545052): #0 0x00007f0423e912fc in ?? () No symbol table info available. #1 0x0000000000000000 in ?? () No symbol table info available. Thread 9 (LWP 545148): #0 0x00007f0423e93bd6 in ?? () No symbol table info available. #1 0x0000000000000000 in ?? () No symbol table info available. ... > Also, it says it died during an abort - when it dies can you give us any > messages? It prints when qemu died: (qemu) qemu-kvm: /builddir/build/BUILD/qemu-5.0.0/migration/migration.c:3484: migrate_fd_connect: Assertion `s->cleanup_bh' failed. Peter, does this ring any bell? Thanks, Juan. Xiaohui, is the error prompted like below when you tried the 2nd time but before the 3rd time when it crashes? (qemu) migrate_recover tcp:$IP:$PORT Error: Migrate recovery is triggered already (In reply to Peter Xu from comment #12) Hi Peter, Sorry for late. I have available machines to test this bz now. Now It's rhel9, the version of qemu is qemu-kvm-6.0.0-6.el9.x86_64, so I think it will show same issue if test on the latest rhelav-8.5.0. > Xiaohui, is the error prompted like below when you tried the 2nd time but > before the 3rd time when it crashes? > > (qemu) migrate_recover tcp:$IP:$PORT > Error: Migrate recovery is triggered already Now I would say: 1.When tested on the qemu-kvm-5.1.0 (rhelav-8.3.0), I think don't receive any error prompts when I tried the 2nd time but before the 3rd time. Because if I got above error, I wouldn't go on migrating. 2.But now tested on the qemu-kvm-6.0, I got same error as you did above. And src qemu would hit core dump if we go on executing migration for the 2nd time. (dst_qmp){"exec-oob":"migrate-recover", "arguments":{"uri":"tcp:192.168.11.11:1235"}} (dst_qmp){"error": {"class": "GenericError", "desc": "Migrate recovery is triggered already"}} (src_qmp){"execute":"migrate", "arguments":{"uri":"tcp:192.168.11.11:1235", "resume":true}} Actual result: (qemu) qemu-kvm: Unable to write to socket: Broken pipe qemu-kvm: ../util/yank.c:107: yank_unregister_instance: Assertion `QLIST_EMPTY(&entry->yankfns)' failed. 1.sh: line 38: 32077 Aborted (core dumped) From above error info, I guess the different errors between rhelav-8.3 and rhelav-8.5.0 are caused by yank codes from qemu-kvm-6.0, and this bz seems blocked by below bz: Bug 1974366 - Fail to set migrate incoming for 2nd time after the first time failed Didn't hit any issues when test on qemu-img-6.0.0-6.el9.postcopy_recover_v2.x86_64 regarding Comment 17, the v2 build works well. Thanks Xiaohui. The first two patches merged; posted the latter three patches upstream: https://lore.kernel.org/qemu-devel/20210708190653.252961-1-peterx@redhat.com/ I'll do a backport when they all reviewed and landed. (In reply to Peter Xu from comment #23) > https://gitlab.com/redhat/rhel/src/qemu-kvm/qemu-kvm/-/merge_requests/14 Peter, could you set ITR and devel_ack+ and DTM if we want to fix this on 8.5? Then I could set ITM accordingly. Thanks. (In reply to Li Xiaohui from comment #24) > (In reply to Peter Xu from comment #23) > > https://gitlab.com/redhat/rhel/src/qemu-kvm/qemu-kvm/-/merge_requests/14 > > Peter, could you set ITR and devel_ack+ and DTM if we want to fix this on > 8.5? Then I could set ITM accordingly. Thanks. Sorry I mean 9 not 8.5. Done, hopefully in the right way. :) (In reply to Peter Xu from comment #26) > Done, hopefully in the right way. :) Peter, lack of the most important flag: Internal Target Release -> ITR, for example if we set ITR to 9-beta, means we will fix bz on rhel9-beta, please help setting for ITR. BTW, ITR and devel_ack+ and qa_ack+ are the three necessary elements to trigger release+. Only we get release+, then the bz could go to next steps(the build can go to downstream if I'm right) until verify. (In reply to Li Xiaohui from comment #27) > (In reply to Peter Xu from comment #26) > > Done, hopefully in the right way. :) > > Peter, lack of the most important flag: Internal Target Release -> ITR, for > example if we set ITR to 9-beta, means we will fix bz on rhel9-beta, please > help setting for ITR. > > BTW, ITR and devel_ack+ and qa_ack+ are the three necessary elements to > trigger release+. Correct above words, lack of one flag: Internal Target Milestone -> ITM. ITR and devel_ack+ and qa_ack+ and ITM are the four necessary elements to trigger release+ > Only we get release+, then the bz could go to next > steps(the build can go to downstream if I'm right) until verify. (In reply to Li Xiaohui from comment #28) > Correct above words, lack of one flag: Internal Target Milestone -> ITM. > ITR and devel_ack+ and qa_ack+ and ITM are the four necessary elements to > trigger release+ I thought dev setup DTM and qe setup ITM (normally 1-2 weeks later than DTM), or am I wrong? I'm setting it anyway, feel free to correct me. Thanks. Set 'Verified:Tested,SanityOnly' as gating test with qemu-kvm-6.1.0-1.el9 pass Verify bz on the latest rhel9.0.0(kernel-5.14.0-1.7.1.el9.x86_64 & qemu-kvm-6.1.0-6.el9.x86_64) according to Description and Comment 17, postcopy migration succeed, vm works well after migration. Here hit a small issue, Peter, could you confirm whether we need fix it? Question: Shall we get some error info in src qmp when continue starting migration before fixing network issue: (dst qmp): {"exec-oob":"migrate-recover", "arguments":{"uri":"tcp:192.168.130.222:1235"}} {"timestamp": {"seconds": 1635429226, "microseconds": 893450}, "event": "MIGRATION", "data": {"status": "setup"}} {"error": {"class": "GenericError", "desc": "Failed to bind socket: Cannot assign requested address"}} (src qmp): {"execute":"migrate", "arguments":{"uri":"tcp:192.168.130.222:1235", "resume":true}} {"return": {}} I could only see some error in src hmp, expect get similar error in src qmp rather than '{"return": {}}': (qemu) 2021-10-28T10:05:02.980359Z qemu-kvm: Failed to connect to '192.168.130.222:1235': No route to host (In reply to Li Xiaohui from comment #38) \> Question: Shall we get some error info in src qmp when continue starting > migration before fixing network issue: > (dst qmp): > {"exec-oob":"migrate-recover", > "arguments":{"uri":"tcp:192.168.130.222:1235"}} > {"timestamp": {"seconds": 1635429226, "microseconds": 893450}, "event": > "MIGRATION", "data": {"status": "setup"}} > {"error": {"class": "GenericError", "desc": "Failed to bind socket: Cannot > assign requested address"}} > (src qmp): > {"execute":"migrate", "arguments":{"uri":"tcp:192.168.130.222:1235", > "resume":true}} > {"return": {}} > > I could only see some error in src hmp, expect get similar error in src qmp > rather than '{"return": {}}': > (qemu) 2021-10-28T10:05:02.980359Z qemu-kvm: Failed to connect to > '192.168.130.222:1235': No route to host Right, I think that'll happen too if we try to migrate to an address that does not exist. And it should have nothing to do with postcopy recovery even postcopy. But I agree with you, ideally the qmp "migrate" command should still wait for the socket initialization and grab the error if there is. I think we can consider open a bug for that, but even so it'll be with very low priority, because firstly qmp query-migrate will also show the error, meanwhile I think we can also enable migration events then there should be an event generated when the connection falied in qmp at least showing migration is failed. To enable the event, we can use either "-global migration.x-events=on" when booting qemu, or enable it explicitly e.g. via "(HMP) migrate_set_capability events on". Feel free to try. (In reply to Peter Xu from comment #39) > (In reply to Li Xiaohui from comment #38) > \> Question: Shall we get some error info in src qmp when continue starting > > migration before fixing network issue: > > > > I could only see some error in src hmp, expect get similar error in src qmp > > rather than '{"return": {}}': > > (qemu) 2021-10-28T10:05:02.980359Z qemu-kvm: Failed to connect to > > '192.168.130.222:1235': No route to host > > Right, I think that'll happen too if we try to migrate to an address that > does not exist. And it should have nothing to do with postcopy recovery > even postcopy. > > But I agree with you, ideally the qmp "migrate" command should still wait > for the socket initialization and grab the error if there is. > > I think we can consider open a bug for that, but even so it'll be with very > low priority, because firstly qmp query-migrate will also show the error, > meanwhile I think we can also enable migration events then there should be > an event generated when the connection falied in qmp at least showing > migration is failed. No new event generated after enabling events capability, only get migration status: postcopy-paused via "query-migrate". Thanks Peter, I have filed a bug to track this issue. You could go there to get more information: Bug 2018404 - Source host resuming postcopy gets no error prompt under postcopy-paused and migration network down > > To enable the event, we can use either "-global migration.x-events=on" when > booting qemu, or enable it explicitly e.g. via "(HMP) migrate_set_capability > events on". Feel free to try. I would mark this bz verified per above Comment 38, Comment 39 and Comment 40 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (new packages: qemu-kvm), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:2307 |