1869015 – Qemu core dump on src host when network recover + migration if mistake to migrate before handle network failure

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1869015 - Qemu core dump on src host when network recover + migration if mistake to migrate before handle network failure

Summary: Qemu core dump on src host when network recover + migration if mistake to mig...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 9
Classification:	Red Hat
Component:	qemu-kvm
Sub Component:
Version:	9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	rc
Target Release:	9.0
Assignee:	Peter Xu
QA Contact:	Li Xiaohui
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-15 11:34 UTC by Li Xiaohui
Modified:	2022-05-17 12:24 UTC (History)
CC List:	12 users (show)
Fixed In Version:	qemu-kvm-6.1.0-1.el9
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-05-17 12:23:22 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Gitlab	redhat/centos-stream/src qemu-kvm merge_requests 23	0	None	None	None	2021-07-15 15:30:50 UTC
Red Hat Product Errata	RHBA-2022:2307	0	None	None	None	2022-05-17 12:24:06 UTC

Description Li Xiaohui 2020-08-15 11:34:22 UTC

Description of problem:
when network is down, mistake to execute migrate_recover on dst qemu, get prompt:
Error: Failed to bind socket: Cannot assign requested address

After handle network failure, execute clis:
(1) execute "migrate_recover" again on dst qemu;
(2) execute "migrate -r tcp:192.168.0.46:6666" on src qemu;

After tried (1)&(2) for one time, migration not start, tried again
(1)&(2), qemu on src host will core dump(http://fileshare.englab.nay.redhat.com/pub/logs/xiaohli/core.qemu-kvm.432651.hp-dl385g10-09.lab.eng.pek2.redhat.com.1597303610)   


Version-Release number of selected component (if applicable):
host info: kernel-4.18.0-232.el8.x86_64 & qemu-kvm-5.1.0-2.module+el8.3.0+7652+b30e6901.x86_64
guest info: kernel-4.18.0-232.el8.x86_64


How reproducible:
100%


Steps to Reproduce:
1.set TCP timeout on src&dst hosts:
# cd /proc/sys/net/ipv4/
# echo 3 > tcp_keepalive_probes  
# echo 3 > tcp_keepalive_intvl 
# echo 1 > tcp_retries1 
# echo 1 > tcp_retries2 
# echo 2 > tcp_fin_timeout 
2.boot a guest on src host
3.boot a guest with "-incoming defer" on dst host
4.enable postcopy mode both on src&dst host, and set postcopy speed on src host
5.migrate guest from src to dst host, during migration, change migration into postcopy phase
6.during postcopy phase, down migration network card on dst host
# nmcli con down ens6f0 
7.mistake to recover migration before network is ok:
(qemu) migrate_recover tcp:192.168.0.46:6666
8.then handle network failure
# nmcli con up ens6f0 
9.tried (1)&(2) for one time, migration not start, tried again
(1)&(2), qemu on src host will core 
(1) execute "migrate_recover tcp:192.168.0.46:6666" again on dst qemu;
(2) execute "migrate -r tcp:192.168.0.46:6666" on src qemu;


Actual results:
like Description and test steps.


Expected results:
Migration can go on after step 8


Additional info:
Also found this issue on qemu-img-5.0.0-0.module+el8.3.0+6620+5d5e1420.x86_64

Comment 1 Li Xiaohui 2020-08-15 13:57:45 UTC

List QEMU command lines:
/usr/libexec/qemu-kvm  \
-name "mouse-vm",debug-threads=on \
-sandbox off \
-machine q35 \
-cpu EPYC \
-nodefaults  \
-device VGA \
-chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1,server,nowait \
-chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor,server,nowait \
-mon chardev=qmp_id_qmpmonitor1,mode=control \
-mon chardev=qmp_id_catch_monitor,mode=control \
-device pcie-root-port,port=0x10,chassis=1,id=root0,bus=pcie.0,multifunction=on,addr=0x2 \
-device pcie-root-port,port=0x11,chassis=2,id=root1,bus=pcie.0,addr=0x2.0x1 \
-device pcie-root-port,port=0x12,chassis=3,id=root2,bus=pcie.0,addr=0x2.0x2 \
-device pcie-root-port,port=0x13,chassis=4,id=root3,bus=pcie.0,addr=0x2.0x3 \
-device pcie-root-port,port=0x14,chassis=5,id=root4,bus=pcie.0,addr=0x2.0x4 \
-device pcie-root-port,port=0x15,chassis=6,id=root5,bus=pcie.0,addr=0x2.0x5 \
-device pcie-root-port,port=0x16,chassis=7,id=root6,bus=pcie.0,addr=0x2.0x6 \
-device pcie-root-port,port=0x17,chassis=8,id=root7,bus=pcie.0,addr=0x2.0x7 \
-device nec-usb-xhci,id=usb1,bus=root0 \
-device virtio-scsi-pci,id=virtio_scsi_pci0,bus=root1 \
-device scsi-hd,id=image1,drive=drive_image1,bus=virtio_scsi_pci0.0,channel=0,scsi-id=0,lun=0,bootindex=0 \
-device virtio-net-pci,mac=9a:8a:8b:8c:8d:8e,id=net0,vectors=4,netdev=tap0,bus=root2 \
-device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \
-blockdev driver=file,cache.direct=on,cache.no-flush=off,filename=/mnt/nfs/rhel830-64-virtio-scsi.qcow2,node-name=drive_sys1 \
-blockdev driver=qcow2,node-name=drive_image1,file=drive_sys1 \
-netdev tap,id=tap0,vhost=on \
-m 4096 \
-smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \
-vnc :10 \
-rtc base=utc,clock=host \
-boot menu=off,strict=off,order=cdn,once=c \
-enable-kvm  \
-qmp tcp:0:3333,server,nowait \
-serial tcp:0:4444,server,nowait \
-monitor stdio \

Comment 3 Dr. David Alan Gilbert 2020-08-18 15:55:45 UTC

I can't persuade gdb to give me a backtrace off this core; can you try and get a full backtrace from it please?
Also, it says it died during an abort - when it dies can you give us any messages?

Comment 4 Li Xiaohui 2020-08-19 10:24:41 UTC

(In reply to Dr. David Alan Gilbert from comment #3)
> I can't persuade gdb to give me a backtrace off this core; can you try and
> get a full backtrace from it please?
(gdb) t a a bt full

Thread 10 (LWP 545052):
#0  0x00007f0423e912fc in ?? ()
No symbol table info available.
#1  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 9 (LWP 545148):
#0  0x00007f0423e93bd6 in ?? ()
No symbol table info available.
#1  0x0000000000000000 in ?? ()
No symbol table info available.
...

> Also, it says it died during an abort - when it dies can you give us any
> messages?
It prints when qemu died:
(qemu) qemu-kvm: /builddir/build/BUILD/qemu-5.0.0/migration/migration.c:3484: migrate_fd_connect: Assertion `s->cleanup_bh' failed.

Comment 11 Juan Quintela 2021-06-29 10:57:54 UTC

Peter, does this ring any bell?

Thanks, Juan.

Comment 12 Peter Xu 2021-06-29 15:52:19 UTC

Xiaohui, is the error prompted like below when you tried the 2nd time but before the 3rd time when it crashes?

(qemu) migrate_recover tcp:$IP:$PORT
Error: Migrate recovery is triggered already

Comment 13 Li Xiaohui 2021-07-02 09:01:15 UTC

(In reply to Peter Xu from comment #12)

Hi Peter,
Sorry for late. I have available machines to test this bz now. 
Now It's rhel9, the version of qemu is qemu-kvm-6.0.0-6.el9.x86_64, so I think it will show same issue if test on the latest rhelav-8.5.0.

> Xiaohui, is the error prompted like below when you tried the 2nd time but
> before the 3rd time when it crashes?
> 
> (qemu) migrate_recover tcp:$IP:$PORT
> Error: Migrate recovery is triggered already

Now I would say:
1.When tested on the qemu-kvm-5.1.0 (rhelav-8.3.0), I think don't receive any error prompts when I tried the 2nd time but before the 3rd time. Because if I got above error, I wouldn't go on migrating.
2.But now tested on the qemu-kvm-6.0, I got same error as you did above. And src qemu would hit core dump if we go on executing migration for the 2nd time.
(dst_qmp){"exec-oob":"migrate-recover", "arguments":{"uri":"tcp:192.168.11.11:1235"}}
(dst_qmp){"error": {"class": "GenericError", "desc": "Migrate recovery is triggered already"}}
(src_qmp){"execute":"migrate", "arguments":{"uri":"tcp:192.168.11.11:1235", "resume":true}} 

Actual result:
(qemu) qemu-kvm: Unable to write to socket: Broken pipe
qemu-kvm: ../util/yank.c:107: yank_unregister_instance: Assertion `QLIST_EMPTY(&entry->yankfns)' failed.
1.sh: line 38: 32077 Aborted                 (core dumped)


From above error info, I guess the different errors between rhelav-8.3 and rhelav-8.5.0 are caused by yank codes from qemu-kvm-6.0, and this bz seems blocked by below bz:
Bug 1974366 - Fail to set migrate incoming for 2nd time after the first time failed

Comment 19 Li Xiaohui 2021-07-08 12:32:32 UTC

Didn't hit any issues when test on qemu-img-6.0.0-6.el9.postcopy_recover_v2.x86_64 regarding Comment 17, the v2 build works well.

Comment 20 Peter Xu 2021-07-08 19:20:37 UTC

Thanks Xiaohui.

The first two patches merged; posted the latter three patches upstream:

https://lore.kernel.org/qemu-devel/20210708190653.252961-1-peterx@redhat.com/

I'll do a backport when they all reviewed and landed.

Comment 23 Peter Xu 2021-07-14 19:56:46 UTC

https://gitlab.com/redhat/rhel/src/qemu-kvm/qemu-kvm/-/merge_requests/14

Comment 24 Li Xiaohui 2021-07-15 02:11:11 UTC

(In reply to Peter Xu from comment #23)
> https://gitlab.com/redhat/rhel/src/qemu-kvm/qemu-kvm/-/merge_requests/14

Peter, could you set ITR and devel_ack+ and DTM if we want to fix this on 8.5? Then I could set ITM accordingly. Thanks.

Comment 25 Li Xiaohui 2021-07-15 02:12:32 UTC

(In reply to Li Xiaohui from comment #24)
> (In reply to Peter Xu from comment #23)
> > https://gitlab.com/redhat/rhel/src/qemu-kvm/qemu-kvm/-/merge_requests/14
> 
> Peter, could you set ITR and devel_ack+ and DTM if we want to fix this on
> 8.5? Then I could set ITM accordingly. Thanks.

Sorry I mean 9 not 8.5.

Comment 26 Peter Xu 2021-07-15 13:29:47 UTC

Done, hopefully in the right way. :)

Comment 27 Li Xiaohui 2021-07-15 13:53:56 UTC

(In reply to Peter Xu from comment #26)
> Done, hopefully in the right way. :)

Peter, lack of the most important flag: Internal Target Release -> ITR, for example if we set ITR to 9-beta, means we will fix bz on rhel9-beta, please help setting for ITR.

BTW, ITR and devel_ack+ and qa_ack+ are the three necessary elements to trigger release+. Only we get release+, then the bz could go to next steps(the build can go to downstream if I'm right) until verify.

Comment 28 Li Xiaohui 2021-07-15 13:56:50 UTC

(In reply to Li Xiaohui from comment #27)
> (In reply to Peter Xu from comment #26)
> > Done, hopefully in the right way. :)
> 
> Peter, lack of the most important flag: Internal Target Release -> ITR, for
> example if we set ITR to 9-beta, means we will fix bz on rhel9-beta, please
> help setting for ITR.
> 
> BTW, ITR and devel_ack+ and qa_ack+ are the three necessary elements to
> trigger release+. 

Correct above words, lack of one flag: Internal Target Milestone -> ITM.
ITR and devel_ack+ and qa_ack+ and ITM are the four necessary elements to trigger release+

> Only we get release+, then the bz could go to next
> steps(the build can go to downstream if I'm right) until verify.

Comment 29 Peter Xu 2021-07-15 14:26:11 UTC

(In reply to Li Xiaohui from comment #28)
> Correct above words, lack of one flag: Internal Target Milestone -> ITM.
> ITR and devel_ack+ and qa_ack+ and ITM are the four necessary elements to
> trigger release+

I thought dev setup DTM and qe setup ITM (normally 1-2 weeks later than DTM), or am I wrong?

I'm setting it anyway, feel free to correct me.  Thanks.

Comment 37 Yanan Fu 2021-10-12 06:46:14 UTC

Set 'Verified:Tested,SanityOnly' as gating test with qemu-kvm-6.1.0-1.el9 pass

Comment 38 Li Xiaohui 2021-10-29 03:50:48 UTC

Verify bz on the latest rhel9.0.0(kernel-5.14.0-1.7.1.el9.x86_64 & qemu-kvm-6.1.0-6.el9.x86_64) according to Description and Comment 17, postcopy migration succeed, vm works well after migration.



Here hit a small issue, Peter, could you confirm whether we need fix it?

Question: Shall we get some error info in src qmp when continue starting migration before fixing network issue:
(dst qmp):
{"exec-oob":"migrate-recover", "arguments":{"uri":"tcp:192.168.130.222:1235"}}
{"timestamp": {"seconds": 1635429226, "microseconds": 893450}, "event": "MIGRATION", "data": {"status": "setup"}}
{"error": {"class": "GenericError", "desc": "Failed to bind socket: Cannot assign requested address"}}
(src qmp):
{"execute":"migrate", "arguments":{"uri":"tcp:192.168.130.222:1235", "resume":true}}
{"return": {}}

I could only see some error in src hmp, expect get similar error in src qmp rather than '{"return": {}}':
(qemu) 2021-10-28T10:05:02.980359Z qemu-kvm: Failed to connect to '192.168.130.222:1235': No route to host

Comment 39 Peter Xu 2021-10-29 04:17:48 UTC

(In reply to Li Xiaohui from comment #38)
\> Question: Shall we get some error info in src qmp when continue starting
> migration before fixing network issue:
> (dst qmp):
> {"exec-oob":"migrate-recover",
> "arguments":{"uri":"tcp:192.168.130.222:1235"}}
> {"timestamp": {"seconds": 1635429226, "microseconds": 893450}, "event":
> "MIGRATION", "data": {"status": "setup"}}
> {"error": {"class": "GenericError", "desc": "Failed to bind socket: Cannot
> assign requested address"}}
> (src qmp):
> {"execute":"migrate", "arguments":{"uri":"tcp:192.168.130.222:1235",
> "resume":true}}
> {"return": {}}
> 
> I could only see some error in src hmp, expect get similar error in src qmp
> rather than '{"return": {}}':
> (qemu) 2021-10-28T10:05:02.980359Z qemu-kvm: Failed to connect to
> '192.168.130.222:1235': No route to host

Right, I think that'll happen too if we try to migrate to an address that does not exist.  And it should have nothing to do with postcopy recovery even postcopy.

But I agree with you, ideally the qmp "migrate" command should still wait for the socket initialization and grab the error if there is.

I think we can consider open a bug for that, but even so it'll be with very low priority, because firstly qmp query-migrate will also show the error, meanwhile I think we can also enable migration events then there should be an event generated when the connection falied in qmp at least showing migration is failed.

To enable the event, we can use either "-global migration.x-events=on" when booting qemu, or enable it explicitly e.g. via "(HMP) migrate_set_capability events on".  Feel free to try.

Comment 40 Li Xiaohui 2021-10-29 07:44:58 UTC

(In reply to Peter Xu from comment #39)
> (In reply to Li Xiaohui from comment #38)
> \> Question: Shall we get some error info in src qmp when continue starting
> > migration before fixing network issue:

> > 
> > I could only see some error in src hmp, expect get similar error in src qmp
> > rather than '{"return": {}}':
> > (qemu) 2021-10-28T10:05:02.980359Z qemu-kvm: Failed to connect to
> > '192.168.130.222:1235': No route to host
> 
> Right, I think that'll happen too if we try to migrate to an address that
> does not exist.  And it should have nothing to do with postcopy recovery
> even postcopy.
> 
> But I agree with you, ideally the qmp "migrate" command should still wait
> for the socket initialization and grab the error if there is.
> 
> I think we can consider open a bug for that, but even so it'll be with very
> low priority, because firstly qmp query-migrate will also show the error,
> meanwhile I think we can also enable migration events then there should be
> an event generated when the connection falied in qmp at least showing
> migration is failed.

No new event generated after enabling events capability, only get migration status: postcopy-paused via "query-migrate".

Thanks Peter, I have filed a bug to track this issue. You could go there to get more information:
Bug 2018404 - Source host resuming postcopy gets no error prompt under postcopy-paused and migration network down

> 
> To enable the event, we can use either "-global migration.x-events=on" when
> booting qemu, or enable it explicitly e.g. via "(HMP) migrate_set_capability
> events on".  Feel free to try.

Comment 41 Li Xiaohui 2021-10-29 07:46:35 UTC

I would mark this bz verified per above Comment 38, Comment 39 and Comment 40

Comment 44 errata-xmlrpc 2022-05-17 12:23:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (new packages: qemu-kvm), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:2307

Note You need to log in before you can comment on or make changes to this bug.