Bug 1589730 - libvirtd crashed sometimes if the guest crash on the source host at the end of migration
Summary: libvirtd crashed sometimes if the guest crash on the source host at the end o...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: libvirt
Version: 7.6
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Jiri Denemark
QA Contact: Fangge Jin
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-11 09:33 UTC by yafu
Modified: 2018-10-30 09:57 UTC (History)
5 users (show)

Fixed In Version: libvirt-4.5.0-1.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-30 09:56:58 UTC
Target Upstream Version:


Attachments (Terms of Use)
libvirtd log on source and target host (178.25 KB, application/x-gzip)
2018-06-11 11:29 UTC, yafu
no flags Details
libvirtd log - for comment 9 (587.35 KB, application/x-bzip)
2018-07-26 04:23 UTC, Fangge Jin
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:3113 None None None 2018-10-30 09:57:59 UTC

Description yafu 2018-06-11 09:33:46 UTC
Description of problem:
libvirtd crashed sometimes if the guest crash on the source host at the end of migration

Version-Release number of selected component (if applicable):
libvirt-4.3.0-1.el7.x86_64

How reproducible:
occasional

Steps to Reproduce:
1.Do migration(I use a issue guest and the guest will be crashed sometimes during migration):
#virsh migrate iommu1 qemu+ssh://10.66.5.76/system --live --verbose --p2p --tunnelled
Migration: [100 %]error: Disconnected from qemu:///system due to end of file
error: End of file while reading data: Input/output error

2.Check the qemu log on the source host:
#cat /var/log/libvirt/qemu/iommu1.log
2018-06-11T03:42:28.152233Z qemu-kvm: warning: TSC frequency mismatch between VM (2593995 kHz) and host (3311999 kHz), and TSC scaling unavailable
2018-06-11T03:42:28.152322Z qemu-kvm: warning: TSC frequency mismatch between VM (2593995 kHz) and host (3311999 kHz), and TSC scaling unavailable
2018-06-11T03:42:28.152382Z qemu-kvm: warning: TSC frequency mismatch between VM (2593995 kHz) and host (3311999 kHz), and TSC scaling unavailable
2018-06-11T03:42:28.152454Z qemu-kvm: warning: TSC frequency mismatch between VM (2593995 kHz) and host (3311999 kHz), and TSC scaling unavailable
main_channel_handle_message: agent start
2018-06-11 03:42:47.825+0000: shutting down, reason=crashed

3.The backtrace of libvirtd:
Core was generated by `/usr/sbin/libvirtd'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fa573ca819b in qemuMigrationEatCookie (driver=driver@entry=0x7fa5441115f0, dom=dom@entry=0x7fa5441b9f20, 
    cookiein=cookiein@entry=0x7fa594015af0 "<qemu-migration>\n  <name>iommu1</name>\n  <uuid>1b3268d6-b59c-406b-a14c-33b000b15b6c</uuid>\n  <hostname>fjin-4-101</hostname>\n  <hostuuid>600aa6af-c0f5-11e6-9c43-bc0000340000</hostuuid>\n  <statistics>\n"..., cookieinlen=cookieinlen@entry=1119, flags=flags@entry=32) at qemu/qemu_migration_cookie.c:1534
1534	        mig->jobInfo->operation = priv->job.current->operation;

(gdb) bt
#0  0x00007fa573ca819b in qemuMigrationEatCookie (driver=driver@entry=0x7fa5441115f0, dom=dom@entry=0x7fa5441b9f20, 
    cookiein=cookiein@entry=0x7fa594015af0 "<qemu-migration>\n  <name>iommu1</name>\n  <uuid>1b3268d6-b59c-406b-a14c-33b000b15b6c</uuid>\n  <hostname>fjin-4-101</hostname>\n  <hostuuid>600aa6af-c0f5-11e6-9c43-bc0000340000</hostuuid>\n  <statistics>\n"..., cookieinlen=cookieinlen@entry=1119, flags=flags@entry=32) at qemu/qemu_migration_cookie.c:1534
#1  0x00007fa573c98a71 in qemuMigrationSrcConfirmPhase (driver=driver@entry=0x7fa5441115f0, vm=vm@entry=0x7fa5441b9f20, 
    cookiein=0x7fa594015af0 "<qemu-migration>\n  <name>iommu1</name>\n  <uuid>1b3268d6-b59c-406b-a14c-33b000b15b6c</uuid>\n  <hostname>fjin-4-101</hostname>\n  <hostuuid>600aa6af-c0f5-11e6-9c43-bc0000340000</hostuuid>\n  <statistics>\n"..., cookieinlen=1119, flags=7, retcode=retcode@entry=0) at qemu/qemu_migration.c:2805
#2  0x00007fa573ca4527 in qemuMigrationSrcPerformPeer2Peer3 (flags=7, useParams=<optimized out>, bandwidth=0, migParams=0x7fa5940008c0, nbdPort=0, migrate_disks=<optimized out>, nmigrate_disks=0, 
    listenAddress=<optimized out>, graphicsuri=<optimized out>, uri=<optimized out>, dname=0x0, persist_xml=0x0, xmlin=<optimized out>, vm=0x7fa5441b9f20, 
    dconnuri=0x7fa59401e010 "qemu+ssh://10.66.4.101/system", dconn=0x7fa59402d520, sconn=0x7fa5900347a0, driver=0x7fa5441115f0) at qemu/qemu_migration.c:4207
#3  qemuMigrationSrcPerformPeer2Peer (v3proto=<synthetic pointer>, resource=0, dname=<optimized out>, flags=7, migParams=0x7fa5940008c0, nbdPort=0, migrate_disks=<optimized out>, nmigrate_disks=0, 
    listenAddress=<optimized out>, graphicsuri=<optimized out>, uri=0x0, dconnuri=0x7fa59401e010 "qemu+ssh://10.66.4.101/system", persist_xml=0x0, xmlin=<optimized out>, vm=0x7fa5441b9f20, 
    sconn=0x7fa5900347a0, driver=0x7fa5441115f0) at qemu/qemu_migration.c:4398
#4  qemuMigrationSrcPerformJob (driver=driver@entry=0x7fa5441115f0, conn=conn@entry=0x7fa5900347a0, vm=vm@entry=0x7fa5441b9f20, xmlin=xmlin@entry=0x0, persist_xml=persist_xml@entry=0x0, 
    dconnuri=dconnuri@entry=0x7fa59401e010 "qemu+ssh://10.66.4.101/system", uri=uri@entry=0x0, graphicsuri=graphicsuri@entry=0x0, listenAddress=listenAddress@entry=0x0, nmigrate_disks=nmigrate_disks@entry=0, 
    migrate_disks=migrate_disks@entry=0x0, nbdPort=nbdPort@entry=0, migParams=migParams@entry=0x7fa5940008c0, cookiein=cookiein@entry=0x0, cookieinlen=cookieinlen@entry=0, 
    cookieout=cookieout@entry=0x7fa5a22c4a88, cookieoutlen=cookieoutlen@entry=0x7fa5a22c4a7c, flags=flags@entry=7, dname=dname@entry=0x0, resource=resource@entry=0, v3proto=<optimized out>, v3proto@entry=true)
    at qemu/qemu_migration.c:4478
#5  0x00007fa573ca5394 in qemuMigrationSrcPerform (driver=driver@entry=0x7fa5441115f0, conn=0x7fa5900347a0, vm=0x7fa5441b9f20, xmlin=0x0, persist_xml=0x0, 
    dconnuri=dconnuri@entry=0x7fa59401e010 "qemu+ssh://10.66.4.101/system", uri=0x0, graphicsuri=0x0, listenAddress=0x0, nmigrate_disks=nmigrate_disks@entry=0, migrate_disks=0x0, nbdPort=0, 
    migParams=migParams@entry=0x7fa5940008c0, cookiein=cookiein@entry=0x0, cookieinlen=cookieinlen@entry=0, cookieout=cookieout@entry=0x7fa5a22c4a88, cookieoutlen=cookieoutlen@entry=0x7fa5a22c4a7c, 
    flags=flags@entry=7, dname=0x0, resource=0, v3proto=v3proto@entry=true) at qemu/qemu_migration.c:4661
#6  0x00007fa573cea065 in qemuDomainMigratePerform3Params (dom=0x7fa59401dfd0, dconnuri=0x7fa59401e010 "qemu+ssh://10.66.4.101/system", params=<optimized out>, nparams=<optimized out>, cookiein=0x0, 
    cookieinlen=0, cookieout=0x7fa5a22c4a88, cookieoutlen=0x7fa5a22c4a7c, flags=7) at qemu/qemu_driver.c:12781
#7  0x00007fa5b242e26d in virDomainMigratePerform3Params (domain=domain@entry=0x7fa59401dfd0, dconnuri=0x7fa59401e010 "qemu+ssh://10.66.4.101/system", params=0x7fa594001e20, nparams=0, cookiein=0x0, 
    cookieinlen=0, cookieout=cookieout@entry=0x7fa5a22c4a88, cookieoutlen=cookieoutlen@entry=0x7fa5a22c4a7c, flags=7) at libvirt-domain.c:4976
#8  0x000055a77cbf6276 in remoteDispatchDomainMigratePerform3Params (server=0x55a77de12f20, msg=0x55a77de703b0, ret=0x7fa594001d50, args=0x7fa594001d70, rerr=0x7fa5a22c4bc0, client=<optimized out>)
    at remote/remote_daemon_dispatch.c:5436
#9  remoteDispatchDomainMigratePerform3ParamsHelper (server=0x55a77de12f20, client=<optimized out>, msg=0x55a77de703b0, rerr=0x7fa5a22c4bc0, args=0x7fa594001d70, ret=0x7fa594001d50)
    at remote/remote_daemon_dispatch_stubs.h:7944
#10 0x00007fa5b23193b5 in virNetServerProgramDispatchCall (msg=0x55a77de703b0, client=0x55a77de71ab0, server=0x55a77de12f20, prog=0x55a77de68c70) at rpc/virnetserverprogram.c:437
#11 virNetServerProgramDispatch (prog=0x55a77de68c70, server=server@entry=0x55a77de12f20, client=client@entry=0x55a77de71ab0, msg=msg@entry=0x55a77de703b0) at rpc/virnetserverprogram.c:304
#12 0x00007fa5b232217a in virNetServerProcessMsg (srv=srv@entry=0x55a77de12f20, client=0x55a77de71ab0, prog=<optimized out>, msg=0x55a77de703b0) at rpc/virnetserver.c:145
#13 0x00007fa5b23225c8 in virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x55a77de12f20) at rpc/virnetserver.c:166
#14 0x00007fa5b22069d1 in virThreadPoolWorker (opaque=opaque@entry=0x55a77de127a0) at util/virthreadpool.c:167
#15 0x00007fa5b22057a0 in virThreadHelper (data=<optimized out>) at util/virthread.c:206
#16 0x00007fa5af523dd5 in start_thread (arg=0x7fa5a22c5700) at pthread_create.c:308
#17 0x00007fa5af24db3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113


Actual results:


Expected results:


Additional info:

Comment 2 yafu 2018-06-11 11:29:49 UTC
Created attachment 1449966 [details]
libvirtd log on  source and target host

Comment 3 Jiri Denemark 2018-06-26 12:05:19 UTC
Steps to reproduce this bug without relying on QEMU to crash:

1. start a new domain
2. attach gdb to libvirtd on the destination host and
   set breakpoint to qemuMigrationDstFinish
3. migrate $DOM --p2p ...
4. wait until gdb hits the breakpoint
5. kill QEMU process on the source host
6. run "continue" command in gdb

Comment 4 Jiri Denemark 2018-06-26 12:08:19 UTC
Caused by commit v3.4.0-36-g49d30bc2e2, specifically by

+    if (flags & QEMU_MIGRATION_COOKIE_STATS && mig->jobInfo)
+        mig->jobInfo->operation = priv->job.current->operation;

in qemuMigrationEatCookie because once libvirtd sees EOF from QEMU monitor
while it is executing a remote API, it will clear the migration job.

Comment 5 Jiri Denemark 2018-06-28 14:34:39 UTC
Patches sent upstream for review: https://www.redhat.com/archives/libvir-list/2018-June/msg01815.html

Comment 6 Jiri Denemark 2018-07-02 10:25:07 UTC
Fixed upstream by

commit 5f998219111327dc4f66586ce8953a6d7ecfb619
Refs: v4.5.0-rc2-2-g5f99821911
Author:     Jiri Denemark <jdenemar@redhat.com>
AuthorDate: Thu Jun 28 11:38:52 2018 +0200
Commit:     Jiri Denemark <jdenemar@redhat.com>
CommitDate: Mon Jul 2 11:53:21 2018 +0200

    qemu_migration: Check for active domain after talking to remote daemon

    Once we called qemuDomainObjEnterRemote to talk to the destination
    daemon during a peer to peer migration, the vm lock is released and we
    only hold an async job. If the source domain dies at this point the
    monitor EOF callback is allowed to do its job and (among other things)
    clear all private data irrelevant for stopped domain. Thus when we call
    qemuDomainObjExitRemote, the domain may already be gone and we should
    avoid touching runtime private data (such as current job info).

    In other words after acquiring the lock in qemuDomainObjExitRemote, we
    need to check the domain is still alive. Unless we're doing offline
    migration.

    https://bugzilla.redhat.com/show_bug.cgi?id=1589730

    Signed-off-by: Jiri Denemark <jdenemar@redhat.com>

Comment 8 Fangge Jin 2018-07-26 03:47:48 UTC
Reproduced with libvirt-4.3.0-1.el7.x86_64

Verify with libvirt-4.5.0-4.el7.x86_64, refer to the steps in comment 3, migration succeeds, source libvirtd doesn't crash.


Also tested offline migration with p2p, migration succeeds:
# virsh -k0 migrate rhel7-min qemu+ssh://dell-per730-37.lab.eng.pek2.redhat.com/system --offline --persistent --p2p

Comment 9 Fangge Jin 2018-07-26 04:22:47 UTC
But I found sometimes guest image ownership is changed to root:root after several rounds of migration. I'm not sure whether it's caused by the patch.


Source to target(kill source qemu during migration as comment 3):
[root@fjin-5-190 ~]# virsh -k0 migrate rhel7-min qemu+ssh://dell-per730-37.lab.eng.pek2.redhat.com/system --verbose --live --p2p


Target to source:
# virsh migrate rhel7-min qemu+ssh://10.66.5.190/system --live --verbose --p2p --migrateuri tcp://10.66.5.190
Migration: [100 %]

Source to target(kill source qemu during migration as comment 3):
# virsh -k0 migrate rhel7-min qemu+ssh://dell-per730-37.lab.eng.pek2.redhat.com/system --verbose --live --p2p 
Migration: [100 %]


# ll /nfs/RHEL-7.5-x86_64-latest.qcow2 
-rw-r--r--. 1 root root 1403650048 Jul 25 23:49 /nfs/RHEL-7.5-x86_64-latest.qcow2

Comment 10 Fangge Jin 2018-07-26 04:23:38 UTC
Created attachment 1470583 [details]
libvirtd log - for comment 9

Comment 11 Jiri Denemark 2018-07-26 12:22:19 UTC
Please, file a separate bug for this and explain what steps you made. It's not
very clear from comment 9.

Comment 12 Fangge Jin 2018-07-26 14:04:00 UTC
I filed new Bug 1608931  for issue in comment 9.

Comment 14 errata-xmlrpc 2018-10-30 09:56:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3113


Note You need to log in before you can comment on or make changes to this bug.