Bug 1430634
Summary: | Attach lun type disk report error and crash guest | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | weizhang <weizhan> | ||||||
Component: | libvirt | Assignee: | Michal Privoznik <mprivozn> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 7.4-Alt | CC: | abologna, drjones, dzheng, gsun, jsuchane, michen, mprivozn, pkrempa, rbalakri, weizhan | ||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | aarch64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | libvirt-3.2.0-1.el7 | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 1431224 (view as bug list) | Environment: | |||||||
Last Closed: | 2017-08-02 07:44:59 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1431224, 1432057 | ||||||||
Attachments: |
|
Description
weizhang
2017-03-09 07:43:17 UTC
Libvirt debug logs would be fine, as well as qemu logs of the guest. Also I assume that the same command succeeded on x86. It is not clear to me, what do you mean by 'guest crash'. Was there any qemu process crash? Or the guest was stopped due to guest kernel panic? All in all, there might be no issue in libvirt unless the mount namespace is in charge. Adding Michal and Peter to CC list. Thanks. Created attachment 1261872 [details]
libvirtd.log
Hi Jaroslav,
Sorry for not describing clear, it means qemu process crash. I can not see any crash info with console. And the libvirtd.log will be attached.
And also if you could attach domain status XML that'd be great. You can find it under /var/run/libvirt/qemu/avocado-vt-vm1.xml Ah, so after careful examination of the logs, I think this is what is happening here: 0) the avocado VM is started with namespaces enabled 1) libvirt starts the hotplug routine 2) qemu dies right in the middle of it: 2017-03-10 09:23:15.853+0000: 22172: info : qemuMonitorIOWrite:534 : QEMU_MONITOR_IO_WRITE: mon=0xffff60005f80 buf={"execute":"device_add","arguments":{"driver":"virtio-blk-pci","scsi":"on","bus":"pci.2","addr":"0x0","drive":"drive-virtio-disk1","id":"virtio-disk1"},"id":"libvirt-12"} len=172 ret=172 errno=0 2017-03-10 09:23:16.011+0000: 22172: error : qemuAgentIO:652 : internal error: End of file from agent monitor 3) libvirt tries to roll back. And because it still thinks that the domain is using namespaces it calls function to enter the namespace of the qemu process and do all the work there. The namespace, however, no longer exists - kernel cleaned it up (it always does when the last process in the namespace dies). Therefore our roll back attempts fail: we are trying to enter non-existent namespace. So there are two bugs here: 1) libvirt shouldn't try to use namespace routines once a domain dies, 2) qemu should not crash on device_add. Working on fixing libvirt issue. Created attachment 1261915 [details]
avocado-vt-vm1.xml
Hi Michal,
Attach the xml also to help you make sure the problem :)
Patch proposed upstream: https://www.redhat.com/archives/libvir-list/2017-March/msg00447.html Patch pushed upstream: commit e915942b05d3c97b9b2b412b0cce43045a5628d1 Author: Michal Privoznik <mprivozn> AuthorDate: Fri Mar 10 13:34:15 2017 +0100 Commit: Michal Privoznik <mprivozn> CommitDate: Fri Mar 10 16:02:34 2017 +0100 qemuProcessHandleMonitorEOF: Disable namespace for domain https://bugzilla.redhat.com/show_bug.cgi?id=1430634 If a qemu process has died, we get EOF on its monitor. At this point, since qemu process was the only one running in the namespace kernel has already cleaned the namespace up. Any attempt of ours to enter it has to fail. This really happened in the bug linked above. We've tried to attach a disk to qemu and while we were in the monitor talking to qemu it just died. Therefore our code tried to do some roll back (e.g. deny the device in cgroups again, restore labels, etc.). However, during the roll back (esp. when restoring labels) we still thought that domain has a namespace. So we used secdriver's transactions. This failed as there is no namespace to enter. Signed-off-by: Michal Privoznik <mprivozn> v3.1.0-104-ge915942b0
> So there are two bugs here:
> 1) libvirt shouldn't try to use namespace routines once a domain dies,
> 2) qemu should not crash on device_add.
So we need a qemu clone too, right?
Thanks.
I'm curious what makes this AArch64/Pegas specific? Is it the machine model? Does this reproduce with the q35 model? I just want to be sure we're targeting the right builds with the fix. Thanks, drew (In reply to Andrew Jones from comment #11) > I'm curious what makes this AArch64/Pegas specific? Is it the machine model? > Does this reproduce with the q35 model? If you look at Bug 1431224, Comment 7 you'll see the QEMU crash couldn't be reproduced on x86. With QEMU exiting cleanly instead of crashing, libvirt had a chance to clean up after itself properly: that's why the issue could only be reproduced on aarch64. |