Bug 1431224 - Attach lun type disk report error and crash guest
Summary: Attach lun type disk report error and crash guest
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-rhev
Version: 7.4
Hardware: All
OS: Linux
medium
high
Target Milestone: rc
: ---
Assignee: Fam Zheng
QA Contact: aihua liang
URL:
Whiteboard:
Depends On: 1430634
Blocks: 1432057
TreeView+ depends on / blocked
 
Reported: 2017-03-10 16:22 UTC by Jaroslav Suchanek
Modified: 2017-08-02 03:39 UTC (History)
21 users (show)

Fixed In Version: qemu-kvm-rhev-2.9.0-1.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1430634
: 1432057 (view as bug list)
Environment:
Last Closed: 2017-08-02 03:39:56 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:2392 normal SHIPPED_LIVE Important: qemu-kvm-rhev security, bug fix, and enhancement update 2017-08-01 20:04:36 UTC

Description Jaroslav Suchanek 2017-03-10 16:22:07 UTC
+++ This bug was initially created as a clone of Bug #1430634 +++

Description of problem:
Attach lun type disk report error and crash guest
not sure it is libvirt or qemu-kvm-rhev issues so report libvirt first

Version-Release number of selected component (if applicable):
kernel-4.9.0-10.el7.aarch64
qemu-kvm-rhev-2.8.0-6.el7.aarch64
libvirt-3.1.0-2.el7.aarch64


How reproducible:
100%

Steps to Reproduce:
1.virsh attach-disk --domain avocado-vt-vm1 --source /dev/sdb --target vdb --driver qemu --type lun
error: Failed to attach disk
error: internal error: child reported: Kernel does not provide mount namespace: No such file or directory
2.Then find guest is destroyed
3.

Actual results:
Attach failed and crash guest

Expected results:
PASS without crash

Additional info:

--- Additional comment from Jaroslav Suchanek on 2017-03-09 10:50:28 CET ---

Libvirt debug logs would be fine, as well as qemu logs of the guest.

Also I assume that the same command succeeded on x86. It is not clear to me, what do you mean by 'guest crash'. Was there any qemu process crash? Or the guest was stopped due to guest kernel panic? All in all, there might be no issue in libvirt unless the mount namespace is in charge. Adding Michal and Peter to CC list.

Thanks.

--- Additional comment from weizhang on 2017-03-10 10:28 CET ---

Hi Jaroslav,

Sorry for not describing clear, it means qemu process crash. I can not see any crash info with console. And the libvirtd.log will be attached.

--- Additional comment from Michal Privoznik on 2017-03-10 11:22:44 CET ---

Ah, so after careful examination of the logs, I think this is what is happening here:

0) the avocado VM is started with namespaces enabled
1) libvirt starts the hotplug routine
2) qemu dies right in the middle of it:

2017-03-10 09:23:15.853+0000: 22172: info : qemuMonitorIOWrite:534 : QEMU_MONITOR_IO_WRITE: mon=0xffff60005f80 buf={"execute":"device_add","arguments":{"driver":"virtio-blk-pci","scsi":"on","bus":"pci.2","addr":"0x0","drive":"drive-virtio-disk1","id":"virtio-disk1"},"id":"libvirt-12"}
 len=172 ret=172 errno=0
2017-03-10 09:23:16.011+0000: 22172: error : qemuAgentIO:652 : internal error: End of file from agent monitor

3) libvirt tries to roll back. And because it still thinks that the domain is using namespaces it calls function to enter the namespace of the qemu process and do all the work there. The namespace, however, no longer exists - kernel cleaned it up (it always does when the last process in the namespace dies). Therefore our roll back attempts fail: we are trying to enter non-existent namespace.

So there are two bugs here:
1) libvirt shouldn't try to use namespace routines once a domain dies,
2) qemu should not crash on device_add.

Working on fixing libvirt issue.

Comment 1 Hai Huang 2017-03-13 14:20:37 UTC
Wei,
In your original BZ description,
   Bug 1430634 - Attach lun type disk report error and crash guest

ther following were mentioned in the decription:


0) the avocado VM is started with namespaces enabled
1) libvirt starts the hotplug routine
2) qemu dies right in the middle of it:
   ^^^^^^^^^

2017-03-10 09:23:15.853+0000: 22172: info : qemuMonitorIOWrite:534 : QEMU_MONITOR_IO_WRITE: mon=0xffff60005f80 buf={"execute":"device_add","arguments":{"driver":"virtio-blk-pci","scsi":"on","bus":"pci.2","addr":"0x0","drive":"drive-virtio-disk1","id":"virtio-disk1"},"id":"libvirt-12"}
 len=172 ret=172 errno=0
2017-03-10 09:23:16.011+0000: 22172: error : qemuAgentIO:652 : internal error: End of file from agent monitor


So there are two bugs here:
1) libvirt shouldn't try to use namespace routines once a domain dies,
2) qemu should not crash on device_add.
   ^^^^^^^^^^^^^^^^^^^^^

Apparently, qemu crashed on the device_add operation.
Would it be possible for you the provide the 
qemu stack trace when it crashed?

Thanks.

Comment 2 weizhang 2017-03-14 01:51:19 UTC
Hi Hai,

Please check if it is what you need.

# gdb -p `pidof qemu-kvm`
(gdb) c
Continuing.
[New Thread 0xfffe73a8ec30 (LWP 22058)]

Program received signal SIGABRT, Aborted.
0x0000ffff8cc541b8 in raise () from /lib64/libc.so.6

(gdb) bt
#0  0x0000ffff8cc541b8 in raise () from /lib64/libc.so.6
#1  0x0000ffff8cc55848 in abort () from /lib64/libc.so.6
#2  0x0000ffff8cc4d8d4 in __assert_fail_base () from /lib64/libc.so.6
#3  0x0000ffff8cc4d98c in __assert_fail () from /lib64/libc.so.6
#4  0x0000aaaada0766c8 in error_setv ()
#5  0x0000aaaada076808 in error_setg_internal ()
#6  0x0000aaaad9f7d998 in virtio_pci_device_plugged ()
#7  0x0000aaaad9f7ef84 in virtio_bus_device_plugged ()
#8  0x0000aaaad9e783b8 in virtio_device_realize ()
#9  0x0000aaaad9f451ac in device_set_realized ()
#10 0x0000aaaad9fca460 in property_set_bool ()
#11 0x0000aaaad9fcbf28 in object_property_set ()
#12 0x0000aaaad9fce3d8 in object_property_set_qobject ()
#13 0x0000aaaad9fcc0e4 in object_property_set_bool ()
#14 0x0000aaaad9f7e770 in virtio_pci_realize ()
#15 0x0000aaaad9f6211c in pci_qdev_realize ()
#16 0x0000aaaad9f7c058 in virtio_pci_dc_realize ()
#17 0x0000aaaad9f451ac in device_set_realized ()
#18 0x0000aaaad9fca460 in property_set_bool ()
#19 0x0000aaaad9fcbf28 in object_property_set ()
#20 0x0000aaaad9fce3d8 in object_property_set_qobject ()
#21 0x0000aaaad9fcc0e4 in object_property_set_bool ()
#22 0x0000aaaad9eede0c in qdev_device_add ()
#23 0x0000aaaad9eee408 in qmp_device_add ()
#24 0x0000aaaada0699e0 in qmp_dispatch ()
#25 0x0000aaaad9e3244c in handle_qmp_command ()
#26 0x0000aaaada06ef48 in json_message_process_token ()
#27 0x0000aaaada086340 in json_lexer_feed_char ()
#28 0x0000aaaada086428 in json_lexer_feed ()
#29 0x0000aaaad9e30d30 in monitor_qmp_read ()
#30 0x0000aaaad9ef38d8 in qemu_chr_be_write ()
#31 0x0000aaaad9ef3cfc in tcp_chr_read ()
#32 0x0000aaaada03695c in qio_channel_fd_source_dispatch ()
#33 0x0000ffff8d07ebb4 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
#34 0x0000aaaad9fd7ba4 in main_loop_wait ()
#35 0x0000aaaad9df7af8 in main ()

Comment 3 Andrew Jones 2017-03-14 11:20:33 UTC
I think there's another libvirt bug here, as well as a qemu bug. Based on the stack trace it appears we're calling error_setg from virtio_pci_device_plugged with a preexisting error (that's the qemu bug). The preexisting error comes from virtio_blk_get_features (that's the libvirt bug). Here's the sequence

virtio_bus_device_plugged       -- errp is NULL
  virtio_blk_get_features       -- errp is set, vdev->host_features = 0
    virtio_pci_device_plugged
      if !ignore_backend_features && !(vdev->host_features & VIRTIO_F_VERSION_1)
         && !legacy
        error_setg(errp, ...)

As all the conditions are true (!(vdev->host_features & VIRTIO_F_VERSION_1) is true because vdev->host_features has been set to zero by the failure of virtio_blk_get_features), then we attempt to error [again].

To solve the QEMU bug, I don't think we should call virtio_pci_device_plugged when virtio_blk_get_features fails, but rather we should propagate the error.

To solve the libvirt bug, I don't think we should do what's making virtio_blk_get_features fail. So what's making it fail? Commit efb8206ca7f1, which states we can't have both scsi=on and disable-modern=off,disable-legacy=on at the same time, as SCSI passthrough is no longer supported in virtio 1.0.

Comment 4 Fam Zheng 2017-03-17 11:50:12 UTC
Thank you Drew, following your analysis I posted this fix to upstream:

https://lists.gnu.org/archive/html/qemu-devel/2017-03/msg03533.html

Comment 5 Fam Zheng 2017-03-28 12:43:17 UTC
Merged for upstream QEMU 2.9:

commit a77690c41da67d85bab1e784a9f24f18bc63dbd9
Author: Fam Zheng <famz@redhat.com>
Date:   Fri Mar 17 20:32:42 2017 +0800

    virtio: Fix error handling in virtio_bus_device_plugged
    
    For one thing we shouldn't continue if an error happened, for the other
    two steps failing can cause an abort() in error_setg because we reuse
    the same errp blindly.
    
    Add error handling checks to fix both issues.
    
    Signed-off-by: Fam Zheng <famz@redhat.com>
    Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
    Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
    Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
    Reviewed-by: Andrew Jones <drjones@redhat.com>
    Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>

Comment 7 aihua liang 2017-04-14 06:29:54 UTC
The bug can't be reproduced on x86.

kernel version:3.10.0-588.el7.x86_64
qemu-kvm-rhev version:qemu-kvm-rhev-2.8.0-6.el7.x86_64
libvirt version:libvirt-2.0.0-10.el7.x86_64


Test steps:
  1.start domain "test", qemu cmds as bellow:
    /usr/libexec/qemu-kvm\
    -name guest=test,debug-threads=on \
    -S \
    -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-test/master-key.aes \
    -machine pc-i440fx-rhel7.4.0,accel=kvm,usb=off \
    -cpu Penryn \
    -m 4096 \
    -realtime mlock=off \
    -smp 1,sockets=1,cores=1,threads=1 \
    -uuid 96cd2ce0-b126-40bf-9557-2f33aa559621 \
    -no-user-config -nodefaults \
    -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-1-test/monitor.sock,server,nowait \
    -mon chardev=charmonitor,id=monitor,mode=control \
    -rtc base=utc,driftfix=slew \
    -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown \
    -global PIIX4_PM.disable_s3=1 \
    -global PIIX4_PM.disable_s4=1 \
    -boot strict=on \
    -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x5.0x7 \
    -device ich9-usb uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x5 \
    -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x5.0x1 \
    -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x5.0x2 \
    -drive file=/home/kvm_autotest_root/images/rhel74-64-virtio.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 \
    -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
    -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=28 \
    -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:22:2c:4d,bus=pci.0,addr=0x3 \
    -chardev pty,id=charserial0 \
    -device isa-serial,chardev=charserial0,id=serial0 \
    -vnc 0.0.0.0:0 \
    -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 \
    -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 \
    -msg timestamp=on \


  2.Passthrough blk device with scsi=on
    virsh attach-disk --domain test --source /dev/sdb --target vdb --driver qemu --type lun

Test Result:
  Attach device failed with error msg:
   error: Failed to attach disk
   error: internal error: unable to execute QEMU command 'device_add': Please set scsi=off for virtio-blk devices in order to use virtio 1.0

Comment 8 aihua liang 2017-04-14 06:39:22 UTC
Hi, Fam

  I can't reproduce the bug on x86, can you help to check if the test steps in comment 7 is right?


Thanks,
aliang

Comment 9 Fam Zheng 2017-04-17 03:37:40 UTC
Wei seems to have a reproducer in comment 2. Could you help, Wei?

Comment 10 weizhang 2017-04-17 03:56:05 UTC
Hi Fam and aihua, 

It can not be reproduced in x86, the previous bug I report is on aarch64

Comment 11 aihua liang 2017-04-17 05:46:10 UTC
Hi,wei

 Got it,thanks

BR,
aliang

Comment 13 Fam Zheng 2017-04-17 07:00:01 UTC
Not sure what the needinfo is for, but the plan in comment 12 makes sense to me.

Comment 14 aihua liang 2017-04-19 09:37:41 UTC
Can reproduce this bug on aarch64.

kernel version:4.10.0-11.el7.aarch64
libvirt version:libvirt-3.2.0-2.el7.aarch64
AAVMF version:AAVMF-20170228-3.gitc325e41585e3.el7.noarch
qemu-kvm-rhev:qemu-kvm-rhev-2.8.0-6.el7.aarch64


Test steps:
1.start guest with qemu cmds:
 /usr/libexec/qemu-kvm \
-name guest=te,debug-threads=on \
-S \
-object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-3-te/master-key.aes \
-machine virt-rhel7.4.0,accel=kvm,usb=off,dump-guest-core=off \
-cpu host -drive file=/usr/share/AAVMF/AAVMF_CODE.fd,if=pflash,format=raw,unit=0,readonly=on \
-drive file=/var/lib/libvirt/qemu/nvram/te_VARS.fd,if=pflash,format=raw,unit=1 \
-m 4096 \
-realtime mlock=off \
-smp 1,sockets=1,cores=1,threads=1 \
-uuid 39f5ef2a-44c5-4a18-b0b2-2f28e063405e \
-no-user-config -nodefaults \
-chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-3-te/monitor.sock,server,nowait \
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on \
-device ioh3420,port=0x8,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x1 \
-device ioh3420,port=0x9,chassis=2,id=pci.2,bus=pcie.0,addr=0x1.0x1 \
-device ioh3420,port=0xa,chassis=3,id=pci.3,bus=pcie.0,addr=0x1.0x2 -device ioh3420,port=0xb,chassis=4,id=pci.4,bus=pcie.0,addr=0x1.0x3 \
-drive file=/home/te.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 \
-device virtio-blk-pci,scsi=off,bus=pci.2,addr=0x0,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
-netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=28 \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:1f:67:bf,bus=pci.1,addr=0x0 \
-serial pty \
-vnc 0.0.0.0:0 -device virtio-gpu-pci,id=video0,bus=pci.3,addr=0x0 \
-msg timestamp=on \

2. Passthrough a disk.
 virsh attach-disk --domain te --source /dev/sdb --target vdb --driver qemu --type lun

Test Result:
error: Failed to attach disk
error: internal error: child reported: Kernel does not provide mount namespace: No such file or directory

Comment 16 aihua liang 2017-05-03 08:27:12 UTC
Verified, the issue has been resolved, so set its status to "Verified".

****************Details**************
Test Version:
  kernel version:4.10.0-13.el7.aarch64
  qemu-kvm-rhev version:qemu-kvm-rhev-2.9.0-2.el7.aarch64
  libvirt version:libvirt-3.2.0-3.el7.aarch64


Test steps:
  1.Start guest with qemu cmds:
/usr/libexec/qemu-kvm \
-name guest=te,debug-threads=on \
-S \
-object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-3-te/master-key.aes \
-machine virt-rhel7.4.0,accel=kvm,usb=off,dump-guest-core=off \
-cpu host -drive file=/usr/share/AAVMF/AAVMF_CODE.fd,if=pflash,format=raw,unit=0,readonly=on \
-drive file=/var/lib/libvirt/qemu/nvram/te_VARS.fd,if=pflash,format=raw,unit=1 \
-m 4096 \
-realtime mlock=off \
-smp 1,sockets=1,cores=1,threads=1 \
-uuid 39f5ef2a-44c5-4a18-b0b2-2f28e063405e \
-no-user-config -nodefaults \
-chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-3-te/monitor.sock,server,nowait \
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on \
-device ioh3420,port=0x8,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x1 \
-device ioh3420,port=0x9,chassis=2,id=pci.2,bus=pcie.0,addr=0x1.0x1 \
-device ioh3420,port=0xa,chassis=3,id=pci.3,bus=pcie.0,addr=0x1.0x2 -device ioh3420,port=0xb,chassis=4,id=pci.4,bus=pcie.0,addr=0x1.0x3 \
-drive file=/home/te.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 \
-device virtio-blk-pci,scsi=off,bus=pci.2,addr=0x0,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
-netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=28 \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:1f:67:bf,bus=pci.1,addr=0x0 \
-serial pty \
-vnc 0.0.0.0:0 -device virtio-gpu-pci,id=video0,bus=pci.3,addr=0x0 \
-msg timestamp=on \

 2. Passthrough a disk.
 virsh attach-disk --domain te --source /dev/sdb --target vdb --driver qemu --type lun

Test Result:
  error: Failed to attach disk
  error: internal error: unable to execute QEMU command 'device_add': Please set scsi=off for virtio-blk devices in order to use virtio 1.0

Comment 22 errata-xmlrpc 2017-08-02 03:39:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:2392


Note You need to log in before you can comment on or make changes to this bug.