Bug 689002
Summary: | guest with assigned nic got kernel panic when send system_reset signal in QEMU monitor | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Alex Williamson <alex.williamson> |
Component: | libvirt | Assignee: | Alex Williamson <alex.williamson> |
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 6.1 | CC: | ajia, alex.williamson, chayang, dallan, ddutile, eblake, jdenemar, juzhang, michen, mjenner, mkenneth, mzhan, syeghiay, tburke, virt-maint, yoyzhang |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | libvirt-0.8.7-14.el6 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | 685147 | Environment: | |
Last Closed: | 2011-05-19 13:29:21 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 685147 | ||
Bug Blocks: |
Comment 2
Eric Blake
2011-03-18 19:36:28 UTC
(In reply to comment #2) > If I'm reading this bz correctly, then the libvirt piece of this patch is > already upstream and just needs to be backported: Yes, exactly. I can reproduce the bug with the above test environment(qemu-kvm-0.12.1.2-2.150.el6.x86_64, NICs is BCM5709 with MSI capability) And the bug has been verified on rhel6.1(2.6.32-122.el6.x86_64) with qemu-kvm-0.12.1.2-2.152.el6.x86_64. From libvirt point of view, I can't execute reset action for the device by virsh command: # virsh nodedev-dettach pci_0000_01_00_1 Device pci_0000_01_00_1 dettached # ls /sys/bus/pci/drivers/bnx2/ 0000:02:00.0 0000:02:00.1 bind module new_id remove_id uevent unbind # ls /sys/bus/pci/drivers/pci-stub/ 0000:01:00.0 0000:01:00.1 0000:09:00.0 bind new_id remove_id uevent unbind # virsh nodedev-reset pci_0000_01_00_1 error: Failed to reset device pci_0000_01_00_1 error: this function is not supported by the connection driver: Unable to reset PCI device 0000:01:00.1: this function is not supported by the connection driver: Active 0000:01:00.0 devices on bus with 0000:01:00.1, not doing bus reset And it's also fail to directly hot-plug/cold-plug the device to guest by virt-manager, the same error information will be raise: Error starting domain: this function is not supported by the connection driver: Unable to reset PCI device 0000:01:00.1: this function is not supported by the connection driver: Active 0000:01:00.0 devices on bus with 0000:01:00.1, not doing bus reset Traceback (most recent call last): File "/usr/share/virt-manager/virtManager/asyncjob.py", line 45, in cb_wrapper callback(asyncjob, *args, **kwargs) File "/usr/share/virt-manager/virtManager/engine.py", line 956, in asyncfunc vm.startup() File "/usr/share/virt-manager/virtManager/domain.py", line 1048, in startup self._backend.create() File "/usr/lib64/python2.6/site-packages/libvirt.py", line 325, in create if ret == -1: raise libvirtError ('virDomainCreate() failed', dom=self) libvirtError: this function is not supported by the connection driver: Unable to reset PCI device 0000:01:00.1: this function is not supported by the connection driver: Active 0000:01:00.0 devices on bus with 0000:01:00.1, not doing bus reset Because the NICs device can't be successfully assigned to guest, so I can't use virsh qemu-monitor-command to send system_reset command to guest. # uname -r 2.6.32-122.el6.x86_64 # rpm -q qemu-kvm qemu-kvm-0.12.1.2-2.152.el6.x86_64 # rpm -q libvirt libvirt-0.8.7-14.el6.x86_64 # rpm -q virt-manager virt-manager-0.8.6-3.el6.noarch # rpm -q python-virtinst python-virtinst-0.500.5-2.el6.noarch I need to clarify some places for the following comment: "Because the NICs device can't be successfully assigned to guest, so I can't use virsh qemu-monitor-command to send system_reset command to guest." I mean I haven't met the previous test environment, although I can use virsh qemu-monitor-command to send system_reset command to guest, it doesn't make sense. Alex Hi Alex, Sorry, I'm still confused why this is going back to ON_DEV. Apologies if libvirt has a different bug life cycle that I'm not familiar with. With respect to nodedev-reset, this patch makes no changes to the behavior of that interface. For hot-plug/cold-plug failing, is this a regression caused by this change, or is this a pre-existing condition? The changes for this patch should only affect the permissions of an extra PCI sysfs file for the device and should not change whether or not a device can be assigned. To verify the libvirt side of things, I think it would be sufficient to check the permissions of the file /sys/bus/pci/devices/ssss:bb:dd.f/reset before and after the patch is applied. Before, the file should be owned by root, after by the qemu user. If using the corresponding qemu-kvm from bz685147, the reset should be triggered any time the guest reboots, or if a reset is triggered via the virsh qemu-monitor-command --hmp system_reset. Please clarify what you're seeing and whether you expect any further fixes from ON_DEV. Thanks, Alex (In reply to comment #6) > # virsh nodedev-reset pci_0000_01_00_1 > error: Failed to reset device pci_0000_01_00_1 > error: this function is not supported by the connection driver: Unable to reset > PCI device 0000:01:00.1: this function is not supported by the connection > driver: Active 0000:01:00.0 devices on bus with 0000:01:00.1, not doing bus > reset > > And it's also fail to directly hot-plug/cold-plug the device to guest by > virt-manager, the same error information will be raise: > > Error starting domain: this function is not supported by the connection driver: > Unable to reset PCI device 0000:01:00.1: this function is not supported by the > connection driver: Active 0000:01:00.0 devices on bus with 0000:01:00.1, not > doing bus reset I believe these are telling you that the device you're trying to assign does not support any method of reset other than a secondary bus reset of the parent PCI bridge, but that option is unavailable because it's a multi-funciton device. This is not a bug, unless you want to file one for the clarity of the error message. I would suggest testing with a device known to work with assignment, such as an 82576, or even most e1000 variants. (In reply to comment #14) > (In reply to comment #6) > > # virsh nodedev-reset pci_0000_01_00_1 > > error: Failed to reset device pci_0000_01_00_1 > > error: this function is not supported by the connection driver: Unable to reset > > PCI device 0000:01:00.1: this function is not supported by the connection > > driver: Active 0000:01:00.0 devices on bus with 0000:01:00.1, not doing bus > > reset > > > > And it's also fail to directly hot-plug/cold-plug the device to guest by > > virt-manager, the same error information will be raise: > > > > Error starting domain: this function is not supported by the connection driver: > > Unable to reset PCI device 0000:01:00.1: this function is not supported by the > > connection driver: Active 0000:01:00.0 devices on bus with 0000:01:00.1, not > > doing bus reset > > I believe these are telling you that the device you're trying to assign does > not support any method of reset other than a secondary bus reset of the parent > PCI bridge, but that option is unavailable because it's a multi-funciton > device. This is not a bug, unless you want to file one for the clarity of the > error message. I would suggest testing with a device known to work with > assignment, such as an 82576, or even most e1000 variants. Hi Alex, As you said, BCM5709 NICs is a single-function device, so it has no reset under the /sys/bus/pci/devices/ssss:bb:dd.f/, Intel 82576 NICs is okay, so I can only use Intel 82576 to verify the bug again. Thanks, Alex Jia The bug has been verified on rhel6.1(2.6.32-122.el6.x86_64) with qemu-kvm-0.12.1.2-2.152.el6.x86_64 and libvirt-0.8.7-14.el6.x86_64. # lspci |grep 82576 09:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01) 09:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01) # virsh nodedev-dumpxml pci_0000_09_10_1 <device> <name>pci_0000_09_10_1</name> <parent>pci_0000_00_09_0</parent> <capability type='pci'> <domain>0</domain> <bus>9</bus> <slot>16</slot> <function>1</function> <product id='0x10ca'>82576 Virtual Function</product> <vendor id='0x8086'>Intel Corporation</vendor> <capability type='phys_function'> <address domain='0x0000' bus='0x09' slot='0x00' function='0x1'/> </capability> </capability> </device> Add the following xml into guest xml configuration or virsh attach-device VM pf.xml when guest is running: # cat pf.xml <hostdev mode='subsystem' type='pci' managed='yes'> <source> <address bus='0x09' slot='0x0' function='0x1'/> </source> </hostdev> # virsh edit vr-rhel6u1-x86_64-kvm Domain vr-rhel6u1-x86_64-kvm XML configuration edited. # ll /sys/bus/pci/devices/0000\:09\:00.1/reset --w-------. 1 root root 4096 Mar 31 01:55 /sys/bus/pci/devices/0000:09:00.1/reset # virsh start vr-rhel6u1-x86_64-kvm Domain vr-rhel6u1-x86_64-kvm started # ll /sys/bus/pci/devices/0000\:09\:00.1/reset --w-------. 1 qemu qemu 4096 Mar 31 01:55 /sys/bus/pci/devices/0000:09:00.1/reset The permissions indeed change from root to qemu before and after assigning the NICs to guest, it seems the result is enough according to your advice, if so, I will change the bug status to VERIFIED, otherwise, need I to run virsh qemu-monitor-command? and then checking /sys/bus/pci/devices/0000\:09\:00.1/reset permission again. In addition, some messages are raise when run the following virsh command, of course, it may be another issue: # virsh qemu-monitor-command vr-rhel6u1-x86_64-kvm --hmp system_reset # Message from syslogd@amd-6168-16-1 at Mar 31 02:29:20 ... kernel:Uhhuh. NMI received for unknown reason 21 on CPU 0. Message from syslogd@amd-6168-16-1 at Mar 31 02:29:20 ... kernel:Do you have a strange power saving mode enabled? Message from syslogd@amd-6168-16-1 at Mar 31 02:29:20 ... kernel:Dazed and confused, but trying to continue Alex Jia Hi Alex, There are questions in Comment 17, I'm not sure if it's okay for the bug, so we need your help and confirm. Thanks, Alex Jia (In reply to comment #17) > The permissions indeed change from root to qemu before and after assigning the > NICs to guest, it seems the result is enough according to your advice, if so, I > will change the bug status to VERIFIED, otherwise, need I to run virsh > qemu-monitor-command? and then checking > /sys/bus/pci/devices/0000\:09\:00.1/reset permission again. The file permissions aren't going to be changed by qemu. I think seeing that it's now owned by qemu is sufficient. bz685147 is the qemu side of the patch that has already been verified that when qemu does have access to the reset file, it does what it's supposed to do. This bz is primarily around setting up those permissions. > In addition, some messages are raise when run the following virsh command, of > course, it may be another issue: > # virsh qemu-monitor-command vr-rhel6u1-x86_64-kvm --hmp system_reset > > # > Message from syslogd@amd-6168-16-1 at Mar 31 02:29:20 ... > kernel:Uhhuh. NMI received for unknown reason 21 on CPU 0. > > Message from syslogd@amd-6168-16-1 at Mar 31 02:29:20 ... > kernel:Do you have a strange power saving mode enabled? > > Message from syslogd@amd-6168-16-1 at Mar 31 02:29:20 ... > kernel:Dazed and confused, but trying to continue Did the device continue to function correctly after the guest was reset? I'm guessing this is an HP test system. ISTR something with the hpwdt driver. Do you still get these messages if you unload/blacklist the hpwdt module? > The file permissions aren't going to be changed by qemu. I think seeing that > it's now owned by qemu is sufficient. bz685147 is the qemu side of the patch > that has already been verified that when qemu does have access to the reset > file, it does what it's supposed to do. This bz is primarily around setting > up those permissions. According to the above comment, setting the bug status to VERIFIED. > Did the device continue to function correctly after the guest was reset? I'm > guessing this is an HP test system. ISTR something with the hpwdt driver. Do > you still get these messages if you unload/blacklist the hpwdt module? Hi Alex, It's is a Dell machine not HP, and I haven't the machine environment now, if I got it again, will try your advice later. Thanks, Alex Jia An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0596.html |