Bug 609437

Summary:	Guest OS shutoff after hot unplugging PCI device
Product:	Red Hat Enterprise Linux 6	Reporter:	wangyimiao <yimwang>
Component:	libvirt	Assignee:	Jiri Denemark <jdenemar>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	high	Docs Contact:
Priority:	low
Version:	6.0	CC:	berrange, clalance, dallan, ddumas, ddutile, dyuan, jdenemar, juzhang, llim, weizhan, xen-maint, yoyzhang
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	libvirt-0_8_1-22_el6	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2010-11-11 14:50:13 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description wangyimiao 2010-06-30 10:13:02 UTC

Description of problem:
Guest OS will shutoff after hot unplug the PF/VF.  

Version-Release number of selected component (if applicable):
-libvirt-0.8.1-12.el6.x86_64
-qemu-kvm-0.12.1.2-2.90.el6.x86_64
-qemu-img-0.12.1.2-2.90.el6.x86_64
-kernel-2.6.32-38.el6.x86_64


How reproducible:
5/5

Steps to Reproduce:

1.# virsh nodedev-list --tree
computer
 |
  +- net_lo_00_00_00_00_00_00
  +- pci_0000_00_00_0
  +- pci_0000_00_01_0
  |   |
  |   +- pci_0000_03_00_0
  |   |   |
  |   |   +- net_eth1_00_1b_21_39_8b_18
  |   |
  |   +- pci_0000_03_00_1
  |   |   |
  |   |   +- net_eth3
  |   |

2.# virsh nodedev-dumpxml pci_0000_03_00_0
<device>
  <name>pci_0000_03_00_0</name>
  <parent>pci_0000_00_01_0</parent>
  <driver>
    <name>igb</name>
  </driver>
  <capability type='pci'>
    <domain>0</domain>
    <bus>3</bus>
    <slot>0</slot>
    <function>0</function>
    <product id='0x10c9'>82576 Gigabit Network Connection</product>
    <vendor id='0x8086'>Intel Corporation</vendor>
  </capability>
</device>

3.#cat PF.xml
  <hostdev mode='subsystem' type='pci'>
            <source>
              <address bus='3' slot='0' function='0'/>
            </source>
  </hostdev>

4.# virsh nodedev-dettach pci_0000_03_00_0
Device pci_0000_03_00_0 dettached

5.#  virsh attach-device rhel6 PF.xml
Device attached successfully

6.#  virsh detach-device rhel6 PF.xml
Device detached successfully

7.#tail /var/log/messages
Jun 30 17:12:33 dhcp-66-70-6 kernel: pci-stub 0000:03:10.0: enabling device (0000 -> 0002)
Jun 30 17:12:54 dhcp-66-70-6 dnsmasq-dhcp[2206]: DHCPREQUEST(virbr0) 192.168.122.166 52:54:00:c4:d7:00
Jun 30 17:12:54 dhcp-66-70-6 dnsmasq-dhcp[2206]: DHCPACK(virbr0) 192.168.122.166 52:54:00:c4:d7:00
Jun 30 17:14:03 dhcp-66-70-6 dnsmasq-dhcp[2206]: DHCPREQUEST(virbr0) 192.168.122.166 52:54:00:c4:d7:00
Jun 30 17:14:03 dhcp-66-70-6 dnsmasq-dhcp[2206]: DHCPACK(virbr0) 192.168.122.166 52:54:00:c4:d7:00
Jun 30 17:15:31 dhcp-66-70-6 libvirtd: 17:15:31.549: error : qemuMonitorJSONCommandWithFd:242 : cannot send monitor command '{"execute":"query-balloon"}': Connection reset by peer
Jun 30 17:15:31 dhcp-66-70-6 avahi-daemon[1784]: Withdrawing address record for fe80::3063:4eff:fea4:587b on vnet0.
Jun 30 17:15:31 dhcp-66-70-6 kernel: virbr0: port 1(vnet0) entering disabled state
Jun 30 17:15:31 dhcp-66-70-6 kernel: device vnet0 left promiscuous mode
Jun 30 17:15:31 dhcp-66-70-6 kernel: virbr0: port 1(vnet0) entering disabled state

8.
#cat /var/log/libvirt/qemu/rhel6.log
............
............
LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -S -M rhel6.0.0 -enable-kvm -m 2048 -smp 1,sockets=1,cores=1,threads=1 -name rhel6 -uuid e164d940-2dd7-6749-3452-87800f467373 -nodefconfig -nodefaults -chardev socket,id=monitor,path=/var/lib/libvirt/qemu/rhel6.monitor,server,nowait -mon chardev=monitor,mode=control -rtc base=utc -boot c -drive file=/var/lib/libvirt/images/RHEL-Server-6-64-virtio.qcow2,if=none,id=drive-ide0-0-0,boot=on,format=qcow2 -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,fd=21,id=hostnet0 -device rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:c4:d7:00,bus=pci.0,addr=0x4 -chardev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -vnc 127.0.0.1:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3
char device redirected to /dev/pts/1


  
Actual results:
  At setp5 guest OS was started and can see the device info via 'lspci' command,but the Guest OS shutoff after unplug the PF/VF.

Expected results:
This device PF could be unplug successfully, and the device should not see in the guest OS.

Comment 2 RHEL Program Management 2010-06-30 10:43:08 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 3 wangyimiao 2010-07-05 10:52:59 UTC

ADDED NOTE:

  My test Guest OS is 'el6.0'

Comment 4 Dave Allan 2010-07-06 16:38:12 UTC

Is this problem specific to SRIOV devices?  Can you reproduce it on an ordinary PCI device?

Comment 5 wangyimiao 2010-07-07 02:00:11 UTC

Yes,it can reproduce on an ordinary PCI device.

Comment 6 dyuan 2010-07-15 08:29:17 UTC

test with libvirt-0.8.1-15.el6.
Set 'managed='yes'' in xml, guest won't be shutoff after hot unplug.

Comment 7 Dave Allan 2010-07-15 18:43:58 UTC

When you use managed='no' you really have to reset the device before assigning it to the guest.  There may be some situations in which that's not strictly necessary, but you really have to know the state of the hardware very well.  If you do a nodedev-reset before attaching the device to the guest, does the problem go away?

Comment 8 wangyimiao 2010-07-16 01:54:23 UTC

 Use 'nodedev-reset' to reset the PCI device before attaching to guest,the problem still exists.

Comment 9 Daniel Berrangé 2010-07-20 16:15:27 UTC

Can you confirm that everything works correctly if you use managed="yes" and do not use any of the 'virsh nodedev-detach/reset/reattach' commands.

If using managed=no there are only two support usage scenarios:

Coldplug:

   virsh nodedev-dettach DEVICE
   virsh nodedev-reset DEVICE
   start KVM guest with managed=no
   stop KVM guest
   virsh nodedev-reset DEVICE
   virsh nodedev-reattach DEVICE

Hotplug:

   start KVM guest with no host device
   virsh nodedev-dettach DEVICE
   virsh nodedev-reset DEVICE
   virsh attach-device with managed=no
   virsh detach-device 
   virsh nodedev-reset DEVICE
   virsh nodedev-reattach DEVICE

Comment 10 wangyimiao 2010-07-21 01:54:40 UTC

Hi 'DB',

Steps:
1.# cat pf.xml 
 <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address bus='0' slot='25' function='0'/>
       </source>
    </hostdev>
2.# virsh attach-device rhel6 pf.xml 
Device attached successfully

3.# virsh detach-device rhel6 pf.xml 
Device detached successfully

After that steps, everything works fine.

Comment 11 Jiri Denemark 2010-07-21 14:26:28 UTC

wangyimiao confirmed on irc, that only coldplug scenario was working with managed=no, while the problem could still be observed in the hotplug case... not sure what to think about it, I'm trying to reproduce myself

Comment 12 Jiri Denemark 2010-07-21 16:14:37 UTC

I can't reproduce, it all works for me with managed=no. Even crazy things like starting a guest with PCI device attached and hot-unplugging and plugging back work.

Comment 13 wangyimiao 2010-07-22 01:54:03 UTC

Hi 'JD',
  
Hotplug:

   start KVM guest with no host device
   virsh nodedev-dettach DEVICE
   virsh nodedev-reset DEVICE
   virsh attach-device with managed=no
   virsh detach-device 
   virsh nodedev-reset DEVICE
   virsh nodedev-reattach DEVICE   

That is my PCI device info.

1.# virsh nodedev-dumpxml pci_0000_00_19_0
.......
    <product id='0x10bd'>82566DM-2 Gigabit Network Connection</product>
    <vendor id='0x8086'>Intel Corporation</vendor>
........

 2.After the step "virsh attach-device with managed=no", please login guest to check the PCI device have get the valid IP or it works fine .May be it can help you to reproduce this issue.

Comment 14 Jiri Denemark 2010-07-22 08:38:22 UTC

The following line was printed by qemu into rhel6.log on the machine where this bug reproduces:

assigned_dev_pci_write_config: pwrite failed, ret = -1 errno = 13

Comment 16 Jiri Denemark 2010-07-28 19:19:14 UTC

Could you please download and install test packages from http://people.redhat.com/clalance/bz617116/ and check if that makes any difference?

Comment 17 dyuan 2010-07-29 10:26:14 UTC

Reproduce the problem with:
libvirt-0.8.1-18clalance.el6.x86_64 
qemu-kvm-0.12.1.2-2.90.el6.x86_64
kernel-2.6.32-53.el6
&
libvirt-0.8.1-18clalance.el6.x86_64 
qemu-kvm-0.12.1.2-2.104.el6.x86_64
kernel-2.6.32-53.el6

Comment 18 dyuan 2010-07-29 11:33:11 UTC

Test with : 
libvirt-0.8.1-20.el6.x86_64 
qemu-kvm-0.12.1.2-2.104.el6.x86_64
kernel-2.6.32-53.el6

reproduct it with rhel6 guest, but cannot reproduce it with rhel5 guest.

Comment 19 Chris Lalancette 2010-08-03 19:50:17 UTC

I talked to Don Dutile a bit about this bug, and we now have a theory as to why this works with managed=yes but not managed=no.

The first thing to realize is that the 82576 is an SR-IOV capable device.  As such, it is required by the SR-IOV spec to support FLR (function-level reset).  For historical reasons, when libvirt detects that a device supports FLR, it does *absolutely nothing* when a reset is requested.  In the case for starting or stopping a virtual machine, this is actually the correct behavior; qemu/kvm will do the FLR on our behalf, and all will be right with the world.  This is why "managed=yes" works just fine.

However, when you do a "virsh nodedev-reset", there is no qemu active to do the reset for us.  So in this case, we will come into the same code path, determine that the device supports FLR, and still do nothing.  But in this case, there is no qemu in the mix to actually do the reset on our behalf, so nothing at all happens.  Now we start the guest, but nothing has ever reset the device, so things get confused.

We may be able to confirm this theory by playing around with /sys/bus/pci/devices/<pciid>/reset; writing a "1" into this file should cause a FLR to occur (though only for SR-IOV VF devices).  After the nodedev-reset step in the above reproducers, try to echo 1 > /sys/bus/pci/devices/<pciid>/reset, and then follow the rest of the steps to see if you can still reproduce the problem.

Chris Lalancette

Comment 20 Don Dutile (Red Hat) 2010-08-04 14:02:38 UTC

Alex Williamson has made a couple of bug fixes in qemu-kvm &/or kvm support when hot-unplug done.
Please update to latest qemu-kvm(we're up to something like .100+) & kernel(like .55+) and retest.

Comment 21 weizhang 2010-08-05 10:29:57 UTC

I test it with the latest qemu-kvm-0.12.1.2-2.108.el6.x86_64 and kernel 2.6.32-59.el6.x86_64 and the error still exists.

libvirt-0.8.1-21.el6.x86_64

Comment 22 Jiri Denemark 2010-08-05 13:43:05 UTC

Finally I found out the core of this bug. It's not related to any particular network card. The Chinese boxes are just lucky enough to hit the bug, unlike other boxes I tried to test this on.

The problem is this:

With svirt enabled, libvirt relabels /sys files corresponding to given PCI device so that qemu-kvm can access them and then it instructs qemu-kvm to attach the device. On detach, libvirt issues 'device_del' qemu command and after it returns, it relabels the files back, which prevents qemu-kvm from accessing them.

Unfortunately, device_del is not synchronous for PCI devices, it merely asks the guest to release the device and returns. If the guest doesn't release it quickly enough, libvirt relabels the files before the guest/qemu-kvm is finished with the PCI device, which may result in the EACCES failure mentioned in comment 14. qemu-kvm's reaction on such failure is exit() and the guest disappears.

Comment 23 Jiri Denemark 2010-08-05 13:44:41 UTC

There's no way we could fix this properly for 6.0. We can only hack around it...

Comment 26 Jiri Denemark 2010-08-06 15:09:16 UTC

I think qemudReattachManagedDevice() might be the reason why we didn't see this bug in managed=yes case. The function is called after device_del returns but before any relabeling is done. The function looks like this:

    int retries = 100;

    if (pciDeviceGetManaged(dev)) {
        while (pciWaitForDeviceCleanup(dev, "kvm_assigned_device")
               && retries) {
            usleep(100*1000);
            retries--;
        }
        if (pciReAttachDevice(dev) < 0) {
            virErrorPtr err = virGetLastError();
            VIR_ERROR(_("Failed to re-attach PCI device: %s"),
                      err ? err->message : "");
            virResetError(err);
        }
    }

In managed=yes case we wait up to 10 seconds in the first loop until the device is actually freed by kvm. Should we do that for non-managed devices as well?

The bad thing is I wasn't able to prove that this code is what helps. If I commented the call to this function in a hope to reproduce the bug with managed=yes, it didn't reproduce even for managed=no.

Comment 27 Jiri Denemark 2010-08-06 15:16:14 UTC

Hmm, actually, after rebooting the host, I reproduced the bug for managed=yes with the call to qemudReattachManagedDevice() commented out...

Comment 29 Dave Allan 2010-08-11 02:49:24 UTC

libvirt-0_8_1-22_el6 has been built in RHEL-6-candidate with the fix.

Dave

Comment 30 wangyimiao 2010-08-12 02:50:12 UTC

Verified PASSED with 
libvirt-client-0.8.1-22.el6.x86_64
qemu-kvm-0.12.1.2-2.108.el6.x86_64
qemu-img-0.12.1.2-2.108.el6.x86_64
.

Comment 31 releng-rhel@redhat.com 2010-11-11 14:50:13 UTC

Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.