Bug 1261708
| Summary: | Guest gets paused after unplugging a PCI device | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Dan Zheng <dzheng> |
| Component: | libvirt | Assignee: | Andrea Bolognani <abologna> |
| Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 7.2 | CC: | abologna, dgibson, dyuan, dzheng, gklein, gsun, hannsj_uhl, jsuchane, lmiksik, mzhan, rbalakri, zhwang |
| Target Milestone: | rc | Keywords: | Reopened, TestOnly |
| Target Release: | --- | ||
| Hardware: | ppc64le | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2015-11-19 06:54:17 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1259556 | ||
| Bug Blocks: | 1201513, 1277183, 1277184 | ||
Retest will be executed after 1259556 is on QA Can you post the output of # virsh nodedev-dumpxml pci_0002_01_00_0 please? Run command with below packages:
libvirt-daemon-1.2.17-9.el7.ppc64le
qemu-kvm-rhev-2.3.0-24.el7.ppc64le
kernel-3.10.0-316.el7.ppc64le
# virsh nodedev-dumpxml pci_0002_01_00_0
<device>
<name>pci_0002_01_00_0</name>
<path>/sys/devices/pci0002:00/0002:00:00.0/0002:01:00.0</path>
<parent>pci_0002_00_00_0</parent>
<driver>
<name>be2net</name>
</driver>
<capability type='pci'>
<domain>2</domain>
<bus>1</bus>
<slot>0</slot>
<function>0</function>
<product id='0xe220'>OneConnect NIC (Lancer)</product>
<vendor id='0x10df'>Emulex Corporation</vendor>
<iommuGroup number='1'>
<address domain='0x0002' bus='0x01' slot='0x00' function='0x0'/>
<address domain='0x0002' bus='0x01' slot='0x00' function='0x1'/>
<address domain='0x0002' bus='0x01' slot='0x00' function='0x2'/>
<address domain='0x0002' bus='0x01' slot='0x00' function='0x3'/>
<address domain='0x0002' bus='0x01' slot='0x00' function='0x4'/>
<address domain='0x0002' bus='0x01' slot='0x00' function='0x5'/>
</iommuGroup>
<numa node='0'/>
<pci-express>
<link validity='cap' port='0' speed='8' width='8'/>
<link validity='sta' speed='8' width='8'/>
</pci-express>
</capability>
</device>
Thanks. Is any of the ports assigned to the host? Or is it using a different Ethernet card altogether? (In reply to Andrea Bolognani from comment #5) > Thanks. > > Is any of the ports assigned to the host? > Or is it using a different Ethernet card altogether? Andrea, Before starting the guest, I have nodedev-detach all the pci devices in this iommugroup from the host. then start the guest. After that , detach one of them. And on that host, there are two Ethernet cards. I used one of them. But for those 6 pci devices in that iommuGroup, they are from one card. Did above answer your questions? # lspci ... 0002:01:00.0 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10) 0002:01:00.1 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10) 0002:01:00.2 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10) 0002:01:00.3 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10) 0002:01:00.4 Fibre Channel: Emulex Corporation OneConnect FCoE Initiator (Lancer) (rev 10) 0002:01:00.5 Fibre Channel: Emulex Corporation OneConnect FCoE Initiator (Lancer) (rev 10) ... 0003:09:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) 0003:09:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) 0003:09:00.2 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) 0003:09:00.3 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) ************************************************************* Today I did test again. But same error happened again. Host is installed with the snapshot 2 tree RHEL-7.2-20150917.0 Server. libvirt-1.2.17-9.el7.ppc64le qemu-kvm-rhev-2.3.0-23.el7.ppc64le (replace qemu-kvm-rhev-2.3.0-24.el7.ppc64le due to bug 1264845) kernel-3.10.0-316.el7.ppc64le Guest: kernel-3.10.0-316.el7.ppc64le Host only has one Ethernet card. # lspci ... 0003:03:00.0 USB controller: Texas Instruments TUSB73x0 SuperSpeed USB 3.0 xHCI Host Controller (rev 02) 0003:09:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) 0003:09:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) 0003:09:00.2 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) 0003:09:00.3 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) <device> <name>pci_0003_09_00_0</name> <path>/sys/devices/pci0003:00/0003:00:00.0/0003:01:00.0/0003:02:09.0/0003:09:00.0</path> <parent>pci_0003_02_09_0</parent> <driver> <name>vfio-pci</name> </driver> <capability type='pci'> <domain>3</domain> <bus>9</bus> <slot>0</slot> <function>0</function> <product id='0x1657'>NetXtreme BCM5719 Gigabit Ethernet PCIe</product> <vendor id='0x14e4'>Broadcom Corporation</vendor> <iommuGroup number='1'> <address domain='0x0003' bus='0x09' slot='0x00' function='0x0'/> <address domain='0x0003' bus='0x09' slot='0x00' function='0x1'/> <address domain='0x0003' bus='0x09' slot='0x00' function='0x2'/> <address domain='0x0003' bus='0x09' slot='0x00' function='0x3'/> </iommuGroup> <numa node='0'/> <pci-express> <link validity='cap' port='0' speed='2.5' width='4'/> <link validity='sta' speed='2.5' width='4'/> </pci-express> </capability> </device> Start the guest with below four pci devices which will all be detached from host automatically as managed=yes. <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0003' bus='0x09' slot='0x00' function='0x0'/> </source> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0003' bus='0x09' slot='0x00' function='0x1'/> </source> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0003' bus='0x09' slot='0x00' function='0x2'/> </source> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0003' bus='0x09' slot='0x00' function='0x3'/> </source> </hostdev> Guest is running. Log in guest. # lspci 00:06.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) 00:07.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) 00:08.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) 00:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) On host, detach pci 03:09:00.02 successfully. <hostdev mode='subsystem' type='pci' managed='yes'> <source> <address domain='0x0003' bus='0x09' slot='0x00' function='0x2'/> </source> </hostdev> Check dumpxml, the xml is updated already to remove this pci device. But guest is paused with same error messages as before. And I also got below error on host. # lspci pcilib: Cannot open /sys/bus/pci/devices/0003:09:00.3/config lspci: Unable to read the standard configuration space header of device 0003:09:00.3 ... 0003:09:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) 0003:09:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) 0003:09:00.2 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) ----Below missing--- 0003:09:00.3 ... Yes, that's exactly the information I was looking for. I just wanted to make sure that there's no obvious reason why the setup you're using wouldn't work, and that doesn't seem to be the case. I'm now confident the issues you're facing will go away as soon as Bug 1259556 has been fixed. Thanks for your help. Product Management has reviewed and declined this request. You may appeal this decision by reopening this request. Test with packages below:
libvirt-1.2.17-13.el7.ppc64le
qemu-kvm-rhev-2.3.0-29.el7.ppc64le
kernel-3.10.0-322.el7.ppc64le
Guest kernel: kernel-3.10.0-322.el7.ppc64le
1.Detach a device pci_0003_09_00_0 from the host.
# virsh nodedev-dumpxml pci_0003_09_00_0
<device>
<name>pci_0003_09_00_0</name>
<path>/sys/devices/pci0003:00/0003:00:00.0/0003:01:00.0/0003:02:09.0/0003:09:00.0</path>
<parent>pci_0003_02_09_0</parent>
<driver>
<name>tg3</name>
</driver>
<capability type='pci'>
<domain>3</domain>
<bus>9</bus>
<slot>0</slot>
<function>0</function>
<product id='0x1657'>NetXtreme BCM5719 Gigabit Ethernet PCIe</product>
<vendor id='0x14e4'>Broadcom Corporation</vendor>
<iommuGroup number='1'>
<address domain='0x0003' bus='0x09' slot='0x00' function='0x0'/>
<address domain='0x0003' bus='0x09' slot='0x00' function='0x1'/>
<address domain='0x0003' bus='0x09' slot='0x00' function='0x2'/>
<address domain='0x0003' bus='0x09' slot='0x00' function='0x3'/>
</iommuGroup>
<numa node='0'/>
<pci-express>
<link validity='cap' port='0' speed='2.5' width='4'/>
<link validity='sta' speed='2.5' width='4'/>
</pci-express>
</capability>
</device>
# virsh nodedev-detach pci_0003_09_00_0
...Successful.
# virsh nodedev-reset pci_0003_09_00_0
...Successful.
2. Start the guest with 3 Host PCI devices. And the guest is running.
<hostdev mode='subsystem' type='pci' managed='yes'>
<driver name='vfio'/>
<source>
<address domain='0x0003' bus='0x09' slot='0x00' function='0x1'/>
</source>
</hostdev>
<hostdev mode='subsystem' type='pci' managed='yes'>
<driver name='vfio'/>
<source>
<address domain='0x0003' bus='0x09' slot='0x00' function='0x2'/>
</source>
</hostdev>
<hostdev mode='subsystem' type='pci' managed='yes'>
<driver name='vfio'/>
<source>
<address domain='0x0003' bus='0x09' slot='0x00' function='0x3'/>
</source>
</hostdev>
3. Check the PCI devices are displayed in the guest and Yes.
4. Detach/attach a PCI device from/to the guest.
unplug.xml:
<hostdev mode='subsystem' type='pci' managed='yes'>
<driver name='vfio'/>
<source>
<address domain='0x0003' bus='0x09' slot='0x00' function='0x1'/>
</source>
</hostdev>
# virsh detach-device virt-tests-vm1 unplug.xml ---> using pci_0003_09_00_1
Successful.
# virsh attach-device virt-tests-vm1 unplug.xml ---> using pci_0003_09_00_0
Successful.
5. Check dumpxml of the guest and it does get updated.
6. Check the lspci within the guest and it does get updated.
7. Repeat step 4 - 6 to use other pci devices in same iommu group, like pci_0003_09_00_3, pci_0003_09_00_2, and it works as expected, except the unexpected guest crashing and rebooting which is tracked by bug 1270636.
The guest's getting paused issue disappears.
So I mark this as verified as the original issue does not happen any more.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-2202.html |
The guest gets paused after unplugging a PCI device and unrecoverable error is detected. Packages: kernel-3.10.0-313.el7.ppc64le qemu-kvm-rhev-2.3.0-22.el7.ppc64le libvirt-daemon-1.2.17-7.el7.ppc64le Guest: kernel-3.10.0-313.el7.ppc64le Steps: 0. Prepare the environment on host #modprobe vfio #modprobe vfio_spapr_eeh #modprobe vfio_iommu_spapr_tce #modprobe vfio_pci 1. Start a guest with 3 PCI devices passthrough to the guest. # virsh start dzhengvm2 Domain dzhengvm2 started # virsh dumpxml dzhengvm2|grep hostdev -A5 <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0002' bus='0x01' slot='0x00' function='0x0'/> </source> <alias name='hostdev0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0002' bus='0x01' slot='0x00' function='0x1'/> </source> <alias name='hostdev1'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0002' bus='0x01' slot='0x00' function='0x3'/> </source> <alias name='hostdev2'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/> </hostdev> 2. Check within the guest, and three PCI devices are displayed. [root@localhost ~]# lspci ... 00:08.0 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10) 00:09.0 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10) 00:0a.0 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer) (rev 10) 3. Unplug one PCI device pci_0002_01_00_1 unplugPF.xml: <hostdev mode='subsystem' type='pci' managed='yes'> <source> <address domain='0x0002' bus='0x01' slot='0x00' function='0x3'/> </source> </hostdev> # virsh detach-device dzhengvm2 unplugPF.xml Device detached successfully 4. Check guest state # virsh list --all Id Name State ---------------------------------------------------- 21 d1 paused 5. Check qemu log 2015-09-06T08:19:44.390589Z qemu-kvm: vfio_err_notifier_handler(0002:01:00.1) Unrecoverable error detected. Please collect any data possible and then kill the guest 2015-09-06T08:19:44.413427Z qemu-kvm: vfio_err_notifier_handler(0002:01:00.0) Unrecoverable error detected. Please collect any data possible and then kill the guest Additional information: This is separated from bug 1245004 and this bug also blocks the reproduction of bug 1245004.