Bug 1346627
| Summary: | qemu discards EEH ioctl results | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | David Gibson <dgibson> |
| Component: | qemu-kvm-rhev | Assignee: | Laurent Vivier <lvivier> |
| Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 7.3 | CC: | dgibson, gwshan, hannsj_uhl, knoel, lvivier, qzhang, virt-maint, xuma |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | ppc64le | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | qemu-kvm-rhev-2.6.0-8.el7 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-11-07 21:17:09 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1266833, 1359843 | ||
|
Description
David Gibson
2016-06-15 06:30:02 UTC
The patch proposed upstream works fine: the guest kernel correctly detects the EEH error. Just a note: Is that normal to always have qemu error while the card is "frozen"? [ 258.886246] EEH: Frozen PHB#0-PE#1 detected [ 258.886296] EEH: PE location: N/A, PHB location: N/A [ 258.886336] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.10.0-422.el7.ppc64le #1 [ 258.886383] Call Trace: ... 2016-06-15T07:56:15.062352Z qemu-system-ppc64: vfio_pci_read_config(0003:09:00.0, 0x3d, 0x1) failed: No such device 2016-06-15T07:56:15.062469Z qemu-system-ppc64: vfio_pci_read_config(0003:09:00.1, 0x3d, 0x1) failed: No such device 2016-06-15T07:56:15.062541Z qemu-system-ppc64: vfio_pci_read_config(0003:09:00.2, 0x3d, 0x1) failed: No such device 2016-06-15T07:56:15.062609Z qemu-system-ppc64: vfio_pci_read_config(0003:09:00.3, 0x3d, 0x1) failed: No such device ... [ 263.053839] EEH: Notify device drivers the completion of reset [ 263.062763] EEH: Notify device driver to resume Fix included in qemu-kvm-rhev-2.6.0-8.el7 Laurent, I'm not really sure if that behaviour is normal. Let's see if Gavin can answer that. It's expected behaviour if 0003:09:00.0 is Broadcom network adapter. In early stage of EEH recovery, its config space has to be blocked because of hardware defect. QEMU gets error while accessing the config space, then print those logs. Reproduced the issue on old version:
Version-Release number of selected component (if applicable):
qemu-kvm-rhev-2.6.0-4.el7.ppc64le
SLOF-20160223-4.gitdbbfda4.el7.noarch
host:3.10.0-453.el7.ppc64le
BE guest:3.10.0-445.el7.ppc64
LE guest:3.10.0-445.el7.ppc64le
Steps to Reproduce:
1.insmod related modules in host:
#modprobe vfio
#modprobe vfio_spapr_eeh
#modprobe vfio_iommu_spapr_tce
#modprobe vfio_pci
2.unbind device from host and bind to vfio_pci bus:
#lspci -ns 0003:09:00.0
0003:09:00.0 0200: 14e4:1657 (rev 01)
echo "14e4 1657" > /sys/bus/pci/drivers/vfio-pci/new_id
echo 0003:09:00.0 >/sys/bus/pci/devices/0003\:09\:00.0/driver/unbind
echo 0003:09:00.1 >/sys/bus/pci/devices/0003\:09\:00.1/driver/unbind
echo 0003:09:00.2 >/sys/bus/pci/devices/0003\:09\:00.2/driver/unbind
echo 0003:09:00.3 >/sys/bus/pci/devices/0003\:09\:00.3/driver/unbind
echo 0003:09:00.0 >/sys/bus/pci/drivers/vfio-pci/bind
echo 0003:09:00.1 >/sys/bus/pci/drivers/vfio-pci/bind
echo 0003:09:00.2 >/sys/bus/pci/drivers/vfio-pci/bind
echo 0003:09:00.3 >/sys/bus/pci/drivers/vfio-pci/bind
3. Boot up guest with vfio-pci device
/usr/libexec/qemu-kvm \
-name xuma-test \
-smp 4 \
-m 1024 \
-rtc base=utc,clock=vm \
-vnc :20 \
-qmp tcp:0:4444,server,nowait \
-usb \
-usbdevice tablet \
-nographic \
\
-device virtio-scsi-pci,bus=pci.0 \
\
-device scsi-hd,id=scsi-hd0,drive=scsi-hd0-dr0,bootindex=0 \
-drive file=minimal.qcow2,if=none,id=scsi-hd0-dr0,format=qcow2,cache=none \
\
-device scsi-cd,id=scsi-cd1,drive=scsi-cd1-dr1,bootindex=1 \
-drive file=RHEL-7.2-20150924.0-Server-ppc64le-dvd1.iso,if=none,id=scsi-cd1-dr1,readonly=on,format=raw,cache=none \
\
-device virtio-serial,id=virtio-serial0 \
-chardev socket,path=/tmp/qga.sock,server,nowait,id=qga0 \
-device virtserialport,bus=virtio-serial0.0,chardev=qga0,id=qemu-ga0,name=org.qemu.guest_agent.0 \
\
-device spapr-pci-vfio-host-bridge,id=vfiohost,iommu=1,index=0x1 \
-device vfio-pci,host=0003:09:00.0,bus=vfiohost.0,addr=0x1,id=vfio_dev \
4.check vfio device in guest and dhclient ip:
(guest)#lspci
0000:00:00.0 VGA compatible controller: Device 1234:1111 (rev 02)
0000:00:01.0 USB controller: Apple Inc. KeyLargo/Intrepid USB
0000:00:02.0 SCSI storage controller: Red Hat, Inc Virtio SCSI
0000:00:03.0 Communication controller: Red Hat, Inc Virtio console
0001:00:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01
#dhclient enP1p0s1
5.ping guest from other host and trigger EEH to guest 6 times in host till vfio device to offline:
(host)#echo 2:0:4:0:0 > /sys/kernel/debug/powerpc/PCI0003/err_injct && lspci -ns 0003:09:00.0
wait pinging guest resume
#echo 2:0:4:0:0 > /sys/kernel/debug/powerpc/PCI0003/err_injct && lspci -ns 0003:09:00.0
Actual results:
The guest lost network connection after triggered eeh 1 time,but the network card still exists.
dmesg log displays as following:
[ 475.038247] Call Trace:
[ 475.038249] [c00000003ffdfcb0] [c0000000007ef478] dev_watchdog+0x388/0x3a0 (unreliable)
[ 475.038263] [c00000003ffdfd50] [c0000000000e8120] call_timer_fn+0x60/0x170
[ 475.038264] [c00000003ffdfdf0] [c0000000000ea5fc] run_timer_softirq+0x2bc/0x3c0
[ 475.038266] [c00000003ffdfea0] [c0000000000de734] __do_softirq+0x154/0x380
[ 475.038268] [c00000003ffdff90] [c000000000024fb8] call_do_softirq+0x14/0x24
[ 475.038270] [c000000001123950] [c000000000011760] do_softirq+0x120/0x170
[ 475.038272] [c000000001123990] [c0000000000decb4] irq_exit+0x1e4/0x1f0
[ 475.038273] [c0000000011239d0] [c00000000001f274] timer_interrupt+0xa4/0xe0
[ 475.038276] [c000000001123a00] [c000000000002914] decrementer_common+0x114/0x180
[ 475.038279] --- Exception: 901 at plpar_hcall_norets+0x8c/0xdc
LR = shared_cede_loop+0xb8/0xd0
[ 475.038284] [c000000001123cf0] [c000000080000000] 0xc000000080000000 (unreliable)
[ 475.038286] [c000000001123d60] [c00000000074d33c] cpuidle_idle_call+0x11c/0x3d0
[ 475.038288] [c000000001123dd0] [c000000000096148] pseries_lpar_idle+0x18/0x60
[ 475.038290] [c000000001123e30] [c000000000018138] arch_cpu_idle+0x68/0x160
[ 475.038292] [c000000001123e60] [c00000000015d670] cpu_startup_entry+0x290/0x300
[ 475.038294] [c000000001123ee0] [c00000000000c9ac] rest_init+0x9c/0xb0
[ 475.038311] [c000000001123f00] [c000000000c23dfc] start_kernel+0x4b4/0x4d0
[ 475.038313] [c000000001123f90] [c000000000009b6c] start_here_common+0x20/0xa8
[ 475.038314] Instruction dump:
[ 475.038315] 994d02a4 4bfffe40 7fc3f378 4bfd36f1 60000000 7fc4f378 7fe6fb78 7c651b78
[ 475.038318] 3c62ffa8 386353c0 48165929 60000000 <0fe00000> 39200001 3d02fff8 99288ae5
[ 475.038321] ---[ end trace 7536ed557dd64f30 ]---
[ 475.038326] tg3 0001:00:01.0 enP1p0s1: transmit timed out, resetting
[ 475.122979] tg3 0001:00:01.0 enP1p0s1: 0x00000000: 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff
Verified the issue on the latest build:
Version-Release number of selected component (if applicable):
qemu-kvm-rhev-2.6.0-8.el7.ppc64le
SLOF-20160223-4.gitdbbfda4.el7.noarch
host:3.10.0-453.el7.ppc64le
BE guest:3.10.0-445.el7.ppc64
LE guest:3.10.0-445.el7.ppc64le
Steps to Verify:
1.insmod related modules in host:
#modprobe vfio
#modprobe vfio_spapr_eeh
#modprobe vfio_iommu_spapr_tce
#modprobe vfio_pci
2.unbind device from host and bind to vfio_pci bus:
#lspci -ns 0003:09:00.0
0003:09:00.0 0200: 14e4:1657 (rev 01)
echo "14e4 1657" > /sys/bus/pci/drivers/vfio-pci/new_id
echo 0003:09:00.0 >/sys/bus/pci/devices/0003\:09\:00.0/driver/unbind
echo 0003:09:00.1 >/sys/bus/pci/devices/0003\:09\:00.1/driver/unbind
echo 0003:09:00.2 >/sys/bus/pci/devices/0003\:09\:00.2/driver/unbind
echo 0003:09:00.3 >/sys/bus/pci/devices/0003\:09\:00.3/driver/unbind
echo 0003:09:00.0 >/sys/bus/pci/drivers/vfio-pci/bind
echo 0003:09:00.1 >/sys/bus/pci/drivers/vfio-pci/bind
echo 0003:09:00.2 >/sys/bus/pci/drivers/vfio-pci/bind
echo 0003:09:00.3 >/sys/bus/pci/drivers/vfio-pci/bind
3. Boot up guest with vfio-pci device
/usr/libexec/qemu-kvm \
-name xuma-test \
-smp 4 \
-m 1024 \
-rtc base=utc,clock=vm \
-vnc :20 \
-qmp tcp:0:4444,server,nowait \
-usb \
-usbdevice tablet \
-nographic \
\
-device virtio-scsi-pci,bus=pci.0 \
\
-device scsi-hd,id=scsi-hd0,drive=scsi-hd0-dr0,bootindex=0 \
-drive file=minimal.qcow2,if=none,id=scsi-hd0-dr0,format=qcow2,cache=none \
\
-device scsi-cd,id=scsi-cd1,drive=scsi-cd1-dr1,bootindex=1 \
-drive file=RHEL-7.2-20150924.0-Server-ppc64le-dvd1.iso,if=none,id=scsi-cd1-dr1,readonly=on,format=raw,cache=none \
\
-device virtio-serial,id=virtio-serial0 \
-chardev socket,path=/tmp/qga.sock,server,nowait,id=qga0 \
-device virtserialport,bus=virtio-serial0.0,chardev=qga0,id=qemu-ga0,name=org.qemu.guest_agent.0 \
\
-device spapr-pci-vfio-host-bridge,id=vfiohost,iommu=1,index=0x1 \
-device vfio-pci,host=0003:09:00.0,bus=vfiohost.0,addr=0x1,id=vfio_dev \
4.check vfio device in guest and dhclient ip:
(guest)#lspci
0000:00:00.0 VGA compatible controller: Device 1234:1111 (rev 02)
0000:00:01.0 USB controller: Apple Inc. KeyLargo/Intrepid USB
0000:00:02.0 SCSI storage controller: Red Hat, Inc Virtio SCSI
0000:00:03.0 Communication controller: Red Hat, Inc Virtio console
0001:00:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01
#dhclient enP1p0s1
5.ping guest from other host and trigger EEH to guest 6 times in host till vfio device to offline:
(host)#echo 2:0:4:0:0 > /sys/kernel/debug/powerpc/PCI0003/err_injct && lspci -ns 0003:09:00.0
wait pinging guest resume
#echo 2:0:4:0:0 > /sys/kernel/debug/powerpc/PCI0003/err_injct && lspci -ns 0003:09:00.0
Actual results:
The network disconnect several seconds after trigger EEH then reconnect.
The vfio device of guest will be offline after 6 times trigger EEH,and will be online after reboot guest.
No any error message in the dmesg log.
The bug has been fixed in qemu-kvm-rhev-2.6.0-8.el7.ppc64le
Hi, David Could you give a help to check the scenario we tested in above comment 6 is exactly what you fixed in this BZ? And could we call it verified pass now? Thanks! Qunfang The procedure described in comment 6 should exercise the fix applied for this BZ. It's not the only way to trigger it, and it's not a minimal case to trigger it, but it should be sufficient. Okay, thanks for confirmation. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2673.html |