1346627 – qemu discards EEH ioctl results

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1346627 - qemu discards EEH ioctl results

Summary: qemu discards EEH ioctl results

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	qemu-kvm-rhev
Sub Component:
Version:	7.3
Hardware:	ppc64le
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Laurent Vivier
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1266833 RHV4.1PPC
TreeView+	depends on / blocked

Reported:	2016-06-15 06:30 UTC by David Gibson
Modified:	2016-11-07 21:17 UTC (History)
CC List:	8 users (show)
Fixed In Version:	qemu-kvm-rhev-2.6.0-8.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-11-07 21:17:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:2673	0	normal	SHIPPED_LIVE	qemu-kvm-rhev bug fix and enhancement update	2016-11-08 01:06:13 UTC

Description David Gibson 2016-06-15 06:30:02 UTC

Description of problem:

When invoking EEH functions to handle guest EEH operations on a pass-through device, qemu discards any non-error return values from the ioctl().  This breaks those functions which rely on non-zero, non-error returns from ioctl().  There aren't many of those, but that's enough to break things.

Version-Release number of selected component (if applicable):

qemu-kvm-rhev-2.6.0-5.el7.ppc64le

How reproducible:

100%, but requires very specific setup and operations.

Steps to Reproduce:

See bug 1266833 for one example which hits this bug.

Additional info:

An upstream patch has just been posted for this and merged to my ppc-for-2.7 staging tree.  Once it's merged upstream we'll want to pull it downstream.

Comment 1 Laurent Vivier 2016-06-15 08:02:21 UTC

The patch proposed upstream works fine:

the guest kernel correctly detects the EEH error.

Just a note:

Is that normal to always have qemu error while the card is "frozen"?


[  258.886246] EEH: Frozen PHB#0-PE#1 detected
[  258.886296] EEH: PE location: N/A, PHB location: N/A
[  258.886336] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.10.0-422.el7.ppc64le #1
[  258.886383] Call Trace:
...
2016-06-15T07:56:15.062352Z qemu-system-ppc64: vfio_pci_read_config(0003:09:00.0, 0x3d, 0x1) failed: No such device
2016-06-15T07:56:15.062469Z qemu-system-ppc64: vfio_pci_read_config(0003:09:00.1, 0x3d, 0x1) failed: No such device
2016-06-15T07:56:15.062541Z qemu-system-ppc64: vfio_pci_read_config(0003:09:00.2, 0x3d, 0x1) failed: No such device
2016-06-15T07:56:15.062609Z qemu-system-ppc64: vfio_pci_read_config(0003:09:00.3, 0x3d, 0x1) failed: No such device
...
[  263.053839] EEH: Notify device drivers the completion of reset
[  263.062763] EEH: Notify device driver to resume

Comment 2 Miroslav Rezanina 2016-06-21 07:04:15 UTC

Fix included in qemu-kvm-rhev-2.6.0-8.el7

Comment 4 David Gibson 2016-06-28 02:43:56 UTC

Laurent,  I'm not really sure if that behaviour is normal.  Let's see if Gavin can answer that.

Comment 5 Gavin Shan 2016-06-28 07:12:23 UTC

It's expected behaviour if 0003:09:00.0 is Broadcom network adapter. In early stage of EEH recovery, its config space has to be blocked because of hardware defect. QEMU gets error while accessing the config space, then print those logs.

Comment 6 Xujun Ma 2016-07-06 05:31:14 UTC

Reproduced the issue on old version:

Version-Release number of selected component (if applicable):
qemu-kvm-rhev-2.6.0-4.el7.ppc64le
SLOF-20160223-4.gitdbbfda4.el7.noarch
host:3.10.0-453.el7.ppc64le
BE guest:3.10.0-445.el7.ppc64
LE guest:3.10.0-445.el7.ppc64le

Steps to Reproduce:
1.insmod related modules in host:
#modprobe vfio
#modprobe vfio_spapr_eeh
#modprobe vfio_iommu_spapr_tce
#modprobe vfio_pci

2.unbind device from host and bind to vfio_pci bus:
#lspci  -ns 0003:09:00.0 
0003:09:00.0 0200: 14e4:1657 (rev 01)

echo "14e4 1657" > /sys/bus/pci/drivers/vfio-pci/new_id
echo 0003:09:00.0 >/sys/bus/pci/devices/0003\:09\:00.0/driver/unbind
echo 0003:09:00.1 >/sys/bus/pci/devices/0003\:09\:00.1/driver/unbind
echo 0003:09:00.2 >/sys/bus/pci/devices/0003\:09\:00.2/driver/unbind
echo 0003:09:00.3 >/sys/bus/pci/devices/0003\:09\:00.3/driver/unbind
echo 0003:09:00.0 >/sys/bus/pci/drivers/vfio-pci/bind
echo 0003:09:00.1 >/sys/bus/pci/drivers/vfio-pci/bind
echo 0003:09:00.2 >/sys/bus/pci/drivers/vfio-pci/bind
echo 0003:09:00.3 >/sys/bus/pci/drivers/vfio-pci/bind

3. Boot up guest with vfio-pci device
/usr/libexec/qemu-kvm \
 -name xuma-test \
 -smp 4 \
 -m 1024 \
 -rtc base=utc,clock=vm \
 -vnc :20 \
 -qmp tcp:0:4444,server,nowait \
 -usb \
 -usbdevice tablet \
 -nographic \
\
 -device virtio-scsi-pci,bus=pci.0 \
\
 -device scsi-hd,id=scsi-hd0,drive=scsi-hd0-dr0,bootindex=0 \
 -drive file=minimal.qcow2,if=none,id=scsi-hd0-dr0,format=qcow2,cache=none \
\
 -device scsi-cd,id=scsi-cd1,drive=scsi-cd1-dr1,bootindex=1 \
 -drive file=RHEL-7.2-20150924.0-Server-ppc64le-dvd1.iso,if=none,id=scsi-cd1-dr1,readonly=on,format=raw,cache=none \
\
 -device virtio-serial,id=virtio-serial0 \
 -chardev socket,path=/tmp/qga.sock,server,nowait,id=qga0 \
 -device virtserialport,bus=virtio-serial0.0,chardev=qga0,id=qemu-ga0,name=org.qemu.guest_agent.0 \
\
 -device spapr-pci-vfio-host-bridge,id=vfiohost,iommu=1,index=0x1 \
 -device vfio-pci,host=0003:09:00.0,bus=vfiohost.0,addr=0x1,id=vfio_dev \

4.check vfio device in guest and dhclient ip:
(guest)#lspci
0000:00:00.0 VGA compatible controller: Device 1234:1111 (rev 02)
0000:00:01.0 USB controller: Apple Inc. KeyLargo/Intrepid USB
0000:00:02.0 SCSI storage controller: Red Hat, Inc Virtio SCSI
0000:00:03.0 Communication controller: Red Hat, Inc Virtio console
0001:00:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01

#dhclient enP1p0s1


5.ping guest from other host and trigger EEH to guest 6 times in host till vfio device to offline:
(host)#echo 2:0:4:0:0 > /sys/kernel/debug/powerpc/PCI0003/err_injct && lspci -ns 0003:09:00.0 
wait pinging guest resume
#echo 2:0:4:0:0 > /sys/kernel/debug/powerpc/PCI0003/err_injct && lspci -ns 0003:09:00.0 

Actual results:
The guest lost network connection after triggered eeh 1 time,but the network card still exists.
dmesg log displays as following:
[  475.038247] Call Trace:
[  475.038249] [c00000003ffdfcb0] [c0000000007ef478] dev_watchdog+0x388/0x3a0 (unreliable)
[  475.038263] [c00000003ffdfd50] [c0000000000e8120] call_timer_fn+0x60/0x170
[  475.038264] [c00000003ffdfdf0] [c0000000000ea5fc] run_timer_softirq+0x2bc/0x3c0
[  475.038266] [c00000003ffdfea0] [c0000000000de734] __do_softirq+0x154/0x380
[  475.038268] [c00000003ffdff90] [c000000000024fb8] call_do_softirq+0x14/0x24
[  475.038270] [c000000001123950] [c000000000011760] do_softirq+0x120/0x170
[  475.038272] [c000000001123990] [c0000000000decb4] irq_exit+0x1e4/0x1f0
[  475.038273] [c0000000011239d0] [c00000000001f274] timer_interrupt+0xa4/0xe0
[  475.038276] [c000000001123a00] [c000000000002914] decrementer_common+0x114/0x180
[  475.038279] --- Exception: 901 at plpar_hcall_norets+0x8c/0xdc
    LR = shared_cede_loop+0xb8/0xd0
[  475.038284] [c000000001123cf0] [c000000080000000] 0xc000000080000000 (unreliable)
[  475.038286] [c000000001123d60] [c00000000074d33c] cpuidle_idle_call+0x11c/0x3d0
[  475.038288] [c000000001123dd0] [c000000000096148] pseries_lpar_idle+0x18/0x60
[  475.038290] [c000000001123e30] [c000000000018138] arch_cpu_idle+0x68/0x160
[  475.038292] [c000000001123e60] [c00000000015d670] cpu_startup_entry+0x290/0x300
[  475.038294] [c000000001123ee0] [c00000000000c9ac] rest_init+0x9c/0xb0
[  475.038311] [c000000001123f00] [c000000000c23dfc] start_kernel+0x4b4/0x4d0
[  475.038313] [c000000001123f90] [c000000000009b6c] start_here_common+0x20/0xa8
[  475.038314] Instruction dump:
[  475.038315] 994d02a4 4bfffe40 7fc3f378 4bfd36f1 60000000 7fc4f378 7fe6fb78 7c651b78 
[  475.038318] 3c62ffa8 386353c0 48165929 60000000 <0fe00000> 39200001 3d02fff8 99288ae5 
[  475.038321] ---[ end trace 7536ed557dd64f30 ]---
[  475.038326] tg3 0001:00:01.0 enP1p0s1: transmit timed out, resetting
[  475.122979] tg3 0001:00:01.0 enP1p0s1: 0x00000000: 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff



Verified the issue on the latest build:
Version-Release number of selected component (if applicable):
qemu-kvm-rhev-2.6.0-8.el7.ppc64le
SLOF-20160223-4.gitdbbfda4.el7.noarch
host:3.10.0-453.el7.ppc64le
BE guest:3.10.0-445.el7.ppc64
LE guest:3.10.0-445.el7.ppc64le


Steps to Verify:
1.insmod related modules in host:
#modprobe vfio
#modprobe vfio_spapr_eeh
#modprobe vfio_iommu_spapr_tce
#modprobe vfio_pci

2.unbind device from host and bind to vfio_pci bus:
#lspci  -ns 0003:09:00.0 
0003:09:00.0 0200: 14e4:1657 (rev 01)

echo "14e4 1657" > /sys/bus/pci/drivers/vfio-pci/new_id
echo 0003:09:00.0 >/sys/bus/pci/devices/0003\:09\:00.0/driver/unbind
echo 0003:09:00.1 >/sys/bus/pci/devices/0003\:09\:00.1/driver/unbind
echo 0003:09:00.2 >/sys/bus/pci/devices/0003\:09\:00.2/driver/unbind
echo 0003:09:00.3 >/sys/bus/pci/devices/0003\:09\:00.3/driver/unbind
echo 0003:09:00.0 >/sys/bus/pci/drivers/vfio-pci/bind
echo 0003:09:00.1 >/sys/bus/pci/drivers/vfio-pci/bind
echo 0003:09:00.2 >/sys/bus/pci/drivers/vfio-pci/bind
echo 0003:09:00.3 >/sys/bus/pci/drivers/vfio-pci/bind

3. Boot up guest with vfio-pci device
/usr/libexec/qemu-kvm \
 -name xuma-test \
 -smp 4 \
 -m 1024 \
 -rtc base=utc,clock=vm \
 -vnc :20 \
 -qmp tcp:0:4444,server,nowait \
 -usb \
 -usbdevice tablet \
 -nographic \
\
 -device virtio-scsi-pci,bus=pci.0 \
\
 -device scsi-hd,id=scsi-hd0,drive=scsi-hd0-dr0,bootindex=0 \
 -drive file=minimal.qcow2,if=none,id=scsi-hd0-dr0,format=qcow2,cache=none \
\
 -device scsi-cd,id=scsi-cd1,drive=scsi-cd1-dr1,bootindex=1 \
 -drive file=RHEL-7.2-20150924.0-Server-ppc64le-dvd1.iso,if=none,id=scsi-cd1-dr1,readonly=on,format=raw,cache=none \
\
 -device virtio-serial,id=virtio-serial0 \
 -chardev socket,path=/tmp/qga.sock,server,nowait,id=qga0 \
 -device virtserialport,bus=virtio-serial0.0,chardev=qga0,id=qemu-ga0,name=org.qemu.guest_agent.0 \
\
 -device spapr-pci-vfio-host-bridge,id=vfiohost,iommu=1,index=0x1 \
 -device vfio-pci,host=0003:09:00.0,bus=vfiohost.0,addr=0x1,id=vfio_dev \

4.check vfio device in guest and dhclient ip:
(guest)#lspci
0000:00:00.0 VGA compatible controller: Device 1234:1111 (rev 02)
0000:00:01.0 USB controller: Apple Inc. KeyLargo/Intrepid USB
0000:00:02.0 SCSI storage controller: Red Hat, Inc Virtio SCSI
0000:00:03.0 Communication controller: Red Hat, Inc Virtio console
0001:00:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01

#dhclient enP1p0s1


5.ping guest from other host and trigger EEH to guest 6 times in host till vfio device to offline:
(host)#echo 2:0:4:0:0 > /sys/kernel/debug/powerpc/PCI0003/err_injct && lspci -ns 0003:09:00.0 
wait pinging guest resume
#echo 2:0:4:0:0 > /sys/kernel/debug/powerpc/PCI0003/err_injct && lspci -ns 0003:09:00.0 

Actual results:
The network disconnect several seconds after trigger EEH then reconnect.
The vfio device of guest will be offline after 6 times trigger EEH,and will be online after reboot guest.
No any error message in the dmesg log.

The bug has been fixed in qemu-kvm-rhev-2.6.0-8.el7.ppc64le

Comment 7 Qunfang Zhang 2016-07-11 05:56:49 UTC

Hi, David

Could you give a help to check the scenario we tested in above comment 6 is exactly what you fixed in this BZ?  And could we call it verified pass now?  

Thanks!
Qunfang

Comment 8 David Gibson 2016-07-12 01:24:44 UTC

The procedure described in comment 6 should exercise the fix applied for this BZ.  It's not the only way to trigger it, and it's not a minimal case to trigger it, but it should be sufficient.

Comment 9 Qunfang Zhang 2016-07-12 01:34:01 UTC

Okay, thanks for confirmation.

Comment 11 errata-xmlrpc 2016-11-07 21:17:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2673.html

Note You need to log in before you can comment on or make changes to this bug.