Bug 1373802
Summary: | Network can't recover when trigger EEH one time | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Xujun Ma <xuma> |
Component: | qemu-kvm-rhev | Assignee: | David Gibson <dgibson> |
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 7.3 | CC: | alex.williamson, bugproxy, dgibson, hannsj_uhl, knoel, lmiksik, lvivier, michen, mrezanin, qzhang, thuth, virt-maint |
Target Milestone: | rc | Keywords: | Regression |
Target Release: | 7.3 | ||
Hardware: | ppc64le | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | qemu-kvm-rhev-2.6.0-26.el7 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-11-07 21:35:30 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1359843 |
Description
Xujun Ma
2016-09-07 07:38:08 UTC
Xujun, Qunfang, 1) The example uses spapr-pci-vfio-host-bridge. This is deprecated in RHEL7.3 and should no longer be used. However, it's unlikely to be the cause of this problem. 2) I'm a little confused by the problem description. Step 5 says you injected the EEH error 6 times, until the device went offline. But elsewhere it says the problem occurs when EEH is triggered one time? Can you please clarify the situation? (In reply to David Gibson from comment #4) > Xujun, Qunfang, > > 1) > > The example uses spapr-pci-vfio-host-bridge. This is deprecated in RHEL7.3 > and should no longer be used. However, it's unlikely to be the cause of > this problem. Right, we highlighted it within team and updated in test plan to use spapr-pci-host-bridge, this may be a mis-copypaste from bug 1266833. > > 2) > > I'm a little confused by the problem description. Step 5 says you injected > the EEH error 6 times, until the device went offline. But elsewhere it says > the problem occurs when EEH is triggered one time? > > Can you please clarify the situation? Yes, I confirmed this with Xujun earlier this morning as well since it's confusing, he could reproduce it *at the first time* when EEH is triggered, just pasted the test steps from Bug 1266833 which uses a "6 times" step. He said he wanted to trigger it 6 times however he reproduced the bug at the first attempt. Ok, I've finally had a chance to investigate this in some depth. I've reproduced the problem for myself. Interestingly it triggers for the Broadcom BCM5719 NIC, but *not* for the Emulex OneConnect NIC in the same machine. This makes investigation harder, since the Broadcom NIC is used for the host's networking. Still looking.. I've tracked the regression to downstream commit 4f78268 "kvm-irqchip: simplify kvm_irqchip_add_msi_route". There was a bug that was noticed upstream in that patch, but the fix for it has already been backported. This looks to be a different bug, still investigating... I've confirmed the regression happens with the upstream version of the patch, d1f6af6a17a66f58c238e1c26b928cf71c0c11da, as well. I had lengthy IRC discussion with Peter Xu (author of the patch causing the regression) and Gavin San (IBM EEH expert). None of us have yet figured out how the patch could be causing the regression. I've now confirmed that this still works if we set kernel_irqchip=off (with the host kernel fixed for bug 1375778). So it looks like qemu is doing something wrong wiring up the interrupt routing on the kvm irqchip. Finally tracked this down to a really subtle behavioural change in the handling of the "dummy" msi irq. Will post upstream fix shortly. Upstream patch posted. Brewing a preliminary downstream fix at: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11756232 Alex Williamson has now sent a pull request upstream with this fix. New brew based on the upstream version at: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11764155 Fix included in qemu-kvm-rhev-2.6.0-26.el7 Reproduced the issue on old version: Version-Release number of selected component (if applicable): qemu-kvm-rhev-2.6.0-25.el7.ppc64le SLOF-20160223-6.gitdbbfda4.el7.noarch host:kernel-3.10.0-506.el7.ppc64le guest:kernel-3.10.0-506.el7.ppc64le Steps to Reproduce: 1.insmod related modules in host: #modprobe vfio #modprobe vfio_spapr_eeh #modprobe vfio_iommu_spapr_tce #modprobe vfio_pci 2.unbind device from host and bind to vfio_pci bus: #lspci -ns 0003:09:00.0 0003:09:00.0 0200: 14e4:1657 (rev 01) echo "14e4 1657" > /sys/bus/pci/drivers/vfio-pci/new_id echo 0003:09:00.0 >/sys/bus/pci/devices/0003\:09\:00.0/driver/unbind echo 0003:09:00.1 >/sys/bus/pci/devices/0003\:09\:00.1/driver/unbind echo 0003:09:00.2 >/sys/bus/pci/devices/0003\:09\:00.2/driver/unbind echo 0003:09:00.3 >/sys/bus/pci/devices/0003\:09\:00.3/driver/unbind echo 0003:09:00.0 >/sys/bus/pci/drivers/vfio-pci/bind echo 0003:09:00.1 >/sys/bus/pci/drivers/vfio-pci/bind echo 0003:09:00.2 >/sys/bus/pci/drivers/vfio-pci/bind echo 0003:09:00.3 >/sys/bus/pci/drivers/vfio-pci/bind 3. Boot up guest with vfio-pci device /usr/libexec/qemu-kvm \ -name xuma-test \ -smp 4 \ -m 1024 \ -rtc base=utc,clock=vm \ -vnc :20 \ -qmp tcp:0:4444,server,nowait \ -usb \ -usbdevice tablet \ -nographic \ -device virtio-scsi-pci,bus=pci.0 \ -device scsi-hd,id=scsi-hd0,drive=scsi-hd0-dr0,bootindex=0 \ -drive file=minimal.qcow2,if=none,id=scsi-hd0-dr0,format=qcow2,cache=none \ -device spapr-pci-host-bridge,id=vfiohost,index=0x1 \ -device vfio-pci,host=0003:09:00.0,bus=vfiohost.0,addr=0x1,id=vfio_dev \ 4.check vfio device in guest and dhclient ip: (guest)#lspci 00:00.0 VGA compatible controller: Device 1234:1111 (rev 02) 00:01.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 03) 00:02.0 SCSI storage controller: Red Hat, Inc Virtio block device 0001:00:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01 #dhclient eth1 5.ping guest from other host and trigger EEH to guest 6 times in host till vfio device to offline: (host)#echo 2:0:4:0:0 > /sys/kernel/debug/powerpc/PCI0003/err_injct && lspci -ns 0003:09:00.0 wait pinging guest resume #echo 2:0:4:0:0 > /sys/kernel/debug/powerpc/PCI0003/err_injct && lspci -ns 0003:09:00.0 Actual results: Network can't recover when trigger EEH one time. Verified the issue on the latest build: Version-Release number of selected component (if applicable): qemu-kvm-rhev-2.6.0-26.el7.ppc64le SLOF-20160223-6.gitdbbfda4.el7.noarch host:kernel-3.10.0-506.el7.ppc64le guest:kernel-3.10.0-506.el7.ppc64le Steps to Verify: the same steps as above Actual results: Network can recover when trigger EEH 6 times,then network card will be offline,and will be back after guest reboots. Base on the above results ,the bug has been fixed. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2673.html |