Bug 1563525 - Cisco VIC caused VM paused forever
Summary: Cisco VIC caused VM paused forever
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-rhev
Version: 7.4
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: rc
: ---
Assignee: Alex Williamson
QA Contact: xiywang
URL:
Whiteboard:
: 1563524 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-04 05:53 UTC by Chen
Modified: 2019-11-21 03:05 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-11 22:25:59 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Chen 2018-04-04 05:53:33 UTC
Description of problem:

Cisco VIC caused VM paused forever.

Snip of /var/log/libvirt/qemu/rhel7.4.log

2018-04-04T04:34:53.304846Z qemu-kvm: vfio_err_notifier_handler(0000:0b:00.0) Unrecoverable error detected. Please collect any data possible and then kill the guest
2018-04-04T04:34:53.405020Z qemu-kvm: vfio_err_notifier_handler(0000:0b:00.0) Unrecoverable error detected. Please collect any data possible and then kill the guest

$ cat lspci | grep 0b:00
0b:00.0 Ethernet controller [0200]: Cisco Systems Inc VIC Ethernet NIC [1137:0043] (rev a2)

For Intel NIC the customer doesn't see similar symptoms.

Version-Release number of selected component (if applicable):

vfio-pci
Cisco VIC
kernel-3.10.0-693.17.1.el7.x86_64
qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64

How reproducible:

Quite frequent

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 4 Chen 2018-04-04 06:33:49 UTC
*** Bug 1563524 has been marked as a duplicate of this bug. ***

Comment 5 Chen 2018-04-04 06:34:55 UTC
Hi Jun Yi,

Sorry I closed 1563524 as duplicate of this one.

Best Regards,
Chen

Comment 8 Alex Williamson 2018-04-04 14:29:46 UTC
The device generated an uncorrected AER error as seen in messages:

Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0000
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type
=Transaction Layer, id=0420(Receiver ID)
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0:   device [10b5:8632] error status/mask=00200000/001000
00
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0:    [21] ACS Violation          (First)
Apr  4 13:34:53 rhel7-bare kvm: 5 guests now active
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0: AER: Device recovery failed
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0000
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type
=Transaction Layer, id=0420(Receiver ID)
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0:   device [10b5:8632] error status/mask=00200000/001000
00
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0:    [21] ACS Violation          (First)
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0: AER: Device recovery failed

There are multiple levels of switches used in this system:

00:03.0->{03:00.0->04:04.0}->{06:00.0->07:01.0}->{09:00.0->0a:00.0}->0b:00.0
             PLX switch         Cisco switch        Cisco switch

The ACS violation seems to be detected by the downstream port of the PLX switch and forwarded up to the PCIe root port.

This is a hardware issue, not a software issue.  QEMU will pause the VM for data collection upon receiving an uncorrected AER error.  Customer should work with hardware vendors to determine the cause of the violation.

Comment 10 Alex Williamson 2018-06-01 16:21:56 UTC
The customer case has been closed, can we also close this bz?


Note You need to log in before you can comment on or make changes to this bug.