RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1563525 - Cisco VIC caused VM paused forever
Summary: Cisco VIC caused VM paused forever
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: qemu-kvm-rhev
Version: 7.4
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: rc
: ---
Assignee: Alex Williamson
QA Contact: xiywang
URL:
Whiteboard:
: 1563524 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-04 05:53 UTC by Chen
Modified: 2021-06-10 15:39 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-11 22:25:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Chen 2018-04-04 05:53:33 UTC
Description of problem:

Cisco VIC caused VM paused forever.

Snip of /var/log/libvirt/qemu/rhel7.4.log

2018-04-04T04:34:53.304846Z qemu-kvm: vfio_err_notifier_handler(0000:0b:00.0) Unrecoverable error detected. Please collect any data possible and then kill the guest
2018-04-04T04:34:53.405020Z qemu-kvm: vfio_err_notifier_handler(0000:0b:00.0) Unrecoverable error detected. Please collect any data possible and then kill the guest

$ cat lspci | grep 0b:00
0b:00.0 Ethernet controller [0200]: Cisco Systems Inc VIC Ethernet NIC [1137:0043] (rev a2)

For Intel NIC the customer doesn't see similar symptoms.

Version-Release number of selected component (if applicable):

vfio-pci
Cisco VIC
kernel-3.10.0-693.17.1.el7.x86_64
qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64

How reproducible:

Quite frequent

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 4 Chen 2018-04-04 06:33:49 UTC
*** Bug 1563524 has been marked as a duplicate of this bug. ***

Comment 5 Chen 2018-04-04 06:34:55 UTC
Hi Jun Yi,

Sorry I closed 1563524 as duplicate of this one.

Best Regards,
Chen

Comment 8 Alex Williamson 2018-04-04 14:29:46 UTC
The device generated an uncorrected AER error as seen in messages:

Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0000
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type
=Transaction Layer, id=0420(Receiver ID)
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0:   device [10b5:8632] error status/mask=00200000/001000
00
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0:    [21] ACS Violation          (First)
Apr  4 13:34:53 rhel7-bare kvm: 5 guests now active
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0: AER: Device recovery failed
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0000
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type
=Transaction Layer, id=0420(Receiver ID)
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0:   device [10b5:8632] error status/mask=00200000/001000
00
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0:    [21] ACS Violation          (First)
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0: AER: Device recovery failed

There are multiple levels of switches used in this system:

00:03.0->{03:00.0->04:04.0}->{06:00.0->07:01.0}->{09:00.0->0a:00.0}->0b:00.0
             PLX switch         Cisco switch        Cisco switch

The ACS violation seems to be detected by the downstream port of the PLX switch and forwarded up to the PCIe root port.

This is a hardware issue, not a software issue.  QEMU will pause the VM for data collection upon receiving an uncorrected AER error.  Customer should work with hardware vendors to determine the cause of the violation.

Comment 10 Alex Williamson 2018-06-01 16:21:56 UTC
The customer case has been closed, can we also close this bz?


Note You need to log in before you can comment on or make changes to this bug.