Bug 1563525

Summary: Cisco VIC caused VM paused forever
Product: Red Hat Enterprise Linux 7 Reporter: Chen <cchen>
Component: qemu-kvm-rhevAssignee: Alex Williamson <alex.williamson>
Status: CLOSED INSUFFICIENT_DATA QA Contact: xiywang
Severity: medium Docs Contact:
Priority: unspecified    
Version: 7.4CC: cchen, chayang, juzhang, knoel, michen, pezhang, rbalakri, siliu, virt-maint, xiywang
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-11 22:25:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Chen 2018-04-04 05:53:33 UTC
Description of problem:

Cisco VIC caused VM paused forever.

Snip of /var/log/libvirt/qemu/rhel7.4.log

2018-04-04T04:34:53.304846Z qemu-kvm: vfio_err_notifier_handler(0000:0b:00.0) Unrecoverable error detected. Please collect any data possible and then kill the guest
2018-04-04T04:34:53.405020Z qemu-kvm: vfio_err_notifier_handler(0000:0b:00.0) Unrecoverable error detected. Please collect any data possible and then kill the guest

$ cat lspci | grep 0b:00
0b:00.0 Ethernet controller [0200]: Cisco Systems Inc VIC Ethernet NIC [1137:0043] (rev a2)

For Intel NIC the customer doesn't see similar symptoms.

Version-Release number of selected component (if applicable):

vfio-pci
Cisco VIC
kernel-3.10.0-693.17.1.el7.x86_64
qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64

How reproducible:

Quite frequent

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 4 Chen 2018-04-04 06:33:49 UTC
*** Bug 1563524 has been marked as a duplicate of this bug. ***

Comment 5 Chen 2018-04-04 06:34:55 UTC
Hi Jun Yi,

Sorry I closed 1563524 as duplicate of this one.

Best Regards,
Chen

Comment 8 Alex Williamson 2018-04-04 14:29:46 UTC
The device generated an uncorrected AER error as seen in messages:

Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0000
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type
=Transaction Layer, id=0420(Receiver ID)
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0:   device [10b5:8632] error status/mask=00200000/001000
00
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0:    [21] ACS Violation          (First)
Apr  4 13:34:53 rhel7-bare kvm: 5 guests now active
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0: AER: Device recovery failed
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0000
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type
=Transaction Layer, id=0420(Receiver ID)
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0:   device [10b5:8632] error status/mask=00200000/001000
00
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0:    [21] ACS Violation          (First)
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0: AER: Device recovery failed

There are multiple levels of switches used in this system:

00:03.0->{03:00.0->04:04.0}->{06:00.0->07:01.0}->{09:00.0->0a:00.0}->0b:00.0
             PLX switch         Cisco switch        Cisco switch

The ACS violation seems to be detected by the downstream port of the PLX switch and forwarded up to the PCIe root port.

This is a hardware issue, not a software issue.  QEMU will pause the VM for data collection upon receiving an uncorrected AER error.  Customer should work with hardware vendors to determine the cause of the violation.

Comment 10 Alex Williamson 2018-06-01 16:21:56 UTC
The customer case has been closed, can we also close this bz?