Bug 1563525

Summary:	Cisco VIC caused VM paused forever
Product:	Red Hat Enterprise Linux 7	Reporter:	Chen <cchen>
Component:	qemu-kvm-rhev	Assignee:	Alex Williamson <alex.williamson>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	xiywang
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	7.4	CC:	cchen, chayang, juzhang, knoel, michen, pezhang, rbalakri, siliu, virt-maint, xiywang
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-06-11 22:25:59 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Chen 2018-04-04 05:53:33 UTC

Description of problem:

Cisco VIC caused VM paused forever.

Snip of /var/log/libvirt/qemu/rhel7.4.log

2018-04-04T04:34:53.304846Z qemu-kvm: vfio_err_notifier_handler(0000:0b:00.0) Unrecoverable error detected. Please collect any data possible and then kill the guest
2018-04-04T04:34:53.405020Z qemu-kvm: vfio_err_notifier_handler(0000:0b:00.0) Unrecoverable error detected. Please collect any data possible and then kill the guest

$ cat lspci | grep 0b:00
0b:00.0 Ethernet controller [0200]: Cisco Systems Inc VIC Ethernet NIC [1137:0043] (rev a2)

For Intel NIC the customer doesn't see similar symptoms.

Version-Release number of selected component (if applicable):

vfio-pci
Cisco VIC
kernel-3.10.0-693.17.1.el7.x86_64
qemu-kvm-rhev-2.9.0-16.el7_4.13.x86_64

How reproducible:

Quite frequent

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 juzhang 2018-04-04 06:29:51 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1563524 and https://bugzilla.redhat.com/show_bug.cgi?id=1563525 is same issue?

Comment 4 Chen 2018-04-04 06:33:49 UTC

*** Bug 1563524 has been marked as a duplicate of this bug. ***

Comment 5 Chen 2018-04-04 06:34:55 UTC

Hi Jun Yi,

Sorry I closed 1563524 as duplicate of this one.

Best Regards,
Chen

Comment 8 Alex Williamson 2018-04-04 14:29:46 UTC

The device generated an uncorrected AER error as seen in messages:

Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0000
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type
=Transaction Layer, id=0420(Receiver ID)
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0:   device [10b5:8632] error status/mask=00200000/001000
00
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0:    [21] ACS Violation          (First)
Apr  4 13:34:53 rhel7-bare kvm: 5 guests now active
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0: AER: Device recovery failed
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0000
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type
=Transaction Layer, id=0420(Receiver ID)
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0:   device [10b5:8632] error status/mask=00200000/001000
00
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0:    [21] ACS Violation          (First)
Apr  4 13:34:53 rhel7-bare kernel: pcieport 0000:04:04.0: AER: Device recovery failed

There are multiple levels of switches used in this system:

00:03.0->{03:00.0->04:04.0}->{06:00.0->07:01.0}->{09:00.0->0a:00.0}->0b:00.0
             PLX switch         Cisco switch        Cisco switch

The ACS violation seems to be detected by the downstream port of the PLX switch and forwarded up to the PCIe root port.

This is a hardware issue, not a software issue.  QEMU will pause the VM for data collection upon receiving an uncorrected AER error.  Customer should work with hardware vendors to determine the cause of the violation.

Comment 10 Alex Williamson 2018-06-01 16:21:56 UTC

The customer case has been closed, can we also close this bz?