Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Can this be reproduced using the PCI sysfs reset file directly? For example, bind the device to pci-stub, then do the following without assigning the device to a guest:
# echo 1 > /sys/bus/pci/devices/0000:06:00.0/reset
If that also fails (expected) then this is probably more appropriately a kernel issue and we'll have to create a reset quirk for this device.
(In reply to comment #6)
> Can this be reproduced using the PCI sysfs reset file directly? For example,
> bind the device to pci-stub, then do the following without assigning the device
> to a guest:
>
> # echo 1 > /sys/bus/pci/devices/0000:06:00.0/reset
>
> If that also fails (expected) then this is probably more appropriately a kernel
> issue and we'll have to create a reset quirk for this device.
Bind the device to pci-stub, and then echo to PCI sysfs reset file directly, got following error:
# echo 1 > /sys/bus/pci/devices/0000\:06\:00.0/reset
-bash: echo: write error: Invalid argument
If boot a guest with it assigned then shut it down, successful to echo to PCI sysfs reset file, but *won't* trigger kernel panic
I can trigger this without KVM being involved. The mistake I made in the Comment 6 instructions was to bind the device to pci-stub. In fact, we should leave the device bound to e1000e, then we just need to echo 1 to the pci-sysfs reset file (sometimes more than once) and the system will hit the same panic. I'd like to see if it's avoided if we boot with noaer, but I seem to have lost the test system.
pci=noaer makes no difference since this is an APEI error. ghes.disable=1 will avoid printing the error at the OS and not call panic, but we still get an unknown NMI and the device doesn't work after.
Based on the errata for this 82574L, I also tried masking unsupported request errors in the advanced error reporting capability register (config offset 0x108, bit 20). This does seem to prevent the APEI error, but the device is unable to dhcp an IP after reset. This possibly relates back to Specification Clarification note 2 in the specification update for this device, which indicates a D3->D0 transition resets the PHY.
I also attempted to do a secondary bus reset on the parent bridge to this device from userspace, but met another APEI error for an unsupported request from the bridge.
Adding Matthew Garrett and Don Zickus to the cc since they may be able to shed some light on the ACPI and APEI aspects of this.
I did verify this system is running the latest released BIOS.
If it's generating an NMI even without any APEI/AER support, I think it's pretty clear that the hardware is doing something nasty to the bus and we're just the messenger.
(In reply to comment #15)
> If it's generating an NMI even without any APEI/AER support, I think it's
> pretty clear that the hardware is doing something nasty to the bus and we're
> just the messenger.
Right or the hardware's firmware just croaked. Either way an NMI is being generated. In the APEI case, the bios firmware traps it, grabs some info and wraps a pretty little acpi bow on it. Without APEI, the NMI just goes to the OS to decipher, which we do not do a good job of currently.
There isn't much we can do except ask for a firmware update on the hardware or just blacklist the device from being used with a guest :-)
Cheers,
Don
(In reply to comment #16)
>
> There isn't much we can do except ask for a firmware update on the hardware or
> just blacklist the device from being used with a guest :-)
Or create a device specific reset for it if we can figure out how to make it play nicely (though we can't really protect if userspace decides to do a D0->D3hot->D0 on it's own, which is all it takes to hit this).
Created attachment 523522 [details] guest device message