| Summary: | kvm host crash when abusing passed through nic in guest | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Igor Mammedov <imammedo> | ||||||||
| Component: | kernel | Assignee: | Alex Williamson <alex.williamson> | ||||||||
| Status: | CLOSED WORKSFORME | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||||
| Severity: | unspecified | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 6.2 | ||||||||||
| Target Milestone: | rc | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2011-11-08 22:28:57 UTC | Type: | --- | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Attachments: |
|
||||||||||
|
Description
Igor Mammedov
2011-09-07 16:28:51 UTC
Created attachment 521938 [details]
guest config
Since RHEL 6.2 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. The host console log with the APEI error looks like what we get for some other Intel NICs when something goes wrong on the bus, APEI hands off to the BIOS and the BIOS tells the host that a fatal error has occurred, reboot. Officially, Intel doesn't support assignment of PFs, so I wonder if we can make this happen with VFs. I'll try to reproduce on a system w/o APEI and see what happens. I notice that this is not the same system bug 713221 was found on, can we test to see if the fixes implemented there on Xen also resolve the problem on this system that maybe implements an over zealous APEI? I can't reproduce on my system with either the igb PF or VF. I patched igbvf in the same way as igb, when I remove the module, nothing interesting happens. Trying to reload the module fails in the guest since the interrupt was never unregistered. If the guest is rebooted, the device works again. The PF device using the modified igb driver behaves exactly the same. I suspect Igor's test system may have an overactive APEI layer causing the reboot. Closer reading of bug 713221, I see now that that was a dom0/pv pass-through issue, not reproducible with hvm, so there's really nothing to potentially leverage from that bug. I don't think it would tell us anything to attempt to reproduce that bug on this system. The VF test is still interesting through. With xen hvm or kvm, the guest failing to disable interrupts shouldn't have any adverse effects on the host. It's still possible though for the device to send out a bogus transaction which the chipset and bios can over-react to. In this case we're getting a report of an unsupported transaction, which my system w/o APEI could simply be discarding and continuing along happily. APEI seems to put all of the decisions about recovery in the hands of the BIOS, we should probably investigate whether there's an opportunity to just kill the guest attached to the offending device and take the device offline. Created attachment 531834 [details]
igbvf patch
This is what I used to try to evoke the same behavior with igbvf. Patch against RHEL5.7.
(In reply to comment #6) Tried to reproduce both with VF and PF. Guest in both cases crashes. In case of PF host receives NMI but stays alive. So bug is not reproducible any more. Changes since last time host crashed is: - host motherboard was replaced. (it permanently declined to initialize igb nic, after several host crashes) - igb nic was moved to another slot. Probably host originally crashed due to faulty motherboard after all. Since there is no way to reproduce it now, lets close bug. If someone will see similar crash fill free to reopen. |