Created attachment 521937 [details]
host console log
Description of problem:
Version-Release number of selected component (if applicable):
host kernel 2.6.32-192.el6.x86_64
Steps to Reproduce:
1. create x86_64 guest and install RHEL5.7 x86_64 xen
2. pass-through to guest igb nic.
in my case:
# virsh nodedev-dettach pci_0000_04_00_0
# virsh nodedev-dettach pci_0000_04_00_1
And add these devices to guest using virt-manager
3. Build custom igb module for RHEL5.7 guest using patch
"Reproducer using igb driver" from bug 713221 comment 16
And execute commands from that comment.
the last command crashes host.
Only guest should crash but not the host
Created attachment 521938 [details]
Since RHEL 6.2 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.
Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.
The host console log with the APEI error looks like what we get for some other Intel NICs when something goes wrong on the bus, APEI hands off to the BIOS and the BIOS tells the host that a fatal error has occurred, reboot. Officially, Intel doesn't support assignment of PFs, so I wonder if we can make this happen with VFs. I'll try to reproduce on a system w/o APEI and see what happens.
I notice that this is not the same system bug 713221 was found on, can we test to see if the fixes implemented there on Xen also resolve the problem on this system that maybe implements an over zealous APEI?
I can't reproduce on my system with either the igb PF or VF. I patched igbvf in the same way as igb, when I remove the module, nothing interesting happens. Trying to reload the module fails in the guest since the interrupt was never unregistered. If the guest is rebooted, the device works again. The PF device using the modified igb driver behaves exactly the same. I suspect Igor's test system may have an overactive APEI layer causing the reboot.
Closer reading of bug 713221, I see now that that was a dom0/pv pass-through issue, not reproducible with hvm, so there's really nothing to potentially leverage from that bug. I don't think it would tell us anything to attempt to reproduce that bug on this system. The VF test is still interesting through.
With xen hvm or kvm, the guest failing to disable interrupts shouldn't have any adverse effects on the host. It's still possible though for the device to send out a bogus transaction which the chipset and bios can over-react to. In this case we're getting a report of an unsupported transaction, which my system w/o APEI could simply be discarding and continuing along happily. APEI seems to put all of the decisions about recovery in the hands of the BIOS, we should probably investigate whether there's an opportunity to just kill the guest attached to the offending device and take the device offline.
Created attachment 531834 [details]
This is what I used to try to evoke the same behavior with igbvf. Patch against RHEL5.7.
(In reply to comment #6)
Tried to reproduce both with VF and PF. Guest in both cases crashes. In case of PF host receives NMI but stays alive. So bug is not reproducible any more.
Changes since last time host crashed is:
- host motherboard was replaced. (it permanently declined to initialize igb nic, after several host crashes)
- igb nic was moved to another slot.
Probably host originally crashed due to faulty motherboard after all.
Since there is no way to reproduce it now, lets close bug.
If someone will see similar crash fill free to reopen.