Bug 736421 - kvm host crash when abusing passed through nic in guest
Summary: kvm host crash when abusing passed through nic in guest
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel
Version: 6.2
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Alex Williamson
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-09-07 16:28 UTC by Igor Mammedov
Modified: 2011-11-08 22:28 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-11-08 22:28:57 UTC


Attachments (Terms of Use)
host console log (3.81 KB, text/plain)
2011-09-07 16:28 UTC, Igor Mammedov
no flags Details
guest config (2.41 KB, application/xml)
2011-09-07 16:35 UTC, Igor Mammedov
no flags Details
igbvf patch (952 bytes, patch)
2011-11-04 19:21 UTC, Alex Williamson
no flags Details | Diff

Description Igor Mammedov 2011-09-07 16:28:51 UTC
Created attachment 521937 [details]
host console log

Description of problem:


Version-Release number of selected component (if applicable):

host kernel 2.6.32-192.el6.x86_64
qemu-kvm-0.12.1.2-2.183.el6.x86_64
qemu-img-0.12.1.2-2.183.el6.x86_64
gpxe-roms-qemu-0.9.7-6.7.el6.noarch


How reproducible:
Always

Steps to Reproduce:
1. create x86_64 guest and install RHEL5.7 x86_64 xen

2. pass-through to guest igb nic.
in my case:
# virsh nodedev-dettach pci_0000_04_00_0
# virsh nodedev-dettach pci_0000_04_00_1
And add these devices to guest using virt-manager

3. Build custom igb module for RHEL5.7 guest using patch
"Reproducer using igb driver" from bug 713221 comment 16
And execute commands from that comment. 
the last command crashes host.
  
Actual results:
host crash

Expected results:
Only guest should crash but not the host

Comment 1 Igor Mammedov 2011-09-07 16:35:10 UTC
Created attachment 521938 [details]
guest config

Comment 3 RHEL Product and Program Management 2011-10-07 15:47:38 UTC
Since RHEL 6.2 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 4 Alex Williamson 2011-11-04 15:04:19 UTC
The host console log with the APEI error looks like what we get for some other Intel NICs when something goes wrong on the bus, APEI hands off to the BIOS and the BIOS tells the host that a fatal error has occurred, reboot.  Officially, Intel doesn't support assignment of PFs, so I wonder if we can make this happen with VFs.  I'll try to reproduce on a system w/o APEI and see what happens.

I notice that this is not the same system bug 713221 was found on, can we test to see if the fixes implemented there on Xen also resolve the problem on this system that maybe implements an over zealous APEI?

Comment 5 Alex Williamson 2011-11-04 17:50:44 UTC
I can't reproduce on my system with either the igb PF or VF.  I patched igbvf in the same way as igb, when I remove the module, nothing interesting happens.  Trying to reload the module fails in the guest since the interrupt was never unregistered.  If the guest is rebooted, the device works again.  The PF device using the modified igb driver behaves exactly the same.  I suspect Igor's test system may have an overactive APEI layer causing the reboot.

Comment 6 Alex Williamson 2011-11-04 19:09:20 UTC
Closer reading of bug 713221, I see now that that was a dom0/pv pass-through issue, not reproducible with hvm, so there's really nothing to potentially leverage from that bug.  I don't think it would tell us anything to attempt to reproduce that bug on this system.  The VF test is still interesting through.

With xen hvm or kvm, the guest failing to disable interrupts shouldn't have any adverse effects on the host.  It's still possible though for the device to send out a bogus transaction which the chipset and bios can over-react to.  In this case we're getting a report of an unsupported transaction, which my system w/o APEI could simply be discarding and continuing along happily.  APEI seems to put all of the decisions about recovery in the hands of the BIOS, we should probably investigate whether there's an opportunity to just kill the guest attached to the offending device and take the device offline.

Comment 7 Alex Williamson 2011-11-04 19:21:54 UTC
Created attachment 531834 [details]
igbvf patch

This is what I used to try to evoke the same behavior with igbvf.  Patch against RHEL5.7.

Comment 8 Igor Mammedov 2011-11-08 22:28:57 UTC
(In reply to comment #6)
Tried to reproduce both with VF and PF. Guest in both cases crashes. In case of PF host receives NMI but stays alive. So bug is not reproducible any more.
Changes since last time host crashed is:
   - host motherboard was replaced. (it permanently declined to initialize igb nic, after several host crashes)
   - igb nic was moved to another slot.

Probably host originally crashed due to faulty motherboard after all.

Since there is no way to reproduce it now, lets close bug.
If someone will see similar crash fill free to reopen.


Note You need to log in before you can comment on or make changes to this bug.