Bug 740203 - Host crash when pass-through fails
Summary: Host crash when pass-through fails
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen
Version: 5.8
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: ---
Assignee: Laszlo Ersek
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: 745726 790780
TreeView+ depends on / blocked
 
Reported: 2011-09-21 09:44 UTC by Yuyu Zhou
Modified: 2013-01-10 00:21 UTC (History)
11 users (show)

Fixed In Version: kernel-2.6.18-294.el5
Doc Type: Bug Fix
Doc Text:
A previously applied patch (introduced as a fix in CVE-2011-1898) prevented PCI pass-through inside the assign_device domctl via a security check. Because the security check was not included in the test_assign_device domctl as well, qemu-dm may have started to encounter failures in the assign_device domctl, ultimately causing an HVM guest to have a partly accessible PCI device, which in some cases resulted in a crash of the host machine. With this update, the security check introduced in CVE-2011-1898 has been replicated in the test_assign_device domctl, thus fixing this issue.
Clone Of:
Environment:
Last Closed: 2012-02-21 03:56:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
screen log (1.06 MB, image/png)
2011-09-22 11:21 UTC, Yuyu Zhou
no flags Details
1/2 Propagate target domain within XEN_DOMCTL_test_assign_device (6.08 KB, patch)
2011-09-23 18:30 UTC, Laszlo Ersek
no flags Details | Diff
2/2 make the test_assign_device domctl dependent on intremap hardware (3.74 KB, patch)
2011-09-23 18:31 UTC, Laszlo Ersek
no flags Details | Diff
1/2 Propagate target domain within XEN_DOMCTL_test_assign_device (v2) (6.10 KB, patch)
2011-09-26 10:03 UTC, Laszlo Ersek
no flags Details | Diff
2/2 make the test_assign_device domctl dependent on intremap hardware (v2) (3.87 KB, patch)
2011-09-26 10:05 UTC, Laszlo Ersek
no flags Details | Diff
redo in one patch (4.19 KB, patch)
2011-09-27 13:53 UTC, Laszlo Ersek
no flags Details | Diff
1/2 Propagate target domain within XEN_DOMCTL_test_assign_device (v3) (6.19 KB, patch)
2011-09-28 11:20 UTC, Laszlo Ersek
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2012:0150 0 normal SHIPPED_LIVE Moderate: Red Hat Enterprise Linux 5.8 kernel update 2012-02-21 07:35:24 UTC

Description Yuyu Zhou 2011-09-21 09:44:10 UTC
Description of problem:
Host crash when pass-through NIC 82576's VFs to two guests.

Version-Release number of selected component (if applicable):
kernel-xen-2.6.18-284.el5
xen-3.0.3-134.el5

How reproducible:
100%

Steps to Reproduce:
1- Boot up xen host with iommu and pci_pt_e820_access enabled
   Add "iommu=1" in kernel line
2- Enable VF in Dom0
   # modprobe -r igb
   # modprobe igb max_vfs=7
3- Hide VFs from Domain0 via pciback driver
4- Boot up two guest with VF assigned
  
Actual results:
Host crash

Expected results:
No host crash, and the VFs works well in guest.

Additional info:
It works fine in kernel-xen-2.6.18-283.el5.

Comment 5 Laszlo Ersek 2011-09-21 13:31:39 UTC
The patch for bug 716302 (commit 92ff425) triggers on the machine in comment 0:

(XEN) [VT-D]iommu.c:1716: Interrupt Remapping hardware not found
(XEN) [VT-D]iommu.c:1718: Device assignment will be disabled for security
      reasons (CVE-2011-1898).
(XEN) [VT-D]iommu.c:1720: Use iommu=no-intremap to override.

When starting the first HVM domain, the other message was also printed:

(XEN) Interrupt Remapping hardware not found, passing devices
(XEN) to unprivileged domains is insecure.  If you really want
(XEN) to do this, please boot with "iommu=no-intremap".
(XEN) domctl.c:560:d0 XEN_DOMCTL_assign_device: assign device (3:11:0) failed

The host crashed some time afterwards, even without starting the second domU.

Comment 9 RHEL Program Management 2011-09-21 14:11:16 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 14 Yuyu Zhou 2011-09-22 11:21:23 UTC
Created attachment 524385 [details]
screen log

Comment 21 Paolo Bonzini 2011-09-22 20:03:12 UTC
I'm thinking of solving this in userspace.  After all we do report an error from the hypervisor, but xm goes on anyway and creates a domain with a perfectly useless NIC.  It may still be worthwhile to investigate what's going on, but failing this early is a very logical thing to do even without the crash.

Comment 22 Laszlo Ersek 2011-09-23 07:49:57 UTC
I believed the hypervisor was the one to ignore the error somewhere farther down (or up) the way. But if the error is propagated to userspace, then I agree this would be a good solution. It is not the first case that xend ignores failed hypercalls, IIRC.

If I'm not mistaken, domain creation is also denied when the vm config specifies an invalid (non-assignable) device in the pci = [ ... ] stanza. The above remains consistent with that, it only adds another reason why the device can't be assigned.

Comment 23 Paolo Bonzini 2011-09-23 08:16:37 UTC
Looked at the code now.  In the qemu-dm.*.log files on the lab machine (not the one that crashes) I see

register_real_device: Assigning real physical device 01:00.1 ...
register_real_device: Error: xc_assign_device error -1
pt_register_regions: IO region registered (size=0x00020000 base_addr=0xfb940000)
pt_register_regions: IO region registered (size=0x00020000 base_addr=0xfb920000)
pt_register_regions: IO region registered (size=0x00000020 base_addr=0x0000e001)
pt_register_regions: IO region registered (size=0x00004000 base_addr=0xfba40000)
pt_register_regions: Expansion ROM registered (size=0x00020000 base_addr=0xfb900000)
pt_msix_init: get MSI-X table bar base fba40000
pt_msix_init: mapping physical MSI-X table to 2aaaac068000
register_real_device: Real physical device 01:00.1 registered successfuly!

So the assignment fails, but still the physical device is registered successfully.  We could fix this in qemu-dm, but we should not even reach this point.  Here's the twist now, that means that the bug is indeed in the hypervisor: QEMU does the real assignment, but userspace does a "dry run" with a test_assign_device hypercall to check that assignment would pass.  Of course this is as racy as it can be, but I digress.

assign_device goes through the hd->platform_ops->assign_device function pointer, while test_assign_device only checks that the device is owned by dom0 right now (function device_assigned in drivers/passthrough/vtd/iommu.c).  We probably need to make a function pointer for test_assign_device.

Comment 24 Laszlo Ersek 2011-09-23 14:01:20 UTC
I think device_assigned() is in the wrong file. It shouldn't be in "drivers/passthrough/vtd/iommu.c": it's not Intel specific either by contents or by invocation. It should have been placed in "drivers/passthrough/iommu.c".

This is the current path for XEN_DOMCTL_assign_device (ie. 2nd phase):

arch_do_domctl() [arch/x86/domctl.c]
  -> assign_device() [drivers/passthrough/iommu.c]
    -> intel_iommu_assign_device() [drivers/passthrough/vtd/iommu.c],
       via "intel_iommu_ops.assign_device"
or
    -> amd_iommu_assign_device() [drivers/passthrough/amd/pci_amd_iommu.c],
       via "amd_iommu_ops.assign_device"

So (even without any bugs) "drivers/passthrough/iommu.c" should have implemented
device_assigned(). (The prototype is correctly in the central "include/xen/iommu.h" file.) Starting with XEN_DOMCTL_test_assign_device (ie. 1st phase):

arch_do_domctl() [arch/x86/domctl.c]
  -> device_assigned() [*should* be drivers/passthrough/iommu.c]

planned:

    -> intel_device_assignable() via funcptr [drivers/passthrough/vtd/iommu.c]
or
    -> nothing (for AMD)

Comment 25 Paolo Bonzini 2011-09-23 17:12:16 UTC
Agreed. Likely it was placed there just because VT-d came first and pcifront doesn't need test_assign_device (obviously: there xend does everything).

Comment 26 Paolo Bonzini 2011-09-23 17:13:05 UTC
Actually it should call device_assigned first and, if that succeeds, the function pointer.

Comment 27 Laszlo Ersek 2011-09-23 18:30:48 UTC
Created attachment 524670 [details]
1/2 Propagate target domain within XEN_DOMCTL_test_assign_device

Move device_assigned() from the Intel-specific "iommu.c" to the generic
"iommu.c". Rename it to device_assignable(), make it take a domain
pointer (see later on), and turn the error retval into -EINVAL.

In XEN_DOMCTL_test_assign_device(), look up target domain and pass it on
to device_assignable(). Since that makes most of test_assign_device and
assign_device identical, extract the common parts into prep_assign_dev().

--o--

To format this patch, I passed -U4 -Oordfile to "git format-patch"; where "ordfile" contains

include/xen/iommu.h
drivers/passthrough/vtd/iommu.c
drivers/passthrough/iommu.c
arch/x86/domctl.c

IMHO this facility can be used to make patches more understandable.

Comment 28 Laszlo Ersek 2011-09-23 18:31:48 UTC
Created attachment 524671 [details]
2/2 make the test_assign_device domctl dependent on intremap hardware

Comment 29 Laszlo Ersek 2011-09-26 10:03:10 UTC
Created attachment 524875 [details]
1/2 Propagate target domain within XEN_DOMCTL_test_assign_device (v2)

v1->v2: rename prep_assign_dev() to prep_assign_device()

Comment 30 Laszlo Ersek 2011-09-26 10:05:37 UTC
Created attachment 524876 [details]
2/2 make the test_assign_device domctl dependent on intremap hardware (v2)

v1->v2: in device_assignable(), check the domain-specific assignable() op *after* the more generic check (... whether dom0 owns the device)

Comment 32 Yuyu Zhou 2011-09-26 11:32:39 UTC
Hello Laszlo,

Unfortunately, The test failed. 
Hide VFs from Domain0 via pciback driver works but can not boot up two guest with VF assigned.
[root@localhost boot]# xm pci-list-a
0000:03:10.1
0000:03:10.0
0000:03:10.2
0000:03:10.3
0000:03:10.4
0000:03:10.5
0000:03:10.6
0000:03:10.7
0000:03:11.0
0000:03:11.1
0000:03:11.2
0000:03:11.3
0000:03:11.4
0000:03:11.5
[root@localhost boot]# xm cr hvm-R5-1.cfg
Using config file "./hvm-R5-1.cfg".
Error: failed to assign device(3:11.0): maybe it has already been assigned to other domain, or maybe it doesn't exist.

Comment 33 Laszlo Ersek 2011-09-26 11:42:17 UTC
Hi Yuyu,

that was actually a successful test. The host does not have interrupt remapping hardware, and thus PCI passthrough should be prevented. If you check xm dmesg:

(XEN) Interrupt Remapping hardware not found, passing devices
(XEN) to unprivileged domains is insecure.  If you really want
(XEN) to do this, please boot with "iommu=no-intremap".
(XEN) domctl.c:549:d0 XEN_DOMCTL_test_assign_device: 3:11:0 already assigned, or
      non-existent, or denied

(XEN) Interrupt Remapping hardware not found, passing devices
(XEN) to unprivileged domains is insecure.  If you really want
(XEN) to do this, please boot with "iommu=no-intremap".
(XEN) domctl.c:549:d0 XEN_DOMCTL_test_assign_device: 3:11:1 already assigned, or
      non-existent, or denied

Please try to reboot with "iommu=no-intremap" on the xen.gz command line (please reuse my locally-built hv), and then it should work (and be insecure). Thanks!

Comment 34 Yuyu Zhou 2011-09-26 12:08:38 UTC
Hello, Laszlo,
there is still some problems here:

First, it seems not all VFs can get ip and works well in guest.

Second, the guest will become zombie mode after shutdown or destroy.
[root@localhost boot]# xm list
Name                                      ID Mem(MiB) VCPUs State   Time(s)
Domain-0                                   0     5879     4 r-----     72.4
Zombie-RHEL5.7-64-hvm-1                    1     1024     1 --ps-d     70.0
Zombie-RHEL5.7-64-hvm-2                    2     1024     4 --p--d     25.5
Zombie-RHEL5.7-64-hvm-2                    3     1024     4 --p--d     32.5

Third, the released(the guest is desotroyed) VF can be not assigned to another guest.

Comment 42 Laszlo Ersek 2011-09-27 12:12:22 UTC
The diff between the qemu-dm logs (left side: clean shutdow, right side: zombie):

< xs_read(): vncpasswd get error. /vm/95680ad9-727c-f5e5-a86a-b33b37ce74da/vncpasswd.
---
> xs_read(): vncpasswd get error. /vm/27ce54cb-9931-f2c4-f241-48c00ebb35a2/vncpasswd.
11c11
< xs_read(/vm/95680ad9-727c-f5e5-a86a-b33b37ce74da/rtc/timeoffset): read error
---
> xs_read(/vm/27ce54cb-9931-f2c4-f241-48c00ebb35a2/rtc/timeoffset): read error
74,78c74
< pt_pci_read_config: Warning: Return ALL F from libpci read. [00:06.0][Offset:00h][Length:4]
< pt_msix_update_one: now update msix entry 0 with pirq ff gvec b1
< pt_msix_update_one: now update msix entry 1 with pirq fe gvec b9
< pt_msix_update_one: now update msix entry 2 with pirq fd gvec c1
< xs_write(/vm/95680ad9-727c-f5e5-a86a-b33b37ce74da/rtc/timeoffset, rtc/timeoffset): write error
---
> xs_write(/vm/27ce54cb-9931-f2c4-f241-48c00ebb35a2/rtc/timeoffset, rtc/timeoffset): write error

Comment 43 Paolo Bonzini 2011-09-27 12:54:40 UTC
That looks like the device assignment is failing with your patch despite the iommu=no-intremap.

The xenbus messages seem fine, but perhaps they are indeed a problem.  Look with the -286 (working) hypervisor at /local/domain/0/backend/pci/DOM-ID/0/state and /local/domain/DOM-ID/devices/pci/0/state.  If they are 2 and 1 (InitWait and Initialising), that's fine.  I think that should be the case, since pcifront is not running in the guest (pciback is only being used to hide the device, basically).

Comment 44 Laszlo Ersek 2011-09-27 13:53:51 UTC
Created attachment 525135 [details]
redo in one patch

I must be cursed.

I tried to redo the patch as simply as I possibly could. No refactoring, just adding the necessary bits. ping worked in the guest (again, booted with iommu=no-intremap), but the domain turned into a zombie again!

(In reply to comment #43)
> That looks like the device assignment is failing with your patch despite the
> iommu=no-intremap.

I checked the guest eth0 driver (igbvf) and ping worked. No other interfaces except lo were up, and I also checked the routing table.

> The xenbus messages seem fine, but perhaps they are indeed a problem.

They are not printed when shutting down the guest under pristine 286.

>  Look
> with the -286 (working) hypervisor at
> /local/domain/0/backend/pci/DOM-ID/0/state and
> /local/domain/DOM-ID/devices/pci/0/state.  If they are 2 and 1 (InitWait and
> Initialising), that's fine.  I think that should be the case, since pcifront is
> not running in the guest (pciback is only being used to hide the device,
> basically).

When should I look?

Thanks.

Comment 47 Laszlo Ersek 2011-09-28 11:20:48 UTC
Created attachment 525321 [details]
1/2 Propagate target domain within XEN_DOMCTL_test_assign_device (v3)

v2->v3:
- Remove domain reference at the end of the test_assign_device domctl
- reformat controlling expressions so they conform more

Comment 68 Yuyu Zhou 2011-10-20 09:23:50 UTC
Hello, Laszlo,
With "iommu=no-intremap", it works fine now.
Thanks.
Yuyu Zhou

Comment 69 Jarod Wilson 2011-10-27 13:12:50 UTC
Patch(es) available in kernel-2.6.18-294.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.

Comment 71 Martin Prpič 2011-11-29 17:57:35 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
A previously applied patch (introduced as a fix in CVE-2011-1898) prevented PCI pass-through inside the assign_device domctl via a security check. Because the security check was not included in the test_assign_device domctl, qemu-dm could not handle any failures in the test_assign_device domctl, ultimately causing an HVM guest to have a partly accessible PCI device, which in come cases resulted in a crash of the host machine. With this update, the security check introduced in CVE-2011-1898 has been replicated in the test_assign_device domctl, thus fixing this issue.

Comment 72 Laszlo Ersek 2011-11-29 20:48:23 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-A previously applied patch (introduced as a fix in CVE-2011-1898) prevented PCI pass-through inside the assign_device domctl via a security check. Because the security check was not included in the test_assign_device domctl, qemu-dm could not handle any failures in the test_assign_device domctl, ultimately causing an HVM guest to have a partly accessible PCI device, which in come cases resulted in a crash of the host machine. With this update, the security check introduced in CVE-2011-1898 has been replicated in the test_assign_device domctl, thus fixing this issue.+A previously applied patch (introduced as a fix in CVE-2011-1898) prevented PCI pass-through inside the assign_device domctl via a security check. Because the security check was not included in the test_assign_device domctl as well, qemu-dm may have started to encounter failures in the assign_device domctl, ultimately causing an HVM guest to have a partly accessible PCI device, which in some cases resulted in a crash of the host machine. With this update, the security check introduced in CVE-2011-1898 has been replicated in the test_assign_device domctl, thus fixing this issue.

Comment 73 Yuyu Zhou 2011-12-13 05:14:52 UTC
Reproduce the bug with kernel-xen-2.6.18-284.el5.
Verified the bug with kernel-xen-2.6.18-300.el5, xen-3.0.3-135.el5.

On HP-Z400, without parameter "iommu=no-intremap", create a guest with 82576
VFs assigned.
Got following message in the xm dmesg.
(XEN) Interrupt Remapping hardware not found, passing devices
(XEN) to unprivileged domains is insecure.  If you really want
(XEN) to do this, please boot with "iommu=no-intremap".
(XEN) domctl.c:550:d0 XEN_DOMCTL_test_assign_device: 3:10:0 already assigned,
or non-existent, or denied
The guest is no created.

With parameter "iommu=no-intremap", create a guest with 82576 VFs, the guest
can be created successfully and the VFs can be used in the guest. No
host crash or zombies left during the testing

So change this bug to VERIFIED.

Comment 76 errata-xmlrpc 2012-02-21 03:56:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0150.html


Note You need to log in before you can comment on or make changes to this bug.