Red Hat Bugzilla – Bug 254208
Xen: Mapping random guest pages returns the wrong information
Last modified: 2008-05-21 10:53:50 EDT
+++ This bug was initially created as a clone of Bug #249409 +++
+++ This bug was initially created as a clone of Bug #248947 +++
-- Additional comment from email@example.com on 2007-07-24 10:56 EST --
Created an attachment (id=159853)
Simple test program to map and read arbitrary mfns
Drop into tools/xenfb, compile with
$ gcc -g -O0 -Wall -D__XEN_TOOLS__ -D_LARGEFILE_SOURCE
-D_LARGEFILE64_SOURCE -I../../tools/libxc -I../../tools/xenstore
-I../../linux-2.6-xen-sparse/include -o crash249409 crash249409.c
Then this crashes a freshly booted dom0 not running any guests:
# ./crash249409 1 3435973664
The crash happens on access of the mapped page.
mm.c:1923:d0 Unknown domain '1'
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at mm/memory.c:2290
invalid opcode: 0000  SMP
last sysfs file: /devices/pci0000:00/0000:00:08.0/irq
Modules linked in: netloop netbk blktap blkbk ipt_MASQUERADE iptable_nat ip_nat
xt_state ip_conntrack nfnetlink ipt_REJECT xt_tcpudp iptable_filter ip_tables
x_tables bridge autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 dm_multipath
video sbs backlight i2c_ec button battery asus_acpi ac parport_pc lp parport
snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event sg
snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm forcedeth ide_cd
serio_raw snd_timer pcspkr snd k8_edac edac_mc shpchp floppy cdrom soundcore
8250_pnp 8250 serial_core k8temp hwmon snd_page_alloc i2c_nforce2 i2c_core
dm_snapshot dm_zero dm_mirror dm_mod sata_nv libata sd_mod scsi_mod ext3 jbd
ehci_hcd ohci_hcd uhci_hcd
Pid: 4401, comm: crash249409 Not tainted 2.6.18-34.el5xen #1
RIP: e030:[<ffffffff80208b30>] [<ffffffff80208b30>]
RSP: e02b:ffff8800ce7b9de8 EFLAGS: 00010202
RAX: ffffffff80514840 RBX: 0000000000000560 RCX: 00003ffffffff000
RDX: 00000000ce399560 RSI: 0000000000000067 RDI: ffff8800ea4320c0
RBP: ffff8800ea4320c0 R08: 0000000000000003 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
R13: ffff8800ce399560 R14: 00002aaaaaaac000 R15: ffff8800ce378138
FS: 00002aaaaaac3670(0000) GS:ffffffff80599000(0000) knlGS:0000000000000000
CS: e033 DS: 0000 ES: 0000
Process crash249409 (pid: 4401, threadinfo ffff8800ce7b8000, task
Stack: 00000000fffffff2 0000000080275de1 ffff8800ea4320c0 ffff8800cefc7aa8
0000000000001000 00002aaaaaaac000 ffff8800ea4320c0 ffffffff80261889
Code: 0f 0b 68 ee 50 47 80 c2 f2 08 49 8b 87 90 00 00 00 48 c7 44
RIP [<ffffffff80208b30>] __handle_mm_fault+0x379/0xf46
<0>Kernel panic - not syncing: Fatal exception
-- Additional comment from firstname.lastname@example.org on 2007-08-24 09:33 EST --
Created an attachment (id=172412)
Patch to fix this crash
This patch is the same one that is attached to BZ 253583, but I'm going to use
this BZ to track the kernel changes. While this patch does prevent the crash,
Markus correctly points out that with this patch, an invalid
xc_map_foreign_batch will map pages of zeros, instead of cleanly failing. What
should really happen is a failure of mapping, instead of the successful
mapping. Nevertheless, I believe we need this patch, and we should open
another BZ to track the appropriate failure scenario.
This is that "other BZ", to track fixing the mapping for real. While the
current patch prevents the crash, it doesn't do the right thing. What should
really happen is that the HV should detect this improper mapping, pass the error
to the dom0 kernel, which would pass it on to the userland app as a failure case.
Created attachment 182941 [details]
Simple test program to map and read arbitrary mfns
The foreign map is implemented as ioctl in privcmd_ioctl(). It uses
direct_remap_pfn_range() to do the actual work. direct_remap_pfn_range()
diagnoses failed hypercalls correctly and returns an error code.
privcmd_ioctl() case IOCTL_PRIVCMD_MMAP, which implements
xc_map_foreign_range(), returns the error code. Case IOCTL_PRIVCMD_MMAPBATCH,
which implements xc_map_foreign_batch(), does not. Instead, it changes the
mfns that failed in the argument array. This appears to be undocumented. Some
users of xc_map_foreign_batch() check the argument array for failure, some
I'm inclined to classify this broken-as-designed and close NOTABUG.
Revised test program attached.
On further reflection, I think we have a couple of issues here, and the bug
should remain open to track them.
One is behavior of IOCTL_PRIVCMD_MMAPBATCH: when it can't map a page, it changes
the mfn to indicate the error, but the exact error code is not communicated.
That's indeed broken as designed, and any fix needs to come from upstream. It
also maps a page of zeros, which doesn't make much sense. Upstream code is
different, and Chris Lalancette suspects that it doesn't map anything in that
case. We better find out.
Another are the users of xc_map_foreign_batch() that neglect to check for failed
mappings. I'm working with upstream to get them fixed there.
OK, I've been looking at this in more detail. First, I took a Xensource 3.1
i386 kernel, and ran it with a RHEL-5.1 HV and RHEL-5.1 userspace tools. In
terms of basic functionality, things work, so it is a good test. When running
the "crash249409" program on the upstream Xen kernel, it is indeed the case that
the offending program gets a SIGBUS when it tries to map these invalid pages.
So this seems to be the behavior we want to shoot for.
Now, in RHEL-5, what I think we actually want to do is to put the do_no_page
handler back into privcmd, and remove VM_PFNMAP from the vma having to do with
privcmd. This fixes the problem on x86_64, and should also cause the
appropriate SIGBUS to be returned to a bad process (as in upstream Xen). I have
to make sure this is the case.
However, when we do what I said above, all guests stop working on i386 (both PV
and HVM). Interestingly, dom0 boots up absolutely fine. However, when trying
to start a PV guest, they always fail with:
[root@dhcp83-141 xenfb]# xm create -c rhel5u1pv
Using config file "/etc/xen/rhel5u1pv".
Error: (1, 'Internal error', 'arch_setup_bootlate: pin_table failed (pfn 0xe73,
Digging deeper, what ends up happening is that the hypercall to pin_table() is
failing. That hypercall is MMUEXT, which gets handled by the hypervisor:
do_mmuext_op() -> get_page_and_type_from_pagenr() -> get_page_type() ->
alloc_page_type() -> alloc_l3_table()
In alloc_l3_table(), there is:
if ( is_pv_32bit_domain(d) && (i == 3) )
if ( !(l3e_get_flags(pl3e[i]) & _PAGE_PRESENT) ||
(l3e_get_flags(pl3e[i]) & l3_disallow_mask(d)) ||
After adding some more debugging to this, what's happening is that the flags of
this page don't have _PAGE_PRESENT set; so this test fails, the hypercall fails,
and the guest creation fails. Note that with the Xensource kernel and our HV
and tools, this works fine, so it is pretty clearly a bug in the RHEL-5 kernel.
I've been combing through the page table stuff in RHEL-5, and I haven't found
the problem, but I haven't found it yet. It seems pretty clear that the kernel
needs to mark the pages as PAGE_PRESENT once they get mmap'ed and the IOCTL is
run; but I don't see how or where that would be done yet.
After some further investigation, it is actually userland's responsibility to
fill in the initial pagetables and set _PAGE_PRESENT on the entries, which it
is, in fact, doing. I dumped the table right before hypercalling, and it looked
correct. However, once in the hypervisor, that same page was filled with 0's.
It seems to me that the dom0 kernel is somehow not mapping the pages properly
and is actually faulting in a dom0 page on the writes, so that it is a
*different* page than the one the HV looks at later. So I'm trying to find out
what the difference between our kernel and the Xensource one is in this regard.
Created attachment 217751 [details]
Use the nopfn() handler where nopage() was being used on privcmd
This patch (that needs the previous fix that was included on RHEL-5 to be
applied) changes the privcmd mmap vma to use nopfn(), that is supposed to be
used for VM_PFNMAP vmas.
This should make the program receive a SIGBUS (as the original nopage() handler
was supposed to do) instead of getting a all-zeros page.
Patch untested, as I don't have a RHEL-5 system for testing here right now.
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
(In reply to comment #4)
> After some further investigation, it is actually userland's responsibility
> fill in the initial pagetables and set _PAGE_PRESENT on the entries, which
> is, in fact, doing. I dumped the table right before hypercalling, and it
> correct. However, once in the hypervisor, that same page was filled with
This happens because of the missing VM_PFNMAP flag, that is necessary. It
doesn't happen upstream because it doesn't have this patch, present on RHEL-5:
VM_PFNMAP was already supposed to be on the mapping upstream, but
tracking-dirty-pages make it fatal. If the flag is not set, the VM doesn't
know it is a special mapping and tries to catch the writes to the mapping by
write-protecting it. Then, do_wp_page() gets confused and maps a zeroed page
in the privcmd vma.
The patch in comment#5 applied to 2.6.18-54.el5 works for me: the test program
gets SIGBUS, and I can still start guests.
You can download this test kernel from http://people.redhat.com/dzickus/el5
-62 compiled by myself works fine with the test program from comment#1.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.