Bug 254208

Summary: Xen: Mapping random guest pages returns the wrong information
Product: Red Hat Enterprise Linux 5 Reporter: Chris Lalancette <clalance>
Component: kernel-xenAssignee: Markus Armbruster <armbru>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: high Docs Contact:
Priority: medium    
Version: 5.1CC: xen-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2008-0314 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-05-21 14:53:50 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 249409    
Bug Blocks:    
Attachments:
Description Flags
Simple test program to map and read arbitrary mfns
none
Use the nopfn() handler where nopage() was being used on privcmd none

Description Chris Lalancette 2007-08-24 19:16:25 UTC
+++ This bug was initially created as a clone of Bug #249409 +++

+++ This bug was initially created as a clone of Bug #248947 +++

-- Additional comment from armbru on 2007-07-24 10:56 EST --
Created an attachment (id=159853)
Simple test program to map and read arbitrary mfns

Drop into tools/xenfb, compile with

    $ gcc -g -O0 -Wall	-D__XEN_TOOLS__  -D_LARGEFILE_SOURCE
-D_LARGEFILE64_SOURCE  -I../../tools/libxc -I../../tools/xenstore
-I../../linux-2.6-xen-sparse/include -o crash249409 crash249409.c
../libxc/libxenctrl.a ../xenstore/libxenstore.a

Then this crashes a freshly booted dom0 not running any guests:
    # ./crash249409 1 3435973664

The crash happens on access of the mapped page.

Serial console:
mm.c:1923:d0 Unknown domain '1'
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at mm/memory.c:2290
invalid opcode: 0000 [1] SMP 
last sysfs file: /devices/pci0000:00/0000:00:08.0/irq
CPU 0 
Modules linked in: netloop netbk blktap blkbk ipt_MASQUERADE iptable_nat ip_nat
xt_state ip_conntrack nfnetlink ipt_REJECT xt_tcpudp iptable_filter ip_tables
x_tables bridge autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 dm_multipath
video sbs backlight i2c_ec button battery asus_acpi ac parport_pc lp parport
snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event sg
snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm forcedeth ide_cd
serio_raw snd_timer pcspkr snd k8_edac edac_mc shpchp floppy cdrom soundcore
8250_pnp 8250 serial_core k8temp hwmon snd_page_alloc i2c_nforce2 i2c_core
dm_snapshot dm_zero dm_mirror dm_mod sata_nv libata sd_mod scsi_mod ext3 jbd
ehci_hcd ohci_hcd uhci_hcd
Pid: 4401, comm: crash249409 Not tainted 2.6.18-34.el5xen #1
RIP: e030:[<ffffffff80208b30>]	[<ffffffff80208b30>]
__handle_mm_fault+0x379/0xf46
RSP: e02b:ffff8800ce7b9de8  EFLAGS: 00010202
RAX: ffffffff80514840 RBX: 0000000000000560 RCX: 00003ffffffff000
RDX: 00000000ce399560 RSI: 0000000000000067 RDI: ffff8800ea4320c0
RBP: ffff8800ea4320c0 R08: 0000000000000003 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
R13: ffff8800ce399560 R14: 00002aaaaaaac000 R15: ffff8800ce378138
FS:  00002aaaaaac3670(0000) GS:ffffffff80599000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
Process crash249409 (pid: 4401, threadinfo ffff8800ce7b8000, task
ffff8800ea4ab7a0)
Stack:	00000000fffffff2  0000000080275de1  ffff8800ea4320c0  ffff8800cefc7aa8 

 0000000000001000  00002aaaaaaac000  ffff8800ea4320c0  ffffffff80261889 
 ffff8800ea432128  ffffffff80221ee8 
Call Trace:
 [<ffffffff80261889>] _spin_lock_irqsave+0x9/0x14
 [<ffffffff80221ee8>] __up_read+0x19/0x7f
 [<ffffffff802641db>] do_page_fault+0xe48/0x11dc
 [<ffffffff8030b071>] file_has_perm+0x94/0xa3
 [<ffffffff8025d823>] error_exit+0x0/0x6e


Code: 0f 0b 68 ee 50 47 80 c2 f2 08 49 8b 87 90 00 00 00 48 c7 44 
RIP  [<ffffffff80208b30>] __handle_mm_fault+0x379/0xf46
 RSP <ffff8800ce7b9de8>
 <0>Kernel panic - not syncing: Fatal exception


-- Additional comment from clalance on 2007-08-24 09:33 EST --
Created an attachment (id=172412)
Patch to fix this crash

This patch is the same one that is attached to BZ 253583, but I'm going to use
this BZ to track the kernel changes.  While this patch does prevent the crash,
Markus correctly points out that with this patch, an invalid
xc_map_foreign_batch will map pages of zeros, instead of cleanly failing.  What
should really happen is a failure of mapping, instead of the successful
mapping.  Nevertheless, I believe we need this patch, and we should open
another BZ to track the appropriate failure scenario.

------------------------------------------------------------------------------

This is that "other BZ", to track fixing the mapping for real.  While the
current patch prevents the crash, it doesn't do the right thing.  What should
really happen is that the HV should detect this improper mapping, pass the error
to the dom0 kernel, which would pass it on to the userland app as a failure case.

Comment 1 Markus Armbruster 2007-08-31 08:07:46 UTC
Created attachment 182941 [details]
Simple test program to map and read arbitrary mfns

The foreign map is implemented as ioctl in privcmd_ioctl().  It uses
direct_remap_pfn_range() to do the actual work. direct_remap_pfn_range()
diagnoses failed hypercalls correctly and returns an error code. 
privcmd_ioctl() case IOCTL_PRIVCMD_MMAP, which implements
xc_map_foreign_range(), returns the error code.  Case IOCTL_PRIVCMD_MMAPBATCH,
which implements xc_map_foreign_batch(), does not.  Instead, it changes the
mfns that failed in the argument array.  This appears to be undocumented.  Some
users of xc_map_foreign_batch() check the argument array for failure, some
don't.

I'm inclined to classify this broken-as-designed and close NOTABUG.

Revised test program attached.

Comment 2 Markus Armbruster 2007-09-05 15:14:32 UTC
On further reflection, I think we have a couple of issues here, and the bug
should remain open to track them.

One is behavior of IOCTL_PRIVCMD_MMAPBATCH: when it can't map a page, it changes
the mfn to indicate the error, but the exact error code is not communicated. 
That's indeed broken as designed, and any fix needs to come from upstream.  It
also maps a page of zeros, which doesn't make much sense.  Upstream code is
different, and Chris Lalancette suspects that it doesn't map anything in that
case.  We better find out.

Another are the users of xc_map_foreign_batch() that neglect to check for failed
mappings.  I'm working with upstream to get them fixed there.

Comment 3 Chris Lalancette 2007-09-13 14:45:36 UTC
OK, I've been looking at this in more detail.  First, I took a Xensource 3.1
i386 kernel, and ran it with a RHEL-5.1 HV and RHEL-5.1 userspace tools.  In
terms of basic functionality, things work, so it is a good test.  When running
the "crash249409" program on the upstream Xen kernel, it is indeed the case that
the offending program gets a SIGBUS when it tries to map these invalid pages. 
So this seems to be the behavior we want to shoot for.

Now, in RHEL-5, what I think we actually want to do is to put the do_no_page
handler back into privcmd, and remove VM_PFNMAP from the vma having to do with
privcmd.  This fixes the problem on x86_64, and should also cause the
appropriate SIGBUS to be returned to a bad process (as in upstream Xen).  I have
to make sure this is the case.

However, when we do what I said above, all guests stop working on i386 (both PV
and HVM).  Interestingly, dom0 boots up absolutely fine.  However, when trying
to start a PV guest, they always fail with:

[root@dhcp83-141 xenfb]# xm create -c rhel5u1pv
Using config file "/etc/xen/rhel5u1pv".
Error: (1, 'Internal error', 'arch_setup_bootlate: pin_table failed (pfn 0xe73,
rc=1)\n')
[root@dhcp83-141 xenfb]# 

Digging deeper, what ends up happening is that the hypercall to pin_table() is
failing.  That hypercall is MMUEXT, which gets handled by the hypervisor:

do_mmuext_op() -> get_page_and_type_from_pagenr() -> get_page_type() ->
alloc_page_type() -> alloc_l3_table()

In alloc_l3_table(), there is:

        if ( is_pv_32bit_domain(d) && (i == 3) )
        {
            if ( !(l3e_get_flags(pl3e[i]) & _PAGE_PRESENT) ||
                 (l3e_get_flags(pl3e[i]) & l3_disallow_mask(d)) ||
                 !get_page_and_type_from_pagenr(l3e_get_pfn(pl3e[i]),
                                                PGT_l2_page_table |
                                                PGT_pae_xen_l2,
                                                d) )
                goto fail;

After adding some more debugging to this, what's happening is that the flags of
this page don't have _PAGE_PRESENT set; so this test fails, the hypercall fails,
and the guest creation fails.  Note that with the Xensource kernel and our HV
and tools, this works fine, so it is pretty clearly a bug in the RHEL-5 kernel.  

I've been combing through the page table stuff in RHEL-5, and I haven't found
the problem, but I haven't found it yet.  It seems pretty clear that the kernel
needs to mark the pages as PAGE_PRESENT once they get mmap'ed and the IOCTL is
run; but I don't see how or where that would be done yet.

Chris Lalancette

Comment 4 Chris Lalancette 2007-09-14 19:47:02 UTC
After some further investigation, it is actually userland's responsibility to
fill in the initial pagetables and set _PAGE_PRESENT on the entries, which it
is, in fact, doing.  I dumped the table right before hypercalling, and it looked
correct.  However, once in the hypervisor, that same page was filled with 0's. 
It seems to me that the dom0 kernel is somehow not mapping the pages properly
and is actually faulting in a dom0 page on the writes, so that it is a
*different* page than the one the HV looks at later.  So I'm trying to find out
what the difference between our kernel and the Xensource one is in this regard.

Chris Lalancette

Comment 5 Eduardo Habkost 2007-10-05 17:34:09 UTC
Created attachment 217751 [details]
Use the nopfn() handler where nopage() was being used on privcmd

This patch (that needs the previous fix that was included on RHEL-5 to be
applied) changes the privcmd mmap vma to use nopfn(), that is supposed to be
used for VM_PFNMAP vmas.

This should make the program receive a SIGBUS (as the original nopage() handler
was supposed to do) instead of getting a all-zeros page.

Patch untested, as I don't have a RHEL-5 system for testing here right now.

Comment 6 RHEL Program Management 2007-10-16 03:44:01 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 7 Eduardo Habkost 2007-10-25 17:16:13 UTC
(In reply to comment #4)
> After some further investigation, it is actually userland's responsibility 
to
> fill in the initial pagetables and set _PAGE_PRESENT on the entries, which 
it
> is, in fact, doing.  I dumped the table right before hypercalling, and it 
looked
> correct.  However, once in the hypervisor, that same page was filled with 
0's. 

This happens because of the missing VM_PFNMAP flag, that is necessary. It 
doesn't happen upstream because it doesn't have this patch, present on RHEL-5:

linux-2.6-mm-tracking-dirty-pages.patch

VM_PFNMAP was already supposed to be on the mapping upstream, but 
tracking-dirty-pages make it fatal. If the flag is not set, the VM doesn't 
know it is a special mapping and tries to catch the writes to the mapping by 
write-protecting it. Then, do_wp_page() gets confused and maps a zeroed page 
in the privcmd vma.

Comment 8 Markus Armbruster 2007-10-25 17:59:44 UTC
The patch in comment#5 applied to 2.6.18-54.el5 works for me: the test program
gets SIGBUS, and I can still start guests.


Comment 10 Don Zickus 2007-12-21 20:17:45 UTC
in 2.6.18-62.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 11 Markus Armbruster 2008-01-03 15:43:47 UTC
-62 compiled by myself works fine with the test program from comment#1.


Comment 14 errata-xmlrpc 2008-05-21 14:53:50 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html