+++ This bug was initially created as a clone of Bug #249409 +++ +++ This bug was initially created as a clone of Bug #248947 +++ -- Additional comment from armbru on 2007-07-24 10:56 EST -- Created an attachment (id=159853) Simple test program to map and read arbitrary mfns Drop into tools/xenfb, compile with $ gcc -g -O0 -Wall -D__XEN_TOOLS__ -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -I../../tools/libxc -I../../tools/xenstore -I../../linux-2.6-xen-sparse/include -o crash249409 crash249409.c ../libxc/libxenctrl.a ../xenstore/libxenstore.a Then this crashes a freshly booted dom0 not running any guests: # ./crash249409 1 3435973664 The crash happens on access of the mapped page. Serial console: mm.c:1923:d0 Unknown domain '1' ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at mm/memory.c:2290 invalid opcode: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:08.0/irq CPU 0 Modules linked in: netloop netbk blktap blkbk ipt_MASQUERADE iptable_nat ip_nat xt_state ip_conntrack nfnetlink ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 dm_multipath video sbs backlight i2c_ec button battery asus_acpi ac parport_pc lp parport snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event sg snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm forcedeth ide_cd serio_raw snd_timer pcspkr snd k8_edac edac_mc shpchp floppy cdrom soundcore 8250_pnp 8250 serial_core k8temp hwmon snd_page_alloc i2c_nforce2 i2c_core dm_snapshot dm_zero dm_mirror dm_mod sata_nv libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Pid: 4401, comm: crash249409 Not tainted 2.6.18-34.el5xen #1 RIP: e030:[<ffffffff80208b30>] [<ffffffff80208b30>] __handle_mm_fault+0x379/0xf46 RSP: e02b:ffff8800ce7b9de8 EFLAGS: 00010202 RAX: ffffffff80514840 RBX: 0000000000000560 RCX: 00003ffffffff000 RDX: 00000000ce399560 RSI: 0000000000000067 RDI: ffff8800ea4320c0 RBP: ffff8800ea4320c0 R08: 0000000000000003 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 R13: ffff8800ce399560 R14: 00002aaaaaaac000 R15: ffff8800ce378138 FS: 00002aaaaaac3670(0000) GS:ffffffff80599000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process crash249409 (pid: 4401, threadinfo ffff8800ce7b8000, task ffff8800ea4ab7a0) Stack: 00000000fffffff2 0000000080275de1 ffff8800ea4320c0 ffff8800cefc7aa8 0000000000001000 00002aaaaaaac000 ffff8800ea4320c0 ffffffff80261889 ffff8800ea432128 ffffffff80221ee8 Call Trace: [<ffffffff80261889>] _spin_lock_irqsave+0x9/0x14 [<ffffffff80221ee8>] __up_read+0x19/0x7f [<ffffffff802641db>] do_page_fault+0xe48/0x11dc [<ffffffff8030b071>] file_has_perm+0x94/0xa3 [<ffffffff8025d823>] error_exit+0x0/0x6e Code: 0f 0b 68 ee 50 47 80 c2 f2 08 49 8b 87 90 00 00 00 48 c7 44 RIP [<ffffffff80208b30>] __handle_mm_fault+0x379/0xf46 RSP <ffff8800ce7b9de8> <0>Kernel panic - not syncing: Fatal exception -- Additional comment from clalance on 2007-08-24 09:33 EST -- Created an attachment (id=172412) Patch to fix this crash This patch is the same one that is attached to BZ 253583, but I'm going to use this BZ to track the kernel changes. While this patch does prevent the crash, Markus correctly points out that with this patch, an invalid xc_map_foreign_batch will map pages of zeros, instead of cleanly failing. What should really happen is a failure of mapping, instead of the successful mapping. Nevertheless, I believe we need this patch, and we should open another BZ to track the appropriate failure scenario. ------------------------------------------------------------------------------ This is that "other BZ", to track fixing the mapping for real. While the current patch prevents the crash, it doesn't do the right thing. What should really happen is that the HV should detect this improper mapping, pass the error to the dom0 kernel, which would pass it on to the userland app as a failure case.
Created attachment 182941 [details] Simple test program to map and read arbitrary mfns The foreign map is implemented as ioctl in privcmd_ioctl(). It uses direct_remap_pfn_range() to do the actual work. direct_remap_pfn_range() diagnoses failed hypercalls correctly and returns an error code. privcmd_ioctl() case IOCTL_PRIVCMD_MMAP, which implements xc_map_foreign_range(), returns the error code. Case IOCTL_PRIVCMD_MMAPBATCH, which implements xc_map_foreign_batch(), does not. Instead, it changes the mfns that failed in the argument array. This appears to be undocumented. Some users of xc_map_foreign_batch() check the argument array for failure, some don't. I'm inclined to classify this broken-as-designed and close NOTABUG. Revised test program attached.
On further reflection, I think we have a couple of issues here, and the bug should remain open to track them. One is behavior of IOCTL_PRIVCMD_MMAPBATCH: when it can't map a page, it changes the mfn to indicate the error, but the exact error code is not communicated. That's indeed broken as designed, and any fix needs to come from upstream. It also maps a page of zeros, which doesn't make much sense. Upstream code is different, and Chris Lalancette suspects that it doesn't map anything in that case. We better find out. Another are the users of xc_map_foreign_batch() that neglect to check for failed mappings. I'm working with upstream to get them fixed there.
OK, I've been looking at this in more detail. First, I took a Xensource 3.1 i386 kernel, and ran it with a RHEL-5.1 HV and RHEL-5.1 userspace tools. In terms of basic functionality, things work, so it is a good test. When running the "crash249409" program on the upstream Xen kernel, it is indeed the case that the offending program gets a SIGBUS when it tries to map these invalid pages. So this seems to be the behavior we want to shoot for. Now, in RHEL-5, what I think we actually want to do is to put the do_no_page handler back into privcmd, and remove VM_PFNMAP from the vma having to do with privcmd. This fixes the problem on x86_64, and should also cause the appropriate SIGBUS to be returned to a bad process (as in upstream Xen). I have to make sure this is the case. However, when we do what I said above, all guests stop working on i386 (both PV and HVM). Interestingly, dom0 boots up absolutely fine. However, when trying to start a PV guest, they always fail with: [root@dhcp83-141 xenfb]# xm create -c rhel5u1pv Using config file "/etc/xen/rhel5u1pv". Error: (1, 'Internal error', 'arch_setup_bootlate: pin_table failed (pfn 0xe73, rc=1)\n') [root@dhcp83-141 xenfb]# Digging deeper, what ends up happening is that the hypercall to pin_table() is failing. That hypercall is MMUEXT, which gets handled by the hypervisor: do_mmuext_op() -> get_page_and_type_from_pagenr() -> get_page_type() -> alloc_page_type() -> alloc_l3_table() In alloc_l3_table(), there is: if ( is_pv_32bit_domain(d) && (i == 3) ) { if ( !(l3e_get_flags(pl3e[i]) & _PAGE_PRESENT) || (l3e_get_flags(pl3e[i]) & l3_disallow_mask(d)) || !get_page_and_type_from_pagenr(l3e_get_pfn(pl3e[i]), PGT_l2_page_table | PGT_pae_xen_l2, d) ) goto fail; After adding some more debugging to this, what's happening is that the flags of this page don't have _PAGE_PRESENT set; so this test fails, the hypercall fails, and the guest creation fails. Note that with the Xensource kernel and our HV and tools, this works fine, so it is pretty clearly a bug in the RHEL-5 kernel. I've been combing through the page table stuff in RHEL-5, and I haven't found the problem, but I haven't found it yet. It seems pretty clear that the kernel needs to mark the pages as PAGE_PRESENT once they get mmap'ed and the IOCTL is run; but I don't see how or where that would be done yet. Chris Lalancette
After some further investigation, it is actually userland's responsibility to fill in the initial pagetables and set _PAGE_PRESENT on the entries, which it is, in fact, doing. I dumped the table right before hypercalling, and it looked correct. However, once in the hypervisor, that same page was filled with 0's. It seems to me that the dom0 kernel is somehow not mapping the pages properly and is actually faulting in a dom0 page on the writes, so that it is a *different* page than the one the HV looks at later. So I'm trying to find out what the difference between our kernel and the Xensource one is in this regard. Chris Lalancette
Created attachment 217751 [details] Use the nopfn() handler where nopage() was being used on privcmd This patch (that needs the previous fix that was included on RHEL-5 to be applied) changes the privcmd mmap vma to use nopfn(), that is supposed to be used for VM_PFNMAP vmas. This should make the program receive a SIGBUS (as the original nopage() handler was supposed to do) instead of getting a all-zeros page. Patch untested, as I don't have a RHEL-5 system for testing here right now.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
(In reply to comment #4) > After some further investigation, it is actually userland's responsibility to > fill in the initial pagetables and set _PAGE_PRESENT on the entries, which it > is, in fact, doing. I dumped the table right before hypercalling, and it looked > correct. However, once in the hypervisor, that same page was filled with 0's. This happens because of the missing VM_PFNMAP flag, that is necessary. It doesn't happen upstream because it doesn't have this patch, present on RHEL-5: linux-2.6-mm-tracking-dirty-pages.patch VM_PFNMAP was already supposed to be on the mapping upstream, but tracking-dirty-pages make it fatal. If the flag is not set, the VM doesn't know it is a special mapping and tries to catch the writes to the mapping by write-protecting it. Then, do_wp_page() gets confused and maps a zeroed page in the privcmd vma.
The patch in comment#5 applied to 2.6.18-54.el5 works for me: the test program gets SIGBUS, and I can still start guests.
in 2.6.18-62.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
-62 compiled by myself works fine with the test program from comment#1.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0314.html