Bug 771764 (CVE-2012-0028)

Summary:	CVE-2012-0028 kernel: futex: clear robust_list on execve
Product:	[Other] Security Response	Reporter:	Petr Matousek <pmatouse>
Component:	vulnerability	Assignee:	Red Hat Product Security <security-response-team>
Status:	CLOSED ERRATA	QA Contact:
Severity:	high	Docs Contact:
Priority:	high
Version:	unspecified	CC:	agordeev, anton, arozansk, bhu, davej, dhoward, fhrbata, gansalmon, itamar, jkacur, jonathan, jwboyer, kernel-maint, kernel-mgr, lgoncalv, lwang, madhu.chinakonda, plougher, rt-maint, sforsber, vgoyal, williams
Target Milestone:	---	Keywords:	Security
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-05-04 08:48:04 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	750283, 771774, 789370
Bug Blocks:	770893

Description Petr Matousek 2012-01-04 22:08:20 UTC

Move "exit_robust_list" into mm_release() and clear them

We don't want to get rid of the futexes just at exit() time, we want to
drop them when doing an execve() too, since that gets rid of the
previous VM image too.

Doing it at mm_release() time means that we automatically always do it
when we disassociate a VM map from the task.

Upstream patches:
8141c7f3e7aee618312fa1c15109e1219de784a7
fc6b177dee33365ccb29fe6d2092223cf8d679f9

Comment 1 Kurt Seifried 2012-01-04 22:20:23 UTC

Assigned CVE-2012-0028 as per http://www.openwall.com/lists/oss-security/2012/01/04/18

Comment 4 Petr Matousek 2012-02-01 16:40:24 UTC

Below are the most relevant comments that were extracted from our per product tracking bug (which we do not open to the public). These comments describe the steps that were undertaken to find the root cause of the problem.

---------------------------------------------------------------------------

Description of problem:

On ia64 HVM guest I see this: 

1) after the guest is installed and rebooted virt-manager's UI will show the guest console in a VNC window
2) I can see the console messages before firstboot. 
3) Then firstboot starts and I go through all the steps
4) At the end firstboot exits and the OS should load GDM login window
5) At this point the vnc widget disconnects from the guest and can't connect anymore.

Version-Release number of selected component (if applicable):
2.6.18-294

How reproducible:
1/1

Steps to Reproduce:
1. See above. Tried with 5.7 Server ia64 host and 5.8 Beta 1.0 guest. 
2. Created HVM guest and used NFS iso as the source.
3.
  

Additional info:
Booting in runlevel 3 has no issues, I can see the login prompt. 

After logged into runlevel 3 and doing init 5 - loads Xorg + GDM just fine. 

I'm not sure if this has something to do with the guest (kernel or Xorg drivers) or with the host (VNC widget).

This is on host hp-rx2660-01.rhts.eng.brq.redhat.com.

--- Additional comment from mrezanin on 2011-12-13 09:05:23 EST ---

Problem is, that qemu crash. So guest is on from xen point of view but not running. You can verify this by 'xm list' - there should be no state for guest.

--- Additional comment from mrezanin on 2011-12-16 10:23:21 EST ---

I retested the problem with RHEL 5.7 guest - it's working ok and without problems. I updated guest kernel to RHEL 5.8 Beta version and it is still ok. So this is definitely regression but not xen one.  


As logs show, problem is in bad io request - addr: 33a8 size: 8. Package bisecting revealed the problem to be in glibc update (glibc, glibc-common,nscd). Futher bisecting in this is needed to find out the cause of problem. Problem is somewhere between glibc 2.5-66 and 2.5-72. I was unable to test older glibc than 2.5-72. 

Reassign to glibc for bisecting.

--- Additional comment from drjones on 2011-12-21 07:52:03 EST ---

glibc changelog for -67

* Wed Aug 17 2011 Andreas Schwab <schwab> - 2.5-67
- ldd: never run file directly (#531160)
- Implement greedy matching of weekday and month names (#657570)
- Fix incorrect numeric settings (#675259)
- Implement new mode for NIS passwd.adjunct.byname table (#678318)
- Query NIS domain only when needed (#703345)
- Count total processors using sysfs (#706894)
- Translate clone error if necessary (#707998)
- Workaround kernel clobbering robust list (#711531)

Back on the xen side, Miro says he found a qemu log

inp: bad size: 4004 8

which means that something in the guest attempted to read ioport 0x4004, and attempted to read 8 bytes at once, which isn't legal. That I/O address isn't mapped in the guest either, see the guest's ioports below.

Now we also see that the reason qemu exits is because it was designed to do so in this case

unsigned long do_inp(CPUState *env, unsigned long addr, unsigned long size)
{
    switch(size) {
    case 1:
        return cpu_inb(env, addr);
    case 2:
        return cpu_inw(env, addr);
    case 4:
        return cpu_inl(env, addr);
    default:
        fprintf(logfile, "inp: bad size: %lx %lx\n", addr, size);
        exit(-1);
    }
}   

I don't see how that error is related to the glibc version unless the _inb() function and friends (which are written in C for ia64) are somehow compiled differently with -67. Would "ldd: never run file directly (#531160)" have changed anything like that?

---

# cat /proc/ioports 
00000000-00000cf7 : PCI Bus 0000:00
  00000170-00000177 : ide1
  000001f0-000001f7 : ide0
  00000376-00000376 : ide1
  000003c0-000003df : vga+
  000003f6-000003f6 : ide0
  000003f8-000003ff : serial
00000d00-0000ffff : PCI Bus 0000:00
  00001f40-00001f7f : 0000:00:01.2
    00001f40-00001f43 : ACPI PM1a_EVT_BLK
    00001f44-00001f45 : ACPI PM1a_CNT_BLK
    00001f48-00001f4b : ACPI PM_TMR
  00002000-000020ff : 0000:00:04.0
    00002000-000020ff : 8139cp
  00002100-000021ff : 0000:00:03.0
    00002100-000021ff : xen-platform-pci
  00002200-0000220f : 0000:00:01.1
    00002200-00002207 : ide0
    00002208-0000220f : ide1

--- Additional comment from lersek on 2011-12-22 06:19:45 EST ---

(In reply to comment #23)

> If that's true, can we use the qemu state to map back to the executing code in
> the guest kernel?

We discussed this with Drew yesterday. Let me build a xen package that hangs (pauses) qemu-dm instead of exiting on that error path. This way the insn emulation will take very long for the guest, and we should have a chance to dump the guest's core for analysis with crash.

--- Additional comment from lersek on 2011-12-22 09:23:09 EST ---

Guys in the CC, any help would be greatly appreciated. Thanks!

Analysing the guest coredump. Looking at "docs/misc/dump-core-format.txt" and
"tools/libxc/xc_core.c" in xen-userspace:

    +--------------------------------------------------------+
    |ELF header                                              |
    +--------------------------------------------------------+
    |section headers                                         |
    |    [...]                                               |
    +--------------------------------------------------------+
    |.note.Xen:note section                                  |
    |    [...]                                               |
    +--------------------------------------------------------+
    |.xen_prstatus                                           |
    |       vcpu_guest_context_t[nr_vcpus]                   |
    +--------------------------------------------------------+


Hypervisor source, "include/public/arch-ia64.h", extract:

    struct vcpu_guest_context {
        unsigned long flags;       /* VGCF_* flags */
        struct cpu_user_regs user_regs;

    struct cpu_user_regs {
        /* The following registers are saved by SAVE_MIN: */
        unsigned long b6;  /* scratch */
        unsigned long b7;  /* scratch */

        unsigned long ar_csd; /* used by cmp8xchg16 (scratch) */
        unsigned long ar_ssd; /* reserved for future use (scratch) */

        unsigned long r8;  /* scratch (return value register 0) */
        unsigned long r9;  /* scratch (return value register 1) */
        unsigned long r10; /* scratch (return value register 2) */
        unsigned long r11; /* scratch (return value register 3) */

        unsigned long cr_ipsr; /* interrupted task's psr */
        unsigned long cr_iip;  /* interrupted task's instruction pointer */

So we have to skip 10 u64's to get at the instruction pointer.

    [root@hp-rx2660-02 dump]# readelf -x \
        .xen_prstatus 2011-1222-0707.03-virt2.27.core

    Hex dump of section '.xen_prstatus':
      0x00000000 00000000 00000000 00000000 00000000 ................
      0x00000010 a0000001 000102b0 a0000001 000102e0 ................
      0x00000020 00000000 00000000 00000000 00000000 ................
      0x00000030 00000000 00000263 00000000 00000000 ........c.......
      0x00000040 20000000 018ea500 00000000 00000000 ...............
      0x00000050 a0000001 000c2160 00005010 08da6010 .`...P..`!......
      0x00000060 00000000 00000000 80000000 0000050d ................
      [...]

The instruction pointer is a0000001000c2160. In crash:

    crash> dis 0xa0000001000c2160
    0xa0000001000c2160 <exit_robust_list+128>:      [MII]       ld8 r9=[r33]

(Also, the offending process is Xorg, according to crash.)

With addr2line:

    [root@hp-rx2660-02 dump]# addr2line -ife \
        /usr/lib/debug/lib/modules/2.6.18-300.el5/vmlinux <<< a0000001000c2160

    fetch_robust_entry
    /usr/src/debug/kernel-2.6.18/linux-2.6.18-300.el5.ia64/kernel/futex.c:1989
    exit_robust_list
    /usr/src/debug/kernel-2.6.18/linux-2.6.18-300.el5.ia64/kernel/futex.c:2016

So, the problem is triggered by the glibc commit linked in bug 711531 comment
4: http://repo.or.cz/w/glibc.git/commitdiff/6f8326ca

Due to the patch, glibc calls sys_set_robust_list() in the child after fork(),
if I understand correctly, which sets "robust_list" in the task_struct of the
process. This will make the kernel walk the list of unlocked robust mutexen
when the child dies (see "Documentation/robust-futex-ABI.txt").

This walk is implemented by exit_robust_list(), which is exactly the function
where we die during the emulation on IA64 [kernel/kernel/futex.c]:

    /*
     * Walk curr->robust_list (very carefully, it's a userspace list!)
     * and mark any locks found there dead, and notify any waiters.
     *
     * We silently return on any sign of list-walking problem.
     */
    void exit_robust_list(struct task_struct *curr)
    {
            struct robust_list_head __user *head = curr->robust_list;
            struct robust_list __user *entry, *next_entry, *pending;
            unsigned int limit = ROBUST_LIST_LIMIT, pi, next_pi, pip;
            unsigned long futex_offset;
            int rc;

            /*
             * Fetch the list head (which was registered earlier, via
             * sys_set_robust_list()):
             */
[2016]      if (fetch_robust_entry(&entry, &head->list.next, &pi))
                    return;

    /*
     * Fetch a robust-list pointer. Bit 0 signals PI futexes:
     */
    static inline int fetch_robust_entry(struct robust_list __user **entry,
                                         struct robust_list __user **head,
                                         int *pi)
    {
            unsigned long uentry;

[1989]      if (get_user(uentry, (unsigned long *)head))
                    return -EFAULT;

get_user() [include/asm-ia64/uaccess.h]
-> __get_user_check()
  -> __do_get_user()
    -> __get_user_size()

__get_user_size() has two implementations, dependent on the macro
ASM_SUPPORTED. The feature test macro is defined in
[include/asm-ia64/gcc_intrin.h]. The corresponding __get_user_size()
implementation seems to use r8 and r9; my eyes sting looking at it so I won't
quote it. Anyway, refer back to the disassembly above:

    crash> dis 0xa0000001000c2160
    0xa0000001000c2160 <exit_robust_list+128>:      [MII]       ld8 r9=[r33]

qemu-dm crashes during get_user(), with a size arg of 8 bytes. Drew says it
might be an address corruption problem -- the instruction could be decoded all
fine, but the address is frobnicated so that the emulator thinks it's in MMIO
range, and tries to translate it to port IO, and qemu-dm chokes on the 8-byte
size.

r33 is a "strange" register, to say the least; the IA64 manual states that
GR32 to GR127 is "IA-32 code execution space" and that "Convetion" is
"undefined" -- "All registers in the current and prior registers frames are
left in an undefined state after IA-32 execution. Software must preserve these
values before entering the IA-32 instruction set".

OTOH I couldn't find anything above r31 in the "cpu_user_regs" structure in
the hypervisor's "include/public/arch-ia64.h" source file. Grepping the hv
tree further for "r33" yields hits in:

- arch/ia64/linux/memcpy_mck.S:
  "Itanium 2-optimized version of memcpy and copy_user function"

- arch/ia64/linux/sn/kernel/pio_phys.S:
  pio_phys_write_mmr, pio_atomic_phys_write_mmrs

- arch/ia64/xen/ivt.S:
  "interruption vector table"

- include/asm-ia64/dom_fw.h:
  "Xen domain firmware emulation"

We should figure out if this is a glibc-kernel problem (like not detecting
invalid userspace data in the kernel), or a kernel-hypervisor problem (like
misdecoding a valid kernel access in the hypervisor).

--- Additional comment from lersek on 2011-12-22 17:25:43 EST ---

I was intrigued how get_user() could blow up -- why does it trap to the hypervisor *at all*? Thankfully, "include/asm-ia64/uaccess.h" has an intro comment that explains __get_user_size():

/*
 * This file defines various macros to transfer memory areas across
 * the user/kernel boundary.  This needs to be done carefully because
 * this code is executed in kernel mode and uses user-specified
 * addresses.  Thus, we need to be careful not to let the user to
 * trick us into accessing kernel memory that would normally be
 * inaccessible.  This code is also fairly performance sensitive,
 * so we want to spend as little time doing safety checks as
 * possible.
 *
 * To make matters a bit more interesting, these macros sometimes also
 * called from within the kernel itself, in which case the address
 * validity check must be skipped.  The get_fs() macro tells us what
 * to do: if get_fs()==USER_DS, checking is performed, if
 * get_fs()==KERNEL_DS, checking is bypassed.
 *
 * Note that even if the memory area specified by the user is in a
 * valid address range, it is still possible that we'll get a page
 * fault while accessing it.  This is handled by filling out an
 * exception handler fixup entry for each instruction that has the
 * potential to fault.  When such a fault occurs, the page fault
 * handler checks to see whether the faulting instruction has a fixup
 * associated and, if so, sets r8 to -EFAULT and clears r9 to 0 and
 * then resumes execution at the continuation point.
 *
 * Based on <asm-alpha/uaccess.h>.
 *
 * Copyright (C) 1998, 1999, 2001-2004 Hewlett-Packard Co
 *	David Mosberger-Tang <davidm.com>
 */

The page fault could trap to the hypervisor. Instead of giving it back to the domU fault handler, the hypervisor could try to translate it to MMIO.

Drew identified a hypervisor step -- emulate_io_inst() -- very close to qemu-dm in comment 16. It's called from the following function [arch/ia64/vmx/vmx_process.c]:

    /* We came here because the H/W VHPT walker failed to find an entry */
    IA64FAULT
    vmx_hpw_miss(u64 vadr , u64 vec, REGS* regs)

This function has two call sites, looking like:

                [ conditions snipped ]
                emulate_io_inst(v,((vadr<<1)>>1),4);   //  UC
                return IA64_FAULT;

and

                if (data->pl >= ((regs->cr_ipsr >> IA64_PSR_CPL0_BIT) & 3))
                    emulate_io_inst(v, gppa, data->ma);
                else {
                    vcpu_set_isr(v, misr.val);
                    data_access_rights(v, vadr);
                }
                return IA64_FAULT;

That is, whichever emulate_io_inst() call gets exercised, the vmx_hpw_miss() handler returns with IA64_FAULT. I'll make a big jump here, but if this returns the control to the domU kernel ultimately, then fetch_robust_entry() will return with -EFAULT, short-circuiting exit_robust_list(), and the domU simply won't care -- it's prepared for that.

This is consistent with Mirek's testing: when he removed the exit(-1) call from qemu-dm, everything worked okay. The emulated IO instruction should have turned into a NOP (because qemu-dm didn't do anything except complain), and the guest kernel happily processed the expected EFAULT in exit_robust_list().

I think this is a hypervisor (or qemu-dm problem). Either the hypervisor shouldn't interpret the access as MMIO, or qemu-dm should be able to handle it. Let's look at vmx_hpw_miss() in upstream.

--- Additional comment from lersek on 2011-12-22 19:04:51 EST ---

I believe we can get the address-to-be-fetched from the current core; the call
in comment 27 should give us the starting point.

    crash> struct -o robust_list_head
    struct robust_list_head {
       [0] struct robust_list list;
       [8] long int futex_offset;
      [16] struct robust_list *list_op_pending;
    }
    SIZE: 24

    crash> struct -o robust_list
    struct robust_list {
      [0] struct robust_list *next;
    }
    SIZE: 8

So, 

(char *)&(curr->robust_list->list.next) == (char *)curr->robust_list + 0 + 0

    crash> set 1030
        PID: 1030
    COMMAND: "Xorg"
       TASK: e00000000d618000  [THREAD_INFO: e00000000d619040]
        CPU: 0
      STATE: TASK_RUNNING (ACTIVE)

    crash> struct task_struct.robust_list e00000000d618000
      robust_list = 0x20000000018ea500

This is the (user) address we try to dereference. However:

    crash> vtop -u 0x20000000018ea500
    VIRTUAL           PHYSICAL        
    20000000018ea500  (not accessible)

(Just to be sure:

    crash> vtop -k 0x20000000018ea500
    VIRTUAL           PHYSICAL        
    20000000018ea500  (not a kernel virtual address)
)

This could be the cause of the fault / trap.

Side note: referring back to the .xen_prstatus ELF section dump in comment 27,
the above value is present in the dump at offset 0x40:

      0x00000040 20000000 018ea500 00000000 00000000 ...............
      
This should correspond to r11. I tried to confirm it by disassembling
exit_robust_list(), but it's greek to me.

--- Additional comment from lersek on 2011-12-23 07:46:32 EST ---

If the fault is indeed mistakenly classified as needing MMIO, then
emulate_io_inst() shouldn't be called at all. Therefore the ldfp8 instruction
is the result of a mis-parse. The disassembly from "crash" again (full bundle):

0xa0000001000c2160 <exit_robust_list+128>:      [MII]       ld8 r9=[r33]
0xa0000001000c2161 <exit_robust_list+129>:                  nop.i 0x0
0xa0000001000c2162 <exit_robust_list+130>:                  nop.i 0x0;;

crash> px  *(long unsigned *)0xa0000001000c2160
$3 = 0x101842004801

crash> px  *(long unsigned *)0xa0000001000c2168
$4 = 0x4000000000200

$4.$3 == 0040000000002000.0000101842004801

In binary, from left to right:

00000000000001000000000000000000000000000 -- 41 bits, slot 2
00000000000001000000000000000000000000000 -- 41 bits, slot 1
01000000011000010000100000000001001000000 -- 41 bits, slot 0
00001                                     -- 5 bits, template

The template value determines MII (M-Unit for slot 0, I unit for the others),
plus there are no stops between the slots. Slot 1 and slot 2 have identical
contents, corresponding to "nop.i". They probably only pad the bundle to three
slots.

0000 0 000 000001 0 00000000000000000000 000000
 I18 i  x3   x6   y        imm20a          qp

("y" distinguishes between "nop.i" and "hint.i".)

slot 0:

0100 0 000011 00 0 0100001 0000000 0001001 000000
M1/2 m   x6   h. x    r3      r2     r1      qp

opcode: 4              --> M1 or M2
m     : 0
x6    : 0000.11 binary --> ld8 M2
hint  : 00 binary      --> no load hint
x     : 0
r3    : 33 decimal
r1    : 9 decimal

That is, the disassembler in "crash" does not lie.

--- Additional comment from lersek on 2011-12-23 13:17:06 EST ---

emulate_io_inst() does not misparse the ld8 instruction as ldfp8 actually -- it recognizes it correctly. Refer back to the previous comment (comment 40) for data. The branch is not easy to find because the transfer size (8 bytes) is not specified in cleartext, it is expressed as "1 << (x6 & 0x03)".

    // Integer Load/Store
    if(inst.M1.major==4&&inst.M1.m==0&&inst.M1.x==0){
        inst_type = SL_INTEGER;  //
        size=(inst.M1.x6&0x3);
        if((inst.M1.x6>>2)>0xb){      // write
            dir=IOREQ_WRITE;     //write
            vcpu_get_gr_nat(vcpu,inst.M4.r2,&data);
        }else if((inst.M1.x6>>2)<0xb){   //  read
            dir=IOREQ_READ;
        }
    }

[... a bunch of else-if branches, then ...]

    size = 1 << size;
    if(dir==IOREQ_WRITE){
        mmio_access(vcpu, padr, &data, size, ma, dir);
    }else{
        mmio_access(vcpu, padr, &data, size, ma, dir);
        if(inst_type==SL_INTEGER){       //gp
            vcpu_set_gr(vcpu,inst.M1.r1,data,0);
        }else{
            panic_domain(NULL, "Don't support ldfd now !");
        }
    }
    vcpu_increment_iip(vcpu);

--- Additional comment from lersek on 2011-12-23 15:31:51 EST ---

Created attachment 549385 [details]
add debug logging to ia64 fault handler and mmio insn decoder

The output of this patch seems to confirm the above. The last three hv dmesg
lines before the hang are:

(XEN) mmio.c:458:d1 callsite=2 vadr=0x20000000018ea500
                    ^                 ^
                    |                 Xorg.robust_list, see comment 36
                    |
                    2nd emulate_io_inst() call in vmx_hpw_miss()
                    (comment 32, but see here too, with the debug patch:)

(XEN) mmio.c:463:d1 gppa=0xe0cea500 vec=2 d_ppn=0xe0ce8 d_ma=4 d_ps=14

 313  /* We came here because the H/W VHPT walker failed to find an entry */
 314  IA64FAULT
 315  vmx_hpw_miss(u64 vadr , u64 vec, REGS* regs)
 316  {

 330    else if (vec == 2)
 331      type = DSIDE_TLB;

vec == 2, data TLB.

 351    if((data=vtlb_lookup(v, vadr,type))!=0){

vadr is found in the data TLB.

 352      if (v->domain != dom0 && type == DSIDE_TLB) {
 353        if (misr.sp) { /* Refer to SDM Vol2 Table 4-10,4-12 */
 354          if ((data->ma == VA_MATTR_UC) || (data->ma == VA_MATTR_UCE))
 355            return vmx_handle_lds(regs);
 356        }
 357        gppa = (vadr & ((1UL << data->ps) - 1)) +
 358               (data->ppn >> (data->ps - 12) << data->ps);

"gppa" -- guest physical page address -- is computed as follows. The data TLB
entry specifies a page size (d_ps == 14, 16KB). The lower 14 bits are taken
from the virtual address (->0x2500). The low 2 bits in the physical page number
(d_ppn == 0xe0ce8) are cleared, and then it is shifted 12 bits to the left
(->0xe0ce8000). Finally they are combined to gppa == 0xe0cea500.

 359        if (__gpfn_is_io(v->domain, gppa >> PAGE_SHIFT)) {

PAGE_SHIFT is 14 in our configuration, the guest page frame number is 0x3833A.
It is found to be an IO mapped page. I'll have to see why. An interesting
tidbit is that qemu-dm logs:

    inp: bad size: 33a8 8 -- pausing

"33a8" looks awfully similar to 0x3833A -- keep the low 16 bits of the latter
and do a 4-bit ROL in those 16 bits, and you end up with 33A8.

 360          if (misr.sp)
 361            panic_domain(NULL, "ld.s on I/O page not with UC attr."
 362                         " pte=0x%lx\n", data->page_flags);

So misr.sp was false above too (at line 353).

 363          if (data->pl >= ((regs->cr_ipsr >> IA64_PSR_CPL0_BIT) & 3)) {

This is a privilege level check.

 364            scratch.callsite = 2;
 365            scratch.vadr = vadr;
 366            scratch.gppa = gppa;
 367            scratch.vec = vec;
 368            scratch.d_ppn = data->ppn;
 369            scratch.d_ma = data->ma;
 370            scratch.d_ps = data->ps;
 371            emulate_io_inst(v, gppa, data->ma);

(XEN) mmio.c:467:d1 padr=0xe0cea500 ma=4 cr_iip=0xa0000001000c2160 slot=0
                    ^                    ^                         ^
                    gppa                 domU get_user(),          slot in
                                         see comment 27            bundle

      bundle=0x00040000000002000000101842004801
               ^
               the insn bundle containing the ld8 and the two NOPs, see
               comment 40.

"ma" is "memory attribute", 4 means VA_MATTR_UC, uncacheable (see table 4-11).

----o----

"callsite == 1" was never hit. The guest hit "callsite == 2" 2201 times
(including the fatal, last one). The first 2200 occasions (that succeeded) were
all like this:

(XEN) mmio.c:458:d1 callsite=2 vadr=0xc0000000000b....
(XEN) mmio.c:463:d1 gppa=0xb.... vec=2 d_ppn=0xb8 d_ma=4 d_ps=24
(XEN) mmio.c:467:d1 padr=0xb.... ma=4 cr_iip=0xa0000001002e9... slot=0
      bundle=[whatever]

cr_iip always points into __copy_user() in the domU, and the physical address /
physical frame number suggests it's video RAM. Logging depends on the transfer
size being 8, so it's interesting why qemu-dm satisfies the first 2200 read
requests.

... I believe because they have a different path already in the hypervisor -- 
mmio_access() special-cases GPFN_LOW_MMIO:

low_mmio_access()
-> hvm_buffered_io_intercept()

static struct hvm_buffered_io_range buffered_stdvga_range = {0xA0000, 0x20000};

All 2200 successful requests fall into this buffered range.

----o----

So, is the problem that "vadr" is found in the virtual data TLB? Or is it that
the retrieved frame number qualifies as (high) MMIO?

The virtual address in Xorg's robust_list definitely shouldn't be recognized as
MMIO. Again, gppa is 0xe0cea500, which is above 3596 MB (decimal). The HVM
guest in question has 1 GB RAM, so I'm inclined to believe the hypervisor will
consider such a physical address as MMIO. Therefore I suspect the data TLB has
a bogus entry. We're doomed.

--- Additional comment from lersek on 2011-12-25 06:20:52 EST ---

Created attachment 549475 [details]
hv debug logging v2

... in ia64 fault handler and mmio insn decoder, mmio_access(),
lookup_domain_mpa()

(In reply to comment #42)

> the guest page frame number is 0x3833A.
> It is found to be an IO mapped page. I'll have to see why. An interesting
> tidbit is that qemu-dm logs:
>
>     inp: bad size: 33a8 8 -- pausing
>
> "33a8" looks awfully similar to 0x3833A -- keep the low 16 bits of the latter
> and do a 4-bit ROL in those 16 bits, and you end up with 33A8.

(XEN) mmio.c:472:d1 callsite=2 vadr=0x20000000018ea500
(XEN) mmio.c:479:d1 gppa=0xe0cea500 vec=2 d_ppn=0xe0ce8 d_ma=4 d_ps=14
      misr_rs=0 vpsr_dt=1 vpsr_rt=1
(XEN) mmio.c:483:d1 padr=0xe0cea500 ma=4 cr_iip=0xa0000001000c2160 slot=0
      bundle=0x00040000000002000000101842004801
(XEN) mmio.c:242:d1 mmio_access(): entry
(XEN) mm.c:712:d1 lookup_domain_mpa(): entry
(XEN) mm.c:718:d1 lookup_domain_mpa(): ptep=f000000181c699d0
(XEN) mm.c:726:d1 lookup_domain_mpa(): pte=0x5010000000000761
(XEN) mm.c:731:d1 lookup_domain_mpa(): pte present
(XEN) mmio.c:248:d1 mmio_access(): iot=0x5000000000000000

vmx_hpw_miss() -- non-phys mode call, comment 42
-> emulate_io_inst() -- for ld8, comment 41
  -> mmio_access()
    -> __gpfn_is_io -- returns GPFN_LEGACY_IO == 0x5000000000000000
    -> legacy_io_access()

legacy_io_access() is the one to turn 0xE0CEA500 into 0x33A8 when composing the
ioreq for qemu-dm:

    #define TO_LEGACY_IO(pa)  (((pa)>>12<<2)|((pa)&0x3))

    p->addr = TO_LEGACY_IO(pa&0x3ffffffUL);

11100000110011101010010100000000 bin = 0xE0CEA500
      ^^^^^^^^^^^^^^          ^^
      00110011101010          00 bin = 0x33A8

--- Additional comment from lersek on 2011-12-26 06:17:44 EST ---

Created attachment 549579 [details]
hv debug logging v3

debug patch:
- ia64 fault handler and mmio insn decoder
- mmio_access(), lookup_domain_mpa()
- vmx_vcpu_thash() -- log "vhpt_adr" output
- guest_vhpt_lookup()

(In reply to comment #42)

>  351    if((data=vtlb_lookup(v, vadr,type))!=0){
>
> vadr is found in the data TLB.
>
>  352      if (v->domain != dom0 && type == DSIDE_TLB) {
>  353        if (misr.sp) { /* Refer to SDM Vol2 Table 4-10,4-12 */
>  354          if ((data->ma == VA_MATTR_UC) || (data->ma == VA_MATTR_UCE))
>  355            return vmx_handle_lds(regs);
>  356        }
>  357        gppa = (vadr & ((1UL << data->ps) - 1)) +
>  358               (data->ppn >> (data->ps - 12) << data->ps);
>
> "gppa" -- guest physical page address -- is computed as follows. The data TLB
> entry specifies a page size (d_ps == 14, 16KB). The lower 14 bits are taken
> from the virtual address (->0x2500). The low 2 bits in the physical page
> number (d_ppn == 0xe0ce8) are cleared, and then it is shifted 12 bits to the
> left (->0xe0ce8000). Finally they are combined to gppa == 0xe0cea500.

(In reply to comment #44)

> I may have to catch all references to the faulting vaddr (Xorg's robust_list)
> in the virtual TLB.

So, why does vtlb_lookup() return 0xe0ce8 for the PPN? This patch helps track
that back a bit more.

There are three mentions of guest iip=0xa0000001000c2160 in the log (which is
the problematic ld8 of *robust_list, virtual address 0x20000000018ea500).
There's a big gap between the 1st and (2nd+3rd), the latter two are adjacent.

The first DTLB miss (translation failure):

(XEN) vtlb.c:275:d1 guest_vhpt_lookup(): data=f000000180274100
(XEN) vmx_process.c:444:d1 vmx_hpw_miss(): vhpt_adr=0x3ffe0000000031d0
(XEN) vtlb.c:555:d1 thash_purge_and_insert(): pte=0x1000000e3181a1
      itir=0x3ab938 ifa=0x20000000018ea500 type=1 iip=0xa0000001000c2160
      caller=vmx_hpw_miss

vmx_hpw_miss() is called for a DTLB miss.
- We're not in physical mode,
- the vadr is not found in the vtlb,
- we advance through vmx_hpw_miss() until we reach the following:

        vmx_vcpu_thash(v, vadr, &vhpt_adr);

The virtual address is hashed to vhpt_addr (= 0x3ffe0000000031d0).

        if (!guest_vhpt_lookup(vhpt_adr, &pteval)) {

It is looked up in the guest VHPT, and found; a PTE is returned, containing a
PPN (here 0xe318, corresponding to about 227 MB),

            /* VHPT successfully read.  */
            if (!(pteval & _PAGE_P)) {
                if (vpsr.ic) {
                    vcpu_set_isr(v, misr.val);
                    dtlb_fault(v, vadr);
                    return IA64_FAULT;
                } else {
                    nested_dtlb(v);
                    return IA64_FAULT;
                }
            } else if ((pteval & _PAGE_MA_MASK) != _PAGE_MA_ST) {
                vcpu_get_rr(v, vadr, &rr);
                itir = rr & (RR_RID_MASK | RR_PS_MASK);

                if (0x20000000018ea500 == vadr) {
                    gdprintk(XENLOG_DEBUG, "vmx_hpw_miss(): vhpt_adr=0x%lx\n",
                             vhpt_adr);
                }

                thash_purge_and_insert(v, pteval, itir, vadr, DSIDE_TLB,
                                       __FUNCTION__);
                return IA64_NO_FAULT;

and then inserted in the vtlb.

I think, what happens is basically: translation fails at first, we copy it from
the guest VHPT to the TLB, and then we make progress. All good.

The second & third misses are our problem -- actually I think the same guest
instruction faults twice in quick succession. Of those, the first happens just
like above:

(XEN) vtlb.c:275:d1 guest_vhpt_lookup(): data=f000000180276100
(XEN) vmx_process.c:444:d1 vmx_hpw_miss(): vhpt_adr=0x3ffe0000000031d0
(XEN) vtlb.c:555:d1 thash_purge_and_insert(): pte=0x100000e0ce85b1
      itir=0x3bc138 ifa=0x20000000018ea500 type=1 iip=0xa0000001000c2160
      caller=vmx_hpw_miss

The vadr that fails to translate is the same (0x20000000018ea500, contents of
robust_list), at the same domU instruction. The vadr hashes to the same
vhpt_adr (0x3ffe0000000031d0), and it is again found in the guest VHPT, but
with different contents: the PTE returned this time specifies PPN 0xe0ce8,
which is what we copy in the VTLB.

Then the fault kicks again immediately (... probably), we retrieve the PPN from
the VTLB we just copied over from the guest VHPT:

(XEN) mmio.c:472:d1 callsite=2 vadr=0x20000000018ea500
(XEN) mmio.c:479:d1 gppa=0xe0cea500 vec=2 d_ppn=0xe0ce8 d_ma=4 d_ps=14
      misr_rs=0 vpsr_dt=1 vpsr_rt=1
(XEN) mmio.c:483:d1 padr=0xe0cea500 ma=4 cr_iip=0xa0000001000c2160 slot=0
      bundle=0x00040000000002000000101842004801
(XEN) mmio.c:242:d1 mmio_access(): entry
(XEN) mm.c:712:d1 lookup_domain_mpa(): entry
(XEN) mm.c:718:d1 lookup_domain_mpa(): ptep=f000000181c699d0
(XEN) mm.c:726:d1 lookup_domain_mpa(): pte=0x5010000000000761
(XEN) mm.c:731:d1 lookup_domain_mpa(): pte present
(XEN) mmio.c:248:d1 mmio_access(): iot=0x5000000000000000

and we decide the PPN is IO-mapped.

In short, the source of the problematic PPN is the guest VHPT; it's there where
we copy the PPN from, into the VTLB. We retrieve the PPN from the VTLB finally,
and it causes us to compose a bogus ioreq for qemu-dm.

I believe this code implements (virtualizes) the algorithm described in chapter
"4.1 Virtual Addressing" of the Itanium SDM vol2.

I backported the fixed assembly from c/s 17086 to guest_vhpt_lookup(), assuming
guest_vhpt_lookup() returned a garbage PTE (and so PPN) because of the messed
up gcc assembly template. Unfortunately, the backport didn't help at all, it
seems the guest VHPT indeed specifies the bad PPN.

An interesting point is the difference in the VCPU's ITIR ("Interruption TLB
Insertion Register"), which is basically an input param to the vmx_hpw_miss()
handler.

first (valid) thash insertion case:    itir=0x3ab938 
second (bad) thash insertion case:     itir=0x3bc138 

The difference is in the subsequence called "key" (bits 8 to 31). "On an
instruction or data translation fault, this field is set to the accessed Region
Identifier (RR.rid)" (Table 3-8.)

Why on earth does the VCPU think it is accessing a different Region? The vadr
is the same, so the Virtual Region Number (the most significant 3 bits in the
vadr) is the same, selecting the same (0th) region register. So, the contents
of that region register must have changed. In that case, however, the hash of
the virtual address (vhpt_adr, computed by vmx_vcpu_thash(), see above) should
have changed as well, because the region ID is an input to the hash as well.
However, vhpt_adr is identical (0x3ffe0000000031d0).

--- Additional comment from lersek on 2011-12-26 14:30:23 EST ---

I'm stuck figuring out where the guest VHPT is populated.

This could be the data flow:

guest VHPT ---> vtlb ---> HVM domU mm_struct  ---> legacy MMIO
            ^         ^   (GPFN to mach. PTE)
            |         |
           1st       2nd
       translation transl.
           fault    fault

The guest page frame number 0xe0ce8 comes from the guest VHPT. I'm unable to find what writes it there (ie. more to the left on the diagram).

--- Additional comment from lersek on 2011-12-26 16:12:01 EST ---

Re-reading comment 36 (the userspace virtual address is not even resolvable in the guest!) this could be a "stale guest VHPT" problem. If the 1st translation fault didn't find the virtual adress's hash in the guest VHPT, we wouldn't even reach the 2nd translation: vmx_hpw_miss() would return IA64_FAULT, and the get_user() call in the guest would complete.

I think when the virtual address 0x20000000018ea500 (= robust_list) is unmapped, the guest VHPT is not updated (flushed?) quickly enough. guest_vhpt_lookup() should return failure for the vadr's hashed value.

--- Additional comment from lersek on 2011-12-28 11:08:03 EST ---

I think we're making progress!

(1) I changed the guest kernel like this:

--------v--------
diff --git a/kernel/futex.c b/kernel/futex.c
index e4aa3d5..3a7d52b 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1992,7 +1992,8 @@ static inline int fetch_robust_entry(struct robust_list
__user **entry,
 
 #ifdef CONFIG_IA64
        ia64_ptr(2, head, 3);
-       ia64_srlz_d();
+       ia64_ptcga(head, 3 << 2);
+       ia64_srlz_i();
 #endif
        if (get_user(uentry, (unsigned long *)head))
                return -EFAULT;
--------^--------

Inserting ptc.ga (purge global translation cache) before attempting to fetch
*robust_list had the desired effect I wrote about in comment 49:

(In reply to comment #49)

> If the 1st translation fault didn't find the virtual adress's hash in the
> guest VHPT, we wouldn't even reach the 2nd translation: vmx_hpw_miss() would
> return IA64_FAULT

ptc.ga is trapped by the hypervisor. It is emulated by:

vmx_emulate() [arch/ia64/vmx/vmx_virt.c]
-> vmx_emul_ptc_ga()
  -> vmx_vcpu_ptc_ga() [arch/ia64/vmx/vmmu.c]
    -> vmx_vcpu_ptc_l()
      -> thash_purge_entries() [arch/ia64/vmx/vtlb.c]
        -> vhpt_purge()

which indeed kills off the virtual address from the guest VHPT: when the guest
kernel tries to dereference the vaddr, the subsequent vmx_hpw_miss() in the HV
fails to look it up in either the VTLB or the VHPT -- guest_vhpt_lookup():

(XEN) vtlb.c:275:d13 guest_vhpt_lookup(): data=0000000000000000
(XEN) vtlb.c:281:d13 guest_vhpt_lookup(): data2=0000000000000000

and vmx_hpw_miss() ends up calling dvhpt_fault() --> _vhpt_fault() in
"arch/ia64/vmx/vmx_interrupt.c". The latter function sets up IFA (= faulting
vadr), ITIR, IHA, and injects an IA64_VHPT_TRANS_VECTOR guest interruption:

Hypervisor: [include/asm-ia64/ia64_int.h]

    #define	IA64_VHPT_TRANS_VECTOR			0x0000

domU kernel: [arch/ia64/kernel/ivt.S]

    // 0x0000 Entry 0 (size 64 bundles) VHPT Translation (8,20,47)
    ENTRY(vhpt_miss)

The vhpt_miss() assembly-language domU function executes an "itc.d r18"
instruction -- "insert translation cache". This traps back to the hypervisor,
and the hypervisor inserts the PTE (coming from the domU!) in the VTLB:

(XEN) mm.c:712:d13 lookup_domain_mpa(): entry
(XEN) mm.c:718:d13 lookup_domain_mpa(): ptep=f000001368dc19d0
(XEN) mm.c:726:d13 lookup_domain_mpa(): pte=0x5010000000000761
(XEN) mm.c:731:d13 lookup_domain_mpa(): pte present
(XEN) vtlb.c:555:d13 thash_purge_and_insert(): pte=0x10100000e0ce85b1
      itir=0x3c7138 ifa=0x20000000018ea500 type=1 iip=0xa000000100000130
      caller=vmx_vcpu_itc_d

(iip=0xa000000100000130 points to the itc.d in vhpt_miss() in my test kernel, I
verified it with objdump -D).

vmx_emulate() [arch/ia64/vmx/vmx_virt.c]
-> vmx_emul_itc_d()
  -> vmx_vcpu_itc_d() [arch/ia64/vmx/vmmu.c]
    -> thash_purge_and_insert() [arch/ia64/vmx/vtlb.c]

Then the *robust_list read is retried, and resolved from the VTLB as before.

(XEN) mm.c:712:d13 lookup_domain_mpa(): entry
(XEN) mm.c:718:d13 lookup_domain_mpa(): ptep=f000001368dc19d0
(XEN) mm.c:726:d13 lookup_domain_mpa(): pte=0x5010000000000761
(XEN) mm.c:731:d13 lookup_domain_mpa(): pte present
(XEN) mmio.c:472:d13 callsite=2 vadr=0x20000000018ea500
(XEN) mmio.c:479:d13 gppa=0xe0cea500 vec=2 d_ppn=0xe0ce8 d_ma=4 d_ps=14
      misr_rs=0 vpsr_dt=1 vpsr_rt=1
(XEN) mmio.c:483:d13 padr=0xe0cea500 ma=4 cr_iip=0xa0000001000c21a0 slot=0
      bundle=0x00040000000002000000101842004801
(XEN) mmio.c:242:d13 mmio_access(): entry
(XEN) mm.c:712:d13 lookup_domain_mpa(): entry
(XEN) mm.c:718:d13 lookup_domain_mpa(): ptep=f000001368dc19d0
(XEN) mm.c:726:d13 lookup_domain_mpa(): pte=0x5010000000000761
(XEN) mm.c:731:d13 lookup_domain_mpa(): pte present
(XEN) mmio.c:248:d13 mmio_access(): iot=0x5000000000000000


Summary:
--------

(0) the "robust_list" userspace address is *not* in the VTLB, in either case.

(1) normally, it is in the guest VHPT, and on access it is copied from the
gVHPT to the VTLB, and resolved from there.

(2) when I force the vaddr out of the gVHPT, the miss goes "deeper", ie. back
to the domU, and this time the guest kernel actively reinserts the same bogus 
pseudo-physical page number into the VTLB, originating from the *domU PTE* for
the virtual address. Then on access the address is resolved from the VTLB as
before.

It very much seems like a guest kernel problem.

--- Additional comment from lersek on 2011-12-28 14:53:22 EST ---

Created attachment 549849 [details]
walk_vadr() -- walk robust_list thorugh the pagetable (domU kernel)

Getting close.

I scavenged mapped_kernel_page_is_present() [arch/ia64/mm/fault.c] and wrote
walk_vadr(). Each time sys_set_robust_list() is called or fetch_robust_entry()
is called, the patch walks the pagetables and prints the PTE. This can be
correlated with the hypervisor debug log.

(Please excuse the stupid " %16s" format specifier, I meant "%.16s" -- I didn't
want to specify the field width (padding) but the precision for %s: the
precision puts a limit on the number of printed bytes. I wasn't sure if
task_struct.comm is NUL-terminated or not. Thankfully it is.)

Reproducing the problem again, this is the end of the hv log (see comment 52
for more):

(XEN) mmio.c:472:d21 callsite=2 vadr=0x20000000018ea500
(XEN) mmio.c:479:d21 gppa=0xe0cea500 vec=2 d_ppn=0xe0ce8 d_ma=4 d_ps=14
      misr_rs=0 vpsr_dt=1 vpsr_rt=1
(XEN) mmio.c:483:d21 padr=0xe0cea500 ma=4 cr_iip=0xa0000001000c2860 slot=0
      bundle=0x00040000000002000000101848004801

vadr=0x20000000018ea500 is the usual contents of the robust_list pointer, gppa
has the usual offender PPN (coming from the usual bad guest PTE).
cr_iip=0xa0000001000c2860 corresponds to the get_user() call in
fetch_robust_entry(), right after the walk_vadr() invocation this patch
introduced there.

So, what does the guest kernel log? A bunch of sys_set_robust_list() and
fetch_robust_entry() lines. The revelation comes when we

- grab the last line, printed right before the problematic get_user() call (see
cr_iip above), and

- check all other lines for the same PID, in order to see the original
sys_set_robust_list() for the same process:

walk_vadr(): rhgb[1084]: sys_set_robust_list(): vadr=0x20000000018ea500
    pgdp=e00000000dd10800 pgd=0xd8d4000
    pudp=e00000000dd10800 pud=0xd8d4000
    pmdp=e00000000d8d4000 pmd=0x3ce2c000
    ptep=e00000003ce2f1d0 pte=0x1000000e7485e1

walk_vadr(): Xorg[1084]: fetch_robust_entry(): vadr=0x20000000018ea500
    pgdp=e00000003c40c800 pgd=0x1e5c000
    pudp=e00000003c40c800 pud=0x1e5c000
    pmdp=e000000001e5c000 pmd=0x3cbf4000
    ptep=e00000003cbf71d0 pte=0x100000e0ce85b1

The task's name (task_struct.comm) has changed! Although PIDs are recycled,
during such a short time it must have been an exec(). The vadr is the same
(consistent with the hypervisor), and compare the guest PTE (at Xorg exit time)
against the bad PPN in the hypervisor log:

      pte=0x100000e0ce85b1
    d_ppn=0x      e0ce8

So, rhgb[1084] is *forked* by something (let's call it "rhgb^"), then
rhgb[1084] *executes* Xorg[1084], without resetting the "robust_list" pointer
in the task struct. When Xorg (a completely independent process image) exits,
the same userspace virtual address happens to be mapped in it, but the backing
PTE is completely different.

      rhgb^    -- sets robust_list

        |
        |
      fork()   -- *used* to clear robust_list, but not after the glibc patch
        |         for bug 711531.
        |
        V

    rhgb[1084]

        |
        |
      exec()   -- *never* clears robust_list
        |
        |
        V

    Xorg[1084]

The patch for bug 711531 completed the inheritance chain from rhgb^ to
Xorg[1084]. It must be broken again by fixing exec(). Candidate patch should
follow soon.

--- Additional comment from lersek on 2012-01-02 11:21:27 EST ---

Created attachment 550268 [details]
Move "exit_robust_list" into mm_release(); nullify lists after cleanup (v2)

This is a backport of upstream commits 8141c7f3 & fc6b177d:

    We don't want to get rid of the futexes just at exit() time, we want
    to drop them when doing an execve() too, since that gets rid of the
    previous VM image too.

    Doing it at mm_release() time means that we automatically always do it
    when we disassociate a VM map from the task.

    The robust list pointers of user space held futexes are kept intact
    over an exec() call. When the exec'ed task exits exit_robust_list() is
    called with the stale pointer. The risk of corruption is minimal, but
    still it is incorrect to keep the pointers valid. Actually glibc
    should uninstall the robust list before calling exec() but we have to
    deal with it anyway.

    Nullify the pointers after [compat_]exit_robust_list() has been
    called.

The fault caused by the stale robust_list pointer was spuriously resolved
to MMIO in Xen HVM guests on IA64.

Inclusion of <linux/compat.h> in kernel/fork.c depends on !__GENKSYMS__ in
order to avoid kABI breakage.

----o----

--- Additional comment from lersek on 2012-01-06 04:17:33 EST ---

(In reply to comment #71)
> (In reply to comment #68)
>> (In reply to comment #63)
>>> ...
>>> Can you please also test the kernel as described in bug 711531
>>> comment 0? (It doesn't matter if it's a guest or the host, or even bare
>>> metal.)

> Is this supposed to be tested with a privileged user?

It should make no difference.

I think the test program checks if robust mutexen are indeed robust <http://www.opengroup.org/onlinepubs/9699919799/functions/pthread_mutexattr_getrobust.html>. Basically, a mutex can be shared by threads in the same process, or by threads in different processes (process-shared mutex).

If the mutex is process shared and robust, and thread T1 in process P1 locks it, and then process P1 exits or executes another image, then the next thread T2 in process P2 trying to acquire the mutex will get EOWNERDEAD. (Other contenders will block on the mutex). Then T2 can clean up the shared state, mark the mutex consistent, and unlock the mutex. This way global progress can be made.

If it's not the entire process P1 to exit, just thread T1 (which also enables the single-process case, ie. threads T1 and T2 are in the same process P1), and T1 exits while holding the mutex (returns from the thread start function or calls pthread_exit()), then returning EOWNERDEAD to T2 in pthread_mutex_lock() is optional -- preventing the deadlock in this case is not mandatory for the system.

The test program checks four cases. There are two dimensions and all variations are tested. The first dimension is the above, ie. whether T1 and T2 share P1, or belong to different processes (via fork()). Linux seems to provide (otherwise not mandated, but allowed) EOWNERDEAD even for threads sharing the same process.

The second dimension is whether the test program tries to "patch up" glibc dynamically, ie. if it calls the set_robust_list() syscall manually right after fork(). The test program uses pthread_atfork() for this <http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_atfork.html>, it copies the robust_list pointer from the parent to the child if glibc doesn't take care of that itself. The test program was originally a reproducer for the glibc bug -- when it says in case #4: "forks with NO post-fork set_robust_list -- hangs (no owner-death cleanup)", it demonstrates the missing set_robust_list() call after fork() in glibc.

With the glibc patch applied this dimension should make no difference, all four test cases and the entire test run should complete.

I also executed the test program (in the HVM Xen guest that I used to debug the bug), on the brew kernel from comment 61 (_gks2), after the guest booted successfully into runlevel 5. The test program exited successfully (which matches Qixiang's results in comment 68), so I think we can consider this verified.


--- Additional comment from lersek on 2012-02-01 11:04:29 EST ---

We tracked down where gppa=0xe0cea500 comes from: why is that
(pseudo-)physical page mapped into Xorg at all?

Xorg contains a bunch of ioperm() calls
[hw/xfree86/os-support/linux/lnx_video.c]:
- xf86EnableIO()
- xf86DisableIO()
- xf86DisableInterrupts()
- xf86EnableInterrupts()

xf86EnableIO() is the most probable caller.

ioperm() is a libc function which depends on appropriate privileges (Petr
tested its EPERM condition in the IA64 HVM guest). On IA64, ioperm() is
implemented like this [sysdeps/unix/sysv/linux/ia64/ioperm.c]:

- "get I/O base physical address from ar.k0 as per PRM" --> phys_io_base,
- open /dev/mem --> this gives access to (pseudo-)physical memory,
  if the process has appropriate privileges,
- get a length from "io_offset(MAX_PORT)" --> returns 64 MB (see also
  chapter 10.7 "I/O Port Space Model" in the Itanium(R) SDM rev 2.3),
- mmap offset range [phys_io_base, phys_io_base + len) from /dev/mem,
- close the file descriptor.

(Should Xorg execute another image with such a mapping: memory mappings are
dropped on execve(), and the fd is closed early enough. Xorg is not
multi-threaded, so the "usual" FD_CLOEXEC (or close()) race against other
threads shouldn't apply.)

The mappings of the Xorg process (from /proc/PID/maps, some fields omitted):
- virtual address range: 2000000000c00000-2000000004c00000
- consequently, size: 64 MB
- file: /dev/mem
- start offset in file: e0000000

From /proc/iomem (in the guest):
  e0000000-e033dcf7 : PCI Bus 0000:00 I/O Ports 00000000-00000cf7
  e0340d00-e3ffffff : PCI Bus 0000:00 I/O Ports 00000d00-0000ffff

Also 64 MB (e0000000 .. e3ffffff). Plus, 0x400 bytes (= 1 KB) are allocated
for each port.

Again, gppa=0xe0cea500, falling in the second iomem range. Xorg must have
attempted to access the following port:

     (gppa - iomem_range_base) / bytes_per_port + port_base
  == (0xe0cea500 - 0xe0340d00) / 0x400 + 0xD00
  == 0x9A9800 / 0x400 + 0xD00
  == 0x26A6 + 0xD00
  == 0x33A6

Which is quite close to the 0x33A8 described in comment 45. There's some
"non-linearity" between the two port ranges above; if we rebase the formula
onto the first range:

     (0xe0cea500 - 0xe0000000) / 0x400 + 0x0 ~= 0x33A9

In summary,
- itc.d and its dependencies are kernel-space only in the guest,
- qemu-dm's reaction to an ill-sized MMIO request could be hardened perhaps,
  but it needs more analysis,
- Xorg had the gppa mapped because that's how ioperm() works on IA64.

Comment 5 errata-xmlrpc 2012-02-09 16:41:53 UTC

This issue has been addressed in following products:

  Red Hat Enterprise Linux 5

Via RHSA-2012:0107 https://rhn.redhat.com/errata/RHSA-2012-0107.html

Comment 7 Eugene Teo (Security Response) 2012-02-15 05:22:27 UTC

Statement:

This issue did not affect the Linux kernel as shipped with Red Hat Enterprise Linux 4 as it did not have support for robust futexes. It did not affect Red Hat Enterprise Linux 6 and Red Hat Enterprise MRG as they have the backported fixes. This has been addressed in Red Hat Enterprise Linux 5 via https://rhn.redhat.com/errata/RHSA-2012-0107.html.

Comment 9 errata-xmlrpc 2012-03-06 17:44:00 UTC

This issue has been addressed in following products:

  Red Hat Enterprise Linux 5.6 EUS - Server Only

Via RHSA-2012:0358 https://rhn.redhat.com/errata/RHSA-2012-0358.html