Bug 633940

Summary: Booting SMP kernel on single cpu, unhandled user address faults in futex lock,cmpxchg
Product: Red Hat Enterprise Linux 5 Reporter: David Bein <d.bein>
Component: kernelAssignee: Red Hat Kernel Manager <kernel-mgr>
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: low    
Version: 5.4   
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-02-27 09:46:02 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Adds an alternate exception lookup table for UP case. Safe for hotplug cpu.
none
Defer calls to free_init_pages() until alternative exception table is created.
none
A different approach based on 2.6.23 -> 2.6.27 (maybe later) none

Description David Bein 2010-09-14 17:46:28 UTC
Description of problem:

When an SMP RH5 kernel boots on a single cpu, there is code which
runs to replace the "lock",<operation> sequences with "nop",<operation>.
This in general works fine and supposedly it offers a benefit performance
wise by avoiding the expense of locked bus operations when there is no
other cpu requiring the operation to be interlocked.

However, the futex code [specifically the PI futex code] does atomic
handoffs with user addresses complete with exception handling in case
the user page is not resident and needs to be faulted in to complete
the operation. In general this works because there is a preceding
get_user(uval, uaddr) [before locking down mmap_sem] to acquire the
initial value of the futex. If however acquiring the mmap_sem for read
blocks, it is possible that the page containing the futex word will fault
and if so, the fixup table addresses are off-by-1 if the UP lock => nop
replacement has been done. In short, the exception address is the original
address+1 and never being found causes the kernel to not handle the fault
and OOPS on the page fault (as if the kernel were doing something bad).

Here is an example of what that looks like:

<1>Unable to handle kernel paging request at 000000000bdd99b8 RIP: 
<1> [<ffffffff80060b93>] do_futex+0x1333/0x1490
<4>PGD 346146067 PUD 32bbcc067 PMD 341c73067 PTE 800000032e5a2065
<0>Oops: 0003 [1] SMP 
<1>last sysfs file: /class/scsi_host/host0/proc_name

[0]kdb> rd
     r15 = 0x0000000000000000      r14 = 0x000000000bdd99b8 
     r13 = 0x000000008000591d      r12 = 0x0000000000000000 
     rbp = 0xffffffff8060fc78      rbx = 0x0000000000000000 
     r11 = 0x0000000000000000      r10 = 0x00000000ffffffff 
      r9 = 0x0000000000000015       r8 = 0xffff81032e5ca000 
     rax = 0x000000008000591d      rcx = 0x0000000000000000 
     rdx = 0x00000000fffffff2      rsi = 0x00000000fdfbd800 
     rdi = 0xffffffff8060fc70 orig_rax = 0xffffffffffffffff 
     rip = 0xffffffff80060b93       cs = 0x0000000000000010 
  eflags = 0x0000000000210246      rsp = 0xffff81032e5cbc90 
      ss = 0x0000000000000000 &regs = 0xffff81032e5cbbf8
[0]kdb> bt
Stack traceback for pid 22813
0xffff81032e5af100    22813    22675  1    0   R  0xffff81032e5af390 *process
rsp                rip                Function (args)
0xffff81032e5cbc78 0xffffffff80060b93 do_futex+0x1333
0xffff81032e5cbca0 0xffffffff80060a14 do_futex+0x11b4
0xffff81032e5cbce0 0xffffffff800797be unlock_page+0x2e
0xffff81032e5cbd30 0xffffffff8007cb1b filemap_nopage+0x19b
0xffff81032e5cbe38 0xffffffff8003cab0 default_wake_function
0xffff81032e5cbe70 0xffffffff8002e607 do_page_fault+0x4b7
0xffff81032e5cbe80 0xffffffff8033d0d8 thread_return+0x62
0xffff81032e5cbec0 0xffffffff8000f4c3 do_gettimeofday+0x43
0xffff81032e5cbf10 0xffffffff80061349 compat_sys_futex+0x119
0xffff81032e5cbf80 0xffffffff80030c72 ia32_sysret


Version-Release number of selected component (if applicable): Any RHEL5 kernel.


How reproducible:

It is non-trivial to reproduce this, but it should be clear that
futex_atomic_cmpxchg_inatomic() generates the exception address
where the "lock","cmpxchg ..." sequence is and AFAIK as long as
the "lock" prefix is there, it will fault on the saved exception
address. If the "lock" is replaced with a "nop", then the fault
address is the address of the cmpexch instruction and the exception
lookup will fail if it faults [which is the hard part]. I suspect
that a thread doing something to the address space of the process
requiring write lock on mmap_sem is the culprit. All that has to
happen is for the page to fault when the atomic update of the futex
takes place and it will fault [as above].


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

I am working on a piece of code that for i386/x86_64 would adjust the
exception table addresses dynamically [after they have been sorted].
Alternatively, it would be easy to augment the exception table lookup
with another table that is computed from the intersection of the nop
addresses and the exception table for the kernel. Doing that would add
slight overhead for exception lookup, but would only happen if the
main exception table lookup fails. We are talking about < 16 additional
exception table entries if we augment the search (after failing the
main kernel and module lookup).

Comment 1 David Bein 2010-09-14 17:49:25 UTC
Even though the bug is filed against x86_64 kernels, it is also an issue
for i386 kernels [i686]. The alternative instruction logic is shared
between the i686 and x86_64 kernels.

Comment 2 David Bein 2010-09-14 20:59:10 UTC
Created attachment 447322 [details]
Adds an alternate exception lookup table for UP case. Safe for hotplug cpu.

Comment 3 David Bein 2010-09-14 21:00:47 UTC
Comment on attachment 447322 [details]
Adds an alternate exception lookup table for UP case. Safe for hotplug cpu.

NOTE: This set of patches is relative to rh5.5:

2.6.18-194.11.3.el5

Comment 4 David Bein 2010-09-15 01:05:37 UTC
For full coverage in the presence of boot options: noreplacement or smp-alt-boot
or if smp_alt_once is set to 1 in alternative_instructions() [because
the maximum number of possible cpus is < 2], we need to defer the
call which frees [__smp_alt_begin => __smp_alt_end] because that happens
to include [__smp_locks => __smp_locks_end]. Computing the alternative
exception addresses requires __smp_locks => __smp_locks_end to be untouched.
The call to free_init_pages() poisons the pages which breaks the
logic in alternatives_smp_check_exceptions(). It does not fault or anything,
but it also does not create the UP alternate exception table [which is
the whole point of this dance].

The next attachment is relative to the first one and delays actual calls
to free_init_pages() on the SMP alternatives segment until the calculated
alternate exceptions are completed. No change in semantics, just defer
the free_init_pages() until things have settled down a bit.

Comment 5 David Bein 2010-09-15 01:07:31 UTC
Created attachment 447367 [details]
Defer calls to free_init_pages() until alternative exception table is created.

This patch is relative to the previous patch for arch/i386/kernel/alternative.c .

Comment 6 David Bein 2010-09-15 17:14:26 UTC
It looks like others have hit this one before:

https://bugzilla.redhat.com/show_bug.cgi?id=429412

https://bugzilla.redhat.com/show_bug.cgi?id=431823

Digging a bit deeper, the fault was caused by _PAGE_RW being clear
in the pte. I have yet to track down why that happened, but maybe
COW handling in a fork?

Either way, it is because the off-by-1 in the exception table
entries for various atomic cmpxchg instruction sequences in the
futex code.

Comment 7 David Bein 2010-09-15 19:38:19 UTC
Created attachment 447555 [details]
A different approach based on 2.6.23 -> 2.6.27 (maybe later)

At the expense of preserving the "lock",<operator> sequences in the futex
code, a different approach is to hardwire the "lock" sequences for the
futex code which means that the SMP alternatives code will never touch them.
This is the approach taken by some kernel.org releases, notably 2.6.24
and at least to 2.6.27 [as of this date]. The latest kernel.org kernels
have some other scheme for handling this entirely.

This is a simpler approach to fixing the problem and is standalone.
It is unclear if anyone will notice the overhead on UP for the
lock instructions for just the futex operators.

Comment 8 David Bein 2010-11-14 06:21:37 UTC
It looks like this was fixed in:

https://rhn.redhat.com/errata/RHSA-2010-0839.html

The other bug for this is:

https://bugzilla.redhat.com/show_bug.cgi?id=633170

The choice was to hardwire the lock prefix in futex.h
which happens to be the same as:

https://bugzilla.redhat.com/attachment.cgi?id=447555

This bug should be marked as a duplicate of 633170.

Comment 9 Jes Sorensen 2013-02-27 09:46:02 UTC

*** This bug has been marked as a duplicate of bug 633170 ***

Comment 10 David Bein 2014-06-09 11:47:36 UTC
Since this is a dup, nothing is needed at this point.