Description of problem: When an SMP RH5 kernel boots on a single cpu, there is code which runs to replace the "lock",<operation> sequences with "nop",<operation>. This in general works fine and supposedly it offers a benefit performance wise by avoiding the expense of locked bus operations when there is no other cpu requiring the operation to be interlocked. However, the futex code [specifically the PI futex code] does atomic handoffs with user addresses complete with exception handling in case the user page is not resident and needs to be faulted in to complete the operation. In general this works because there is a preceding get_user(uval, uaddr) [before locking down mmap_sem] to acquire the initial value of the futex. If however acquiring the mmap_sem for read blocks, it is possible that the page containing the futex word will fault and if so, the fixup table addresses are off-by-1 if the UP lock => nop replacement has been done. In short, the exception address is the original address+1 and never being found causes the kernel to not handle the fault and OOPS on the page fault (as if the kernel were doing something bad). Here is an example of what that looks like: <1>Unable to handle kernel paging request at 000000000bdd99b8 RIP: <1> [<ffffffff80060b93>] do_futex+0x1333/0x1490 <4>PGD 346146067 PUD 32bbcc067 PMD 341c73067 PTE 800000032e5a2065 <0>Oops: 0003 [1] SMP <1>last sysfs file: /class/scsi_host/host0/proc_name [0]kdb> rd r15 = 0x0000000000000000 r14 = 0x000000000bdd99b8 r13 = 0x000000008000591d r12 = 0x0000000000000000 rbp = 0xffffffff8060fc78 rbx = 0x0000000000000000 r11 = 0x0000000000000000 r10 = 0x00000000ffffffff r9 = 0x0000000000000015 r8 = 0xffff81032e5ca000 rax = 0x000000008000591d rcx = 0x0000000000000000 rdx = 0x00000000fffffff2 rsi = 0x00000000fdfbd800 rdi = 0xffffffff8060fc70 orig_rax = 0xffffffffffffffff rip = 0xffffffff80060b93 cs = 0x0000000000000010 eflags = 0x0000000000210246 rsp = 0xffff81032e5cbc90 ss = 0x0000000000000000 ®s = 0xffff81032e5cbbf8 [0]kdb> bt Stack traceback for pid 22813 0xffff81032e5af100 22813 22675 1 0 R 0xffff81032e5af390 *process rsp rip Function (args) 0xffff81032e5cbc78 0xffffffff80060b93 do_futex+0x1333 0xffff81032e5cbca0 0xffffffff80060a14 do_futex+0x11b4 0xffff81032e5cbce0 0xffffffff800797be unlock_page+0x2e 0xffff81032e5cbd30 0xffffffff8007cb1b filemap_nopage+0x19b 0xffff81032e5cbe38 0xffffffff8003cab0 default_wake_function 0xffff81032e5cbe70 0xffffffff8002e607 do_page_fault+0x4b7 0xffff81032e5cbe80 0xffffffff8033d0d8 thread_return+0x62 0xffff81032e5cbec0 0xffffffff8000f4c3 do_gettimeofday+0x43 0xffff81032e5cbf10 0xffffffff80061349 compat_sys_futex+0x119 0xffff81032e5cbf80 0xffffffff80030c72 ia32_sysret Version-Release number of selected component (if applicable): Any RHEL5 kernel. How reproducible: It is non-trivial to reproduce this, but it should be clear that futex_atomic_cmpxchg_inatomic() generates the exception address where the "lock","cmpxchg ..." sequence is and AFAIK as long as the "lock" prefix is there, it will fault on the saved exception address. If the "lock" is replaced with a "nop", then the fault address is the address of the cmpexch instruction and the exception lookup will fail if it faults [which is the hard part]. I suspect that a thread doing something to the address space of the process requiring write lock on mmap_sem is the culprit. All that has to happen is for the page to fault when the atomic update of the futex takes place and it will fault [as above]. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: I am working on a piece of code that for i386/x86_64 would adjust the exception table addresses dynamically [after they have been sorted]. Alternatively, it would be easy to augment the exception table lookup with another table that is computed from the intersection of the nop addresses and the exception table for the kernel. Doing that would add slight overhead for exception lookup, but would only happen if the main exception table lookup fails. We are talking about < 16 additional exception table entries if we augment the search (after failing the main kernel and module lookup).
Even though the bug is filed against x86_64 kernels, it is also an issue for i386 kernels [i686]. The alternative instruction logic is shared between the i686 and x86_64 kernels.
Created attachment 447322 [details] Adds an alternate exception lookup table for UP case. Safe for hotplug cpu.
Comment on attachment 447322 [details] Adds an alternate exception lookup table for UP case. Safe for hotplug cpu. NOTE: This set of patches is relative to rh5.5: 2.6.18-194.11.3.el5
For full coverage in the presence of boot options: noreplacement or smp-alt-boot or if smp_alt_once is set to 1 in alternative_instructions() [because the maximum number of possible cpus is < 2], we need to defer the call which frees [__smp_alt_begin => __smp_alt_end] because that happens to include [__smp_locks => __smp_locks_end]. Computing the alternative exception addresses requires __smp_locks => __smp_locks_end to be untouched. The call to free_init_pages() poisons the pages which breaks the logic in alternatives_smp_check_exceptions(). It does not fault or anything, but it also does not create the UP alternate exception table [which is the whole point of this dance]. The next attachment is relative to the first one and delays actual calls to free_init_pages() on the SMP alternatives segment until the calculated alternate exceptions are completed. No change in semantics, just defer the free_init_pages() until things have settled down a bit.
Created attachment 447367 [details] Defer calls to free_init_pages() until alternative exception table is created. This patch is relative to the previous patch for arch/i386/kernel/alternative.c .
It looks like others have hit this one before: https://bugzilla.redhat.com/show_bug.cgi?id=429412 https://bugzilla.redhat.com/show_bug.cgi?id=431823 Digging a bit deeper, the fault was caused by _PAGE_RW being clear in the pte. I have yet to track down why that happened, but maybe COW handling in a fork? Either way, it is because the off-by-1 in the exception table entries for various atomic cmpxchg instruction sequences in the futex code.
Created attachment 447555 [details] A different approach based on 2.6.23 -> 2.6.27 (maybe later) At the expense of preserving the "lock",<operator> sequences in the futex code, a different approach is to hardwire the "lock" sequences for the futex code which means that the SMP alternatives code will never touch them. This is the approach taken by some kernel.org releases, notably 2.6.24 and at least to 2.6.27 [as of this date]. The latest kernel.org kernels have some other scheme for handling this entirely. This is a simpler approach to fixing the problem and is standalone. It is unclear if anyone will notice the overhead on UP for the lock instructions for just the futex operators.
It looks like this was fixed in: https://rhn.redhat.com/errata/RHSA-2010-0839.html The other bug for this is: https://bugzilla.redhat.com/show_bug.cgi?id=633170 The choice was to hardwire the lock prefix in futex.h which happens to be the same as: https://bugzilla.redhat.com/attachment.cgi?id=447555 This bug should be marked as a duplicate of 633170.
*** This bug has been marked as a duplicate of bug 633170 ***
Since this is a dup, nothing is needed at this point.