162797 – dangling __kernel_sigreturn makes signal return unreliable

Bug 162797 - dangling __kernel_sigreturn makes signal return unreliable

Summary: dangling __kernel_sigreturn makes signal return unreliable

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	5
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Ingo Molnar
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-07-08 18:29 UTC by John Reiser
Modified:	2007-11-30 22:11 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2006-03-29 20:57:05 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
start.S assembly code for testcase (1.36 KB, text/plain) 2005-07-19 21:39 UTC, John Reiser	no flags	Details
sigreturn.c for testcase (1.55 KB, text/plain) 2005-07-19 21:39 UTC, John Reiser	no flags	Details
put vDSO at STACK_TOP (1.01 KB, patch) 2005-12-15 01:27 UTC, John Reiser	no flags	Details \| Diff
vDSO: random, STACK_TOP, just below mm->start_code (3.91 KB, patch) 2005-12-17 05:05 UTC, John Reiser	no flags	Details \| Diff
exec-shield options for vDSO placement on x86 (6.59 KB, patch) 2006-03-17 00:45 UTC, John Reiser	no flags	Details \| Diff
View All

Description John Reiser 2005-07-08 18:29:42 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20050524 Fedora/1.0.4-4 Firefox/1.0.4

Description of problem:
Return from signal handler is unreliable because the kernel uses a dangling value for __kernel_sigreturn when setting up the code for return from signal handler.

setup_frame() in arch/i386/kernel/signal.c uses
        restorer = current->mm->context.vdso + (long)&__kernel_sigreturn;
whenever !(.sa_flags & SA_RESTORER).  Unfortunately: context.vdso is
never updated when the user changes the mapping for that page, the mapping
is not protected against being changed, and because /proc/PID/auxv is read-only
then the user cannot inform the kernel.  So any *sigaction() that does not
specify SA_RESTORER creates a time bomb for the return from signal handler.  Some users want to move the AT_SYSINFO_EHDR page in order to maximize contiguous page ranges when dealing with large arrays.

The kernel must allow the user to move the AT_SYSINFO page.
[If not, then the kernel must object to any mmap/mprotect/munmap/mremap that affects the AT_SYSINFO page.]  The alternatives I see are: kernel detects mmap(vaddr, PAGE_SIZE, PROT_EXEC, MAP_FIXED, fd, 0) where fd is /proc/self/auxv, then adjusts __kernel_sigreturn; let the user tell the kernel by writing to /proc/PID/auxv (using some protocol such as: seek to AT_SYSINFO_EHDR * sizeof(void *), write new binary value); or let the user tell the kernel by using a new syscall.


Version-Release number of selected component (if applicable):
kernel-2.6.11-1.1369_FC4

How reproducible:
Always

Steps to Reproduce:
1. At Elf32_Ehdr.e_entry, immediately after execve(): find AT_SYSINFO_EHDR, copy that page to a new one, update AT_SYSINFO and AT_SYSINFO_EHDR to point to the new page, discard the old page.
2. Call *sigaction() with a handler but without SA_RESTORER.
3. Receive a signal, and attempt to return from the handler.
  

Actual Results:  Return from the handler faults SIGSEGV because the original AT_SYSINFO_EHDR page no longer exists.

Expected Results:  Kernel should not rely on dangling __kernel_sigreturn when setting up return from signal handler.  Either kernel should use a fallback, "always works" mechanism, or allow [and require] the user to tell the kernel when the user moves the AT_SYSINFO page.

Additional info:

Comment 1 Roland McGrath 2005-07-08 19:07:57 UTC

It seems dubious to me that the user should reasonably expect to unmap the
kernel-supplied page and not have things go all to hell.  If the application
needs to ensure that certain ranges of its address space are free, the most
proper thing to do is reserve those by using PT_LOAD segments in the executable
ELF file.
Segments that reserve address space without loading anything can have p_flags=0
to get PROT_NONE mappings that you can later unmap or mmap over.
It's possible the kernel has some bug dealing with this, but that would be a
separate kernel bug and if a fix is necessary it should be done there.

Unless you can convince me otherwise, I'm inclined to resolve this NOTABUG.

Comment 2 John Reiser 2005-07-08 20:11:21 UTC

I agree that ranges with fixed addresses known in advance should be reserved
using PT_LOAD.  The problem is for ranges not known in advance, namely all the
holes left over after mapping the executable and the PT_INTERP, particularly
when ET_DYN, 0==p_vaddr, and randomization is involved.  This is a frequent case
for PT_INTERP [that has not been prelinked], and an increasingly-common case for
-fPIE main executables.  The kernel randomly places the AT_SYSINFO page
somewhere in the holes, and this often fragments the address space
unnecessarily.  Splitting a 100MB hole into a {30MB hole, 4KB AT_SYSINFO, 70MB
hole} is costly because a 75MB array can no longer use that address space if the
4KB AT_SYSINFO page cannot be moved.

Here is a compromise that I could live with: adjust the default policy for
AT_SYSINFO placement to be 1 page below that of the first PT_LOAD for the
PT_INTERP (and especially for a ET_DYN PT_INTERP), else 1 page below that of the
first PT_LOAD of the main executable.  In particualar, if I prelink my PT_INTERP
(or use an ET_EXEC PT_INTERP) then in effect I also prelink the AT_SYSINFO page
at 4KB less.  This would dramatically reduce the unnecessary fragmentation of
the address space, while still retaining some security benefits of randomization
(if either the PT_INTERP or the main execve() were ET_DYN with 0==p_vaddr).  It
would also give administrators and users more control, and in an understandable way.

Comment 3 John Reiser 2005-07-11 19:21:05 UTC

There is a performance aspect, too.  The random placement of the AT_SYSINFO page
(linux-gate.so.1) by the kernel can disrupt much of the prelinking for a typical
configuration of shared libraries.  The kernel places AT_SYSINFO after seeing at
most the main execve() and the PT_INTERP.  If the kernel happens to chose a page
that is prefered by a prelinked .so which is mapped-in later by the usual
PT_INTERP ld-linux.so.2, then ld-linux must relocate that .so somewhere else. 
Doing so invokes another randomization by the kernel, which may step on pages
prefered by subsequent prelinked .so, and the result can cascade many times.  It
is not uncommon for KDE, Gnome, even "bare" X11 applications to use a dozen or
more prelinked .so.  Being forced to abandon the prelinked address costs CPU
time and may reduce page sharing.  If the placement policy for AT_SYSINFO were
"1 page below [or above] PT_INTERP" then a local administrator (or prelink
itself) could take care to avoid that page.  The result would be that the
randomization would be controlled by the prelink policy (and not directly by the
kernel unless 0==p_vaddr), and the kernel would not "accidentally" disrupt
placement of an entire configuration of .so.

Comment 4 John Reiser 2005-07-13 01:51:10 UTC

Here is an example which shows that the kernel placement of linux-gate.so.1 does
interfere with a prelinked glibc, forcing libc.so.6 to be relocated at runtime.
 In this specific case where glibc is the only .so besides ld-linux.so.2, the
frequency was 7.4%.
-----
for i in 0 1 2 3 4 5 6 7 8 9; do
  for j in 0 1 2 3 4 5 6 7 8 9; do
    for k in 0 1 2 3 4 5 6 7 8 9; do
      ldd /bin/cat
    done
  done
done  |  grep libc  |  sort  |  uniq -c
-----
     74         libc.so.6 => /lib/libc.so.6 (0x00111000)
    926         libc.so.6 => /lib/libc.so.6 (0x009ee000)
-----

Comment 5 Dave Jones 2005-07-15 21:05:15 UTC

[This comment has been added as a mass update for all FC4 kernel bugs.
 If you have migrated this bug from an FC3 bug today, ignore this comment.]

Please retest your problem with todays 2.6.12-1.1398_FC4 update.

If your problem involved being unable to boot, or some hardware not being
detected correctly, please make sure your /etc/modprobe.conf is correct *BEFORE*
installing any kernel updates.
If in doubt, you can recreate this file using..

mv /etc/sysconfig/hwconf /etc/sysconfig/hwconf.bak
mv /etc/modprobe.conf /etc/modprobe.conf.bak
kudzu


Thank you.

Comment 6 John Reiser 2005-07-19 21:38:16 UTC

kernel-2.6.12-1.1398_FC4 on i686 still has the original problem: dangling
__kernel_sigreturn.

I will attach two small files which form a reproducible testcase: start.S and
sigreturn.c.
$ gcc -g -o sigreturn -nostartfiles -nostdlib start.S sigreturn.c
$ ./sigreturn
Segmentation fault   # because linux-gate.so.1 page was moved
$

Comment 7 John Reiser 2005-07-19 21:39:17 UTC

Created attachment 116950 [details]
start.S assembly code for testcase

Comment 8 John Reiser 2005-07-19 21:39:55 UTC

Created attachment 116951 [details]
sigreturn.c for testcase

Comment 9 Dave Jones 2005-09-30 06:05:43 UTC

Mass update to all FC4 bugs:

An update has been released (2.6.13-1.1526_FC4) which rebases to a new upstream
kernel (2.6.13.2). As there were ~3500 changes upstream between this and the
previous kernel, it's possible your bug has been fixed already.

Please retest with this update, and update this bug if necessary.

Thanks.

Comment 10 John Reiser 2005-09-30 16:09:08 UTC

The testcase of comment #6 still fails (gives Segmentation fault because
__kernel_sigreturn dangles after the user moves the AT_SYSINFO page) under
kernel-2.6.13-1.1526_FC4.

The performance test of comment #4 showed:
     68         libc.so.6 => /lib/libc.so.6 (0x00111000)
    932         libc.so.6 => /lib/libc.so.6 (0x009ee000)
which is 6.8% interference of AT_SYSINFO page with pre-linked glibc.  (This will
become worse if the app uses more than one pre-linked shared lib.)

Comment 11 John Reiser 2005-09-30 16:38:38 UTC

bugzilla mail says  jreiser changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO_REPORTER           |NEEDINFO

but I did not "press any buttons" [except the "Save Changes"] when creating
Comment #10.  Perhaps >this< comment will be enough to respond to the "NEEDINFO"
status.

Comment 12 Dave Jones 2005-11-10 19:01:50 UTC

2.6.14-1.1637_FC4 has been released as an update for FC4.
Please retest with this update, as a large amount of code has been changed in
this release, which may have fixed your problem.

Thank you.

Comment 13 John Reiser 2005-11-11 16:42:23 UTC

kernel-2.6.14-1.1637_FC4 still gives SIGSEGV because __kernel_sigreturn dangles
after the user moves the AT_SYSINFO page (testcase of comment #6.)

The performance test of comment #4 showed:
     77         libc.so.6 => /lib/libc.so.6 (0x00111000)
    923         libc.so.6 => /lib/libc.so.6 (0x009e4000)
which is 7.7% interference of AT_SYSINFO page with pre-linked glibc.  The
degradation will get worse as the process uses more pre-linked shared libraries
(such as any KDE or Gnome application.)

Comment 14 John Reiser 2005-12-15 01:27:39 UTC

Created attachment 122261 [details]
put vDSO at STACK_TOP

This patch linux-2.6-x86-vdso-stacktop.patch to kernel-2.6.14-1.1760_FC5 is a
workaround that puts the vDSO at STACK_TOP, 1 page below TASK_SIZE.  Because
the vDSO page is executable and will reside at the highest user address, then
exec_shield has no effect; so set /proc/sys/kernel/exec-shield (or the
compile-time variable exec_shield in kernel/sysctl.c) to zero.

Comment 15 John Reiser 2005-12-17 05:05:56 UTC

Created attachment 122364 [details]
vDSO: random, STACK_TOP, just below mm->start_code

This patch puts the vDSO at STACK_TOP when exec-shield is 0.  Otherwise,
another bit in exec_shield chooses between random placement and the page just
below current->mm->start_code.

Comment 16 Dave Jones 2006-02-03 06:29:36 UTC

This is a mass-update to all currently open kernel bugs.

A new kernel update has been released (Version: 2.6.15-1.1830_FC4)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO_REPORTER state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

Thank you.

Comment 17 John Reiser 2006-02-03 23:06:40 UTC

The issues persist with kernel-2.6.15-1.1826.2.10_FC5 on i686.  [Changed Version
to fc5test2.]  kernel-2.6.15-1.1895_FC5 hangs starting udev on Athlon SiS730
[see bugzilla #179601.]

The performance degradation of Comment #4 remains about 7%:
     74         libc.so.6 => /lib/libc.so.6 (0x00111000)
    926         libc.so.6 => /lib/libc.so.6 (0x006db000)

Comment 18 John Reiser 2006-03-17 00:45:36 UTC

Created attachment 126260 [details]
exec-shield options for vDSO placement on x86

Two more bits in /proc/sys/kernel/exec-shield control placement of vDSO on x86:
random, just below STACK_TOP, just below .text of main executable, just below
.text of PT_INTERP (ld.so.)

Comment 19 John Reiser 2006-03-17 00:48:40 UTC

kernel-2.6.15-1.2054_FC5 still has the same properties here.

The performance problem (randomly-placed vDSO interferes with prelinking) sticks
out prominently.  The Firefox browser often takes over 15 seconds to start (from
click on icon in menu bar, until some content is visible from local file:///
home page), which includes over 10 seconds after "Starting Web Browser"
disappears but before window appears.  Thus it looks like launch has failed.

In contrast, with the patch of Comment #18, Firefox always launches in less than
5 seconds.  The vDSO is placed just below .text of the PT_INTERP (ld.so), just
below the .text of the main executable, or just below STACK_TOP; this is
controlled by new bits in exec-shield.

Comment 20 Dave Jones 2006-03-29 20:57:05 UTC

Fixed in 2080_FC5 and rawhide (and the pending FC4 update).
Thanks for persevering on this one John.

Note You need to log in before you can comment on or make changes to this bug.