1410097 – kernel: too small userspace stack allocated for PIE binary

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1410097 - kernel: too small userspace stack allocated for PIE binary

Summary: kernel: too small userspace stack allocated for PIE binary

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	6.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Rik van Riel
QA Contact:	Li Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	RHEL-hardening-cflags
TreeView+	depends on / blocked

Reported:	2017-01-04 12:58 UTC by Florian Weimer
Modified:	2018-02-22 11:26 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-12-06 10:19:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
reproducer.tar.xz (419.62 KB, application/octet-stream) 2017-01-04 12:58 UTC, Florian Weimer	no flags	Details
reproducer (8.59 KB, application/x-sharedlib) 2017-01-04 17:29 UTC, Florian Weimer	no flags	Details
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	708563	0	unspecified	CLOSED	2.6.38.6-27.fc15.x86_64 kernel doesn't work with PIE when ASLR is disabled	2021-02-22 00:41:40 UTC

Internal Links: 708563

Description Florian Weimer 2017-01-04 12:58:32 UTC

Created attachment 1237119 [details]
reproducer.tar.xz

Description of problem:

If kernel.randomize_va_space=0, the kernel allocates a very small stack for some binaries.

Version-Release number of selected component (if applicable):

kernel-2.6.32-642.el6.x86_64
kernel-2.6.32-642.11.1.el6.x86_64

How reproducible:

Always with the attached binary.

Steps to Reproduce:
1. Enter a Red Hat Enterprise Linux 7 chroot.
2. Download the reproducer.tar.xz and unpack it (it unpacks to the current directory).
3. ./genautomata i386.md insn-conditions.md > /dev/null

Actual results:

Execution terminates with SIBGUS due to a stack overflow.

Expected results:

Execution completes successfully.

Additional info:

I don't know if this can happen for non-PIE binaries or with kernel.randomize_va_space=2 (the default).

Running under GDB yields:

Program received signal SIGBUS, Bus error.
0x00007ffff7fdfd95 in pass_state_graph (start_state=0x7fffff2f2260, 
    applied_func=0x7ffff7fdaaa0 <incr_states_and_arcs_nums(state_t)>)
    at ../../gcc/genautomata.c:5820
5820    {
(gdb) print $rsp
$1 = (void *) 0x7ffffffd0000

/proc/PID/maps shows that the stack is tiny:

…
7ffffffd0000-7ffffffff000 rw-p 00000000 00:00 0                          [stack]
…

ulimit -a shows a decent stack size:

…
stack size              (kbytes, -s) 10240
…

The genautomata program headers don't look suspicious to me:

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  PHDR           0x000040 0x0000000000000040 0x0000000000000040 0x0001f8 0x0001f8 R E 0x8
  INTERP         0x000238 0x0000000000000238 0x0000000000000238 0x00001c 0x00001c R   0x1
        [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x030b24 0x030b24 R E 0x200000
  LOAD           0x031a48 0x0000000000231a48 0x0000000000231a48 0x004660 0x005638 RW  0x200000
  DYNAMIC        0x035be8 0x0000000000235be8 0x0000000000235be8 0x000200 0x000200 RW  0x8
  NOTE           0x000254 0x0000000000000254 0x0000000000000254 0x000044 0x000044 R   0x4
  GNU_EH_FRAME   0x02c8c0 0x000000000002c8c0 0x000000000002c8c0 0x000994 0x000994 R   0x4
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW  0x10
  GNU_RELRO      0x031a48 0x0000000000231a48 0x0000000000231a48 0x0045b8 0x0045b8 R   0x1

Comment 2 Florian Weimer 2017-01-04 13:15:17 UTC

This also happens if the genautomata binary is run under GDB, with an unchanged kernel.randomize_va_space setting (at the default value 2).

Comment 3 Florian Weimer 2017-01-04 16:39:53 UTC

“ulimit -s unlimited” works around this issue because the address space layout changes.

NB: This is about the kernel allocation of the main stack for *userspace*.

Comment 4 Rik van Riel 2017-01-04 17:00:10 UTC

The stack is currently small, but I see nothing in your bug showing that the kernel did not leave space to expand it. What does the /proc/<pid>/maps file look like when the task gets its SIGBUS?

It looks like as long as the access done by user space is within 64kB of the bottom of the stack, the kernel will automatically expand the stack to cover the newly touched stack space:

	if (error_code & PF_USER) {
		/*
		 * Accessing the stack below %sp is always a bug.
		 * The large cushion allows instructions like enter
		 * and pusha to work. ("enter $65535, $31" pushes
		 * 32 pointers and then decrements %sp by 65535.)
		 */
		if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
			bad_area(regs, error_code, address);
			return;
		}
	}
	if (unlikely(expand_stack(vma, address))) {
		bad_area(regs, error_code, address);
		return;
	}

Furthermore, when stack expansion fails, or when the kernel refuses to expand the stack, userspace should be getting a segfault, not a bus error.

Let me see if I can reproduce your issue in a RHEL6 guest here.

Comment 5 Florian Weimer 2017-01-04 17:11:48 UTC

(In reply to Rik van Riel from comment #4)
> The stack is currently small, but I see nothing in your bug showing that the
> kernel did not leave space to expand it. What does the /proc/<pid>/maps file
> look like when the task gets its SIGBUS?

This is the line posted in the description.

> It looks like as long as the access done by user space is within 64kB of the
> bottom of the stack, the kernel will automatically expand the stack to cover
> the newly touched stack space:

I should have pasted more context from the maps file.  This is from the point of the crash:

7ffff8205000-7ffffffcf000 rw-p 00000000 00:00 0                          [heap]
7ffffffd0000-7ffffffff000 rw-p 00000000 00:00 0                          [stack]

So it looks like the brk heap blocks stack expansion.

I still don't think this is an application issue (e.g. memory leak) because the kernel only left a ~125 MiB gap for the heap.

Comment 6 Florian Weimer 2017-01-04 17:26:43 UTC

Based on that I have much simpler reproducer.  The C sources follow.  To be compiled under RHEL 7 with: gcc -fpie -pie reproducer.c

#include <stdio.h>
#include <stdbool.h>
#include <stdlib.h>

static char *first_address;

static void
recurse (int depth)
{
  char foo[4000];
  if (first_address == NULL)
    first_address = foo;
  printf ("depth %d stack address %p distance %td KiB\n",
          depth, &foo, (first_address - foo) / 1024);
  recurse (depth + 1);
  asm volatile ("" ::: "memory"); /* Prevent tail recursion.  */
}

int
main (void)
{
  void **heap_filler = 0;
  size_t total = 0;
  while (total < 130 * 1024 * 1024)
    {
      size_t sz = 16000;
      void **next = malloc (sz);
      if (next == NULL)
        break;
      total += sz;
      *next = heap_filler;
      heap_filler = next;
    }
  printf ("heap-allocated %zu bytes\n", total);

  recurse (0);
}

Output:

heap-allocated 136320000 bytes
depth 0 stack address 0x7fffffffd610 distance 0 KiB
depth 1 stack address 0x7fffffffc650 distance 3 KiB
depth 2 stack address 0x7fffffffb690 distance 7 KiB
depth 3 stack address 0x7fffffffa6d0 distance 11 KiB
depth 4 stack address 0x7fffffff9710 distance 15 KiB
depth 5 stack address 0x7fffffff8750 distance 19 KiB
depth 6 stack address 0x7fffffff7790 distance 23 KiB
depth 7 stack address 0x7fffffff67d0 distance 27 KiB
depth 8 stack address 0x7fffffff5810 distance 31 KiB
depth 9 stack address 0x7fffffff4850 distance 35 KiB
depth 10 stack address 0x7fffffff3890 distance 39 KiB
depth 11 stack address 0x7fffffff28d0 distance 43 KiB
depth 12 stack address 0x7fffffff1910 distance 47 KiB
depth 13 stack address 0x7fffffff0950 distance 51 KiB
depth 14 stack address 0x7ffffffef990 distance 55 KiB
depth 15 stack address 0x7ffffffee9d0 distance 59 KiB
depth 16 stack address 0x7ffffffeda10 distance 63 KiB
depth 17 stack address 0x7ffffffeca50 distance 66 KiB
depth 18 stack address 0x7ffffffeba90 distance 70 KiB
depth 19 stack address 0x7ffffffeaad0 distance 74 KiB
depth 20 stack address 0x7ffffffe9b10 distance 78 KiB
depth 21 stack address 0x7ffffffe8b50 distance 82 KiB
depth 22 stack address 0x7ffffffe7b90 distance 86 KiB
depth 23 stack address 0x7ffffffe6bd0 distance 90 KiB
depth 24 stack address 0x7ffffffe5c10 distance 94 KiB
depth 25 stack address 0x7ffffffe4c50 distance 98 KiB
depth 26 stack address 0x7ffffffe3c90 distance 102 KiB
depth 27 stack address 0x7ffffffe2cd0 distance 106 KiB
depth 28 stack address 0x7ffffffe1d10 distance 110 KiB
depth 29 stack address 0x7ffffffe0d50 distance 114 KiB
depth 30 stack address 0x7ffffffdfd90 distance 118 KiB
depth 31 stack address 0x7ffffffdedd0 distance 122 KiB
depth 32 stack address 0x7ffffffdde10 distance 126 KiB
depth 33 stack address 0x7ffffffdce50 distance 129 KiB
depth 34 stack address 0x7ffffffdbe90 distance 133 KiB
depth 35 stack address 0x7ffffffdaed0 distance 137 KiB
depth 36 stack address 0x7ffffffd9f10 distance 141 KiB
depth 37 stack address 0x7ffffffd8f50 distance 145 KiB
depth 38 stack address 0x7ffffffd7f90 distance 149 KiB
depth 39 stack address 0x7ffffffd6fd0 distance 153 KiB
depth 40 stack address 0x7ffffffd6010 distance 157 KiB
depth 41 stack address 0x7ffffffd5050 distance 161 KiB
depth 42 stack address 0x7ffffffd4090 distance 165 KiB
depth 43 stack address 0x7ffffffd30d0 distance 169 KiB
depth 44 stack address 0x7ffffffd2110 distance 173 KiB
depth 45 stack address 0x7ffffffd1150 distance 177 KiB
depth 46 stack address 0x7ffffffd0190 distance 181 KiB
depth 47 stack address 0x7ffffffcf1d0 distance 185 KiB
depth 48 stack address 0x7ffffffce210 distance 189 KiB
depth 49 stack address 0x7ffffffcd250 distance 192 KiB
depth 50 stack address 0x7ffffffcc290 distance 196 KiB
depth 51 stack address 0x7ffffffcb2d0 distance 200 KiB
Bus error (core dumped)

Comment 7 Florian Weimer 2017-01-04 17:29:26 UTC

Created attachment 1237250 [details]
reproducer

Compiled x86_64 program from source code in comment #6.  This should run on RHEL 6 even outside a RHEL 7 chroot because it only references symbols provided by the RHEL 6 glibc.

Comment 8 Rik van Riel 2017-01-04 18:45:47 UTC

That second reproducer shows the bug quite easily, indeed!

It dies with the heap being just over 128MB in size, and the stack being only a little over 200kB in size.

7ffff81ff000-7ffff8200000 rw-p 00001000 fd:00 305164                     /root/reproducer
7ffff8200000-7ffffffc9000 rw-p 00000000 00:00 0                          [heap]
7ffffffca000-7ffffffff000 rw-p 00000000 00:00 0                          [stack]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

$ echo $((0x7ffffffc9000 - 0x7ffff8200000))
131895296

$ echo $((0x7ffffffff000 - 0x7ffffffca000))
217088

In short, it looks like a PIE binary is mapped near the end of the address space, with just 128MB of space for the heap, and 10MB space for the stack, instead of finding it a place much lower down in address space, where the heap has room to grow (without running into the stack).

Comment 9 Rik van Riel 2017-01-04 18:47:34 UTC

For example, looking at /proc/$$/maps (bash) on RHEL6 shows me this:

008dc000-008e5000 rw-p 000dc000 fd:00 264590                             /bin/bash
00d06000-00d48000 rw-p 00000000 00:00 0                                  [heap]
3c5b200000-3c5b220000 r-xp 00000000 fd:00 351                            /lib64/ld-2.12.so
... (many lines)
7fd024c70000-7fd024c77000 r--s 00000000 fd:00 134326                     /usr/lib64/gconv/gconv-modules.cache
7fd024c77000-7fd024c78000 rw-p 00000000 00:00 0 
7fff10be5000-7fff10bfa000 rw-p 00000000 00:00 0                          [stack]
7fff10bff000-7fff10c00000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

In other words, not only is there an absolutely gigantic gap for the heap, there is also a proper size gap for the stack.

Lets see how we can fix this for RHEL6 x86-64...

Comment 10 Rik van Riel 2017-01-04 19:44:49 UTC

Your reproducer fails once it reaches the ulimit stack limit on Fedora 25, kernel 4.8.8-300.fc25x86_64. The binary gets loaded low in memory (0x555555554000 this test run, should have some randomization in there).

We need to ensure that brk() never overflows the 128MB allocated for the heap, running into the stack area. The malloc() implementation in glibc should fall back to malloc() once brk() fails to return more memory, making this transparent to userspace. However, mapping the binary at a way lower virtual memory address would give the heap several TB of space to grow before anything would happen :)

There are several changesets in the upstream kernel that seem relevant:


commit d1fd836dcf00d2028c700c7e44d2c23404062c90
Author: Kees Cook <keescook>
Date:   Tue Apr 14 15:48:07 2015 -0700

    mm: split ET_DYN ASLR from mmap ASLR
    
    This fixes the "offset2lib" weakness in ASLR for arm, arm64, mips,
    powerpc, and x86.  The problem is that if there is a leak of ASLR from
    the executable (ET_DYN), it means a leak of shared library offset as
    well (mmap), and vice versa.  Further details and a PoC of this attack
    is available here:
    
      http://cybersecurity.upv.es/attacks/offset2lib/offset2lib.html

commit a87938b2e246b81b4fb713edb371a9fa3c5c3c86
Author: Michael Davidson <md>
Date:   Tue Apr 14 15:47:38 2015 -0700

    fs/binfmt_elf.c: fix bug in loading of PIE binaries
    
    With CONFIG_ARCH_BINFMT_ELF_RANDOMIZE_PIE enabled, and a normal top-down
    address allocation strategy, load_elf_binary() will attempt to map a PIE
    binary into an address range immediately below mm->mmap_base.
    
    Unfortunately, load_elf_ binary() does not take account of the need to
    allocate sufficient space for the entire binary which means that, while
    the first PT_LOAD segment is mapped below mm->mmap_base, the subsequent
    PT_LOAD segment(s) end up being mapped above mm->mmap_base into the are
    that is supposed to be the "gap" between the stack and the binary.


I will see how much (or little) we want to backport here. It could be a one-liner, but that would make RHEL6 different from upstream...

Al, any opinions?

Comment 11 Neil Horman 2017-01-04 19:50:25 UTC

Rik, thats correct.  IIRC in RHEL6 for reasons I can't recall, text is mapped high in a process address space near the stack and heap along with heap (possibly to allow a greater contiguous address space for programatic memory mapping?  not sure).  Either way, I think it was based on the assumption that the stack user limit let you have confidence in what the stack growth would be.  setting stack size to unlimited forced the kernel and glibc to map a process such that text and heap were much lower in address space so stack could grow accordingly. 

I think the most direct answer to our building problem here is to just ensure that limits.conf is configured on the builders such that stack space has unlimited (or a very large value)

Comment 12 Florian Weimer 2017-01-04 20:21:52 UTC

(In reply to Rik van Riel from comment #10)
> Your reproducer fails once it reaches the ulimit stack limit on Fedora 25,
> kernel 4.8.8-300.fc25x86_64. The binary gets loaded low in memory
> (0x555555554000 this test run, should have some randomization in there).
> 
> We need to ensure that brk() never overflows the 128MB allocated for the
> heap, running into the stack area. The malloc() implementation in glibc
> should fall back to malloc() once brk() fails to return more memory, making
> this transparent to userspace.

It does that to some extent (I haven't checked how efficient the fallback is, i.e. if glibc properly switches over to arenas, instead using mmap for individual allocations).  Sorry, my reproducer isn't very clear regarding this.

The problem is that at this point, the heap has already been exhausted and grown as closely as possible to the stack, disregarding the configured stack size.  This means that the stack cannot grow further when the deep recursion starts.

The brk behavior (consuming address space which could also be used by the stack due to the configured stack size) is probably not *that* buggy, assuming that there is a strict address space limit and you can't both satisfy all potential heap and stack allocations from the gap.  The real problem is that the gap is so small IMHO.  glibc can't really work around either problem while still using brk for the main arena.  We just don't know when to stop using the main arena so that we don't trigger the stack overlap.

(In reply to Neil Horman from comment #11)
> I think the most direct answer to our building problem here is to just
> ensure that limits.conf is configured on the builders such that stack space
> has unlimited (or a very large value)

This seems to have the side effect of increasing the gap, yes.  Setting to a few GiB should prevent build issues on 64 bit architectures.  I don't know yet if this problem actually applies to 32-bit architectures, where this approach would be far less compelling.  But the i386 compat mode on i686 puts the heap at a completely different place, so that it's not next to the stack, and the problem cannot really arise anymore:

00112000-0c944000 rw-p 00000000 00:00 0                                  [heap]
…
fabfb000-ffffe000 rw-p 00000000 00:00 0                                  [stack]

It's odd that i386 (in compat mode) with its limited address space doesn't use the stack/heap sharing trick used in the x86_64 case.

Maybe it's just a side effect of where the main executable is put in the address space (in the middle on current Fedora, and on RHEL 6 high for x86_64, low for i386).  The brk heap must start after the data segment, otherwise Emacs will probably no longer work (please don't ask).

Comment 13 Florian Weimer 2017-01-04 20:27:48 UTC

This may be related (it went into Linux 3.2):

commit a3defbe5c337dbc6da911f8cc49ae3cc3b49b453
Author: Jiri Kosina <jkosina>
Date:   Wed Nov 2 13:37:41 2011 -0700

    binfmt_elf: fix PIE execution with randomization disabled
    
    The case of address space randomization being disabled in runtime through
    randomize_va_space sysctl is not treated properly in load_elf_binary(),
    resulting in SIGKILL coming at exec() time for certain PIE-linked binaries
    in case the randomization has been disabled at runtime prior to calling
    exec().
    
    Handle the randomize_va_space == 0 case the same way as if we were not
    supporting .text randomization at all.

    Based on original patch by H.J. Lu and Josh Boyer.

Cc: Josh for comments.

Comment 19 Rik van Riel 2017-01-05 14:03:43 UTC

I believe we want this at least for x86-64.

I could see leaving i686, PPC, and s390 alone, just because of where we are in the release cycle.

Comment 20 Josh Boyer 2017-01-05 14:12:35 UTC

(In reply to Florian Weimer from comment #13)
> This may be related (it went into Linux 3.2):
> 
> commit a3defbe5c337dbc6da911f8cc49ae3cc3b49b453
> Author: Jiri Kosina <jkosina>
> Date:   Wed Nov 2 13:37:41 2011 -0700
> 
>     binfmt_elf: fix PIE execution with randomization disabled
>     
>     The case of address space randomization being disabled in runtime through
>     randomize_va_space sysctl is not treated properly in load_elf_binary(),
>     resulting in SIGKILL coming at exec() time for certain PIE-linked
> binaries
>     in case the randomization has been disabled at runtime prior to calling
>     exec().
>     
>     Handle the randomize_va_space == 0 case the same way as if we were not
>     supporting .text randomization at all.
> 
>     Based on original patch by H.J. Lu and Josh Boyer.
> 
> Cc: Josh for comments.

Oof.  That was 5 years ago.  I don't remember much about it, but I think it was a result of this bug report:

https://bugzilla.redhat.com/show_bug.cgi?id=708563

Comment 24 Jan Kurik 2017-12-06 10:19:31 UTC

Red Hat Enterprise Linux 6 is in the Production 3 Phase. During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available.

The official life cycle policy can be reviewed here:

http://redhat.com/rhel/lifecycle

This issue does not meet the inclusion criteria for the Production 3 Phase and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification. Note that a strong business justification will be required for re-evaluation. Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL:

https://access.redhat.com/

Note You need to log in before you can comment on or make changes to this bug.