1623519 – glibc: stdlib/tst-setcontext9 test suite failure

Bug 1623519 - glibc: stdlib/tst-setcontext9 test suite failure

Summary: glibc: stdlib/tst-setcontext9 test suite failure

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	glibc
Sub Component:
Version:	29
Hardware:	i686
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Florian Weimer
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1623536
TreeView+	depends on / blocked

Reported:	2018-08-29 14:20 UTC by Florian Weimer
Modified:	2018-11-07 08:08 UTC (History)
CC List:	10 users (show)
Fixed In Version:	glibc-2.28.9000-8.fc30 glibc-2.28-16.fc29
Clone Of:
Clones:	1623536 (view as bug list)
Environment:
Last Closed:	2018-11-07 08:08:36 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Sourceware	23717	0	None	None	None	2018-09-26 22:26:20 UTC

Description Florian Weimer 2018-08-29 14:20:13 UTC

From the glibc-2.28-9.fc29.i686 build log:

=====FAIL: stdlib/tst-setcontext9.out=====
making contexts
swap contexts
start f1
swap contexts in f2
set context in f1
set context
Didn't expect signal from child: got `Segmentation fault'

This was observed with kernel 4.17.3-200.fc28.x86_64.

I've seen it on rawhide and on other architectures as well (but not x86_64).

Comment 1 Florian Weimer 2018-08-29 15:42:52 UTC

The value of %ebx stored on the stack, used after resuming f2 and calling puts ("end f2") is corrupted.

The corruption appears to happen here:

Hardware watchpoint 6: *(int *)0xffffd44c

Old value = 1448452096
New value = 1448438070
0xf7e613ee in swapcontext () from /lib/libc.so.6

#0  0xf7e613ee in swapcontext () from /lib/libc.so.6
#1  0x56556936 in f1 () at tst-setcontext9.c:47
#2  0xf7e6134f in makecontext () from /lib/libc.so.6
#3  0x5655a100 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

(gdb) info symb 0x5655a100
ctx in section .bss of …/stdlib/tst-setcontext9

It seems that the allocated stack is reused.  I don't know why this happens.

Comment 2 Carlos O'Donell 2018-08-30 11:43:15 UTC

This is specific to the toolchain used here. It doesn't reproduce for i686 using the F28 toolchain. I'm going to dig a bit more here.

Comment 3 Florian Weimer 2018-08-30 11:50:04 UTC

(In reply to Carlos O'Donell from comment #2)
> This is specific to the toolchain used here. It doesn't reproduce for i686
> using the F28 toolchain. I'm going to dig a bit more here.

Did you use the RPM build flags?  I think it needs PIE to trigger on i686.  The bug will likely reproduce on other architectures which spill the GOT pointer, too.

Comment 4 Carlos O'Donell 2018-08-30 17:54:57 UTC

Fails in an i686 F29 mock chroot, running on F28. So it does indeed seem to be a toolchain issue. Looking at the differences now.

Comment 5 Carlos O'Donell 2018-08-31 03:50:09 UTC

(In reply to Florian Weimer from comment #1)
> The value of %ebx stored on the stack, used after resuming f2 and calling
> puts ("end f2") is corrupted.

I can confirm that using the proper GOT value as computed from the start of the function we can find the .rodata pointer and fined the string constant 'end f2' correctly:

(gdb) p/x $edx
$29 = 0x804c000
      ^^^^^^^^^ $edx is PC, plus offset == .got.plt

(gdb) p/x $edx - 0x1fcc
$27 = 0x804a034
      ^^^^^^^^^ got relative address of rodata constant.

(gdb) x/7c 0x804a034
0x804a034:      101 'e' 110 'n' 100 'd' 32 ' '  102 'f' 50 '2'  0 '\000'

However, at the end of f2 we have:

   0x0804947a <+122>:   mov    0x18(%esp),%esi

(gdb) p/x $esi
$33 = 0x8049548

Should have been 0x804c000 as above.

As you note it looks like the stack is being reused and overwritten.

In a "working" version of the test program the compiler picks $ebx, and because this is callee-saves it doesn't have to save and restore this to the stack.

Quick review of f2 stack manipulations (bytes):
- call __x86.get_pc_thunk.dx (store pc in edx)
- push esi (Down 0x4, total 0x4) [save esi]
- push ebx (Down 0x4, total 0x8) [save ebx]
- sub 32 (Down 0x20, total 0x28)
- compute .got.plt address and store to esp+0x18

esp+24 = saved esi
esp+20 = saved ebx
esp+1c to esp+1f = empty (4 bytes)
esp+18 = saved .got.plt address
...
esp+0 = nothing yet

- .got.plt address located at esp+0x18
- push eax (Down 0x4, total 0x2c)
- pop edx (Up 0x4, total 0x28)
- pop ecx (Up 0x4, total 0x24)
- $ebx / GOT located at $esp+0x14
- push eax (Down 0x4, total 0x28)
- push eax (Down 0x4, total 0x2c)
- .got.plt address located at $esp+0x1c
- swapcontext call
- add 0x10 esp (Up 0x10, total 0x1c)
- .got.plt address located at $esp+0xc
- sub 0xc esp (Down 0xc, total 0x28)
- .got.plt address located at esp+0x18

Totals stack manipulations in f2() look correct, and
access via 'mov    0x18(%esp),%esi' looks correct.

A double check around the swapcontext():

(gdb) c
Continuing.
swap contexts in f2

Breakpoint 7, 0x0804943c in f2 () at tst-setcontext9.c:34
34	  if (swapcontext (&ctx[4], &ctx[2]) != 0)
(gdb) x/4x $esp + 0x1c
0xffffc57c:	0x00	0xc0	0x04	0x08
(gdb) c
Continuing.
set context in f1
set context

Breakpoint 6, 0x08049449 in f2 () at tst-setcontext9.c:34
34	  if (swapcontext (&ctx[4], &ctx[2]) != 0)
(gdb) x/4x $esp + 0x1c
0xffffc57c:	0x48	0x95	0x04	0x08
(gdb) 

Shows the stack is corrupted during the call and the on-stack value is not preserved.

(gdb) p/x *(int *)0xffffc57c
$42 = 0x804c000
(gdb) watch *(int *)0xffffc57c
Hardware watchpoint 8: *(int *)0xffffc57c
(gdb) c
Continuing.

Hardware watchpoint 8: *(int *)0xffffc57c

Old value = 134529024
New value = 134517990
0xf7e2d32a in swapcontext () from /lib/libc.so.6
(gdb) bt
#0  0xf7e2d32a in swapcontext () from /lib/libc.so.6
#1  0x080494e6 in f1 () at tst-setcontext9.c:47
#2  0xf7e2d28b in makecontext () from /lib/libc.so.6
#3  0x0804c080 in ?? ()
#4  0xf7e06211 in __libc_start_main () from /lib/libc.so.6
#5  0x0804c000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) 

And this makes perfect sense because...

(In reply to Florian Weimer from comment #1)
> It seems that the allocated stack is reused.  I don't know why this happens.

... the test is buggy.

This makes complete sense. 

We are in f2(), which is a callee of f1(), and when we swapcontext() to f1() via &ctx[2], all we are doing is moving esp back up to the starting value it had when we created &ctx[2], and there is enough work done in f1() after getcontext() that it will clobber all the values on the stack in f2() since f1() need not have made the stack adjustment early in the prologue (if it did then we would have entered f2() with enough stack gap to avoid the clobber).

All of this is exacerbated by -fpie which requires lots of saves/restores from the stack to get all the required addresses computed. It was bound to fail.

In fact there is only 32-bytes of stack between the start of f1() and the start of f2().

The simplest solution that retains the present test structure is:

* Split f1 into f1a and f1b.
* In f1a adjust ctx[2] to have it's own stack via makecontext.
* From f1a call f2() directly.
* Have f1b check for if (done) {...} else exit (EXIT_FAILURE);.

The lesson here is that we should not setcontext back to a caller if you don't know if there is enough stack between the caller and the calee.

I'll hand this back to you so you can look at this in your timezone.

Comment 6 Carlos O'Donell 2018-08-31 03:54:03 UTC

(In reply to Carlos O'Donell from comment #5)
> The lesson here is that we should not setcontext back to a caller if you
> don't know if there is enough stack between the caller and the calee.

... if you plan to return to the callee with another setcontext.

Comment 7 Carlos O'Donell 2018-09-20 02:49:16 UTC

Fixed upstream in master.

commit 791b350dc725545e3f9b5db0f97ebdbc60c9735f (HEAD -> master, origin/master, origin/HEAD)
Author: Carlos O'Donell <carlos>
Date:   Wed Sep 5 01:16:42 2018 -0400

    Fix tst-setcontext9 for optimized small stacks.
    
    If the compiler reduces the stack usage in function f1 before calling
    into function f2, then when we swapcontext back to f1 and continue
    execution we may overwrite registers that were spilled to the stack
    while f2 was executing.  Later when we return to f2 the corrupt
    registers will be reloaded from the stack and the test will crash.  This
    was most commonly observed on i686 with __x86.get_pc_thunk.dx and
    needing to save and restore $edx.  Overall i686 has few registers and
    the spilling to the stack is bound to happen, therefore the solution to
    making this test robust is to split function f1 into two parts f1a and
    f1b, and allocate f1b it's own stack such that subsequent execution does
    not overwrite the stack in use by function f2.
    
    Tested on i686 and x86_64.
    
    Signed-off-by: Carlos O'Donell <carlos>

Comment 8 Florian Weimer 2018-09-26 22:21:09 UTC

The test still fails on POWER in glibc-2.28.9000-7.fc30.  The upstream fix is insufficient.

Comment 9 Carlos O'Donell 2018-09-27 13:10:16 UTC

(In reply to Florian Weimer from comment #8)
> The test still fails on POWER in glibc-2.28.9000-7.fc30.  The upstream fix
> is insufficient.

I will investigate this. I assume the failure can be seen on ppc64le.

Comment 10 Carlos O'Donell 2018-09-27 13:11:50 UTC

(In reply to Carlos O'Donell from comment #9)
> (In reply to Florian Weimer from comment #8)
> > The test still fails on POWER in glibc-2.28.9000-7.fc30.  The upstream fix
> > is insufficient.
> 
> I will investigate this. I assume the failure can be seen on ppc64le.

I see Andreas has fixed this already:

commit f841c97e515a1673485a2b12b3c280073d737890
Author: Andreas Schwab <schwab>
Date:   Thu Sep 27 11:12:13 2018 +0200

    Fix stack overflow in tst-setcontext9 (bug 23717)
    
    The function f1a, executed on a stack of size 32k, allocates an object of
    size 32k on the stack.  Make the stack variables static to reduce
    excessive stack usage.

And cherry picked into 2.28.

commit 3a67c72c1512f778304a5644dea2fcf5bdece274
Author: Andreas Schwab <schwab>
Date:   Thu Sep 27 12:37:06 2018 +0200

    Fix stack overflow in tst-setcontext9 (bug 23717)
    
    The function f1a, executed on a stack of size 32k, allocates an object of
    size 32k on the stack.  Make the stack variables static to reduce
    excessive stack usage.
    
    (cherry picked from commit f841c97e515a1673485a2b12b3c280073d737890)

Comment 11 Florian Weimer 2018-09-27 13:15:22 UTC

Correct, fixes are under way.

Comment 12 Fedora Update System 2018-11-03 07:01:05 UTC

glibc-2.28-17.fc29 has been submitted as an update to Fedora 29. https://bodhi.fedoraproject.org/updates/FEDORA-2018-0e5874aba9

Comment 13 Fedora Update System 2018-11-04 06:38:29 UTC

glibc-2.28-17.fc29 has been pushed to the Fedora 29 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-0e5874aba9

Comment 14 Fedora Update System 2018-11-07 02:40:46 UTC

glibc-2.28-17.fc29 has been pushed to the Fedora 29 stable repository. If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.