Bug 435337

Summary: pthread_attr_setstacksize is incorrect on PPC and S390
Product: Red Hat Enterprise Linux 5 Reporter: Tom Lane <tgl>
Component: glibcAssignee: Jakub Jelinek <jakub>
Status: CLOSED NOTABUG QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: high    
Version: 5.1CC: David.Holmes, dwmw2, fweimer, gbenson, hhorak, jwboyer, langel
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: ppc64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-03-11 09:28:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Oddity none

Description Tom Lane 2008-02-28 19:24:35 UTC
Description of problem:
The thread stack actually provided is 64K less than requested.

Version-Release number of selected component (if applicable):
glibc-2.5-18.ppc

How reproducible:
100%

Steps to Reproduce:
size_t thread_stack = 0x80000;
pthread_attr_setstacksize(&connection_attrib,thread_stack);
pthread_create(...);
Check memory map of process, eg via /proc/n/maps
  
Actual results:
Area allocated for the thread stack is 0x70000, with an 0x10000 dead zone in front of it
(maybe bug amounts to mis-counting dead zone as part of stack?)

Expected results:
Usable size of thread stack must be at least as large as requested.

Additional info:
I am not certain if this is a glibc or kernel bug.  I am seeing it under RHEL5.1 release, and it is
also affecting PPC builds in Fedora rawhide --- but I'm told the Koji build machines are using
a RHEL5 kernel so maybe the problem comes from there.  The problem appeared on the Koji
machines sometime between 2007-Dec-13 and 2008-Jan-08, and I was not seeing it when
I rebuilt mysql in RHEL-5 brew on 2007-Dec-13 either.

I don't see any comparable problem on non-PPC arches, so the 64K vs 4K page size business
may well factor in here somewhere  Also, I failed to replicate it on PPC (not PPC64) hardware
using F9-alpha or recent rawhide kernel.

I'm rating this as high priority because it makes mysql vulnerable to crashing in situations where
it ought to recover with a "statement too complex" type of error.  It wouldn't be out of the realm
of reason to call it a security bug.

Comment 1 Jakub Jelinek 2008-02-28 19:42:31 UTC
Please read
man 3p pthread_attr_getguardsize
The default guard size is one page, in RHEL5-ish and some other kernels ppc has
64K page, F8/F9 ppc64 kernels use AFAIK 4K pages.

Comment 2 Tom Lane 2008-02-28 19:54:59 UTC
Well, *something* in this area changed recently, because mysql worked fine on RHEL5 up through 
December.  Can you tell me exactly what did change?

Comment 3 Jakub Jelinek 2008-02-28 20:25:39 UTC
Nothing changed really.  The only possible change would be if the buildboxes
were using RHEL4 or some other kernels until December.

Comment 4 Tom Lane 2008-02-28 21:05:03 UTC
BTW, I do not agree with your reading of the specification.  While neither the pthread_attr_setstacksize nor pthread_attr_getguardsize spec pages say in so many words whether the guard area is to be subtracted 
from the requested stack size, the guardsize page says that "... the implementation allocates *extra* 
memory at the overflow end of the stack ..." which to me implies the guard area is *in addition to* the 
stack size.  This is also in accord with common sense; if the guard area is supposed to be included in the 
stack size, wouldn't there be a large warning on the setstacksize page to remind people to allow for it 
when selecting their stack size?  It'd certainly mean that no one could correctly use setstacksize without 
being aware of the guardsize parameter, but there's not even a cross-reference to it on the setstacksize 
page.

So I remain of the opinion that RHEL5's behavior is broken.

Comment 5 David Woodhouse 2008-02-28 23:09:33 UTC
We were building on hosts with 4KiB pages till recently

Comment 6 Tom Lane 2008-02-28 23:42:50 UTC
BTW, it appears that the brew machines are still using 4KB pages?  I just tested a scratch build and the 
failure doesn't seem to occur in brew.  Any info on when/if brew is likely to transition to 64KB pages?

Comment 7 David Woodhouse 2008-02-29 01:16:21 UTC
I believe they plan to move to RHEL5 for the build system some time "soon". This
would mean 64KiB pages.

Comment 8 Gary Benson 2008-02-29 10:40:12 UTC
IcedTea uses pthread_attr_setstacksize and assumes that the amount it asked for
is the amount it got.  Are you saying that this has *never* been the case?

Comment 9 Tom Lane 2008-02-29 20:28:26 UTC
I have just finished experimenting with RHTS machines.  Using the RHEL5-U1 releases, I find that the 
exact requested stacksize is allocated on i386, x86_64, and ia64.  Only ppc and s390x subtract the guard 
space.  (I didn't try s390 separately.)

Seeing that all the mainstream architectures allocate the full requested stack space, I think your position 
that this is not a bug is completely untenable.  It is hardly likely that any program out there will be 
expecting that it has to add on the guard area.

Comment 10 Jakub Jelinek 2008-02-29 20:45:57 UTC
Then your testing wasn't very good.
Try say:
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>

void *tf (void *arg)
{
  char buf[64];
  snprintf (buf, sizeof buf, "cat /proc/%d/maps", (int) getpid ());
  system (buf);
  return arg;
}

int
main (void)
{
  pthread_attr_t a;
  pthread_attr_init (&a);
  pthread_attr_setstacksize (&a, 16 * 1024 * 1024);
  pthread_attr_setguardsize (&a, 10 * 1024 * 1024);
  pthread_t th;
  pthread_create (&th, &a, tf, NULL);
  pthread_join (th, NULL);
}

and you'll see that the guard area is part of the stack sized allocation on all
architectures.

Comment 11 Tom Lane 2008-02-29 21:33:03 UTC
That's a useful test case but I don't think it proves your point.  What I'm seeing with it on my x86_64 box
is that the allocated stack space is one page (4K) larger than it should be according to your argument.
Since one page is the default and typical guard area, the net effect is that a program that is ignorant of
the guard area parameter will get a stack that is exactly the size it asked for.  Thus, I stand by my opinion
that few programs out there will be expecting this behavior.



Comment 12 Jakub Jelinek 2008-02-29 21:46:23 UTC
That's just on i686 and x86_64, iff stacksize is multiple of 64K, one page is
added to avoid page aliasing performance degradation.

Comment 13 Lillian Angel 2008-02-29 21:46:52 UTC
My IcedTea builds all succeed on ppc, ppc64, x86, x86_64. But on the koji
machines, the ppc build fails no matter what. I am certain we are having the
same problem.

Comment 14 Tom Lane 2008-02-29 22:12:24 UTC
Hm, I think this also shows that ia64 is just plain broken.  Consider this variant of your test program:

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

void *tf (void *arg)
{
  char buf[64];
  snprintf (buf, sizeof buf, "cat /proc/%d/maps", (int) getpid ());
  system (buf);
  sleep(4);
  return arg;
}

int
main (void)
{
  pthread_attr_t a;
  pthread_t th1;
  pthread_t th2;
  pthread_t th3;

  pthread_attr_init (&a);
  pthread_attr_setstacksize (&a, 16 * 1024 * 1024);
//  pthread_attr_setguardsize (&a, 10 * 1024 * 1024);

  pthread_create (&th1, &a, tf, NULL);
  sleep(1);
  pthread_create (&th2, &a, tf, NULL);
  sleep(1);
  pthread_create (&th3, &a, tf, NULL);

  pthread_join (th1, NULL);
  pthread_join (th2, NULL);
  pthread_join (th3, NULL);

  return 0;
}


Running this on a RHEL5-U1 ia64 RHTS machine, the printout shows the thread stack space as

2000000000b18000-2000000000b1c000 ---p 2000000000b18000 00:00 0 
2000000000b1c000-200000000131c000 rw-p 2000000000b1c000 00:00 0 

then

2000000000b18000-2000000000b1c000 ---p 2000000000b18000 00:00 0 
2000000000b1c000-2000000001b18000 rw-p 2000000000b1c000 00:00 0 
2000000001b18000-2000000001b1c000 ---p 2000000001b18000 00:00 0 
2000000001b1c000-200000000231c000 rw-p 2000000001b1c000 00:00 0 

then

2000000000b18000-2000000000b1c000 ---p 2000000000b18000 00:00 0 
2000000000b1c000-2000000001b18000 rw-p 2000000000b1c000 00:00 0 
2000000001b18000-2000000001b1c000 ---p 2000000001b18000 00:00 0 
2000000001b1c000-2000000002b18000 rw-p 2000000001b1c000 00:00 0 
2000000002b18000-2000000002b1c000 ---p 2000000002b18000 00:00 0 
2000000002b1c000-200000000331c000 rw-p 2000000002b1c000 00:00 0 

At each step the newest thread only seems to be getting 8MB not 16 as requested.




Comment 15 Tom Lane 2008-02-29 22:22:58 UTC
BTW, this last might help explain some bizarre coding I found in mysql:

#if defined(__ia64__) || defined(__ia64)
  /*
    Peculiar things with ia64 platforms - it seems we only have half the
    stack size in reality, so we have to double it here
  */
  pthread_attr_setstacksize(&thr_attr,thread_stack*2);
#else
  pthread_attr_setstacksize(&thr_attr,thread_stack);
#endif

I had thought that this was either nuts or due to insufficient understanding of the guard area
issue, but when I replace this code with something that just adds the guard area size, it crashes
--- on ia64 only.

Comment 16 Jakub Jelinek 2008-02-29 22:30:13 UTC
ia64 has two stacks for each thread, normal stack and register stack.  Normal
stack grows down, register stack grows up, guard page(s) if any are in the middle.

Comment 17 Tom Lane 2008-02-29 22:35:52 UTC
So how does that explain the change in the size of the previous thread's already-allocated stack?

Comment 18 Tom Lane 2008-02-29 22:37:39 UTC
Oh, nevermind, I see what you're saying: there's  no guard space between one thread's normal stack and
the next one's register stack.  Bizarre.

Comment 19 Tom Lane 2008-02-29 22:46:29 UTC
One more question, if I may.  It looks like on ia64, if you setstacksize to some reasonably-round number,
you get exactly half of that for normal stack and half less the guard area for the register stack.  Correct?
How can one know if this is enough register stack?  The stack depth limiting techniqures in both mysql 
and postgresql will (I believe) measure normal stack accurately, but they've got no handle on register stack 
depth AFAICS.  Can the register stack grow faster than normal stack?  Or even as fast?

Comment 20 Josh Boyer 2008-03-06 15:27:41 UTC
(In reply to comment #12)
> That's just on i686 and x86_64, iff stacksize is multiple of 64K, one page is
> added to avoid page aliasing performance degradation.

So we have varying behavior depending on the stacksize set?  Reading the two man
pages involved we have:

pthread_attr_getguardsize:

"If a thread’s stack is  created with guard protection, the implementation
allocates extra memory at the overflow end of the stack as a buffer against
stack  overflow of the stack pointer."

pthread_attr_setstacksize:

"The  stacksize attribute shall define the minimum stack size (in bytes)
allocated for the created threads stack."

Note that getguardsize explicitly states that the implementation should allocate
_extra_ memory, and setstacksize should provide the _minimum_ stack size in bytes.

To my reading, this means that glibc should pad out the guard page
unconditionally.  I'm a bit confused as to how the current behavior can be
considered conforming to POSIX.  Jakub could you explain that please?

With a less than strict reading, certain parts might be a bit ambiguous so
perhaps this needs to go to the standards committee for clarification.  In the
meantime however, programmers need to be aware of the current behavior.  Could
we perhaps add a brief section to the man page of pthread_attr_setstacksize that
describes it's interaction with the guard page?

Comment 21 Gary Benson 2008-03-06 16:27:21 UTC
Given the IA64 situation I've come to the conclusion that the current
implementation is the best you could do.  The alternative would be to end up
with threads allocating way more stack than expect, just to hide some complexity
from application developers.

Comment 22 Gary Benson 2008-03-06 16:32:00 UTC
Created attachment 297069 [details]
Oddity

Out of interest, I noticed it's impossible to allocate a stack that's a power
of two pages in size on i386 and x86_64 machines: you get exactly one page more
than you asked for:

  to-gcj-1:[~]$ cat /etc/fedora-release
  Fedora release 8 (Werewolf)
  to-gcj-1:[~]$ uname -a
  Linux to-gcj-1.yyz.redhat.com 2.6.23.1-42.fc8 #1 SMP Tue Oct 30 13:18:33 EDT
2007 x86_64 x86_64 x86_64 GNU/Linux
  to-gcj-1:[~]$ gcc -o sticky-stacker -lpthread sticky-stacker.c &&
./sticky-stacker
  Requested 512000, got 512000
  Requested 516096, got 516096
  Requested 520192, got 520192
  Requested 524288, got 528384
  Requested 528384, got 528384
  Requested 532480, got 532480
  Requested 536576, got 536576

Is this expected?

Comment 23 Jakub Jelinek 2008-03-06 16:52:06 UTC
Yes, that's expected:
          /* To avoid aliasing effects on a larger scale than pages we
             adjust the allocated stack size if necessary.  This way
             allocations directly following each other will not have
             aliasing problems.  */
#if MULTI_PAGE_ALIASING != 0
          if ((size % MULTI_PAGE_ALIASING) == 0)
            size += pagesize_m1 + 1;
#endif
and
libc/nptl/sysdeps/i386/i686/Makefile:CFLAGS-pthread_create.c +=
-DMULTI_PAGE_ALIASING=65536
libc/nptl/sysdeps/x86_64/Makefile:CFLAGS-pthread_create.c +=
-DMULTI_PAGE_ALIASING=65536


Comment 24 Gary Benson 2008-03-06 17:41:38 UTC
Cool, I thought it would be but I wanted to check.

Comment 25 Josh Boyer 2008-03-10 12:58:29 UTC
> To my reading, this means that glibc should pad out the guard page
> unconditionally.  I'm a bit confused as to how the current behavior can be
> considered conforming to POSIX.  Jakub could you explain that please?
> 
> With a less than strict reading, certain parts might be a bit ambiguous so
> perhaps this needs to go to the standards committee for clarification.  In the
> meantime however, programmers need to be aware of the current behavior.  Could
> we perhaps add a brief section to the man page of pthread_attr_setstacksize that
> describes it's interaction with the guard page?

Jakub, any comments on this at all?



Comment 26 Jakub Jelinek 2008-03-11 09:28:45 UTC
I've talked to Ulrich about this and he says this is intentional and not
violating POSIX.

Comment 27 Tom Lane 2008-03-11 16:25:55 UTC
It still desperately needs a documentation change, as suggested at comment #20.

Comment 28 David Holmes 2008-03-13 23:54:44 UTC
I agree with Tom Lane and Josh Boyer, the guard pages should be in addition to
the stack usable by the thread. Gary Benson just raised this issue with OpenJDK
because our code expects the glibc guard-page to be outside the stack requested
by setstacksize, or as reported by getstacksize.