494631 – glibc malloc change breaks emacs 23 and mysql

Bug 494631 - glibc malloc change breaks emacs 23 and mysql

Summary: glibc malloc change breaks emacs 23 and mysql

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	glibc
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Assignee:	Jakub Jelinek
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	F11Preview
TreeView+	depends on / blocked

Reported:	2009-04-07 16:51 UTC by Andy Wingo
Modified:	2018-04-11 06:46 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2009-04-16 21:15:51 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Patch to reduce stack usage in __get_nprocs if stack limit is very low (1.20 KB, patch) 2009-04-15 12:28 UTC, Jakub Jelinek	no flags	Details \| Diff
Patch to truncate very long lines. (976 bytes, patch) 2009-04-15 12:35 UTC, Jakub Jelinek	no flags	Details \| Diff
View All

Description Andy Wingo 2009-04-07 16:51:53 UTC

Hi,

glibc-2.9.90-12 configures glibc with the --enable-experimental-malloc option. For some reason, this breaks Emacs 23 at runtime. For example, starting an emacs from git from a few days ago always fails with "bad regex" errors, though which error is nondeterministic.

I tried recompiling emacs from source, but because emacs is run as part of its build procedure that failed too, hanging at the temacs bootstrap stage. NB, the C part of emacs compiled, but failed again at runtime.

Downgrading to glibc 2.9.90-8 solves the problem for me.

Thanks,

Andy

Comment 1 Panu Matilainen 2009-04-08 09:13:57 UTC

There are some indicators that glibc-2.9.90-12 (experimental malloc?) breaks rpm too:
- on April 1st when glibc-2.9.90-12 was built, rawhide composes started acting up, more precisely rpm macro expansions show memory corruption like:
    error: Unterminated {: {%{_keyringpath}/*.ke
      0< /tmp/yumdir.21410/
    error: Unterminated {: {%{_dbpathrp÷(
      0< /tmp/yumdir.21410/
- there have been a few failed koji builds with the same symptoms
- https://www.redhat.com/archives/fedora-test-list/2009-April/msg00515.html reports the same symptoms, attempted F10 -> rawhide update broke early on and the system is essentially F10 but with rawhide glibc

Couple of data points:
- this seems to be limited to 32bit x86, at least so far nobody has reported seeing this breakage on other archs
- rpm macro expansion does a downsize realloc(), which is perhaps not all that common thing to do (macro expansion size is unknown so it grabs a large buffer and then reallocs down to actual needed size when done expanding)

Comment 2 Jakub Jelinek 2009-04-08 09:22:52 UTC

#c1 is unrelated, yes, it could be due to the malloc changes, but doesn't have anything else common with the emacs issue.  emacs is broken because it uses malloc_{get,set}_state, saving when run with one version of glibc, restoring with a different one, which largely makes most of malloc internal's changes an ABI for emacs.
I don't think rpm uses this.
So for #c1 it would be better to track it under a separate bug and if possible for Ulrich have some easy reproducer.

Comment 3 Tom Lane 2009-04-08 20:12:09 UTC

FWIW, this change seems to have also broken mysql, but only on x86_64.  I was tearing my hair for most of yesterday evening and today, but the failure disappeared after updating from glibc-2.9.90-12 to -14.
I didn't get much further than identifying that it dumped core down inside a calloc call.

Could we perhaps not enable "experimental" stuff during beta?

Comment 4 Andy Wingo 2009-04-09 02:38:34 UTC

I do not the the emacs issue is purely one of a changing malloc ABI, given that with glibc 2.9.90-12, even a freshly-recompiled emacs doesn't bootstrap.

I am on x86-32.

Comment 5 Tom Lane 2009-04-10 16:45:51 UTC

The mysql failure is back with -15 :-(

Comment 6 Jesse Keating 2009-04-13 18:26:45 UTC

Putting this on the blocker.  Jakub, we really need a resolution here, final freeze is tomorrow.

Comment 7 Valdis Kletnieks 2009-04-14 03:21:59 UTC

I may be seeing this one here as well - admittedly, it's against 64-bit nightly trunk builds of Firefox 3.6a1pre with the beta 64-bit Adobe Flash plugin. But data points:

1) It started with -12, and goes away if I put back an older glibc.
2) The firefox crash is pretty obviously a stray/duff pointer - a SIGSEGV

Comment 8 Jakub Jelinek 2009-04-14 14:20:56 UTC

emacs doesn't seem to build with neither glibc-2.9.90-14 nor -15, at least not on x86_64, with F10 kernel.  Both segfault, so I'd say the bug is on the emacs side.

Comment 9 Tom Lane 2009-04-15 00:18:55 UTC

mysql failure still there with -16 ... so apparently the "final freeze" snapshot is going to include a known broken mysql.

Comment 10 Jakub Jelinek 2009-04-15 09:14:39 UTC

Regarding emacs, after upgrading to a non-broken F10 kernel I was able to build
emacs against glibc-2.9.90-11 on x86_64 just fine and run the resulting dumped emacs against glibc-2.9.90-16, and also dump emacs against glibc-2.9.90-16 and run the resulting dumped binary, so I can't reproduce any issues.

For mysql, where are any details how it can be reproduced?  What kind of debugging you've done to ensure it isn't mysql fault, which would be very likely?  The changes in malloc are not very large, the changes only affect which arena is selected for which thread and whether locking is needed or atomic operations are used instead in some cases.

Comment 11 Jakub Jelinek 2009-04-15 10:54:56 UTC

I managed to reproduce the mysql test failure.  The problem is that it creates threads with very small stack sizes:
4142  mmap(NULL, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7fa870662000
4142  mprotect(0x7fa870662000, 4096, PROT_NONE) = 0
4142  clone(child_stack=0x7fa8706651f0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fa8706659e0, tls=0x7fa870665910, child_tidptr=0x7fa8706659e0) = 4174
4142  futex(0x24b4014, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
4174  set_robust_list(0x7fa8706659f0, 0x18) = 0
4174  --- SIGSEGV (Segmentation fault) @ 0 (0) ---

Don't know what exactly was passed to pthread_attr_setstacksize yet, but clearly from the 16KB allocated the low 4KB is used for guard page, then 8688 bytes from
the top of guard page till child stack and the rest is struct pthread/TLS blocks etc.

What changed is that in malloc we now newly call in reused_arena:

  static int narenas_limit;
  if (narenas_limit == 0)
    {
      if (mp_.arena_max != 0)
        narenas_limit = mp_.arena_max;
      else
        {
          int n  = __get_nprocs ();

and __get_nprocs does:

int
__get_nprocs ()
{
  /* XXX Here will come a test for the new system call.  */

  char buffer[8192];
  char *const buffer_end = buffer + sizeof (buffer);
...

which is already too much for the extremely limited stack.  While what mysql does is very stupid, I'm afraid we need to tollerate it, because on x86_64 PTHREAD_STACK_MIN is still 16KB, not more.  Unfortunately doing here
if (__libc_use_alloca (8192)) buffer = alloca (8192); else buffer = malloc (8192); isn't going to work here well, as when it is called from inside of malloc, that would be a recursive call.

Comment 12 Jakub Jelinek 2009-04-15 12:28:08 UTC

Created attachment 339675 [details]
Patch to reduce stack usage in __get_nprocs if stack limit is very low

Comment 13 Jakub Jelinek 2009-04-15 12:35:20 UTC

Created attachment 339679 [details]
Patch to truncate very long lines.

The second patch is to avoid next_line returning parts of the same line for very
long lines, instead it just truncates them at 3/4 of the buffer size.  Looking at all users of this function, they all look just at < 30 initial bytes of the
line anyway, never more, so when we truncate lines after 384 bytes, it should
work IMHO better than pretending there was a line break inside of very long
line.

Comment 14 Ulrich Drepper 2009-04-15 16:04:10 UTC

I have added the two patches plus a third to handle programs with small stacks.  Jakub will likely build a new glibc today.

Comment 15 Tom Lane 2009-04-15 17:00:33 UTC

Thanks for investigating this.

I poked around in the mysql sources, and it appears that it's selecting a thread stack size of 64K on x86_64 in the program that's failing.  So while I agree that it's a bad idea for malloc to be eating 8K of stack, I'm not entirely convinced that that explains the crash.  Is it possible that the code is being called recursively?

We've seen before that the mysql guys are ridiculously optimistic about the amount of stack space required, and I'm actually carrying a patch to increase the stack size request in another place in their code.  I'm tempted to patch it higher here too, but I'm not 100% sure I'm looking at the right place.

Comment 16 Tom Lane 2009-04-15 18:16:15 UTC

Ah, ignore the above --- I just found another place that was asking for only 16K stack, and increasing that request seems to make the crash go away.  Still, it's a bit mystifying that this fails only on x86_64; the problem should be as bad or worse on the other three arches.  Is that new code in malloc somehow x86_64 specific?  Is it still true that the stack guard space is silently subtracted from the stack size request (and if so, what happens when the request is <= one page?)

Comment 17 Tom Lane 2009-04-15 18:42:56 UTC

To clarify: I'm wondering why this code works at all on PPC, seeing that (last I heard) our PPC builders have 64K page size.  The algorithm it's using to choose the thread stack size is

1. Start with an 8K basic request (or in the other place I found earlier, a 32K basic request).
2. Double the request if sizeof(pointer) == 8.
3. If less than PTHREAD_STACK_MIN, increase to PTHREAD_STACK_MIN.
4. pthread_attr_setstacksize with this.  Don't change the default guard size.

Unless we're using quite a large PTHREAD_STACK_MIN value on PPC(64), or there's some other special rule on those platforms, this really ought to crash and burn everywhere given that malloc needs at least 8K.  So I'm still confused why it fails to fail on three out of four platforms.

Comment 18 Ulrich Drepper 2009-04-15 18:52:33 UTC

(In reply to comment #17)
> To clarify: I'm wondering why this code works at all on PPC, seeing that (last
> I heard) our PPC builders have 64K page size.  The algorithm it's using to
> choose the thread stack size is

The pthread_attr_setstacksize function (and the others, which shouldn't be used) check that the size is not below PTHREAD_STACK_MIN.  On ppc the limit is 131072.

Comment 19 Tom Lane 2009-04-15 21:58:58 UTC

I've successfully built mysql in HEAD (which still has glibc -16) using a patch that knocks up that 8K request to 32K (64K on 64-bit platforms).  So that confirms the diagnosis.  This can be closed out as far as mysql is concerned.

Comment 20 Andy Wingo 2009-04-16 09:50:55 UTC

I have no more problems with emacs and glibc 2.9.90-15. My symptom of this problem is fixed. Thanks, Jakub & all.

Comment 21 Jesse Keating 2009-04-16 19:00:32 UTC

Putting on F11Preview list, as something(s) need to get tagged for F11 to be in preview.

Comment 22 Jakub Jelinek 2009-04-16 21:15:51 UTC

glibc-2.9.90-19 in rawhide has much smaller stack requirements for malloc (instead of 8KB just 512B if stacksize is low).

Comment 23 Tom Lane 2009-04-16 22:37:05 UTC

Will this be pushed into F-11 final as well?  If not I need to request an update of mysql, because it's entirely broken in F-11 ATM.

Comment 24 Jakub Jelinek 2009-04-16 22:42:14 UTC

koji latest-pkg dist-f11 glibc
Build                                     Tag                   Built by
----------------------------------------  --------------------  ----------------
glibc-2.9.90-19                           dist-f11              jakub

Comment 25 Ulrich Drepper 2009-04-16 22:43:17 UTC

(In reply to comment #23)
> Will this be pushed into F-11 final as well?  If not I need to request an
> update of mysql, because it's entirely broken in F-11 ATM.  

This definitely has to go into F11GA.  And there will be even at least one more RPM renaming the whole set to glibc-2.10.  We intend do this right before the final deadline.

Note You need to log in before you can comment on or make changes to this bug.