Bug 494631
Summary: | glibc malloc change breaks emacs 23 and mysql | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Andy Wingo <wingo> | ||||||
Component: | glibc | Assignee: | Jakub Jelinek <jakub> | ||||||
Status: | CLOSED RAWHIDE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | low | ||||||||
Version: | rawhide | CC: | dcantrell, drepper, fweimer, jakub, mcepl, mcepl, mishu, pmatilai, tgl, valdis.kletnieks, wtogami | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2009-04-16 21:15:51 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 476775 | ||||||||
Attachments: |
|
Description
Andy Wingo
2009-04-07 16:51:53 UTC
There are some indicators that glibc-2.9.90-12 (experimental malloc?) breaks rpm too: - on April 1st when glibc-2.9.90-12 was built, rawhide composes started acting up, more precisely rpm macro expansions show memory corruption like: error: Unterminated {: {%{_keyringpath}/*.ke 0< /tmp/yumdir.21410/ error: Unterminated {: {%{_dbpathrpĂ·( 0< /tmp/yumdir.21410/ - there have been a few failed koji builds with the same symptoms - https://www.redhat.com/archives/fedora-test-list/2009-April/msg00515.html reports the same symptoms, attempted F10 -> rawhide update broke early on and the system is essentially F10 but with rawhide glibc Couple of data points: - this seems to be limited to 32bit x86, at least so far nobody has reported seeing this breakage on other archs - rpm macro expansion does a downsize realloc(), which is perhaps not all that common thing to do (macro expansion size is unknown so it grabs a large buffer and then reallocs down to actual needed size when done expanding) #c1 is unrelated, yes, it could be due to the malloc changes, but doesn't have anything else common with the emacs issue. emacs is broken because it uses malloc_{get,set}_state, saving when run with one version of glibc, restoring with a different one, which largely makes most of malloc internal's changes an ABI for emacs. I don't think rpm uses this. So for #c1 it would be better to track it under a separate bug and if possible for Ulrich have some easy reproducer. FWIW, this change seems to have also broken mysql, but only on x86_64. I was tearing my hair for most of yesterday evening and today, but the failure disappeared after updating from glibc-2.9.90-12 to -14. I didn't get much further than identifying that it dumped core down inside a calloc call. Could we perhaps not enable "experimental" stuff during beta? I do not the the emacs issue is purely one of a changing malloc ABI, given that with glibc 2.9.90-12, even a freshly-recompiled emacs doesn't bootstrap. I am on x86-32. The mysql failure is back with -15 :-( Putting this on the blocker. Jakub, we really need a resolution here, final freeze is tomorrow. I may be seeing this one here as well - admittedly, it's against 64-bit nightly trunk builds of Firefox 3.6a1pre with the beta 64-bit Adobe Flash plugin. But data points: 1) It started with -12, and goes away if I put back an older glibc. 2) The firefox crash is pretty obviously a stray/duff pointer - a SIGSEGV emacs doesn't seem to build with neither glibc-2.9.90-14 nor -15, at least not on x86_64, with F10 kernel. Both segfault, so I'd say the bug is on the emacs side. mysql failure still there with -16 ... so apparently the "final freeze" snapshot is going to include a known broken mysql. Regarding emacs, after upgrading to a non-broken F10 kernel I was able to build emacs against glibc-2.9.90-11 on x86_64 just fine and run the resulting dumped emacs against glibc-2.9.90-16, and also dump emacs against glibc-2.9.90-16 and run the resulting dumped binary, so I can't reproduce any issues. For mysql, where are any details how it can be reproduced? What kind of debugging you've done to ensure it isn't mysql fault, which would be very likely? The changes in malloc are not very large, the changes only affect which arena is selected for which thread and whether locking is needed or atomic operations are used instead in some cases. I managed to reproduce the mysql test failure. The problem is that it creates threads with very small stack sizes: 4142 mmap(NULL, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7fa870662000 4142 mprotect(0x7fa870662000, 4096, PROT_NONE) = 0 4142 clone(child_stack=0x7fa8706651f0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fa8706659e0, tls=0x7fa870665910, child_tidptr=0x7fa8706659e0) = 4174 4142 futex(0x24b4014, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...> 4174 set_robust_list(0x7fa8706659f0, 0x18) = 0 4174 --- SIGSEGV (Segmentation fault) @ 0 (0) --- Don't know what exactly was passed to pthread_attr_setstacksize yet, but clearly from the 16KB allocated the low 4KB is used for guard page, then 8688 bytes from the top of guard page till child stack and the rest is struct pthread/TLS blocks etc. What changed is that in malloc we now newly call in reused_arena: static int narenas_limit; if (narenas_limit == 0) { if (mp_.arena_max != 0) narenas_limit = mp_.arena_max; else { int n = __get_nprocs (); and __get_nprocs does: int __get_nprocs () { /* XXX Here will come a test for the new system call. */ char buffer[8192]; char *const buffer_end = buffer + sizeof (buffer); ... which is already too much for the extremely limited stack. While what mysql does is very stupid, I'm afraid we need to tollerate it, because on x86_64 PTHREAD_STACK_MIN is still 16KB, not more. Unfortunately doing here if (__libc_use_alloca (8192)) buffer = alloca (8192); else buffer = malloc (8192); isn't going to work here well, as when it is called from inside of malloc, that would be a recursive call. Created attachment 339675 [details]
Patch to reduce stack usage in __get_nprocs if stack limit is very low
Created attachment 339679 [details]
Patch to truncate very long lines.
The second patch is to avoid next_line returning parts of the same line for very
long lines, instead it just truncates them at 3/4 of the buffer size. Looking at all users of this function, they all look just at < 30 initial bytes of the
line anyway, never more, so when we truncate lines after 384 bytes, it should
work IMHO better than pretending there was a line break inside of very long
line.
I have added the two patches plus a third to handle programs with small stacks. Jakub will likely build a new glibc today. Thanks for investigating this. I poked around in the mysql sources, and it appears that it's selecting a thread stack size of 64K on x86_64 in the program that's failing. So while I agree that it's a bad idea for malloc to be eating 8K of stack, I'm not entirely convinced that that explains the crash. Is it possible that the code is being called recursively? We've seen before that the mysql guys are ridiculously optimistic about the amount of stack space required, and I'm actually carrying a patch to increase the stack size request in another place in their code. I'm tempted to patch it higher here too, but I'm not 100% sure I'm looking at the right place. Ah, ignore the above --- I just found another place that was asking for only 16K stack, and increasing that request seems to make the crash go away. Still, it's a bit mystifying that this fails only on x86_64; the problem should be as bad or worse on the other three arches. Is that new code in malloc somehow x86_64 specific? Is it still true that the stack guard space is silently subtracted from the stack size request (and if so, what happens when the request is <= one page?) To clarify: I'm wondering why this code works at all on PPC, seeing that (last I heard) our PPC builders have 64K page size. The algorithm it's using to choose the thread stack size is 1. Start with an 8K basic request (or in the other place I found earlier, a 32K basic request). 2. Double the request if sizeof(pointer) == 8. 3. If less than PTHREAD_STACK_MIN, increase to PTHREAD_STACK_MIN. 4. pthread_attr_setstacksize with this. Don't change the default guard size. Unless we're using quite a large PTHREAD_STACK_MIN value on PPC(64), or there's some other special rule on those platforms, this really ought to crash and burn everywhere given that malloc needs at least 8K. So I'm still confused why it fails to fail on three out of four platforms. (In reply to comment #17) > To clarify: I'm wondering why this code works at all on PPC, seeing that (last > I heard) our PPC builders have 64K page size. The algorithm it's using to > choose the thread stack size is The pthread_attr_setstacksize function (and the others, which shouldn't be used) check that the size is not below PTHREAD_STACK_MIN. On ppc the limit is 131072. I've successfully built mysql in HEAD (which still has glibc -16) using a patch that knocks up that 8K request to 32K (64K on 64-bit platforms). So that confirms the diagnosis. This can be closed out as far as mysql is concerned. I have no more problems with emacs and glibc 2.9.90-15. My symptom of this problem is fixed. Thanks, Jakub & all. Putting on F11Preview list, as something(s) need to get tagged for F11 to be in preview. glibc-2.9.90-19 in rawhide has much smaller stack requirements for malloc (instead of 8KB just 512B if stacksize is low). Will this be pushed into F-11 final as well? If not I need to request an update of mysql, because it's entirely broken in F-11 ATM. koji latest-pkg dist-f11 glibc Build Tag Built by ---------------------------------------- -------------------- ---------------- glibc-2.9.90-19 dist-f11 jakub (In reply to comment #23) > Will this be pushed into F-11 final as well? If not I need to request an > update of mysql, because it's entirely broken in F-11 ATM. This definitely has to go into F11GA. And there will be even at least one more RPM renaming the whole set to glibc-2.10. We intend do this right before the final deadline. |