Bug 1418919
Summary: | malloc_printerr() deadlock, when calling malloc_printerr() again | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | chenwei <chenwei68> | ||||||||
Component: | glibc | Assignee: | glibc team <glibc-bugzilla> | ||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | qe-baseos-tools-bugs | ||||||||
Severity: | unspecified | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 7.2 | CC: | ashankar, codonell, fweimer, mnewsome, pfrankli, wanjiankang, yhongjun08 | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2017-07-20 07:37:01 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
chenwei
2017-02-03 07:31:53 UTC
External Bug ID: Sourceware 21046 not 21045 https://sourceware.org/bugzilla/show_bug.cgi?id=21046 We looked at the code and suspect that a corrupt arena ends up on the free list, or an arena is corrupted while on the free list because corruption is detected during a free (deallocation) call. It's hard to tell if that's the cause because all evidence is usually destroyed when the hang happens. I'd prefer if we removed the corrupt bit completely and just conjure up a new arena with _int_new_arena before doing the backtrace. That should simplify the logic considerably. A fresh arena also reduces the risk of additional corruption introduced by threads which are concurrently modifying data in the arena. (But what I *really* prefer is to get rid of the backtrace and IO stream flushing, but you can't have everything.) (In reply to Florian Weimer from comment #3) > We looked at the code and suspect that a corrupt arena ends up on the free > list, or an arena is corrupted while on the free list because corruption is > detected during a free (deallocation) call. It's hard to tell if that's the > cause because all evidence is usually destroyed when the hang happens. > > I'd prefer if we removed the corrupt bit completely and just conjure up a > new arena with _int_new_arena before doing the backtrace. That should > simplify the logic considerably. A fresh arena also reduces the risk of > additional corruption introduced by threads which are concurrently modifying > data in the arena. (But what I *really* prefer is to get rid of the > backtrace and IO stream flushing, but you can't have everything.) Thanks for your reply. If such hang happens again(rare case...╮(╯▽╰)╭), what details information should be collected?(we can use gdb to debug). Since core file will not be generated in this situation. (In reply to chenwei from comment #4) > Thanks for your reply. > If such hang happens again(rare case...╮(╯▽╰)╭), what details information > should be collected?(we can use gdb to debug). I think it will be very difficult to collect helpful information at this point because the cause we suspect (corrupt arena on the arena free list). If you are still on an old glibc release (you mentioned 2.17-68 somewhere), then the free list could have become cyclic, and we have a GDB script to detect that: https://sourceware.org/bugzilla/show_bug.cgi?id=19048#c12 A cyclic free list would make it far more likely that the hang happens (assuming our theory is correct). The cyclic free list should still be visible at the time of the hang. It should even be detectable *without* the hang, against a still-running process. (The behavior is sticky, i.e., if the list is cyclic, it remains so.) > Since core file will not be generated in this situation. You can usually generate a core file with the gcore utility, or by sending SIGABRT to the hanging process (if coredumps are enabled). We're marking this CLOSED/CURRENTRELEASE because we believe the latest version of glibc for RHEL7 contains a fix for what we believe to be the issue you are encountering. Please upgrade, try to reproduce this again, and reopen the issue if you can reproduce. You will need at least glibc-2.17-113. this problem happened quite often in https://github.com/distcc/distcc/releases/download/v3.3.2/distcc-3.3.2.tar.gz where "free" called quite often when there are frequent request from agent to server. My glibc version is 2.17-222 (gdb) bt #0 0x00007f2bd9034e18 in pthread_once () from /lib64/libpthread.so.0 #1 0x00007f2bd8d6fbec in backtrace () from /lib64/libc.so.6 #2 0x00007f2bd8cd3ce4 in __libc_message () from /lib64/libc.so.6 #3 0x00007f2bd8cdbd37 in malloc_consolidate () from /lib64/libc.so.6 #4 0x00007f2bd8cdd205 in _int_malloc () from /lib64/libc.so.6 #5 0x00007f2bd8ce1254 in calloc () from /lib64/libc.so.6 #6 0x00007f2bd96c232f in _dl_new_object () from /lib64/ld-linux-x86-64.so.2 #7 0x00007f2bd96bd2b4 in _dl_map_object_from_fd () from /lib64/ld-linux-x86-64.so.2 #8 0x00007f2bd96bf778 in _dl_map_object () from /lib64/ld-linux-x86-64.so.2 #9 0x00007f2bd96cb3a4 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2 #10 0x00007f2bd96c68d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2 #11 0x00007f2bd96cac8b in _dl_open () from /lib64/ld-linux-x86-64.so.2 #12 0x00007f2bd8d98012 in do_dlopen () from /lib64/libc.so.6 #13 0x00007f2bd96c68d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2 #14 0x00007f2bd8d980d2 in __libc_dlopen_mode () from /lib64/libc.so.6 #15 0x00007f2bd8d6fad5 in init () from /lib64/libc.so.6 #16 0x00007f2bd9034e20 in pthread_once () from /lib64/libpthread.so.0 #17 0x00007f2bd8d6fbec in backtrace () from /lib64/libc.so.6 #18 0x00007f2bd8cd3ce4 in __libc_message () from /lib64/libc.so.6 #19 0x00007f2bd8cda574 in malloc_printerr () from /lib64/libc.so.6 #20 0x0000000000409f81 in dcc_free_argv (argv=0x29eabd0) at src/argutil.c:87 #21 0x000000000040516b in dcc_run_job (out_fd=0, in_fd=<optimized out>) at src/serve.c:1146 #22 dcc_service_job (in_fd=in_fd@entry=5, out_fd=out_fd@entry=5, cli_addr=cli_addr@entry=0x7ffc4b4335a0, cli_len=<optimized out>) at src/serve.c:245 #23 0x00000000004045f2 in dcc_preforked_child (listen_fd=3) at src/prefork.c:187 #24 dcc_create_kids (listen_fd=3) at src/prefork.c:130 #25 0x00000000004046b7 in dcc_preforking_parent (listen_fd=3) at src/prefork.c:90 #26 0x0000000000404129 in dcc_standalone_server () at src/dparent.c:159 #27 0x0000000000403735 in main (argc=<optimized out>, argv=0x7ffc4b4337d8) at src/daemon.c:233 [root@es81-distcc8--4 ~]# rpm -qa|grep glibc glibc-devel-2.17-222.el7.i686 glibc-devel-2.17-222.el7.x86_64 glibc-2.17-222.el7.x86_64 glibc-common-2.17-222.el7.x86_64 glibc-2.17-222.el7.i686 glibc-headers-2.17-222.el7.x86_64 (In reply to yangHongjun from comment #7) > this problem happened quite often in > https://github.com/distcc/distcc/releases/download/v3.3.2/distcc-3.3.2.tar. > gz where "free" called quite often when there are frequent request from > agent to server. My glibc version is 2.17-222 > > (gdb) bt > #0 0x00007f2bd9034e18 in pthread_once () from /lib64/libpthread.so.0 > #1 0x00007f2bd8d6fbec in backtrace () from /lib64/libc.so.6 > #2 0x00007f2bd8cd3ce4 in __libc_message () from /lib64/libc.so.6 > #3 0x00007f2bd8cdbd37 in malloc_consolidate () from /lib64/libc.so.6 > #4 0x00007f2bd8cdd205 in _int_malloc () from /lib64/libc.so.6 > #5 0x00007f2bd8ce1254 in calloc () from /lib64/libc.so.6 We make every attempt to shut down the application gracefully, but we cannot do this in all cases. We fixed some cases of this in the previous releases. It looks like you've found another corruption case which can cause the hang. This is a bug in your application (or a glibc bug, but we don't believe it is). You are corrupting the malloc memory pool (arena->heap->chunk) and the application is shutting down at this point. In future versions of RHEL we have entirely removed the backtracing from the shutdown path and therefore should no longer hang like this. We have not considered doing this in RHEL7 because it would be a change in existing behaviour that users depend upon. Instead a external abort handler should be used to gather a process backtrace upon failure. > [root@es81-distcc8--4 ~]# rpm -qa|grep glibc > glibc-devel-2.17-222.el7.i686 > glibc-devel-2.17-222.el7.x86_64 > glibc-2.17-222.el7.x86_64 > glibc-common-2.17-222.el7.x86_64 > glibc-2.17-222.el7.i686 > glibc-headers-2.17-222.el7.x86_64 Thank you for this information. This is the most recent public release, and contians all the fixes to date for correcting this type of hang. Have you tried setting MALLOC_CHECK_=2 to abort the program and let a local system abort handler do the work to record the error? Hi, Carlos Thanks a lot for your quick reply which i did not expect it is so quick, so , sorry for my late reply. after setting MALLOC_CHECK_=2 in my distcc systemd service file, free won't hang there, but curiously, i can not see any abort log in my /var/log/message(i tried free the same memory twice to probe abort message in /var/log/message successfully). I do not know if this problem gone or something happened. Anyway, many thanks for your "MALLOC_CHECK_=2" :) Hi,All; I have reproduced the problem by a script with a C file.After runing for a while, the process will hang. The informations content of process stack is as follows: pthread_once () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_once.S:94 94 jmp 6b (gdb) bt #0 pthread_once () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_once.S:94 #1 0x00007fb3c0df982c in __GI___backtrace (array=array@entry=0x7ffe76e91fa0, size=size@entry=64) at ../sysdeps/x86_64/backtrace.c:103 #2 0x00007fb3c0d64354 in __libc_message (do_abort=2, fmt=fmt@entry=0x7fb3c0e6e168 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:176 #3 0x00007fb3c0d6cebd in malloc_printerr (ar_ptr=0x7fb3c10a9740 <main_arena>, ptr=0x2431df0, str=0x7fb3c0e6b8cf "malloc(): memory corruption", action=<optimized out>) at malloc.c:5036 #4 _int_malloc (av=av@entry=0x7fb3c10a9740 <main_arena>, bytes=bytes@entry=560) at malloc.c:3482 #5 0x00007fb3c0d6fb36 in __libc_calloc (n=<optimized out>, elem_size=<optimized out>) at malloc.c:3254 #6 0x00007fb3c12de0c5 in allocate_dtv (result=0x7fb2377fe700) at dl-tls.c:317 #7 __GI__dl_allocate_tls (mem=mem@entry=0x7fb2377fe700) at dl-tls.c:533 #8 0x00007fb3c10b8961 in allocate_stack (stack=<synthetic pointer>, pdp=<synthetic pointer>, attr=0x7ffe76e966f0) at allocatestack.c:570 #9 __pthread_create_2_1 (newthread=0x7ffe76e95fe8, attr=0x7ffe76e966f0, start_routine=0x400acd <threadStart>, arg=0x0) at pthread_create.c:451 #10 0x0000000000400d12 in main () Created attachment 1483450 [details]
script file
Created attachment 1483451 [details]
C process file
|