Bug 1418919 - malloc_printerr() deadlock, when calling malloc_printerr() again
Summary: malloc_printerr() deadlock, when calling malloc_printerr() again
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: glibc
Version: 7.2
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: glibc team
QA Contact: qe-baseos-tools-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-03 07:31 UTC by chenwei
Modified: 2018-09-15 08:25 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-07-20 07:37:01 UTC
Target Upstream Version:


Attachments (Terms of Use)
info thr and thread 3/5/8/20/22 (23.64 KB, text/plain)
2017-02-03 07:31 UTC, chenwei
no flags Details
script file (72 bytes, text/plain)
2018-09-15 07:04 UTC, wanjiankang
no flags Details
C process file (2.76 KB, text/plain)
2018-09-15 07:09 UTC, wanjiankang
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Sourceware 21046 0 P2 RESOLVED bug16159 seems to have happened again 2020-07-14 11:57:20 UTC
Sourceware 21754 0 P2 RESOLVED malloc: Perform as little work as possible after heap consistency check failures 2020-07-14 11:57:20 UTC

Description chenwei 2017-02-03 07:31:53 UTC
Created attachment 1247372 [details]
info thr and thread 3/5/8/20/22

Description of problem:
I met the deadlock problem with centOS 7.2 with glibc-2.17, which has already back ported the patch fff94fa(Avoid deadlock in malloc on backtrace)

Version-Release number of selected component (if applicable):
version info in spec as below.
%define glibcsrcdir glibc-2.17-c758a686
%define glibcversion 2.17
%define glibcrelease 68%{?dist}
with glibc-rh1207032.patch(https://git.centos.org/blob/rpms!!glibc.git/12745e32c184da0bced206e6a8919af40ef78d29/SOURCES!glibc-rh1207032.patch%3Bjsessionid=1k84nayc1txqyqqj19uvmj1pj)

How reproducible:
rarely

Steps to Reproduce:
1. M_CHECK_ACTION parameter is default value 3;
2. calloc an array. e.g. the array size is X;
3. write the content of array by using index value, such as X + n (n > 0);
4. free the array (The expection detected);

Actual results:
The process hangs.

Expected results:
The process aborts.

Additional info:
Unfortunately, I didn't collect all threads' backtrace info, but collected "info thr".
Detail please see attchment.
==================== hung thread ===========================
(gdb) thr 20
[Switching to thread 20 (Thread 0x7fa5a8edc700 (LWP 193380))]
#0  pthread_once () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_once.S:94
94		jmp	6b
(gdb) bt
#0  pthread_once () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_once.S:94
#1  0x00007fa5b6e606dc in __GI___backtrace (array=array@entry=0x7fa5a8eda2b0, size=size@entry=64) at ../sysdeps/x86_64/backtrace.c:103
#2  0x00007fa5b6dcb314 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7fa5b6ed4e48 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:176
#3  0x00007fa5b6dd1144 in malloc_printerr (action=<optimized out>, str=0x7fa5b6ed4fe0 "malloc(): smallbin double linked list corrupted", ptr=<optimized out>, ar_ptr=<optimized out>) at malloc.c:5036
#4  0x00007fa5b6dd3f97 in _int_malloc (av=0x7fa588000020, bytes=7809933825095528041) at malloc.c:3396
#5  0x00007fa5b6dd6194 in __GI___libc_malloc (bytes=56) at malloc.c:2909
#6  0x00007fa5ba948238 in _dl_map_object_deps (map=map@entry=0x7fa5bab4a538, preloads=preloads@entry=0x0, npreloads=npreloads@entry=0, trace_mode=trace_mode@entry=0, open_mode=open_mode@entry=-2147483648) at dl-deps.c:515
#7  0x00007fa5ba94e89b in dl_open_worker (a=a@entry=0x7fa5a8edae98) at dl-open.c:269
#8  0x00007fa5ba94a1b4 in _dl_catch_error (objname=objname@entry=0x7fa5a8edae88, errstring=errstring@entry=0x7fa5a8edae90, mallocedp=mallocedp@entry=0x7fa5a8edae80, operate=operate@entry=0x7fa5ba94e700 <dl_open_worker>, 
    args=args@entry=0x7fa5a8edae98) at dl-error.c:177
#9  0x00007fa5ba94e1ab in _dl_open (file=0x7fa5b6ed13c6 "libgcc_s.so.1", mode=-2147483647, caller_dlopen=<optimized out>, nsid=-2, argc=2, argv=0x7ffcac68a628, env=0x7ffcac68a640) at dl-open.c:650
#10 0x00007fa5b6e86fd2 in do_dlopen (ptr=ptr@entry=0x7fa5a8edb0a0) at dl-libc.c:87
#11 0x00007fa5ba94a1b4 in _dl_catch_error (objname=0x7fa5a8edb080, errstring=0x7fa5a8edb090, mallocedp=0x7fa5a8edb070, operate=0x7fa5b6e86f90 <do_dlopen>, args=0x7fa5a8edb0a0) at dl-error.c:177
#12 0x00007fa5b6e87092 in dlerror_run (args=0x7fa5a8edb0a0, operate=0x7fa5b6e86f90 <do_dlopen>) at dl-libc.c:46
#13 __GI___libc_dlopen_mode (name=name@entry=0x7fa5b6ed13c6 "libgcc_s.so.1", mode=mode@entry=-2147483647) at dl-libc.c:163
#14 0x00007fa5b6e605c5 in init () at ../sysdeps/x86_64/backtrace.c:52
#15 0x00007fa5b7123bb0 in pthread_once () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_once.S:103
#16 0x00007fa5b6e606dc in __GI___backtrace (array=array@entry=0x7fa5a8edb360, size=size@entry=64) at ../sysdeps/x86_64/backtrace.c:103
#17 0x00007fa5b6dcb314 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7fa5b6ed4e48 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:176
#18 0x00007fa5b6dd26d3 in malloc_printerr (ar_ptr=0x7fa588000020, ptr=<optimized out>, str=0x7fa5b6ed4e90 "free(): invalid next size (normal)", action=3) at malloc.c:5036
#19 _int_free (av=0x7fa588000020, p=<optimized out>, have_lock=0) at malloc.c:3856
#20 0x00007fa5b9e300ba in virFree (ptrptr=ptrptr@entry=0x7fa5a8edbaf8) at util/viralloc.c:582
#21 0x00007fa5b9e994e8 in virTypedParamsFree (params=0x7fa58800a970, nparams=<optimized out>) at util/virtypedparam.c:1318
#22 0x00007fa5bab8c0c4 in remoteDispatchDomainGetSchedulerParametersFlags (server=0x7fa5bb2fa260, msg=0x7fa588023ab0, ret=0x7fa588021e00, args=<optimized out>, rerr=0x7fa5a8edbc90, client=<optimized out>) at remote.c:1639
#23 remoteDispatchDomainGetSchedulerParametersFlagsHelper (server=0x7fa5bb2fa260, client=<optimized out>, msg=0x7fa588023ab0, rerr=0x7fa5a8edbc90, args=<optimized out>, ret=0x7fa588021e00) at remote_dispatch.h:5853
#24 0x00007fa5b9faabf2 in virNetServerProgramDispatchCall (msg=0x7fa588023ab0, client=0x7fa594003160, server=0x7fa5bb2fa260, prog=0x7fa5bb309650) at rpc/virnetserverprogram.c:437
#25 virNetServerProgramDispatch (prog=0x7fa5bb309650, server=server@entry=0x7fa5bb2fa260, client=0x7fa594003160, msg=0x7fa588023ab0) at rpc/virnetserverprogram.c:307
#26 0x00007fa5babb096d in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>, srv=0x7fa5bb2fa260) at rpc/virnetserver.c:137
#27 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x7fa5bb2fa260) at rpc/virnetserver.c:158
#28 0x00007fa5b9e957f1 in virThreadPoolWorker (opaque=opaque@entry=0x7fa5bb2ef540) at util/virthreadpool.c:167
#29 0x00007fa5b9e94b08 in virThreadHelper (data=<optimized out>) at util/virthread.c:219
#30 0x00007fa5b711edc5 in start_thread (arg=0x7fa5a8edc700) at pthread_create.c:308
#31 0x00007fa5b6e4c71d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Comment 1 chenwei 2017-02-03 08:34:01 UTC
External Bug ID: Sourceware 21046 not 21045
https://sourceware.org/bugzilla/show_bug.cgi?id=21046

Comment 3 Florian Weimer 2017-05-17 12:51:11 UTC
We looked at the code and suspect that a corrupt arena ends up on the free list, or an arena is corrupted while on the free list because corruption is detected during a free (deallocation) call.  It's hard to tell if that's the cause because all evidence is usually destroyed when the hang happens.

I'd prefer if we removed the corrupt bit completely and just conjure up a new arena with _int_new_arena before doing the backtrace.  That should simplify the logic considerably.  A fresh arena also reduces the risk of additional corruption introduced by threads which are concurrently modifying data in the arena.  (But what I *really* prefer is to get rid of the backtrace and IO stream flushing, but you can't have everything.)

Comment 4 chenwei 2017-05-18 01:15:59 UTC
(In reply to Florian Weimer from comment #3)
> We looked at the code and suspect that a corrupt arena ends up on the free
> list, or an arena is corrupted while on the free list because corruption is
> detected during a free (deallocation) call.  It's hard to tell if that's the
> cause because all evidence is usually destroyed when the hang happens.
> 
> I'd prefer if we removed the corrupt bit completely and just conjure up a
> new arena with _int_new_arena before doing the backtrace.  That should
> simplify the logic considerably.  A fresh arena also reduces the risk of
> additional corruption introduced by threads which are concurrently modifying
> data in the arena.  (But what I *really* prefer is to get rid of the
> backtrace and IO stream flushing, but you can't have everything.)

Thanks for your reply.
If such hang happens again(rare case...╮(╯▽╰)╭), what details information should be collected?(we can use gdb to debug).
Since core file will not be generated in this situation.

Comment 5 Florian Weimer 2017-05-18 06:57:51 UTC
(In reply to chenwei from comment #4)
> Thanks for your reply.
> If such hang happens again(rare case...╮(╯▽╰)╭), what details information
> should be collected?(we can use gdb to debug).

I think it will be very difficult to collect helpful information at this point because the cause we suspect (corrupt arena on the arena free list).

If you are still on an old glibc release (you mentioned 2.17-68 somewhere), then the free list could have become cyclic, and we have a GDB script to detect that:

  https://sourceware.org/bugzilla/show_bug.cgi?id=19048#c12

A cyclic free list would make it far more likely that the hang happens (assuming our theory is correct).  The cyclic free list should still be visible at the time of the hang.  It should even be detectable *without* the hang, against a still-running process.  (The behavior is sticky, i.e., if the list is cyclic, it remains so.)

> Since core file will not be generated in this situation.

You can usually generate a core file with the gcore utility, or by sending SIGABRT to the hanging process (if coredumps are enabled).

Comment 6 Carlos O'Donell 2017-07-20 07:37:01 UTC
We're marking this CLOSED/CURRENTRELEASE because we believe the latest version of glibc for RHEL7 contains a fix for what we believe to be the issue you are encountering. Please upgrade, try to reproduce this again, and reopen the issue if you can reproduce. You will need at least glibc-2.17-113.

Comment 7 yangHongjun 2018-09-07 09:38:43 UTC
this problem happened quite often in https://github.com/distcc/distcc/releases/download/v3.3.2/distcc-3.3.2.tar.gz where "free" called quite often when there are frequent request from agent to server. My glibc version is 2.17-222 

(gdb) bt
#0  0x00007f2bd9034e18 in pthread_once () from /lib64/libpthread.so.0
#1  0x00007f2bd8d6fbec in backtrace () from /lib64/libc.so.6
#2  0x00007f2bd8cd3ce4 in __libc_message () from /lib64/libc.so.6
#3  0x00007f2bd8cdbd37 in malloc_consolidate () from /lib64/libc.so.6
#4  0x00007f2bd8cdd205 in _int_malloc () from /lib64/libc.so.6
#5  0x00007f2bd8ce1254 in calloc () from /lib64/libc.so.6
#6  0x00007f2bd96c232f in _dl_new_object () from /lib64/ld-linux-x86-64.so.2
#7  0x00007f2bd96bd2b4 in _dl_map_object_from_fd () from /lib64/ld-linux-x86-64.so.2
#8  0x00007f2bd96bf778 in _dl_map_object () from /lib64/ld-linux-x86-64.so.2
#9  0x00007f2bd96cb3a4 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#10 0x00007f2bd96c68d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#11 0x00007f2bd96cac8b in _dl_open () from /lib64/ld-linux-x86-64.so.2
#12 0x00007f2bd8d98012 in do_dlopen () from /lib64/libc.so.6
#13 0x00007f2bd96c68d4 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#14 0x00007f2bd8d980d2 in __libc_dlopen_mode () from /lib64/libc.so.6
#15 0x00007f2bd8d6fad5 in init () from /lib64/libc.so.6
#16 0x00007f2bd9034e20 in pthread_once () from /lib64/libpthread.so.0
#17 0x00007f2bd8d6fbec in backtrace () from /lib64/libc.so.6
#18 0x00007f2bd8cd3ce4 in __libc_message () from /lib64/libc.so.6
#19 0x00007f2bd8cda574 in malloc_printerr () from /lib64/libc.so.6
#20 0x0000000000409f81 in dcc_free_argv (argv=0x29eabd0) at src/argutil.c:87
#21 0x000000000040516b in dcc_run_job (out_fd=0, in_fd=<optimized out>) at src/serve.c:1146
#22 dcc_service_job (in_fd=in_fd@entry=5, out_fd=out_fd@entry=5, cli_addr=cli_addr@entry=0x7ffc4b4335a0, cli_len=<optimized out>)
    at src/serve.c:245
#23 0x00000000004045f2 in dcc_preforked_child (listen_fd=3) at src/prefork.c:187
#24 dcc_create_kids (listen_fd=3) at src/prefork.c:130
#25 0x00000000004046b7 in dcc_preforking_parent (listen_fd=3) at src/prefork.c:90
#26 0x0000000000404129 in dcc_standalone_server () at src/dparent.c:159
#27 0x0000000000403735 in main (argc=<optimized out>, argv=0x7ffc4b4337d8) at src/daemon.c:233

[root@es81-distcc8--4 ~]# rpm -qa|grep glibc
glibc-devel-2.17-222.el7.i686
glibc-devel-2.17-222.el7.x86_64
glibc-2.17-222.el7.x86_64
glibc-common-2.17-222.el7.x86_64
glibc-2.17-222.el7.i686
glibc-headers-2.17-222.el7.x86_64

Comment 8 Carlos O'Donell 2018-09-07 10:22:56 UTC
(In reply to yangHongjun from comment #7)
> this problem happened quite often in
> https://github.com/distcc/distcc/releases/download/v3.3.2/distcc-3.3.2.tar.
> gz where "free" called quite often when there are frequent request from
> agent to server. My glibc version is 2.17-222 
> 
> (gdb) bt
> #0  0x00007f2bd9034e18 in pthread_once () from /lib64/libpthread.so.0
> #1  0x00007f2bd8d6fbec in backtrace () from /lib64/libc.so.6
> #2  0x00007f2bd8cd3ce4 in __libc_message () from /lib64/libc.so.6
> #3  0x00007f2bd8cdbd37 in malloc_consolidate () from /lib64/libc.so.6
> #4  0x00007f2bd8cdd205 in _int_malloc () from /lib64/libc.so.6
> #5  0x00007f2bd8ce1254 in calloc () from /lib64/libc.so.6

We make every attempt to shut down the application gracefully, but we cannot do this in all cases. We fixed some cases of this in the previous releases.

It looks like you've found another corruption case which can cause the hang. This is a bug in your application (or a glibc bug, but we don't believe it is). You are corrupting the malloc memory pool (arena->heap->chunk) and the application is shutting down at this point.

In future versions of RHEL we have entirely removed the backtracing from the shutdown path and therefore should no longer hang like this. We have not considered doing this in RHEL7 because it would be a change in existing behaviour that users depend upon. Instead a external abort handler should be used to gather a process backtrace upon failure.
 
> [root@es81-distcc8--4 ~]# rpm -qa|grep glibc
> glibc-devel-2.17-222.el7.i686
> glibc-devel-2.17-222.el7.x86_64
> glibc-2.17-222.el7.x86_64
> glibc-common-2.17-222.el7.x86_64
> glibc-2.17-222.el7.i686
> glibc-headers-2.17-222.el7.x86_64

Thank you for this information. This is the most recent public release, and contians all the fixes to date for correcting this type of hang.

Have you tried setting MALLOC_CHECK_=2 to abort the program and let a local system abort handler do the work to record the error?

Comment 9 yangHongjun 2018-09-14 09:09:12 UTC
Hi, Carlos

Thanks a lot for your quick reply which i did not expect it is so quick, so , sorry for my late reply.

after setting MALLOC_CHECK_=2 in my distcc systemd service file, free won't hang there, but curiously, i can not see any abort log in my /var/log/message(i tried free the same memory twice to probe abort message in /var/log/message successfully). I do not know if this problem gone or something happened.

Anyway, many thanks for your "MALLOC_CHECK_=2" :)

Comment 10 wanjiankang 2018-09-15 07:01:47 UTC
Hi,All;
    I have reproduced the problem by a script with a C file.After runing for a while,
the process will hang.
    The informations content of process stack is as follows:
pthread_once () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_once.S:94
94              jmp     6b
(gdb) bt
#0  pthread_once () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_once.S:94
#1  0x00007fb3c0df982c in __GI___backtrace (array=array@entry=0x7ffe76e91fa0, size=size@entry=64) at ../sysdeps/x86_64/backtrace.c:103
#2  0x00007fb3c0d64354 in __libc_message (do_abort=2, fmt=fmt@entry=0x7fb3c0e6e168 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:176
#3  0x00007fb3c0d6cebd in malloc_printerr (ar_ptr=0x7fb3c10a9740 <main_arena>, ptr=0x2431df0, str=0x7fb3c0e6b8cf "malloc(): memory corruption", action=<optimized out>) at malloc.c:5036
#4  _int_malloc (av=av@entry=0x7fb3c10a9740 <main_arena>, bytes=bytes@entry=560) at malloc.c:3482
#5  0x00007fb3c0d6fb36 in __libc_calloc (n=<optimized out>, elem_size=<optimized out>) at malloc.c:3254
#6  0x00007fb3c12de0c5 in allocate_dtv (result=0x7fb2377fe700) at dl-tls.c:317
#7  __GI__dl_allocate_tls (mem=mem@entry=0x7fb2377fe700) at dl-tls.c:533
#8  0x00007fb3c10b8961 in allocate_stack (stack=<synthetic pointer>, pdp=<synthetic pointer>, attr=0x7ffe76e966f0) at allocatestack.c:570
#9  __pthread_create_2_1 (newthread=0x7ffe76e95fe8, attr=0x7ffe76e966f0, start_routine=0x400acd <threadStart>, arg=0x0) at pthread_create.c:451
#10 0x0000000000400d12 in main ()

Comment 11 wanjiankang 2018-09-15 07:04:09 UTC
Created attachment 1483450 [details]
script file

Comment 12 wanjiankang 2018-09-15 07:09:29 UTC
Created attachment 1483451 [details]
C process file


Note You need to log in before you can comment on or make changes to this bug.