Description of problem: Customer is using "ps" within a script to monitor the processes and noticed that once "ps" died because of the glibc detecting a double free or corruption. Version-Release number of selected component (if applicable): procps-3.2.7-8.1.el5 on x86_64 How reproducible: Cannot reproduce Steps to Reproduce: 1. run "ps -e -o user -o pid -o ppid -o args" Actual results: *** glibc detected *** ps: double free or corruption (out): 0x000000000bacf680 *** ======= Backtrace: ========= /lib64/libc.so.6[0x3912e6f444] /lib64/libc.so.6(cfree+0x8c)[0x3912e72a6c] ps[0x402393] /lib64/libc.so.6(__libc_start_main+0xf4)[0x3912e1d8a4] ps[0x4019f9] Expected results: No error Additional info: We do not have a core file for that crash, and as far as I know, the problem occurred only once for the customer and we have not found a way to reproduce that. I tried reproduce the issue using very long cmdline using multibyte characters with teh "ja_JP.UTF-8" locale w/out success. Customer suspected the length of the cmdline string could be the problem but: 1) The size of the cmdline in /proc/$PID/cmdline is limited to a PAGE_SIZE in kernels's "fs/proc/base.c": 268 static int proc_pid_cmdline(struct task_struct *task, char * buffer) 269 { 270 int res = 0; 271 unsigned int len; 272 struct mm_struct *mm = get_task_mm(task); ... 278 len = mm->arg_end - mm->arg_start; 279 280 if (len > PAGE_SIZE) 281 len = PAGE_SIZE; So the cmdline cannot exceed 4096, and: 2) the code in proc/readproc.c can deal with any size of cmdline anyway. I suspected also the use of UTF8 (as the locale used is "ja_JP.UTF-8") with longer cmdline but could not produce the problem either. The backtrace in the log gives: *** glibc detected *** ps: double free or corruption (out): 0x000000000bacf680 *** ======= Backtrace: ========= /lib64/libc.so.6[0x3912e6f444] /lib64/libc.so.6(cfree+0x8c)[0x3912e72a6c] ps[0x402393] /lib64/libc.so.6(__libc_start_main+0xf4)[0x3912e1d8a4] ps[0x4019f9] ======= Memory map: ======== 00400000-00413000 r-xp 00000000 08:02 3124268 /bin/ps 00613000-00614000 rw-p 00013000 08:02 3124268 /bin/ps 00614000-00634000 rw-p 00614000 00:00 0 0bacd000-0baee000 rw-p 0bacd000 00:00 0 3912a00000-3912a1a000 r-xp 00000000 08:02 1887841 /lib64/ld-2.5.so 3912c19000-3912c1a000 r--p 00019000 08:02 1887841 /lib64/ld-2.5.so 3912c1a000-3912c1b000 rw-p 0001a000 08:02 1887841 /lib64/ld-2.5.so 3912e00000-3912f46000 r-xp 00000000 08:02 1887842 /lib64/libc-2.5.so 3912f46000-3913146000 ---p 00146000 08:02 1887842 /lib64/libc-2.5.so 3913146000-391314a000 r--p 00146000 08:02 1887842 /lib64/libc-2.5.so 391314a000-391314b000 rw-p 0014a000 08:02 1887842 /lib64/libc-2.5.so 391314b000-3913150000 rw-p 391314b000 00:00 0 3913200000-3913202000 r-xp 00000000 08:02 1887843 /lib64/libdl-2.5.so 3913202000-3913402000 ---p 00002000 08:02 1887843 /lib64/libdl-2.5.so 3913402000-3913403000 r--p 00002000 08:02 1887843 /lib64/libdl-2.5.so 3913403000-3913404000 rw-p 00003000 08:02 1887843 /lib64/libdl-2.5.so 3913600000-391360d000 r-xp 00000000 08:02 1887857 /lib64/libproc-3.2.7.so 391360d000-391380d000 ---p 0000d000 08:02 1887857 /lib64/libproc-3.2.7.so 391380d000-391380e000 rw-p 0000d000 08:02 1887857 /lib64/libproc-3.2.7.so 391380e000-3913822000 rw-p 391380e000 00:00 0 3922400000-392240d000 r-xp 00000000 08:02 1887848 /lib64/libgcc_s-4.1.2-20070626.so.1 392240d000-392260d000 ---p 0000d000 08:02 1887848 /lib64/libgcc_s-4.1.2-20070626.so.1 392260d000-392260e000 rw-p 0000d000 08:02 1887848 /lib64/libgcc_s-4.1.2-20070626.so.1 2aaaaaaab000-2aaaaaaac000 rw-p 2aaaaaaab000 00:00 0 2aaaaaac7000-2aaaaaaeb000 rw-p 2aaaaaac7000 00:00 0 2aaaaaaeb000-2aaaaaaec000 ---p 2aaaaaaeb000 00:00 0 2aaaaaaec000-2aaaaaaed000 rw-p 2aaaaaaec000 00:00 0 2aaaaab07000-2aaaaab11000 r-xp 00000000 08:02 1887580 /lib64/libnss_files-2.5.so 2aaaaab11000-2aaaaad10000 ---p 0000a000 08:02 1887580 /lib64/libnss_files-2.5.so 2aaaaad10000-2aaaaad11000 r--p 00009000 08:02 1887580 /lib64/libnss_files-2.5.so 2aaaaad11000-2aaaaad12000 rw-p 0000a000 08:02 1887580 /lib64/libnss_files-2.5.so 2aaaac000000-2aaaac021000 rw-p 2aaaac000000 00:00 0 2aaaac021000-2aaab0000000 ---p 2aaaac021000 00:00 0 7fff37085000-7fff3709a000 rw-p 7fff37085000 00:00 0 [stack] ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0 [vdso] Signal 6 (ABRT) caught by ps (procps version 3.2.7). Please send bug reports to <feedback.net> or <albert.net> Checking the the address that lead to the free() error gives: (gdb) list *0x402393 0x402393 is in main (ps/display.c:343). 338 case TF_show_proc: // normal non-thread output 339 while(readproc(ptp,&buf)){ 340 if(want_this_proc(&buf)){ 341 show_one_proc(&buf, proc_format_list); 342 } 343 if(buf.cmdline) free((void*)*buf.cmdline); // ought to reuse 344 if(buf.environ) free((void*)*buf.environ); // ought to reuse 345 } 346 break; I checked the code but did not spot any obvious problem. Those buffers are allocated by readproc() that translates to simple_readproc() and simple_readproc() uses file2strvec() to allocate the memory and read the data from the proc file. I do not see want_this_proc() nor show_one_proc() allocating or deallocating memory so I don't see the problem coming from these functions (although it seems to use non multibyte character aware function such as strlen() to compute output column alignment, but that should not cause any problem other than wrong output alignment). I've now provided the customer with a slightly modified version of procps that does *not* catch the sigsegv and sigabrt signals so we can have a chance to capture a core file when/if the problem reoccurs, and in parallel, I am also escalating this issue to bugzilla to get Engineering opinion on this as I fail to find what could have caused the memory corruption reported by the glibc in "ps".
This request was evaluated by Red Hat Product Management for inclusion, but this component is not scheduled to be updated in the current Red Hat Enterprise Linux release. If you would like this request to be reviewed for the next minor release, ask your support representative to set the next rhel-x.y flag to "?".
hello, since the problem occured only *once* and the issue tracker is closed, can I close this as WORKSFORME?
Created attachment 359482 [details] reproducer program Attaching reproducer and procedure. To reproduce: 1) Build the two executables create_zombie and dummy_sleep: $ make 2) Run "dummy_sleep" in a loop: $ for i in `seq 1 1 10000`; do ./create_zombie 2 & done 3) In a separate terminal/console, run ps -eo pid,args in a loop $ while $(ps -eo pid,args > log.txt); do /bin/true; done Actual results: ps will abort after a few seconds with a: *** glibc detected *** ps: double free or corruption (out) *** Expected results: ps does not abort Additional info: The problem is related to the patch from bug#134516 ("ps truncates line to 2048 characters") and more precisely to that change: https://bugzilla.redhat.com/show_bug.cgi?id=134516#c24 Using: while ((n = read(fd, buf, sizeof buf - 1)) > 0) Instead of: while ((n = read(fd, buf, sizeof buf - 1)) >= 0) does not trigger the corruption but I am not entirely sure why...
Created attachment 359498 [details] Proposed patch I think what happens is the following: With "while ((n = read(fd, buf, sizeof buf - 1)) >= 0)", "end_of_file" is set to 1 by: if (n < (int)(sizeof buf - 1)) end_of_file = 1; At the same time, with n = 0, buf[n-1] points to uninitialized data, so the value of buf[n-1] is likely to be not null, therefore the test is false: if (end_of_file && buf[n-1]) /* last read char not null */ buf[n++] = '\0'; /* so append null-terminator */ So no null-terminator is inserted. And that breaks the computation of the string array entries later in the code. Adding a test for n == 0 avoids the problem: if (end_of_file && (n == 0 || buf[n-1]))/* last read char not null */ buf[n++] = '\0'; /* so append null-terminator */ The reproducer works fine with that patch.
Same problem present in RHEL-4 (bug #521200). Same patch fixes the problem.
fixed in procps-3.2.7-12.el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0200.html