487700 – double free or corruption detected in ps

Bug 487700 - double free or corruption detected in ps

Summary: double free or corruption detected in ps

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	procps
Sub Component:
Version:	5.3
Hardware:	All
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	rc
Target Release:	5.5
Assignee:	Daniel Novotny
QA Contact:	BaseOS QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	499522
TreeView+	depends on / blocked

Reported:	2009-02-27 15:00 UTC by Olivier Fourdan
Modified:	2018-12-01 16:00 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-03-30 08:06:15 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
reproducer program (1.16 KB, application/octet-stream) 2009-09-02 09:01 UTC, Olivier Fourdan	no flags	Details
Proposed patch (655 bytes, patch) 2009-09-02 10:04 UTC, Olivier Fourdan	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	29770	0	None	None	None	Never
Red Hat Product Errata	RHBA-2010:0200	0	normal	SHIPPED_LIVE	procps bug fix and enhancement update	2010-03-29 12:26:34 UTC

Description Olivier Fourdan 2009-02-27 15:00:01 UTC

Description of problem:

Customer is using "ps" within a script to monitor the processes and noticed that once "ps" died because of the glibc detecting a double free or corruption.

Version-Release number of selected component (if applicable):

procps-3.2.7-8.1.el5 on x86_64

How reproducible:

Cannot reproduce

Steps to Reproduce:

1. run "ps -e -o user -o pid -o ppid -o args"
  
Actual results:

*** glibc detected *** ps: double free or corruption (out): 0x000000000bacf680 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3912e6f444]
/lib64/libc.so.6(cfree+0x8c)[0x3912e72a6c]
ps[0x402393]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x3912e1d8a4]
ps[0x4019f9]

Expected results:

No error

Additional info:

We do not have a core file for that crash, and as far as I know, the problem occurred only once for the customer and we have not found a way to reproduce that.

I tried reproduce the issue using very long cmdline using multibyte characters with teh "ja_JP.UTF-8" locale w/out success.

Customer suspected the length of the cmdline string could be the problem but:

1) The size of the cmdline in /proc/$PID/cmdline is limited to a PAGE_SIZE in kernels's "fs/proc/base.c":

    268 static int proc_pid_cmdline(struct task_struct *task, char * buffer)
    269 {
    270        int res = 0;
    271        unsigned int len;
    272        struct mm_struct *mm = get_task_mm(task);
               ...
    278        len = mm->arg_end - mm->arg_start;
    279
    280        if (len > PAGE_SIZE)
    281                len = PAGE_SIZE;

So the cmdline cannot exceed 4096, and:

2) the code in proc/readproc.c can deal with any size of cmdline anyway.

I suspected also the use of UTF8 (as the locale used is "ja_JP.UTF-8") with longer cmdline but could not produce the problem either.

The backtrace in the log gives:

    *** glibc detected *** ps: double free or corruption (out): 0x000000000bacf680 ***
    ======= Backtrace: =========
    /lib64/libc.so.6[0x3912e6f444]
    /lib64/libc.so.6(cfree+0x8c)[0x3912e72a6c]
    ps[0x402393]
    /lib64/libc.so.6(__libc_start_main+0xf4)[0x3912e1d8a4]
    ps[0x4019f9]
    ======= Memory map: ========
    00400000-00413000 r-xp 00000000 08:02 3124268                            /bin/ps
    00613000-00614000 rw-p 00013000 08:02 3124268                            /bin/ps
    00614000-00634000 rw-p 00614000 00:00 0 
    0bacd000-0baee000 rw-p 0bacd000 00:00 0 
    3912a00000-3912a1a000 r-xp 00000000 08:02 1887841                        /lib64/ld-2.5.so
    3912c19000-3912c1a000 r--p 00019000 08:02 1887841                        /lib64/ld-2.5.so
    3912c1a000-3912c1b000 rw-p 0001a000 08:02 1887841                        /lib64/ld-2.5.so
    3912e00000-3912f46000 r-xp 00000000 08:02 1887842                        /lib64/libc-2.5.so
    3912f46000-3913146000 ---p 00146000 08:02 1887842                        /lib64/libc-2.5.so
    3913146000-391314a000 r--p 00146000 08:02 1887842                        /lib64/libc-2.5.so
    391314a000-391314b000 rw-p 0014a000 08:02 1887842                        /lib64/libc-2.5.so
    391314b000-3913150000 rw-p 391314b000 00:00 0 
    3913200000-3913202000 r-xp 00000000 08:02 1887843                        /lib64/libdl-2.5.so
    3913202000-3913402000 ---p 00002000 08:02 1887843                        /lib64/libdl-2.5.so
    3913402000-3913403000 r--p 00002000 08:02 1887843                        /lib64/libdl-2.5.so
    3913403000-3913404000 rw-p 00003000 08:02 1887843                        /lib64/libdl-2.5.so
    3913600000-391360d000 r-xp 00000000 08:02 1887857                        /lib64/libproc-3.2.7.so
    391360d000-391380d000 ---p 0000d000 08:02 1887857                        /lib64/libproc-3.2.7.so
    391380d000-391380e000 rw-p 0000d000 08:02 1887857                        /lib64/libproc-3.2.7.so
    391380e000-3913822000 rw-p 391380e000 00:00 0 
    3922400000-392240d000 r-xp 00000000 08:02 1887848                        /lib64/libgcc_s-4.1.2-20070626.so.1
    392240d000-392260d000 ---p 0000d000 08:02 1887848                        /lib64/libgcc_s-4.1.2-20070626.so.1
    392260d000-392260e000 rw-p 0000d000 08:02 1887848                        /lib64/libgcc_s-4.1.2-20070626.so.1
    2aaaaaaab000-2aaaaaaac000 rw-p 2aaaaaaab000 00:00 0 
    2aaaaaac7000-2aaaaaaeb000 rw-p 2aaaaaac7000 00:00 0 
    2aaaaaaeb000-2aaaaaaec000 ---p 2aaaaaaeb000 00:00 0 
    2aaaaaaec000-2aaaaaaed000 rw-p 2aaaaaaec000 00:00 0 
    2aaaaab07000-2aaaaab11000 r-xp 00000000 08:02 1887580                    /lib64/libnss_files-2.5.so
    2aaaaab11000-2aaaaad10000 ---p 0000a000 08:02 1887580                    /lib64/libnss_files-2.5.so
    2aaaaad10000-2aaaaad11000 r--p 00009000 08:02 1887580                    /lib64/libnss_files-2.5.so
    2aaaaad11000-2aaaaad12000 rw-p 0000a000 08:02 1887580                    /lib64/libnss_files-2.5.so
    2aaaac000000-2aaaac021000 rw-p 2aaaac000000 00:00 0 
    2aaaac021000-2aaab0000000 ---p 2aaaac021000 00:00 0 
    7fff37085000-7fff3709a000 rw-p 7fff37085000 00:00 0                      [stack]
    ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0                  [vdso]


    Signal 6 (ABRT) caught by ps (procps version 3.2.7).
    Please send bug reports to <feedback.net> or <albert.net>

Checking the the address that lead to the free() error gives:

    (gdb) list *0x402393
    0x402393 is in main (ps/display.c:343).
    338       case TF_show_proc:                   // normal non-thread output
    339         while(readproc(ptp,&buf)){
    340           if(want_this_proc(&buf)){
    341             show_one_proc(&buf, proc_format_list);
    342           }
    343           if(buf.cmdline) free((void*)*buf.cmdline); // ought to reuse
    344           if(buf.environ) free((void*)*buf.environ); // ought to reuse
    345         }
    346         break;


I checked the code but did not spot any obvious problem. Those buffers are allocated by readproc() that translates to simple_readproc() and simple_readproc() uses file2strvec() to allocate the memory and read the data from the proc file.

I do not see want_this_proc() nor show_one_proc() allocating or deallocating memory so I don't see the problem coming from these functions (although it seems to use non multibyte character aware function such as strlen() to compute output column alignment, but that should not cause any problem other than wrong output alignment).

I've now provided the customer with a slightly modified version of procps that does *not* catch the sigsegv and sigabrt signals so we can have a chance to capture a core file when/if the problem reoccurs, and in parallel, I am also escalating this issue to bugzilla to get Engineering opinion on this as I fail to find what could have caused the memory corruption reported by the glibc in "ps".

Comment 1 RHEL Program Management 2009-03-26 17:27:18 UTC

This request was evaluated by Red Hat Product Management for
inclusion, but this component is not scheduled to be updated in
the current Red Hat Enterprise Linux release. If you would like
this request to be reviewed for the next minor release, ask your
support representative to set the next rhel-x.y flag to "?".

Comment 3 Daniel Novotny 2009-05-11 11:09:36 UTC

hello, since the problem occured only *once* and the issue tracker is closed, can I close this as WORKSFORME?

Comment 6 Olivier Fourdan 2009-09-02 09:01:10 UTC

Created attachment 359482 [details]
reproducer program

Attaching reproducer and procedure.

To reproduce:

1) Build the two executables create_zombie and dummy_sleep:

   $ make

2) Run "dummy_sleep" in a loop:

   $ for i in `seq 1 1 10000`; do ./create_zombie 2 & done

3) In a separate terminal/console, run ps -eo pid,args in a loop

   $ while $(ps -eo pid,args > log.txt); do /bin/true; done

Actual results:

ps will abort after a few seconds with a: 

  *** glibc detected *** ps: double free or corruption (out) ***

Expected results:

  ps does not abort

Additional info:

The problem is related to the patch from bug#134516 ("ps truncates line to 
2048 characters") and more precisely to that change:

  https://bugzilla.redhat.com/show_bug.cgi?id=134516#c24

Using:

  while ((n = read(fd, buf, sizeof buf - 1)) > 0)

Instead of:

  while ((n = read(fd, buf, sizeof buf - 1)) >= 0)

does not trigger the corruption but I am not entirely sure why...

Comment 8 Olivier Fourdan 2009-09-02 10:04:37 UTC

Created attachment 359498 [details]
Proposed patch

I think what happens is the following:

With "while ((n = read(fd, buf, sizeof buf - 1)) >= 0)", "end_of_file" is set to 1 by:

        if (n < (int)(sizeof buf - 1))
            end_of_file = 1;
 
At the same time, with n = 0, buf[n-1] points to uninitialized data, so the value of buf[n-1] is likely to be not null, therefore the test is false:

        if (end_of_file && buf[n-1])            /* last read char not null */
            buf[n++] = '\0';                    /* so append null-terminator */

So no null-terminator is inserted. And that breaks the computation of the string array entries later in the code.

Adding a test for n == 0 avoids the problem:

        if (end_of_file && (n == 0 || buf[n-1]))/* last read char not null */
            buf[n++] = '\0';                    /* so append null-terminator */

The reproducer works fine with that patch.

Comment 10 Tomas Smetana 2009-09-04 09:41:10 UTC

Same problem present in RHEL-4 (bug #521200). Same patch fixes the problem.

Comment 14 Daniel Novotny 2009-11-19 14:39:15 UTC

fixed in procps-3.2.7-12.el5

Comment 18 errata-xmlrpc 2010-03-30 08:06:15 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0200.html

Note You need to log in before you can comment on or make changes to this bug.