Bug 487700

Summary: double free or corruption detected in ps
Product: Red Hat Enterprise Linux 5 Reporter: Olivier Fourdan <ofourdan>
Component: procpsAssignee: Daniel Novotny <dnovotny>
Status: CLOSED ERRATA QA Contact: BaseOS QE <qe-baseos-auto>
Severity: medium Docs Contact:
Priority: high    
Version: 5.3CC: albert, bhubbard, cward, kem, mosvald, psplicha, rvokal, tao
Target Milestone: rcKeywords: Patch
Target Release: 5.5   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-03-30 08:06:15 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 499522    
Attachments:
Description Flags
reproducer program
none
Proposed patch none

Description Olivier Fourdan 2009-02-27 15:00:01 UTC
Description of problem:

Customer is using "ps" within a script to monitor the processes and noticed that once "ps" died because of the glibc detecting a double free or corruption.

Version-Release number of selected component (if applicable):

procps-3.2.7-8.1.el5 on x86_64

How reproducible:

Cannot reproduce

Steps to Reproduce:

1. run "ps -e -o user -o pid -o ppid -o args"
  
Actual results:

*** glibc detected *** ps: double free or corruption (out): 0x000000000bacf680 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3912e6f444]
/lib64/libc.so.6(cfree+0x8c)[0x3912e72a6c]
ps[0x402393]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x3912e1d8a4]
ps[0x4019f9]

Expected results:

No error

Additional info:

We do not have a core file for that crash, and as far as I know, the problem occurred only once for the customer and we have not found a way to reproduce that.

I tried reproduce the issue using very long cmdline using multibyte characters with teh "ja_JP.UTF-8" locale w/out success.

Customer suspected the length of the cmdline string could be the problem but:

1) The size of the cmdline in /proc/$PID/cmdline is limited to a PAGE_SIZE in kernels's "fs/proc/base.c":

    268 static int proc_pid_cmdline(struct task_struct *task, char * buffer)
    269 {
    270        int res = 0;
    271        unsigned int len;
    272        struct mm_struct *mm = get_task_mm(task);
               ...
    278        len = mm->arg_end - mm->arg_start;
    279
    280        if (len > PAGE_SIZE)
    281                len = PAGE_SIZE;

So the cmdline cannot exceed 4096, and:

2) the code in proc/readproc.c can deal with any size of cmdline anyway.

I suspected also the use of UTF8 (as the locale used is "ja_JP.UTF-8") with longer cmdline but could not produce the problem either.

The backtrace in the log gives:

    *** glibc detected *** ps: double free or corruption (out): 0x000000000bacf680 ***
    ======= Backtrace: =========
    /lib64/libc.so.6[0x3912e6f444]
    /lib64/libc.so.6(cfree+0x8c)[0x3912e72a6c]
    ps[0x402393]
    /lib64/libc.so.6(__libc_start_main+0xf4)[0x3912e1d8a4]
    ps[0x4019f9]
    ======= Memory map: ========
    00400000-00413000 r-xp 00000000 08:02 3124268                            /bin/ps
    00613000-00614000 rw-p 00013000 08:02 3124268                            /bin/ps
    00614000-00634000 rw-p 00614000 00:00 0 
    0bacd000-0baee000 rw-p 0bacd000 00:00 0 
    3912a00000-3912a1a000 r-xp 00000000 08:02 1887841                        /lib64/ld-2.5.so
    3912c19000-3912c1a000 r--p 00019000 08:02 1887841                        /lib64/ld-2.5.so
    3912c1a000-3912c1b000 rw-p 0001a000 08:02 1887841                        /lib64/ld-2.5.so
    3912e00000-3912f46000 r-xp 00000000 08:02 1887842                        /lib64/libc-2.5.so
    3912f46000-3913146000 ---p 00146000 08:02 1887842                        /lib64/libc-2.5.so
    3913146000-391314a000 r--p 00146000 08:02 1887842                        /lib64/libc-2.5.so
    391314a000-391314b000 rw-p 0014a000 08:02 1887842                        /lib64/libc-2.5.so
    391314b000-3913150000 rw-p 391314b000 00:00 0 
    3913200000-3913202000 r-xp 00000000 08:02 1887843                        /lib64/libdl-2.5.so
    3913202000-3913402000 ---p 00002000 08:02 1887843                        /lib64/libdl-2.5.so
    3913402000-3913403000 r--p 00002000 08:02 1887843                        /lib64/libdl-2.5.so
    3913403000-3913404000 rw-p 00003000 08:02 1887843                        /lib64/libdl-2.5.so
    3913600000-391360d000 r-xp 00000000 08:02 1887857                        /lib64/libproc-3.2.7.so
    391360d000-391380d000 ---p 0000d000 08:02 1887857                        /lib64/libproc-3.2.7.so
    391380d000-391380e000 rw-p 0000d000 08:02 1887857                        /lib64/libproc-3.2.7.so
    391380e000-3913822000 rw-p 391380e000 00:00 0 
    3922400000-392240d000 r-xp 00000000 08:02 1887848                        /lib64/libgcc_s-4.1.2-20070626.so.1
    392240d000-392260d000 ---p 0000d000 08:02 1887848                        /lib64/libgcc_s-4.1.2-20070626.so.1
    392260d000-392260e000 rw-p 0000d000 08:02 1887848                        /lib64/libgcc_s-4.1.2-20070626.so.1
    2aaaaaaab000-2aaaaaaac000 rw-p 2aaaaaaab000 00:00 0 
    2aaaaaac7000-2aaaaaaeb000 rw-p 2aaaaaac7000 00:00 0 
    2aaaaaaeb000-2aaaaaaec000 ---p 2aaaaaaeb000 00:00 0 
    2aaaaaaec000-2aaaaaaed000 rw-p 2aaaaaaec000 00:00 0 
    2aaaaab07000-2aaaaab11000 r-xp 00000000 08:02 1887580                    /lib64/libnss_files-2.5.so
    2aaaaab11000-2aaaaad10000 ---p 0000a000 08:02 1887580                    /lib64/libnss_files-2.5.so
    2aaaaad10000-2aaaaad11000 r--p 00009000 08:02 1887580                    /lib64/libnss_files-2.5.so
    2aaaaad11000-2aaaaad12000 rw-p 0000a000 08:02 1887580                    /lib64/libnss_files-2.5.so
    2aaaac000000-2aaaac021000 rw-p 2aaaac000000 00:00 0 
    2aaaac021000-2aaab0000000 ---p 2aaaac021000 00:00 0 
    7fff37085000-7fff3709a000 rw-p 7fff37085000 00:00 0                      [stack]
    ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0                  [vdso]


    Signal 6 (ABRT) caught by ps (procps version 3.2.7).
    Please send bug reports to <feedback.net> or <albert.net>

Checking the the address that lead to the free() error gives:

    (gdb) list *0x402393
    0x402393 is in main (ps/display.c:343).
    338       case TF_show_proc:                   // normal non-thread output
    339         while(readproc(ptp,&buf)){
    340           if(want_this_proc(&buf)){
    341             show_one_proc(&buf, proc_format_list);
    342           }
    343           if(buf.cmdline) free((void*)*buf.cmdline); // ought to reuse
    344           if(buf.environ) free((void*)*buf.environ); // ought to reuse
    345         }
    346         break;


I checked the code but did not spot any obvious problem. Those buffers are allocated by readproc() that translates to simple_readproc() and simple_readproc() uses file2strvec() to allocate the memory and read the data from the proc file.

I do not see want_this_proc() nor show_one_proc() allocating or deallocating memory so I don't see the problem coming from these functions (although it seems to use non multibyte character aware function such as strlen() to compute output column alignment, but that should not cause any problem other than wrong output alignment).

I've now provided the customer with a slightly modified version of procps that does *not* catch the sigsegv and sigabrt signals so we can have a chance to capture a core file when/if the problem reoccurs, and in parallel, I am also escalating this issue to bugzilla to get Engineering opinion on this as I fail to find what could have caused the memory corruption reported by the glibc in "ps".

Comment 1 RHEL Program Management 2009-03-26 17:27:18 UTC
This request was evaluated by Red Hat Product Management for
inclusion, but this component is not scheduled to be updated in
the current Red Hat Enterprise Linux release. If you would like
this request to be reviewed for the next minor release, ask your
support representative to set the next rhel-x.y flag to "?".

Comment 3 Daniel Novotny 2009-05-11 11:09:36 UTC
hello, since the problem occured only *once* and the issue tracker is closed, can I close this as WORKSFORME?

Comment 6 Olivier Fourdan 2009-09-02 09:01:10 UTC
Created attachment 359482 [details]
reproducer program

Attaching reproducer and procedure.

To reproduce:

1) Build the two executables create_zombie and dummy_sleep:

   $ make

2) Run "dummy_sleep" in a loop:

   $ for i in `seq 1 1 10000`; do ./create_zombie 2 & done

3) In a separate terminal/console, run ps -eo pid,args in a loop

   $ while $(ps -eo pid,args > log.txt); do /bin/true; done

Actual results:

ps will abort after a few seconds with a: 

  *** glibc detected *** ps: double free or corruption (out) ***

Expected results:

  ps does not abort

Additional info:

The problem is related to the patch from bug#134516 ("ps truncates line to 
2048 characters") and more precisely to that change:

  https://bugzilla.redhat.com/show_bug.cgi?id=134516#c24

Using:

  while ((n = read(fd, buf, sizeof buf - 1)) > 0)

Instead of:

  while ((n = read(fd, buf, sizeof buf - 1)) >= 0)

does not trigger the corruption but I am not entirely sure why...

Comment 8 Olivier Fourdan 2009-09-02 10:04:37 UTC
Created attachment 359498 [details]
Proposed patch

I think what happens is the following:

With "while ((n = read(fd, buf, sizeof buf - 1)) >= 0)", "end_of_file" is set to 1 by:

        if (n < (int)(sizeof buf - 1))
            end_of_file = 1;
 
At the same time, with n = 0, buf[n-1] points to uninitialized data, so the value of buf[n-1] is likely to be not null, therefore the test is false:

        if (end_of_file && buf[n-1])            /* last read char not null */
            buf[n++] = '\0';                    /* so append null-terminator */

So no null-terminator is inserted. And that breaks the computation of the string array entries later in the code.

Adding a test for n == 0 avoids the problem:

        if (end_of_file && (n == 0 || buf[n-1]))/* last read char not null */
            buf[n++] = '\0';                    /* so append null-terminator */

The reproducer works fine with that patch.

Comment 10 Tomas Smetana 2009-09-04 09:41:10 UTC
Same problem present in RHEL-4 (bug #521200). Same patch fixes the problem.

Comment 14 Daniel Novotny 2009-11-19 14:39:15 UTC
fixed in procps-3.2.7-12.el5

Comment 18 errata-xmlrpc 2010-03-30 08:06:15 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0200.html