Bug 498126 - multi-threaded application dumps a core with wrong thread information
multi-threaded application dumps a core with wrong thread information
Status: CLOSED DUPLICATE of bug 503553
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.3
i686 Linux
low Severity medium
: rc
: ---
Assigned To: Neil Horman
Red Hat Kernel QE team
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-04-28 20:59 EDT by costlow
Modified: 2009-06-08 11:00 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-06-04 06:32:29 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
test program that when crashed shows only 1 thread instead of 2 (325 bytes, application/octet-stream)
2009-04-28 20:59 EDT, costlow
no flags Details
test program that when crashed shows only 1 thread instead of 2 (304 bytes, application/octet-stream)
2009-04-28 21:02 EDT, costlow
no flags Details
patch from other bugzilla (792 bytes, patch)
2009-06-02 12:16 EDT, Neil Horman
no flags Details | Diff

  None (edit)
Description costlow 2009-04-28 20:59:31 EDT
Created attachment 341680 [details]
test program that when crashed shows only 1 thread instead of 2

Description of problem:
The attached C program when compiled for 32 bits crashes and dumps core.  When you attach gdb to the core file, 'info threads' shows the wrong number of threads, 1 instead of 2.

When compiled for 64 bits, or compiled for 32 bits and run under a 64 bit kernel, the problem does not occur.

To make things more confusing, if the non-functional 'double foo =12.0' is removed, then the problem does not occur.

Version-Release number of selected component (if applicable):
RedHat Enterprise Linux 5.3


How reproducible:
Always

Steps to Reproduce:
1. compile simple attached program cc -o test test.c -lpthread
2. make sure you can dump core, ulimit -c unlimited
3. run the program, ./test
4. attach gdb to the core, gdb ./test core.<pid>
5. 'info threads'  If you see 1 thread, you see the bug.  If you see 2 threads, it's correct.
  
Actual results:
1 thread shown

Expected results:
2 threads shown.

Additional info:
6. edit test.c to remove the 'double foo = 12.0'
7. recompile
8. run the program
9. attach the debugger and notice that you now see 2 threads.
Comment 1 costlow 2009-04-28 21:02:29 EDT
Created attachment 341681 [details]
test program that when crashed shows only 1 thread instead of 2
Comment 2 costlow 2009-04-28 21:06:49 EDT
Note that this is only a problem when post-morteming with gdb.  If you attach gdb directly to the executable and run it, you will always correctly see 2 threads.
Comment 3 Hiroto Shibuya 2009-05-01 10:51:34 EDT
This is regression in 2.6.18-128 kernel.  Previous kernels does not exhibit 
this problem. 

I tracked this down to this single patch:

linux-2.6-misc-pipe-support-to-proc-sys-net-core_pattern.patch

and down to this specific code created by this patch in fs/binfmt_elf.c

static int alignfile(struct file *file, loff_t *foffset)
{
        char buf[4] = { 0, };
        int extra = roundup(*foffset,4);
        extra -= *foffset;

        if ((extra > 0) && (extra < 4))
                DUMP_WRITE(buf, extra, foffset);
        return 1;
}

Now, this code is perfectly correct.  But I have observed that there 
are situation that "extra" does come out to 5 or 6, which is supposed to be 
impossible when you are rounding up to 4's multiple and the expected values
are 1 or 2.

This result in unaligne data in the core file after certain case of Note 
segment name, especially "LINUX" in the floating point register dump.  

This is the way it is supposed to be alingned:
 
00000c00  00 00 00 00 00 00 00 00  00 e0 4f 8d 97 6e 12 83  |..........O..n..|
00000c10  f5 3f 00 dc a1 45 b6 f3  fd d4 f8 3f 06 00 00 00  |.?...E.....?....|
00000c20  00 02 00 00 7f 2b e6 46  4c 49 4e 55 58 00 00 00  |.....+.FLINUX...|
00000c30  7f 03 20 02 00 00 1d 05  78 84 04 08 73 00 00 00  |.. .....x...s...|

Note three null bytes after "LINUX"

and here is the corrupted core file:

000008c0  00 00 00 00 00 00 00 00  00 e0 4f 8d 97 6e 12 83  |..........O..n..|
000008d0  f5 3f 00 dc f7 53 e3 a5  9b c4 f8 3f 06 00 00 00  |.?...S.....?....|
000008e0  00 02 00 00 7f 2b e6 46  4c 49 4e 55 58 00 7f 03  |.....+.FLINUX...|

readelf will clearly tell you of the corruption if you try to dump out the 
note segments:

$ readelf -n core.3494

Notes at offset 0x00000354 with length 0x00000adc:
  Owner         Data size       Description
  CORE          0x00000090      NT_PRSTATUS (prstatus structure)
  CORE          0x0000007c      NT_PRPSINFO (prpsinfo structure)
  CORE          0x00000090      NT_AUXV (auxiliary vector)
  CORE          0x0000006c      NT_FPREGSET (floating point registers)
  LINUX         0x00000200      NT_PRXFPREG (user_xfpregs structure)
readelf: Warning: corrupt note found at offset 46c into core notes
readelf: Warning:  type: 4f430000, namesize: 00900000, descsize: 00010000
Comment 4 Hiroto Shibuya 2009-05-01 10:58:15 EDT
Changing the above function to the latest in trunk does NOT fix the problem:

static int alignfile(struct file *file, loff_t *foffset)
{
        static const char buf[4] = { 0, };
        DUMP_WRITE(buf, roundup(*foffset, 4) - *foffset, foffset);
        return 1;
}

(or rather, corrupt in different way by adding 5,6 bytes padding)

Following code DOES work around the compiler issue. (It was a wild guess and 
don't know why). 

static int alignfile(struct file *file, loff_t *foffset)
{
        char buf[4] = { 0, };
        int extra = roundup((int)*foffset,4);
        extra -= (int)*foffset;

        if ((extra > 0) && (extra < 4))
                DUMP_WRITE(buf, extra, foffset);
        return 1;
}
Comment 5 Hiroto Shibuya 2009-05-01 11:13:35 EDT
I should say that I believe this is gcc issue of kernel code compilation 
rather than kernel issue.
Comment 6 Hiroto Shibuya 2009-05-01 11:39:12 EDT
Here are the examples of failed aligning from my debug printk in alignfile:

alignfile: didn't align extra:4 roundup:0xcdc foffset:0xcd8
alignfile: didn't align extra:4 roundup:0xd6c foffset:0xd68
alignfile: didn't align extra:4 roundup:0xe10 foffset:0xe0c
alignfile: didn't align extra:4 roundup:0xe90 foffset:0xe8c
alignfile: didn't align extra:6 roundup:0xea4 foffset:0xe9e
alignfile: didn't align extra:6 roundup:0x10a4 foffset:0x109e
alignfile: didn't align extra:5 roundup:0x10b4 foffset:0x10af
alignfile: didn't align extra:5 roundup:0x1144 foffset:0x113f
alignfile: didn't align extra:4 roundup:0x1154 foffset:0x1150
alignfile: didn't align extra:4 roundup:0x11c0 foffset:0x11bc
alignfile: didn't align extra:6 roundup:0x11d4 foffset:0x11ce
alignfile: didn't align extra:6 roundup:0x13d4 foffset:0x13ce
alignfile: didn't align extra:5 roundup:0x13e4 foffset:0x13df
alignfile: didn't align extra:5 roundup:0x1474 foffset:0x146f
alignfile: didn't align extra:4 roundup:0x1484 foffset:0x1480
alignfile: didn't align extra:4 roundup:0x14f0 foffset:0x14ec
alignfile: didn't align extra:6 roundup:0x1504 foffset:0x14fe
alignfile: didn't align extra:6 roundup:0x1704 foffset:0x16fe
alignfile: didn't align extra:5 roundup:0x1714 foffset:0x170f
alignfile: didn't align extra:5 roundup:0x17a4 foffset:0x179f
alignfile: didn't align extra:4 roundup:0x17b4 foffset:0x17b0
alignfile: didn't align extra:4 roundup:0x1820 foffset:0x181c
alignfile: didn't align extra:6 roundup:0x1834 foffset:0x182e
alignfile: didn't align extra:6 roundup:0x1a34 foffset:0x1a2e
Comment 7 Neil Horman 2009-06-02 11:23:59 EDT
please test with the patch in 503553.  I think this is a dup of that bug.
Comment 8 Hiroto Shibuya 2009-06-02 11:34:22 EDT
I get authorization error trying to access bug 503553.  Can you loosen up the access on that bug?
Comment 9 Neil Horman 2009-06-02 12:16:18 EDT
Created attachment 346281 [details]
patch from other bugzilla

Heres the patch from that bug for you
Comment 10 Hiroto Shibuya 2009-06-03 14:53:53 EDT
That patch worked for me.  Thank you.
Comment 11 Neil Horman 2009-06-04 06:32:29 EDT
copy that.  Thanks

*** This bug has been marked as a duplicate of bug 503553 ***

Note You need to log in before you can comment on or make changes to this bug.