Description of problem: A program compiled on an x86 environment that is moved to an AMD x86_64 (Opteron) system can cause an kernel seg fault. This happens in cases when the program attempts to dump core. The problem is in file fs/binfmt_elf.c in which array element notes[3] is not filled and yet is referenced. This array is placed on the stack. Should a zero be at that location on the stack a seg fault will occur. When you get the seg fault is happens all the time. If you recompile the kernel you may no longer hit this - though garbage data on the stack will still be used. Version-Release number of selected component (if applicable): 2.4.21-1.1931.2.393.entsmp How reproducible: Always Steps to Reproduce: 1. On an x86 system compile the test program like so: gcc self_test.c -o self_test -lpthread 2. Copy the executable to an Opteron box and execute like so: ulimit -c unlimited LD_ASSUME_KERNEL=2.4.1 self_test 1 Actual results: System may take a seg fault and generate an Oops message, depending on how data lands on stack. Expected results: Test program should core dump Additional info: The problem is in file "fs/binfmt_elf.c". The following line of code in procedure "elf_core_dump" may pass an uninitialized value: sz += notesize(¬es[i]); This was analyzed examining a stack dump and observing the following instruction as the first instruction of strlen: cmpb $0x0,(%rdi) This was executed with %rdi as null. The failure may cause a seg fault or may simply use a garbage value - this is entirely dependent on stack contents as "notes[]" is on the stack in this routine. Here is an excerpt of the current code followed by a re-write which fixes the problem and is simpler code as well. (The actual code in file is much longer. I removed all but the incorrect code to illustrate error.) ----- bad code --------------------------- static int elf_core_dump(long signr, struct pt_regs * regs, struct file * file) { ... int numnote = 5; struct memelfnote notes[5]; fill_note(¬es[0], "CORE", NT_PRSTATUS, sizeof(prstatus), &prstatus); fill_note(¬es[1], "CORE", NT_PRPSINFO, sizeof(psinfo), &psinfo); fill_note(¬es[2], "CORE", NT_TASKSTRUCT, sizeof(*current), current); #ifndef __x86_64__ /* Try to dump the FPU. */ if ((prstatus.pr_fpvalid = elf_core_copy_task_fpregs(current, &fpu))) { fill_note(¬es[3], "CORE", NT_PRFPREG, sizeof(fpu), &fpu); } else { --numnote; } #else numnote --; #endif #ifdef ELF_CORE_COPY_XFPREGS if (elf_core_copy_task_xfpregs(current, &xfpu)) { fill_note(¬es[4], "LINUX", NT_PRXFPREG, sizeof (xfpu), xfpu); } else { --numnote; } #else numnote --; #endif for(i = 0; i < numnote; i++) sz += notesize(¬es[i]); ------------------------------------------ The problem is with the two ifdef sections. On the AMD x86_64 the first ifdef will be skipped. This leaves notes[3] uninitialized. But the second ifdef is not skipped, thus notes[4] is filled in and the value of "numnote" is 4. This means that the for loop will call "notesize(¬es[3])" resulting in notesize calling strlen on an element of a structure that was never initialized. If the stack contained zero we get a seg fault, otherwise garbage data is used. The fix follows. It has the advantage of making the code simpler. ----- fixed code ------------------------- int numnote = 0; struct memelfnote notes[5]; fill_note(¬es[numnotes++], "CORE", NT_PRSTATUS, sizeof (prstatus), &prstatus); fill_note(¬es[numnotes++], "CORE", NT_PRPSINFO, sizeof (psinfo), &psinfo); fill_note(¬es[numnotes++], "CORE", NT_TASKSTRUCT, sizeof (*current), current); #ifndef __x86_64__ /* Try to dump the FPU. */ if ((prstatus.pr_fpvalid = elf_core_copy_task_fpregs(current, &fpu))) { fill_note(¬es[numnotes++], "CORE", NT_PRFPREG, sizeof(fpu), &fpu); } #endif #ifdef ELF_CORE_COPY_XFPREGS if (elf_core_copy_task_xfpregs(current, &xfpu)) { fill_note(¬es[numnotes++], "LINUX", NT_PRXFPREG, sizeof (xfpu), &xfpu); } #endif ------------------------------------------
Created attachment 97109 [details] File self_test.c shown in example.
Created attachment 97110 [details] Patch file to fix.
Created attachment 97262 [details] Updated patch Seems like a similar problem exists in elf_dump_thread_status. This patch also fixes that instance. I can't really test due to lack of a hammer box, but it does compile.
The second patch looks good to me and seems to work as advertised (the core dump is gdb'able as well). I'll run it by the elf/gdb folks just for sanity's sake then submit it for U2.
Patch slated for U2
The fix for this problem was committed to the RHEL3 U2 patch pool on Thursday, 12-Feb-2004, for kernel version 2.4.21-9.8.
*** Bug 117941 has been marked as a duplicate of this bug. ***
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2004-188.html