Bug 230339 - The fatal error "Segmentation fault" happens when lots of continuous processes of mount.nfs4 are executed.
The fatal error "Segmentation fault" happens when lots of continuous processe...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.0
i386 Linux
medium Severity high
: ---
: ---
Assigned To: Dave Anderson
Brian Brock
:
: 247169 294141 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-02-28 09:01 EST by shichao
Modified: 2010-10-22 09:24 EDT (History)
6 users (show)

See Also:
Fixed In Version: RHBA-2007-0959
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-11-07 14:42:09 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
The patch for the binfmt_elf of the kernel (788 bytes, patch)
2007-02-28 09:01 EST, shichao
no flags Details | Diff

  None (edit)
Description shichao 2007-02-28 09:01:42 EST
Description of problem:

  In the lots of continuous processes of mount.nfs4 operation, the failure case will
usually happen, exit with the retval "EINVAL" and the error msg "Segmentation
fault".

Version-Release number of selected component (if applicable):
 kernel

How reproducible:
In the lots of continuous processs of mount.nfs4 operation, the failure case
will usually happen, exit with the retval "EINVAL" and the error msg
"Segmentation fault".

Steps to Reproduce:
1. execute a elf-format excutable file (for example, mount.nfs4) 
2. repeat the step 1 for about 2000~8192 times

Expected results:

I have investigated the problem, and found the cause is the problem with the
binary loader of the kernel when the error "Segmentation fault" happened.

I think that the cause for the "segment fault" is the limitation of the
design in the load_elf_binary and load_elf_interp().

The operation flow of the getting elf_interp map address and judgment is as follows.
(execute mount.nfs4)
sys_execve
   | - do_execve
         |
      | - search_binary_handler  
        |-  linux_binfmt= elf_format

        |-  elf_format->load_elf_binary
                      | -  elf_entry = load_elf_interp()
                      |                    | vaddr = eppnt->p_vaddr;
                      |                    *| kernel_read(..., &eppnt,..)*
                      |                    | *map_addr = elf_map()*
                      |                    |  *if  (BAD_ADDR(map_addr)*
                      |                    |      |-load_addr = map_addr -
ELF_PAGESTART(vaddr);   
                      |                    |return load_addr
                      |
                      |----- if (BAD_ADDR(elf_entry)) 
                      |        |--- elf_entry = elf_entry +
loc->interp_elf_ex.e_entry;
                      |
                      |- if  (BAD_ADDR(elf_entry)) 
                      |  force_sig(SIGSEGV, current);
                      |     retval =-EINVAL;

In the do_execve(), after setuping up some data structure, the do_execve() will
invoke the search_binary_handler() to get the corresponding ELF binary loader
for the mount.nfs4, then read the ELF executable image into memory, for each
segments and sections,include interp segment. In our test, when the "segment
fault" of mount.nfs4 happened, in the procedure of load_elf_binary(), the
address elf_entry of the interp segment read from the load_elf_interp() was
fault, it was judged a BAD_ADDR and afterwards the kernel send a forcible signal
"SIGSEGV" to the process of mount.nfs4, and exit with the retval "EINVAL".
Therefore, the error happended.

In the load_elf_interp(), the eppnt->p_vaddr is the virtual address of
the mapped segment, it is a fixed address, in the mount.nfs4, it is
7503872.
The map_addr is return from the *elf_map(). Because the mount.nfs4 is a
ET_DYN( DYN (Shared object file)) **executable **program, the map_addr
will be a random **mapped address** return by the elf_map().
In normal case, **the map_addr is * beyond the vaddr (the relocation
adjustment address is return by the "map_addr - ELF_PAGESTART(vaddr)").
After the judgment on whether the elf_entry is BAD_ADDR, the elf_entry
will be adjusted to the user virtual address of the current process by
"elf_entry + loc->interp_elf_ex.e_entry". The adjusted elf_entry address
will be used as the pointer of the startup routine of the process.
But unluckily, for the map_addr return from the *elf_map() is **random,
it is possible that the *map_addr will be less than the vaddr, then the
problem is happened..
In the lots of continuous mount operations, the failure case will
usually happen. In the failure case, the map_addr was 7499776 , which
was less then the vaddr 7503872. When the "map_addr -
ELF_PAGESTART(vaddr)" is calculated, for they are all unsigned long
type, the result load_addr was 4294963200, then in the judgment on
whether the elf_entry is BAD_ADDR, the load_addr was considered as a
BAD_ADDR, for the load_addr is larger than the TASK_SIZE at that time.
Before the adjustment to the user virtual address, the SIGSEGV was sent,
and the process was exited.

The bug is the BAD_ADDR judgment before the user virtual address
adjustment "elf_entry + loc->interp_elf_ex.e_entry". In fact, the
address as the pointer of the startup routine is the adjusted virtual
address, the elf_entry returned from the load_elf_interp() is just
relative addrss base the load_base address. The judgment on whether the
elf_entry is BAD_ADDR should only be set after the adjusted virtual
address "elf_entry + loc->interp_elf_ex.e_entry" .

For the problem in binfmt_elf, I have made a patch for the limitation of
design.

I have tested, after the patch is applied, the problem with the "segment
fault " can be resolved.
The attachment is the patch for the kernel of RHEL5Beta2.
Comment 1 shichao 2007-02-28 09:01:42 EST
Created attachment 148915 [details]
The patch for the binfmt_elf of the kernel
Comment 3 Eric Sandeen 2007-03-14 12:06:54 EDT
Something akin to the attached patch exists upstream, but the
linux-2.6-execshield.patch patch does this:

@@ -443,8 +491,7 @@ static unsigned long load_elf_interp(str
                        goto out_close;
        }

-       *interp_load_addr = load_addr;
-       error = ((unsigned long)interp_elf_ex->e_entry) + load_addr;
+       error = load_addr;

 out_close:
        kfree(elf_phdata);

... will have to get w/ the author of that patch to see what's going on.
Comment 4 Eric Sandeen 2007-03-14 12:16:22 EDT
Ingo, the suggested patch here actually reverts part of the exec-shield patch. 
I'm out of my area of expertise here, any comments?

Thanks,
-Eric
Comment 6 RHEL Product and Program Management 2007-04-03 18:05:01 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 7 Dave Anderson 2007-04-04 11:32:09 EDT
Hello Shichao,

It's unclear how this scenario can be reproduced and tested,
so I have placed the kernel src.rpm and two i686 binary rpms 
(PAE and non-PAE) containing your unmodified patch here:

  http://people.redhat.com/anderson/BZ_230339

    kernel-2.6.18-13.el5.bz230339.1.i686.rpm
    kernel-PAE-2.6.18-13.el5.bz230339.1.i686.rpm
    kernel-2.6.18-13.el5.bz230339.1.src.rpm

Can you please test and verify them?

Comment 8 shichao 2007-04-09 20:42:42 EDT
I have tested your kernel*-bz230339.
It is confirmed that the patched can really solve the problem. The patch
is suitable.
So, can my patch be applied in the latest product of the RHEL5 ?
Comment 11 Don Zickus 2007-06-12 14:45:42 EDT
in 2.6.18-24.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
Comment 13 Prarit Bhargava 2007-09-17 11:42:35 EDT
*** Bug 247169 has been marked as a duplicate of this bug. ***
Comment 14 Prarit Bhargava 2007-09-18 10:00:58 EDT
*** Bug 294141 has been marked as a duplicate of this bug. ***
Comment 15 Nathan G. Grennan 2007-09-18 15:36:27 EDT
My bug 294141 was made a duplicate of this one, but there seems to be two
different changes to the exec-shield patch. There is the change in comment 3 of
this bug, and then there is the change in comment 7 of 246623. I have tried the
change in comment 7 of 246623, and it works for me.

Are there just two ways to fix the same bug, or are they actually do different bugs?
Comment 16 Dave Anderson 2007-09-18 15:44:29 EDT
There are two ways to fix the same bug.  The fix that went into
the upcoming RHEL5.1 release was put into place in June, before
it was addressed in a different manner upstream and in Fedora.

Comment 17 Mike Gahagan 2007-09-20 17:37:34 EDT
confirmed fix is in the -47.el5 kernel, looks like the patch has already been
tested by the reported some time ago.
Comment 19 errata-xmlrpc 2007-11-07 14:42:09 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html

Note You need to log in before you can comment on or make changes to this bug.