Bug 242453 - kernel doesn't support correctly fork()
Summary: kernel doesn't support correctly fork()
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.3
Hardware: x86_64
OS: Linux
low
medium
Target Milestone: ---
: ---
Assignee: Red Hat Kernel Manager
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-06-04 12:00 UTC by Benedikt Schaefer
Modified: 2007-11-17 01:14 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-06-05 11:54:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Benedikt Schaefer 2007-06-04 12:00:17 UTC
Description of problem:
Application Vectis is using the fork() command and does not run any more over
native Infiniband with RHEL4U3 and voltaire gridstack-4.1.5_3. 

Version-Release number of selected component (if applicable):
OS: RHEL4U3
kernel: 2.6.9-34.ELsmp
IB: Voltaire Gridstck-4.1.5_3
aplication: vectis-3.10p2
MPI: hpmpi-2.02.05.00-20061003r

How reproducible:
Try to run vectis with HPMPI

Steps to Reproduce:
1.phase5 -V 3.10p2 -np 8 -hf MPI_HOSTS piston.INP 
2.
3.
  
Actual results:
Vectis is starting but nothing is running. all process are waiting for something.

Expected results:


Additional info:
some information from Ricardo:
I believe the problem on the cluster running VECTIS is due to the use
of fork() within VECTIS. If I remove all system calls from the code then
it runs ok. I've done some research on this and I have found that OFED
1.1 does not properly support fork(). Problems can be found on any
kernel version but for 2.6.12 and lower it is not supported at all.
Above 2.6.12 there is limited support.

Comment 1 Neil Horman 2007-06-04 20:28:06 UTC
Can you elaborate on what exactly you mean by saying "OFED does not properly
support fork()"?  

Comment 2 Benedikt Schaefer 2007-06-05 07:13:04 UTC
Sorry, This is the information, we get from Software vendor Ricardo.
I have no more deeper Information.

best regards
Benedikt

Comment 3 Benedikt Schaefer 2007-06-05 08:07:57 UTC
>From the OFED 1.1 Release Notes:

2. Fork support from kernel 2.6.12 and above is available provided
   that applications do not use threads. The fork() is supported as long
   as parent process does not run before child exits or calls exec().
   The former can be achieved by calling wait(childpid) the later can be
   achieved by application specific means.  Posix system() call is
   supported.

>From Open-MPI FAQ:

24. Can I use system() or fork() in an MPI application that uses the
OpenFabrics support?

The answer is, unfortunately, complicated.

      *  If you have a Linux kernel before version 2.6.16: no. Some
        distros may provide patches for older versions (e.g, RHEL4 may
        someday receive a hotfix).
        
      * If you have a version of OFED before v1.2: sort of.
        Specifically, newer kernels with OFED 1.0 and OFED 1.1 may
        generally allow the use of system() and/or the use of fork() as
        long as the parent does nothing until the child exits.
        
      * If you have a Linux kernel >= v2.6.16 and OFED >= v1.2 and Open
        MPI >=v1.2.1: yes. 




Comment 4 Neil Horman 2007-06-05 11:54:46 UTC
This sounds like an MPI / Open Fabrics Limitation rather than a Kernel issue. 
Note the end of the FAQ entry:

"NOTE: Arbitrary fork() support is not supported in the OpenFabrics software
stack. If you use fork() in your application, you must not touch any registered
memory before calling some form of exec() to launch another process. Calling
system() is safe."

I've checked the upstream git tree and there are no changesets that look to
correct bugs specific to Open-MPI or OFED.  If there is a specific upstream
commit  that enhances OpenFabrics or Open-MPI we can look into backporting it,
but as it stands now, this appears to me to be an MPI/OpenFabrics limitation,
rather than a kernel bug


Note You need to log in before you can comment on or make changes to this bug.