Bug 98815

Summary: fork()ing from a threaded program makes memory accesses unreliable
Product: Red Hat Enterprise Linux 2.1 Reporter: Johan Walles <johan.walles>
Component: kernelAssignee: Jason Baron <jbaron>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 2.1CC: anderson, arun-public, dseberge, g_saab, knoel, lwoodman, riel, tony.luck
Target Milestone: ---   
Target Release: ---   
Hardware: ia64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2003-08-21 17:40:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Repro case. Works fine on ia32.
none
backport of relevent fixes from 2.4.21 to 2.4.18-e31
none
updated patch incorporating RedHat suggestion
none
patch adapted for 2.4.18-e33 none

Description Johan Walles 2003-07-09 06:39:53 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3b) Gecko/20030210

Description of problem:
When I call fork() from a pthread, other (unrelated) threads sometimes start
behaving badly (crashing / hanging).

Version-Release number of selected component (if applicable):
Linux version 2.4.18-e.25smp (bhcompile.redhat.com) (gcc version
2.96 20000731 (Red Hat Linux 7.2 2.96-112.7.2)) #1 SMP Thu Feb 6 15:33:05 EST 2003

How reproducible:
Always

Steps to Reproduce:
1. Spawn a couple of pthreads
2. Call fork() from one of them
3. If everything keeps working, goto 1

    

Actual Results:  One or several of the running pthreads either crashes or hangs.

Expected Results:  The threads not involved in the fork() shouldn't have been
affected by the fork() at all.

Additional info:

I'll attach a repro case as well.

Comment 1 Johan Walles 2003-07-09 06:45:13 UTC
Created attachment 92816 [details]
Repro case.  Works fine on ia32.

This program has 16 threads wasting time sorting numbers and fighting over a
global lock.  After 10 seconds, a 17th thread wakes up and starts calling
fork() over and over.

Within a minute of the fork()ing thread waking up, the program either segfaults
or just hangs ("top" says CPU usage goes down to zero for all threads).

The program is supposed to run forever, which it does on ia32 (for low values
of "forever" :-).

Comment 2 Johan Walles 2003-07-09 07:19:45 UTC
FWIW, according to Dave Mosberger, the program runs fine on 2.5.73.

Also, we have only seen this on SMP boxes, my attempts at reproducing this on a
single CPU machine has failed.


Comment 3 Johan Walles 2003-07-15 09:54:28 UTC
Severity changed to high because of multiple adverse impacts to customers.  It
makes running Ant (using any Java VM) very unstable.

We (BEA) are hurting.


Comment 4 Johan Walles 2003-07-17 08:56:51 UTC
Tony Luck / Intel alerted me to a newly found memory ordering bug in glibc's
malloc().

Thus, I tried replacing the calls to malloc() and free() inside wasteTime() with
a local int array to avoid calling those functions.  However, the program still
crashes / hangs for us.  Of course, if fork() has calls to malloc() / free(),
that change to the repro case doesn't really change anything.

Some info on one system this repros on:

Kernel:
Linux version 2.4.18-e.25smp (bhcompile.redhat.com) (gcc version
2.96 20000731 (Red Hat Linux 7.2 2.96-112.7.2)) #1 SMP Thu Feb 6 15:33:05 EST 2003

Glibc:
Name        : glibc                        Relocations: (not relocateable)
Version     : 2.2.4                             Vendor: Red Hat, Inc.
Release     : 31.7                          Build Date: Thu 12 Dec 2002 04:08:05
PM CET
Install date: Fri 28 Mar 2003 09:59:05 AM CET      Build Host:
rocky.devel.redhat.com
Group       : System Environment/Libraries   Source RPM: glibc-2.2.4-31.7.src.rpm
Size        : 34485354                         License: LGPL
Packager    : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
Summary     : The GNU libc libraries.

CPUs (it has four of these):
processor  : 2
vendor     : GenuineIntel
arch       : IA-64
family     : Itanium 2
model      : 0
revision   : 7
archrev    : 0
features   : branchlong
cpu number : 0
cpu regs   : 4
cpu MHz    : 896.238999
itc MHz    : 896.238999
BogoMIPS   : 1342.17


Comment 5 Johan Walles 2003-07-17 13:40:35 UTC
Just tried to set a breakpoint in malloc() after the threads had started.  The
program crashes without hitting that breakpoint.

Thus, AFAICT this problem seems unrelated to the bug recently found in malloc().


Comment 6 Tony Luck 2003-07-18 16:49:55 UTC
My change to the test case (getting rid of the malloc/free in wasteTime()) only 
made the problem take longer to happen (57 minutes).  So I agree that the malloc
() problem is not the issue here, sorry for the distraction.

I did find that kernel.org+mosberger patch 2.4.20 kernel has the problem, but 
2.4.21 does not (test ran for 15+ hours without crashing, and all threads were 
still using cpu time).  Looking at the changes between 2.4.20 and 2.4.21 the 
only obvious TLB related change was Erich Focht's fix for wrapping around the 
mmu contexts.  A subtly different version of this patch is already in RedHat's 
e.25 kernel.  I backported Erich's change from 2.4.21 to 2.4.20, and so far the 
test program looks to be running normally (run time just got to 1 hour).

I'm trying to figure out why Erich's fix would make a difference here, and 
whether the subtle differences that are in the 2.4.18-e25 version are 
significant.

Comment 7 Tony Luck 2003-07-18 20:23:49 UTC
Ignore that last comment too ... my patched 2.4.20 test just went into the "not 
using any cpu time" mode across lunchtime.

Comment 8 Tony Luck 2003-07-18 22:38:33 UTC
Here's a better theory (from reading the code, instead of peering at diffs).

When a process/thread forks, we run through dup_mmap() in kernel/fork.c which 
is responsible for copying the address space from the current process to the 
child (marking all the data pages readonly to force copy-on-write semantics). 
When it is done, it calls flush_tlb_mm() to make sure that there are no stale 
TLB entries.  On an SMP system then gets us to arch/ia64/kernel/smp.c where the 
smp_flush_tlb_mm() routine calls local_flush_tlb_mm() to deal with the current 
cpu, and then uses smp_call_function() to do the same for all the other 
processors in the system.

On versions of the kernel where there is a problem, local_flush_tlb_mm() looks 
like this:
        if (mm == current->active_mm) {
                get_new_mmu_context(mm);
                reload_context(mm);
        }
Now we have a multi-threaded application, so several other cpus may find that 
the test in that "if" statement is true, so they will all, more or less 
simultaneously,get a new context (allocate a new context number and assign to 
mm->context) and then run reload_context() (updating region registers rr0-rr4).

In 2.4.21 and 2.5 there is a new routine "activate_context()" which deals with 
these races by checking and reloading the context repeatedly until all the 
racing threads have settled down.


Comment 9 Tony Luck 2003-07-21 17:44:38 UTC
Created attachment 93031 [details]
backport of relevent fixes from 2.4.21 to 2.4.18-e31

Working from my theory on Friday that this problem is caused by races in
local_flush_tlb_mm(), I made a quick&dirty backport of the relevent pieces from
2.4.21 (no promises that this patch includes a minimal set of such changes).
Patch is against RedHat 2.4.18-e.31

BEA testcase ran for 30 minutes on this kernel before the kernel hit a problem
(debugging now, but must be a goof in my patch).

Comment 10 Tony Luck 2003-07-21 18:49:32 UTC
When my 2.4.18-e31 kernel with the 2.4.21 backported change hung, three of the 
four cpus were spinning waiting for held locks (two of them for the same 
task_rq_lock()).  My guess is that I broke some O(1) scheduler stuff with the 
code that I backported from 2.4.21

But I still think that I'm on the right track.  Comments anyone????

Comment 11 Dave Anderson 2003-07-21 19:07:40 UTC
Tony,

Can you try this patch -- applied on top of your patch -- which we've added to
the 2.4.21-based kernel for our RHEL3 beta:

--- linux-2.4.21/include/asm-ia64/mmu_context.h.orig    2003-07-09
13:42:15.000000000 -0400
+++ linux-2.4.21/include/asm-ia64/mmu_context.h 2003-07-09 13:43:23.000000000
-0400
@@ -64,12 +64,14 @@
 static inline mm_context_t
 get_mmu_context (struct mm_struct *mm)
 {
+       unsigned long flags;
+
        mm_context_t context = mm->context;

        if (context)
                return context;

-       spin_lock(&ia64_ctx.lock);
+       spin_lock_irqsave(&ia64_ctx.lock, flags);
        {
                /* re-check, now that we've got the lock: */
                context = mm->context;
@@ -79,7 +81,7 @@
                        mm->context = context = ia64_ctx.next++;
                }
        }
-       spin_unlock(&ia64_ctx.lock);
+       spin_unlock_irqrestore(&ia64_ctx.lock, flags);
        return context;
 }

It was a completely different how-to-cause-it scenario, but 
without it, the system could deadlock because the runqueue
lock is held across the call to context_switch().

This is the description from that case:

--------------------------------------------------------------------------------
One processor grabs the runqueue lock, calls context_switch
which calls switch_mm which calls get_mmu_context which attempts to acquire
ia64_ctx.lock.

     spinlock(&ia64_ctx)
     get_mmu_context()
     switch_mm()
     context_switch()
     schedule() acquires runque lock

Another processor has called activate_mm which has called get_mmu_context and
acquired ia64_ctx.lock. It takes an interrupt from one of the 8 SCSI HBAs. The
scsi_mod driver calls end_buffer_io_sync which calls __wakeup which calls
try_to_wakeup which trys to acquire the runqueue lock held by the other
processor.

     runqueue_lock() blocked on the runqueue lock
     try_to_wake_up()
     __wakeup()
     end_buffer_io_sync()
     a0000000 000301f0
     Interrupt
     spinlock(&ia64_ctx)
     get_mmu_context()
     activate_mm()

All of the other processors are spinning trying to acquire the ia64_ctx lock.
--------------------------------------------------------------------------------

It can't hurt to try it with your patch...


Comment 12 Tony Luck 2003-07-21 19:15:59 UTC
Sure ... I wasn't very certain about taking that part of the 2.4.21 change, 
I'll apply your patch and spin up the BEA test right away.

Comment 13 Tony Luck 2003-07-21 20:33:32 UTC
Created attachment 93034 [details]
updated patch incorporating RedHat suggestion

Looking good.  This kernel has lasted 1hr7minutes running the BEA test.

Here is the updated version of the patch.

Comment 14 Tony Luck 2003-07-21 22:05:18 UTC
Ok, the test system is now at 2.5 hours, and is running just fine.

What do we need to do next?

Is this patch acceptable to RedHat to include in a patch update to BEA?

Can BEA test out this patch to make sure that it fixes the problems that you 
have seen with your actual application? I'd hate to find out later that we 
fixed it for this test program, but not for real applications!

I'm getting on a plane soon to head to the Ottawa Linux Symposium, so I'll be 
out of contact for a while. Someone else at Intel can continue work on this, 
but we need to know what (if anything) is needed.

Comment 15 Need Real Name 2003-07-22 11:59:05 UTC
We (BEA) will have someone test the patch here to verify and respond with 
results.

Comment 16 Dave Anderson 2003-07-22 13:41:45 UTC
Created attachment 93043 [details]
patch adapted for 2.4.18-e33

Comment 17 Dave Anderson 2003-07-22 13:44:38 UTC
The patch for BEA looks fine.

It doesn't apply cleanly to 2.4.18-e33 (for other reasons) so
I created one that does, and am testing it here with the provided
test program. 

Comment 18 Dave Anderson 2003-07-23 12:29:06 UTC
The attached -e33 patch has been running the repro test here
on a 4-CPU system successfully for over 24 hours; I've changed
the component to kernel and reassigned it to Jason Baron for
inclusion in a quarterly errata.

We'd still like to hear from BEA re: their test using Tony's
patch w/their real application.

Comment 19 Tony Luck 2003-07-29 21:43:30 UTC
What is the planned date of the next quarterly errata (that will include this 
fix)?

Comment 20 David Seberger 2003-08-01 14:36:55 UTC
The BEA JRockit team has verified that the proposed patch fixes the underlying
problem. At this time, BEA would like to request that this patch be made 
publicly available at the earliest opportunity. BEA has multiple customer 
outages because of this problem.


Comment 21 David Seberger 2003-08-08 11:10:10 UTC
BEA has completed testing 2.4.18-e36. We have found no problems with it. 

Comment 22 Mark J. Cox 2003-08-21 17:40:48 UTC
An errata has been issued which should help the problem described in this bug report. 
This report is therefore being closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, please follow the link below. You may reopen 
this bug report if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2003-198.html