Bug 98815
Summary: | fork()ing from a threaded program makes memory accesses unreliable | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 2.1 | Reporter: | Johan Walles <johan.walles> | ||||||||||
Component: | kernel | Assignee: | Jason Baron <jbaron> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | medium | ||||||||||||
Version: | 2.1 | CC: | anderson, arun-public, dseberge, g_saab, knoel, lwoodman, riel, tony.luck | ||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | ia64 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2003-08-21 17:40:48 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Description
Johan Walles
2003-07-09 06:39:53 UTC
Created attachment 92816 [details]
Repro case. Works fine on ia32.
This program has 16 threads wasting time sorting numbers and fighting over a
global lock. After 10 seconds, a 17th thread wakes up and starts calling
fork() over and over.
Within a minute of the fork()ing thread waking up, the program either segfaults
or just hangs ("top" says CPU usage goes down to zero for all threads).
The program is supposed to run forever, which it does on ia32 (for low values
of "forever" :-).
FWIW, according to Dave Mosberger, the program runs fine on 2.5.73. Also, we have only seen this on SMP boxes, my attempts at reproducing this on a single CPU machine has failed. Severity changed to high because of multiple adverse impacts to customers. It makes running Ant (using any Java VM) very unstable. We (BEA) are hurting. Tony Luck / Intel alerted me to a newly found memory ordering bug in glibc's malloc(). Thus, I tried replacing the calls to malloc() and free() inside wasteTime() with a local int array to avoid calling those functions. However, the program still crashes / hangs for us. Of course, if fork() has calls to malloc() / free(), that change to the repro case doesn't really change anything. Some info on one system this repros on: Kernel: Linux version 2.4.18-e.25smp (bhcompile.redhat.com) (gcc version 2.96 20000731 (Red Hat Linux 7.2 2.96-112.7.2)) #1 SMP Thu Feb 6 15:33:05 EST 2003 Glibc: Name : glibc Relocations: (not relocateable) Version : 2.2.4 Vendor: Red Hat, Inc. Release : 31.7 Build Date: Thu 12 Dec 2002 04:08:05 PM CET Install date: Fri 28 Mar 2003 09:59:05 AM CET Build Host: rocky.devel.redhat.com Group : System Environment/Libraries Source RPM: glibc-2.2.4-31.7.src.rpm Size : 34485354 License: LGPL Packager : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla> Summary : The GNU libc libraries. CPUs (it has four of these): processor : 2 vendor : GenuineIntel arch : IA-64 family : Itanium 2 model : 0 revision : 7 archrev : 0 features : branchlong cpu number : 0 cpu regs : 4 cpu MHz : 896.238999 itc MHz : 896.238999 BogoMIPS : 1342.17 Just tried to set a breakpoint in malloc() after the threads had started. The program crashes without hitting that breakpoint. Thus, AFAICT this problem seems unrelated to the bug recently found in malloc(). My change to the test case (getting rid of the malloc/free in wasteTime()) only made the problem take longer to happen (57 minutes). So I agree that the malloc () problem is not the issue here, sorry for the distraction. I did find that kernel.org+mosberger patch 2.4.20 kernel has the problem, but 2.4.21 does not (test ran for 15+ hours without crashing, and all threads were still using cpu time). Looking at the changes between 2.4.20 and 2.4.21 the only obvious TLB related change was Erich Focht's fix for wrapping around the mmu contexts. A subtly different version of this patch is already in RedHat's e.25 kernel. I backported Erich's change from 2.4.21 to 2.4.20, and so far the test program looks to be running normally (run time just got to 1 hour). I'm trying to figure out why Erich's fix would make a difference here, and whether the subtle differences that are in the 2.4.18-e25 version are significant. Ignore that last comment too ... my patched 2.4.20 test just went into the "not using any cpu time" mode across lunchtime. Here's a better theory (from reading the code, instead of peering at diffs). When a process/thread forks, we run through dup_mmap() in kernel/fork.c which is responsible for copying the address space from the current process to the child (marking all the data pages readonly to force copy-on-write semantics). When it is done, it calls flush_tlb_mm() to make sure that there are no stale TLB entries. On an SMP system then gets us to arch/ia64/kernel/smp.c where the smp_flush_tlb_mm() routine calls local_flush_tlb_mm() to deal with the current cpu, and then uses smp_call_function() to do the same for all the other processors in the system. On versions of the kernel where there is a problem, local_flush_tlb_mm() looks like this: if (mm == current->active_mm) { get_new_mmu_context(mm); reload_context(mm); } Now we have a multi-threaded application, so several other cpus may find that the test in that "if" statement is true, so they will all, more or less simultaneously,get a new context (allocate a new context number and assign to mm->context) and then run reload_context() (updating region registers rr0-rr4). In 2.4.21 and 2.5 there is a new routine "activate_context()" which deals with these races by checking and reloading the context repeatedly until all the racing threads have settled down. Created attachment 93031 [details]
backport of relevent fixes from 2.4.21 to 2.4.18-e31
Working from my theory on Friday that this problem is caused by races in
local_flush_tlb_mm(), I made a quick&dirty backport of the relevent pieces from
2.4.21 (no promises that this patch includes a minimal set of such changes).
Patch is against RedHat 2.4.18-e.31
BEA testcase ran for 30 minutes on this kernel before the kernel hit a problem
(debugging now, but must be a goof in my patch).
When my 2.4.18-e31 kernel with the 2.4.21 backported change hung, three of the four cpus were spinning waiting for held locks (two of them for the same task_rq_lock()). My guess is that I broke some O(1) scheduler stuff with the code that I backported from 2.4.21 But I still think that I'm on the right track. Comments anyone???? Tony, Can you try this patch -- applied on top of your patch -- which we've added to the 2.4.21-based kernel for our RHEL3 beta: --- linux-2.4.21/include/asm-ia64/mmu_context.h.orig 2003-07-09 13:42:15.000000000 -0400 +++ linux-2.4.21/include/asm-ia64/mmu_context.h 2003-07-09 13:43:23.000000000 -0400 @@ -64,12 +64,14 @@ static inline mm_context_t get_mmu_context (struct mm_struct *mm) { + unsigned long flags; + mm_context_t context = mm->context; if (context) return context; - spin_lock(&ia64_ctx.lock); + spin_lock_irqsave(&ia64_ctx.lock, flags); { /* re-check, now that we've got the lock: */ context = mm->context; @@ -79,7 +81,7 @@ mm->context = context = ia64_ctx.next++; } } - spin_unlock(&ia64_ctx.lock); + spin_unlock_irqrestore(&ia64_ctx.lock, flags); return context; } It was a completely different how-to-cause-it scenario, but without it, the system could deadlock because the runqueue lock is held across the call to context_switch(). This is the description from that case: -------------------------------------------------------------------------------- One processor grabs the runqueue lock, calls context_switch which calls switch_mm which calls get_mmu_context which attempts to acquire ia64_ctx.lock. spinlock(&ia64_ctx) get_mmu_context() switch_mm() context_switch() schedule() acquires runque lock Another processor has called activate_mm which has called get_mmu_context and acquired ia64_ctx.lock. It takes an interrupt from one of the 8 SCSI HBAs. The scsi_mod driver calls end_buffer_io_sync which calls __wakeup which calls try_to_wakeup which trys to acquire the runqueue lock held by the other processor. runqueue_lock() blocked on the runqueue lock try_to_wake_up() __wakeup() end_buffer_io_sync() a0000000 000301f0 Interrupt spinlock(&ia64_ctx) get_mmu_context() activate_mm() All of the other processors are spinning trying to acquire the ia64_ctx lock. -------------------------------------------------------------------------------- It can't hurt to try it with your patch... Sure ... I wasn't very certain about taking that part of the 2.4.21 change, I'll apply your patch and spin up the BEA test right away. Created attachment 93034 [details]
updated patch incorporating RedHat suggestion
Looking good. This kernel has lasted 1hr7minutes running the BEA test.
Here is the updated version of the patch.
Ok, the test system is now at 2.5 hours, and is running just fine. What do we need to do next? Is this patch acceptable to RedHat to include in a patch update to BEA? Can BEA test out this patch to make sure that it fixes the problems that you have seen with your actual application? I'd hate to find out later that we fixed it for this test program, but not for real applications! I'm getting on a plane soon to head to the Ottawa Linux Symposium, so I'll be out of contact for a while. Someone else at Intel can continue work on this, but we need to know what (if anything) is needed. We (BEA) will have someone test the patch here to verify and respond with results. Created attachment 93043 [details]
patch adapted for 2.4.18-e33
The patch for BEA looks fine. It doesn't apply cleanly to 2.4.18-e33 (for other reasons) so I created one that does, and am testing it here with the provided test program. The attached -e33 patch has been running the repro test here on a 4-CPU system successfully for over 24 hours; I've changed the component to kernel and reassigned it to Jason Baron for inclusion in a quarterly errata. We'd still like to hear from BEA re: their test using Tony's patch w/their real application. What is the planned date of the next quarterly errata (that will include this fix)? The BEA JRockit team has verified that the proposed patch fixes the underlying problem. At this time, BEA would like to request that this patch be made publicly available at the earliest opportunity. BEA has multiple customer outages because of this problem. BEA has completed testing 2.4.18-e36. We have found no problems with it. An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2003-198.html |