Bug 477705

Summary:	__reclaim_stacks deadlock with first attempt at fixing O_DIRECT vs fork bug
Product:	Red Hat Enterprise Linux 5	Reporter:	Andrea Arcangeli <aarcange>
Component:	glibc	Assignee:	Jakub Jelinek <jakub>
Status:	CLOSED ERRATA	QA Contact:	BaseOS QE <qe-baseos-auto>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	5.4	CC:	codonell, ddomingo, dmair, drepper, jmoyer, jplans, ltroan, pmuller, riek, riel, syeghiay
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
URL:	https://bugzilla.redhat.com/show_bug.cgi?id=471613
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-09-02 11:43:58 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	477770

Description Andrea Arcangeli 2008-12-22 23:29:37 UTC

Description of problem:

Trying to understand how __reclaim_stacks is safe at not starting with a lll_unlock.

How reproducible:

Apply patch in bug #471613 and run the testcase in bug #471613


Actual results:

deadlock in __reclaim_stacks with two lists joined resulting in infinite loop.

Expected results:

__relciam_stacks should get a coherent copy of the two lists while fork() syscall runs, hence it requires the lll_lock around fork syscall execution.

Comment 1 Jeff Moyer 2009-01-07 20:17:41 UTC

Let me try to make this a bit more clear.  My apologies if the cut-n-paste did not fare well.

Basically, the way things work today, the kernel fork() path will
end up marking pages read-only to trigger COW faults.  This can
happen at any point during the fork() system call.  So, in
theory, there could be a race between the current libc thread
list modifications and the setting of the pte as readonly.

Consider a process that is multi-threaded.  Threads are being
created and destroyed, and there is one thread (spawned from the
initial thread) that is making fork() calls.  It seems like the
following is possible:

t1           t2                                             forked-child
fork start
             pthread_join()
             __deallocate_stack()
               list_del(&pd->list)
	       queue_stack*
                 head->next = newp
mark pte RO
                 <rest of list add operation>
fork end
                                                            __reclaim_stacks()
                                                              list_for_each(runp, &stack_cache)
                                                                infinite loop

*this requires that either the compiler or the CPU reorders the
 list_add instructions.

Comment 2 Jakub Jelinek 2009-01-07 20:28:15 UTC

Ulrich is working on a patch.  The solution isn't locking stack_cache_lock around fork though, but recovering from in-progress stack_{user,cache} manipulations inside of __reclaim_stack.

Comment 3 Ulrich Drepper 2009-01-07 23:54:07 UTC

> __relciam_stacks should get a coherent copy of the two lists while fork()
> syscall runs, hence it requires the lll_lock around fork syscall execution.

I think Jakub already mentioned this, a lock is not possible.  fork can be called from signal handlers.

I've added upstream code to fix the lists, if this is necessary.  There shouldn't be any problem like this anymore.  I see no reason why Jakub could backport this patch for 5.4.

Comment 4 Ulrich Drepper 2009-01-07 23:56:22 UTC

could *NOT* backport this patch.

Comment 5 Jakub Jelinek 2009-01-08 14:05:14 UTC

Yeah.  For now please try rawhide glibc (2.9.90-2).

Comment 6 Jeff Moyer 2009-01-08 16:15:17 UTC

I'm running with Andrea's kernel patch and version 2.9.90-2 of glibc and so far have not seen any hangs.  I'll keep the test running for as long as I can, but it looks good so far.  Previously, we'd see a hang within 2 or 3 invocations.

Comment 7 Jeff Moyer 2009-01-08 16:52:11 UTC

OK, the program finally hung, but I'm having troubles attaching with gdb (it attaches, lists the threads, and then just hangs).  I'll continue to try to get some useful debugging information from it.  I do know that it is *not* stuck in the kernel, it is spinning in userspace, using 100% cpu (as we've seen in other instances of this problem with the older libc).

Comment 8 Jeff Moyer 2009-01-08 17:11:10 UTC

OK, I managed to force a core dump of the hung process.  Here's the backtrace:

(gdb) where
#0  __reclaim_stacks () at allocatestack.c:867
#1  0x00002ab7a6af627f in __libc_fork ()
    at ../nptl/sysdeps/unix/sysv/linux/fork.c:170
#2  0x0000000000400f29 in fork_thread (arg=0x0) at dma_thread.c:113
#3  0x00002ab7a68396da in start_thread (arg=<value optimized out>)
    at pthread_create.c:297
#4  0x00002ab7a6b3444d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#5  0x0000000000000000 in ?? ()

I can attach the binary and core file if that's useful.

Comment 9 Ulrich Drepper 2009-01-08 18:05:24 UTC

We would need a glibc build with this additional patch.  This would help to see whether there is an interrupted operation in this situation or not.


--- allocatestack.c	07 Jan 2009 15:48:33 -0800	1.83
+++ allocatestack.c	08 Jan 2009 10:04:16 -0800	
@@ -849,8 +849,6 @@ __reclaim_stacks (void)
 	  elem->next->prev = elem->prev;
 	  elem->prev->next = elem->next;
 	}
-
-      in_flight_stack = 0;
     }
 
   /* Mark all stacks except the still running one as free.  */
@@ -918,6 +916,8 @@ __reclaim_stacks (void)
   /* There is one thread running.  */
   __nptl_nthreads = 1;
 
+  in_flight_stack = 0;
+
   /* Initialize the lock.  */
   stack_cache_lock = LLL_LOCK_INITIALIZER;
 }

Comment 10 Ulrich Drepper 2009-01-08 18:10:36 UTC

In any case, it would be good if you could track the stack_used list in the debugger:

p &stack_used
p stack_used
p stack_used->next
p stack_used->next->next
...
p stack_used->next->next ... ->next

until you reach a loop.

Comment 11 Jeff Moyer 2009-01-08 18:17:58 UTC

Given Andrea's comments in bug 471613, I'm going to postpone any further testing until we have a new kernel patch.

To answer one of your questions, I did walk through the lists until I hit a loop, and again, it was the case that a list item pointed to by an element of the stack_used was actually on the stack_cache list.  So, the loop happens because we are looping through the stack_cache list, and the terminating condition is elem == stack_used.

Comment 12 Andrea Arcangeli 2009-01-08 18:28:17 UTC

My post on bug 471613 was after reading the glibc commit that won't enforce ordered writes in the cpu with respect to other cpus, I quickly realized we can't run memcpy out of order while glibc still runs because memcpy can't be run in order (well in most cpus it will but on alpha smp_rmb_barrier_depends isn't a noop and so it'd be unsolvable there). For this out of order mechanism to be safe against memcpy, I'll have to stop any access to the page that is being copied to be sure the compiler barrires in the commit are meaningful.

Other than this it may still be a bug in the patch, this isn't very simple but at least this is now only a compiler-level thing like if it was single threaded after I change the kernel patch to mark the pte readonly, and flush tlb, before memcpy.

Comment 13 Ulrich Drepper 2009-01-08 18:42:21 UTC

Maybe Andrea's thoughts are along the same lines as mine, maybe not.

Jeff's analysis, that a list item referenced from stack_cache is actually on the stack_used list can only be true if the list_add operation is not committed to memory in order.  We are sure that the compiler doesn't cache anything in this code.  So my question is this:

In these COW situations like fork(), when I have data in the cache of one core, then another core running a thread in the same process marks the same page COW because of a fork() syscall, will the cache content of the first core be flushed first?

If this doesn't happen it would explain the results and it would mean userlevel cannot do anything since there are no ways for userlevel code to flush all cache data for the address space.

Comment 14 Jakub Jelinek 2009-01-08 21:24:57 UTC

Maybe related, GCC recently changed x86_64 (and i386) __sync_synchronize () to do a mfence (or locked stack access if mfence isn't available), perhaps glibc's
atomic_*barrier on i386/x86_64 needs to do it as well?
See http://gcc.gnu.org/PR36793
For many other glibc targets it already expands to one or another memory barrier instruction (ppc, ia64, sparc, ...).

Comment 15 Jakub Jelinek 2009-01-08 21:43:55 UTC

For atomic_write_barrier sfence should be the right insn I'd say for x86_64, for i?86 lock addl $1,dummyvar.

Comment 16 Andrea Arcangeli 2009-01-08 22:03:58 UTC

on topic but not really related. Problem is for memory barriers to be effective both cpu must use them. And one one side we have glibc, on the other side we have the kernel reading the page. So we can't change kernel fork runtime to run dependent on whatever userland smp memory ordering. That's why I'll have to stop userland to modify the page, while fork is copying the page. So for this bug the compiler barrier ("memory" clobber and no asm insn) is more than enough.

If you need sfence for atomic_write_barrier I don't know, but not for this at least ;)

As long as the current fix is covering all possible corruption sources, that's more than enough on glibc side. I'll return looking into the glibc patch (it's not trivial) after I've a kernel patch that I think current glibc should not have problem to run safe on.

thanks!

Comment 17 Ulrich Drepper 2009-01-08 22:15:42 UTC

(In reply to comment #16)
> That's why I'll have to
> stop userland to modify the page, while fork is copying the page.

For my education: "stopping userland" means stopping/descheduling all of the thread in the process?


> If you need sfence for atomic_write_barrier I don't know, but not for this at
> least ;)

I don't think it's needed.

Comment 18 Andrea Arcangeli 2009-01-09 01:52:24 UTC

Hello,

With "stopping userland" I meant: just before copying the page inside fork, any userland thread running in other cpus that would attempt to access the page under copy, will trigger page faults. So any thread that could modify the page contents, will be prevented to run in userland while the copy is in progress. It'll enter kernel. It's granular per-page and they're not descheduled, the page fault handler will spin on the PT lock while the copy runs. This is a slow path only running for pages under O_DIRECT that are modified simultanously by another thread while fork runs. It's all about getting right the partial O_DIRECT on 512bytes, the other 512bytes belongs to glibc but we can't cow the page in the parent or I/O is lost, which is why pte in the parent has to be writeable at all times that PT lock is released. (today in linux fork sets the pte readonly for parent and child so the parent threads trigger cows before O_DIRECT is complete, which is the bug)

Comment 19 Ulrich Drepper 2009-01-09 02:04:10 UTC

Thanks for the explanation.

I think this doesn't explain what Jeff saw, though.  I guess I'll ask on a mailing list, just to be sure.

Comment 25 Andrea Arcangeli 2009-01-27 22:39:57 UTC

That's good news, because I couldn't find the explanation of the hang in the kernel patch after my last update (that ensures no writes can happen from any cpu-thread whatsoever while the page is being copied/cowed inside fork, only the direct-io can modify the page during the copy but what child gets is undefined). 

However more pages than needed could would get copied with the fix and that's because of the pagevec queues that pins the pages and locking becomes quickly messy if I want to take those into account. So I'm more inclined a fix based on a PG_ bitflag for rhel, even if it'll only cover o-direct (which is the most important and in general this problem shouldn't be kernel exploitable). For mainline things are very different instead and I'm more inclined to add a new atomic counter which should still allow gup-fast to exist there.

Comment 27 Andrea Arcangeli 2009-01-27 23:48:11 UTC

Atomicity in the child is guaranteed per page, not for the whole address space, correct. Hopefully it's possible to use a list algorithm that can use the per-page atomicity guarantee to extract coherent information from the whole address space as long as no object spans over a page boundary (that should be possible to enforce a glibc level). Not saying this is easy but stopping the whole thread pool when a thread forks, while possible, won't perform too well.

Comment 31 Andrea Arcangeli 2009-01-28 17:06:38 UTC

I don't really like that we have nearly unfixable memleaks though. Yeah it's never going to be a practical problem given the small window for the race to trigger, the child may actually exec soon after fork returns so releasing all leaked stacks, but still it's hard to love a memleaking solution.

I was thinking along these lines:

1) create a lock that instead changes from 0 to thread id with cmpxchg or similar. Tid has to be a per-thread local ID obviously.

2) this lock will replace lll_lock with an implementation like this:

void __lll_lock()
{
   do { wait_and_try_to_take_lock(); } while (lock_value != tid());
}

int lll_lock()
{
   if (lock_value == tid())
      return 0;

   __lll_lock();
   return 1;
}

The unlock operation will take the lll_lock retval as parameter.

void lll_unlock(int need_unlock)
{
   if (need_unlock)
      __lll_unlock();
}

The trick here is that the fork critical section won't modify anything (it takes the lock only to prevent _other_ threads to modify the list), so if a signal runs nested inside the fork critical section we're totally fine and lll_lock will succeed in a recursive behavior.

The only problem I've no easy solution for is the fact fork should use __lll_lock not the lll_lock. And I wonder that fork may actually run from a signal handler, and the signal handler could be invoked in some other critical section in the middle of some list manipulation. So for this to work another trick is required to have fork jump back to the normal context if lock_value == tid(), and to have __lll_unlock return to fork just before unlocking. Not sure if it's feasible.

But I'd really like to know if it's possible to find a solution that doesn't involve dealing with an half-corrupted list in the child second guessing where the parent was interrupted. Furthermore if a lock was taken the fact the atomcity is per-page and the lock_value is in a single page, it'd be like if we were stopping all threads (infact we'll be descheduling all them when they try to take the lock). So if we can solve the bit of fork running nested on top of a list manipulation, I think I would like this a lot more. Perhaps this was already thought and turned down as impossible to solve problem, dunno.

It's clear the current lock implementation cannot make it, I'm not arguing about this, what I'm not sure if another kind of lock can make it or not.

Comment 34 Larry Troan 2009-03-10 12:03:48 UTC

Product Management is asking if bug 471613 sufficiently closes the window on this problem to avoid a patch to glibc in the 5.3.z errata stream.

If we need a patch to glibc for the 5.3.z errata stream then when can we get it?

Comment 37 Andrea Arcangeli 2009-03-10 13:57:51 UTC

Latest patches for bug #471613 on RHEL5 seem to hide this problem, right. So update doesn't seem urgent for glibc, especially given it's not a full fix. For a full fix I think it'd be a lot better to create a lock based on sigjump like described in comment #31.

Unfortunately the fix for bug #471613 forwarded to mainline doesn't hide the bug as well with gup-fast. For whatever reason in mainline I still hit this glibc deadlock despite having introduced PG_gup. Introducing PG_gup was enough to hide the problem in glibc if applied to rhel5. I'll be debugging the mainline patch more, perhaps something's wrong with the PG_gup logic there. At least kernel side safe (forkscrew works perfectly) and hugetlb part as well. Only problem left on the mainline port is this __reclaim_stacks thing with dma_thread that for whatever reason become reproducible again. Trying to figure out why...

Comment 38 Andrea Arcangeli 2009-03-10 15:18:22 UTC

Interestingly I noticed that if I use a statically linked version of dma_thread can't reproduce the hang anymore even with the mainline kernel fix. So it may be some offset that gets different with mainline kernel compared to rhel5 kernel.

Comment 49 errata-xmlrpc 2009-09-02 11:43:58 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1415.html