Bug 1302389

Summary: Deadlocks Occurring in Shared Mutex Code
Product: Red Hat Enterprise Linux 6 Reporter: Alan Matsuoka <alanm>
Component: glibcAssignee: Torvald Riegel <triegel>
Status: CLOSED WONTFIX QA Contact: qe-baseos-tools-bugs
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.7CC: ashankar, codonell, dkochuka, fweimer, metze, mnewsome, pandrade, pfrankli, triegel
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-04-03 15:48:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1269194    
Attachments:
Description Flags
reproducer none

Description Alan Matsuoka 2016-01-27 17:05:07 UTC
Description of problem:
The test code creates an anonymous block of shared memory, initializes a robust process-shared mutex in the memory and forks child processes. Each child process publishes its process ID in the shared memory and then acquires and releases the mutex before exiting normally. The parent picks a child process, waits for it to publish its PID and then terminates it via kill. This creates a race condition where the child may or may not receive an asynchronous signal while acquiring, holding or releasing the mutex. The parent attempts to keep a certain number of child processes running by waiting for finished children and forking new ones.

The observed behaviour on multi-core hosts is that the program sometimes deadlocks, with “ps -l” reporting several processes in the tree blocked on the futex WCHAN. The expected behaviour is that the program runs to completion in a reasonable amount of time.

We’ve also noticed that in some cases applying “strace -p” to one of the blocked child processes causes the program to proceed further, although not necessarily to completion. Therefore, as a debugging aid the parent process also forks a “dummy” child which does not interact with the mutex but does have the shared memory mapped. This allows gdb to attach to this particular child process without interfering with any blocked futex system calls, and inspect the state of the mutex. See example session below:

$ gcc -Wall -std=gnu99 -g -o robust_test robust_test.c -lpthread
$ ./robust_test &
[1] 21937
$ Done 0 iterations, 11 pre_lock, 3 locks, 2 unlocks, 2 post_unlock 0 cleanups, 1 not done
Done 5 iterations, 21 pre_lock, 8 locks, 7 unlocks, 7 post_unlock 0 cleanups, 6 not done
Done 10 iterations, 46 pre_lock, 28 locks, 27 unlocks, 27 post_unlock 0 cleanups, 11 not done
Done 15 iterations, 76 pre_lock, 56 locks, 55 unlocks, 53 post_unlock 2 cleanups, 14 not done

$ ps -l
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
0 S 780379 20009 45137  0 80   0 -  6749 wait   pts/14   00:00:00 bash
0 S 780379 21937 20009  0 80   0 -  1526 futex_ pts/14   00:00:00 robust_test
1 S 780379 21938 21937  0 80   0 -  1525 pipe_w pts/14   00:00:00 robust_test
1 S 780379 22003 21937  0 80   0 -  1526 futex_ pts/14   00:00:00 robust_test
1 Z 780379 22007 21937  0 80   0 -     0 exit   pts/14   00:00:00 robust_test <defunct>
1 Z 780379 22008 21937  0 80   0 -     0 exit   pts/14   00:00:00 robust_test <defunct>
1 S 780379 22009 21937  0 80   0 -  1526 futex_ pts/14   00:00:00 robust_test
1 S 780379 22010 21937  0 80   0 -  1526 futex_ pts/14   00:00:00 robust_test
1 S 780379 22011 21937  0 80   0 -  1526 futex_ pts/14   00:00:00 robust_test
1 S 780379 22013 21937  0 80   0 -  1526 futex_ pts/14   00:00:00 robust_test
1 Z 780379 22014 21937  0 80   0 -     0 exit   pts/14   00:00:00 robust_test <defunct>
0 R 780379 22015 20009  0 80   0 -  2820 -      pts/14   00:00:00 ps
0 S 780379 45137 42638  0 80   0 -  6779 wait   pts/14   00:00:00 bash

$ strace -p 22013
Process 22013 attached - interrupt to quit
futex(0x7f6a66369000, FUTEX_WAIT, 2147505585, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x7f6a66369000, FUTEX_WAKE, 1)    = 1
exit_group(0)                           = ?
Process 22013 detached
$ Done 20 iterations, 105 pre_lock, 81 locks, 80 unlocks, 76 post_unlock 4 cleanups, 17 not done
Done 25 iterations, 115 pre_lock, 86 locks, 85 unlocks, 81 post_unlock 4 cleanups, 22 not done
Done 30 iterations, 139 pre_lock, 105 locks, 104 unlocks, 100 post_unlock 4 cleanups, 27 not done
Done 35 iterations, 163 pre_lock, 124 locks, 123 unlocks, 119 post_unlock 4 cleanups, 32 not done

$ gdb -p 21938
GNU gdb (GDB) 7.10
Copyright (C) 2015 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 21938
Reading symbols from robust_test...done.
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
0x00000035ab00e740 in __read_nocancel () from /lib64/libpthread.so.0
(gdb) up
#1  0x0000000000401389 in main (argc=1, argv=0x7fff038d30d8) at robust_test.c:251
251               read(shared->pipe_fds[0], &dummy, sizeof(dummy));
(gdb) print shared->top_mutex
$1 = {
  __data = {
    __lock = 0, 
    __count = 1, 
    __owner = 0, 
    __nusers = 0, 
    __kind = 144, 
    __spins = 0, 
    __list = {
      __prev = 0x0, 
      __next = 0x0
    }
  }, 
  __size = "\000\000\000\000\001", '\000' <repeats 11 times>, "\220", '\000' <repeats 22 times>, 
  __align = 4294967296
}

Version-Release number of selected component (if applicable):

6.7 
How reproducible:
Always on hosts with multiple CPU cores. Occasionally on hosts with two CPU  cores.

Host has glibc-2.12-1.166.el6_7.3 and 2.6.32-573.7.1.el6.x86_64

Steps to Reproduce:
1.compile and run attached reproducer
2.
3.

Actual results:

runs into deadlock 
Expected results:

run to completion

Additional info:

Issue : Program going into deadlock (waiting on futex) . Please read through case description and comment #2 (which explains 2 scenarios)
This happens on system with two CPU's. However, the second test case (comment #2) happens on both single/multiple CPU system.

This is reproducible on my test RHEL 6.7 system with latest glibc and kernel versions.

Need help in analysing the program and take this ahead to Engineering if this turns out to be a bug.
kernel-2.6.32-573.8.1.el6.x86_64
glibc-2.12-1.166.el6_7.3.x86_64

# ./robust_test 
Done 0 iterations, 6 pre_lock, 2 locks, 1 unlocks, 1 post_unlock 0 cleanups, 0 not done
Done 5 iterations, 31 pre_lock, 21 locks, 20 unlocks, 19 post_unlock 1 cleanups, 4 not done
Done 10 iterations, 42 pre_lock, 26 locks, 25 unlocks, 24 post_unlock 1 cleanups, 9 not done
Done 15 iterations, 52 pre_lock, 31 locks, 30 unlocks, 29 post_unlock 1 cleanups, 14 not done
Done 20 iterations, 62 pre_lock, 36 locks, 35 unlocks, 34 post_unlock 1 cleanups, 19 not done
Done 25 iterations, 72 pre_lock, 41 locks, 40 unlocks, 39 post_unlock 1 cleanups, 24 not done
Done 30 iterations, 94 pre_lock, 58 locks, 57 unlocks, 56 post_unlock 1 cleanups, 29 not done
Done 35 iterations, 109 pre_lock, 68 locks, 67 unlocks, 66 post_unlock 1 cleanups, 34 not done
Done 40 iterations, 119 pre_lock, 73 locks, 72 unlocks, 71 post_unlock 1 cleanups, 39 not done
Done 45 iterations, 129 pre_lock, 78 locks, 77 unlocks, 76 post_unlock 1 cleanups, 44 not done
Success 144 pre_lock, 96 locks, 96 unlocks, 95 post_unlock 1 cleanups, 48 not done

# ./robust_test 
Done 0 iterations, 5 pre_lock, 2 locks, 1 unlocks, 1 post_unlock 0 cleanups, 0 not done
Done 5 iterations, 19 pre_lock, 8 locks, 7 unlocks, 6 post_unlock 0 cleanups, 4 not done
Done 10 iterations, 30 pre_lock, 14 locks, 13 unlocks, 12 post_unlock 0 cleanups, 9 not done
Done 15 iterations, 40 pre_lock, 19 locks, 18 unlocks, 17 post_unlock 0 cleanups, 14 not done
Done 20 iterations, 51 pre_lock, 25 locks, 24 unlocks, 23 post_unlock 0 cleanups, 19 not done
Done 25 iterations, 61 pre_lock, 30 locks, 29 unlocks, 28 post_unlock 0 cleanups, 24 not done
Done 30 iterations, 71 pre_lock, 35 locks, 34 unlocks, 33 post_unlock 0 cleanups, 29 not done

This happens occassionally, you may need to repeat ./robust_test run a few times. (happens once in every 5-6 run)
# ps axwll
F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
0     0  2774  2333  20   0   6096   512 futex_ S+   pts/0      0:00 ./robust_test
1     0  2775  2774  20   0   6092   156 pipe_w S+   pts/0      0:00 ./robust_test
1     0  2811  2774  20   0      0     0 exit   Z+   pts/0      0:00 [robust_test] <defunct>
1     0  2812  2774  20   0   6096   184 futex_ S+   pts/0      0:00 ./robust_test
1     0  2813  2774  20   0   6096   184 futex_ S+   pts/0      0:00 ./robust_test
1     0  2814  2774  20   0      0     0 exit   Z+   pts/0      0:00 [robust_test] <defunct>
1     0  2815  2774  20   0   6096   184 futex_ S+   pts/0      0:00 ./robust_test
1     0  2816  2774  20   0   6096   184 futex_ S+   pts/0      0:00 ./robust_test
1     0  2817  2774  20   0   6096   184 futex_ S+   pts/0      0:00 ./robust_test
1     0  2818  2774  20   0      0     0 exit   Z+   pts/0      0:00 [robust_test] <defunct>


Running this under valgrind showed that there are recursive locks 

valgrind -v --tool=helgrind --log-file=test3.log ./robust_test

==36495== Thread #1: Attempt to re-lock a non-recursive lock I already hold

==36495== ---Thread-Announcement------------------------------------------
==36495==
==36495== Thread #1 is the program's root thread
==36495==
==36495== ----------------------------------------------------------------
==36495==
==36495== Thread #1: Attempt to re-lock a non-recursive lock I already hold
==36495==    at 0x4A0EAB9: pthread_mutex_lock (in /opt/rh/devtoolset-3/root/usr/lib64/valgrind/vgprelo
ad_helgrind-amd64-linux.so)
==36495==    by 0x40106A: lock_mutex (robust_test.c:172)
==36495==    by 0x401175: run_child (robust_test.c:212)
==36495==    by 0x4011D2: fork_child (robust_test.c:227)
==36495==    by 0x401450: main (robust_test.c:297)
==36495==  Lock was previously acquired
==36495==    at 0x4A0EB61: pthread_mutex_lock (in /opt/rh/devtoolset-3/root/usr/lib64/valgrind/vgprelo
ad_helgrind-amd64-linux.so)
==36495==    by 0x40106A: lock_mutex (robust_test.c:172)
==36495==    by 0x4013BA: main (robust_test.c:287)
==36495==

Changing the code in the attached reproducer 
[alanm@vmw102 C01551754]$ diff -dcp robust_test.c robust_testR.c 
*** robust_test.c       2015-12-10 09:50:09.000000000 -0500
--- robust_testR.c      2015-12-10 10:40:08.804550857 -0500
*************** void init_mutex(pthread_mutex_t* mutex)
*** 71,77 ****
  {
        pthread_mutexattr_t attr;
        CHECKED(pthread_mutexattr_init, &attr);
!       CHECKED(pthread_mutexattr_settype, &attr, PTHREAD_MUTEX_NORMAL);
        CHECKED(pthread_mutexattr_setpshared, &attr, PTHREAD_PROCESS_SHARED);
        CHECKED(pthread_mutexattr_setrobust, &attr, PTHREAD_MUTEX_ROBUST);
  
--- 71,77 ----
  {
        pthread_mutexattr_t attr;
        CHECKED(pthread_mutexattr_init, &attr);
!       CHECKED(pthread_mutexattr_settype, &attr, PTHREAD_MUTEX_RECURSIVE);
        CHECKED(pthread_mutexattr_setpshared, &attr, PTHREAD_PROCESS_SHARED);
        CHECKED(pthread_mutexattr_setrobust, &attr, PTHREAD_MUTEX_ROBUST);

still caused it to go into deadlock.

There are a couple of upstream BZ's that seem to apply to this.
https://sourceware.org/bugzilla/show_bug.cgi?id=19004
and https://sourceware.org/bugzilla/show_bug.cgi?id=19402

Comment 1 Alan Matsuoka 2016-01-27 17:05:52 UTC
Created attachment 1118834 [details]
reproducer

Comment 27 Carlos O'Donell 2017-04-03 15:48:10 UTC
Summary:
========

RHEL 6.10 is in production phase 3 and only selected Urgent Priority Bug Fixes will be considered. The upstream bugs that have been fixed are all pre-existing issues in RHEL 6 and not new regressions. Fixing these bugs in RHEL 6 could have significant destabilizing effects on the robust mutex implementation. It is with this consideration that I'm marking this issue as CLOSED/WONTFIX for RHEL 6.

We will continue to enhance RHEL 7 with any fixes required for robust mutex support.

Technical details:
==================

The process shared robust mutex support was audited by the Red Hat Platform Tools Base OS team (glibc team) and it was found that there were several key defects that could cause deadlocks and hangs in the implementation under certain conditions.

The team has since then fixed several upstream issues including:

Fix lost wakeups with robust mutex:
https://sourceware.org/bugzilla/show_bug.cgi?id=20973

Fix x86 robust mutex assembly code:
https://sourceware.org/bugzilla/show_bug.cgi?id=20985

Deadlock with robust shared mutex:
https://sourceware.org/bugzilla/show_bug.cgi?id=19402

Correction of compiler barriers around robust shared mutex code:
Commit 8f9450a0b7a9e78267e8ae1ab1000ebca08e473e

The most complex problem here is that the existing x86 assembly optimized robust mutex code is simply wrong with regards to condition handling and it's removal allows the use of the generic C version and reliance on modern compiler optimizations to achieve an optimal balance between provably correct and performance. This kind of change is not possible to make in the RHEL 6 release since such a change would drastically change the behaviour of the robust mutex.

To compound things there is still one more case which is too costly to be fixed without kernel help. One might call it a design flaw in the shared algorithm between the kernel and userspace:
"File corruption race condition in robust mutex unlocking"
https://sourceware.org/bugzilla/show_bug.cgi?id=14485

Bug 14485 is not intended to be alarmist and the race window is considered exceedingly small, but still there, like other latent bugs we don't know about in the implementation.

Thus the amount of change required, and future design constraints, make this risky for RHEL 6. The risks are quite hight the changes could destabilize products already built upon RHEL 6. In RHEL 7 we have a lot more flexibility when it comes to fixing these issues (except 14485 where it needs a broader fix upstream).

As a final note, here is Torvald Riegel's excellent technical summary of the current status:
https://sourceware.org/ml/libc-alpha/2016-12/msg00950.html

Again, we recommend working with Red Hat to ensure RHEL 7 meets your needs with respect to robust mutexes.