Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 726142

Summary: openais deadlocks during rapid start/stop of process
Product: Red Hat Enterprise Linux 5 Reporter: Steven Dake <sdake>
Component: openaisAssignee: Jan Friesse <jfriesse>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.6CC: aconway, cluster-maint, edamato, jkortus, jwest, msvoboda
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openais-0.80.6-32.el5 Doc Type: Bug Fix
Doc Text:
In rare cases, rapidly starting and stopping the openais service sometimes resulted in a deadlock. This update adapts the underlying source code to prevent such a deadlock, resolving this issue. Deadlock situations no longer happen during a rapid start/stop process.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-02-21 05:21:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 739081, 739082, 739083    
Attachments:
Description Flags
core file created by sending segv to the deadlocked process
none
Reproducer script
none
Proposed patch 1 - Handle EINTR
none
Proposed patch 2 - Don't deadlock on ipc_disconnect none

Description Steven Dake 2011-07-27 16:38:03 UTC
Description of problem:
Alan can add more details of the test case.

Version-Release number of selected component (if applicable):
openais-0.80.6-28.el5_6.1


How reproducible:
not sure

Steps to Reproduce:
1.need more detail from Alan Conway
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Steven Dake 2011-07-27 17:03:44 UTC
Created attachment 515559 [details]
core file created by sending segv to the deadlocked process

Comment 2 Steven Dake 2011-07-27 17:07:44 UTC
(gdb) thread apply all bt

Thread 10 (Thread 18873):
#0  0x00000031374cb696 in poll () from /lib64/libc.so.6
#1  0x000000000041cca5 in prioritized_timer_thread (data=<value optimized out>)
    at timer.c:125
#2  0x000000313800673d in start_thread () from /lib64/libpthread.so.0
#3  0x00000031374d44bd in clone () from /lib64/libc.so.6

Thread 9 (Thread 18876):
#0  0x000000313800a048 in pthread_mutex_unlock () from /lib64/libpthread.so.0
#1  0x0000003139c0a64a in _Unwind_Find_FDE () from /lib64/libgcc_s.so.1
#2  0x0000003139c07625 in ?? () from /lib64/libgcc_s.so.1
#3  0x0000003139c08a0e in ?? () from /lib64/libgcc_s.so.1
#4  0x0000003139c08e1b in _Unwind_ForcedUnwind () from /lib64/libgcc_s.so.1
#5  0x000000313800d160 in __pthread_unwind () from /lib64/libpthread.so.0
#6  0x0000003138005400 in sigcancel_handler () from /lib64/libpthread.so.0
#7  <signal handler called>
#8  0x000000313800aee9 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#9  0x0000000000417b2f in worker_thread (thread_data_in=<value optimized out>)
    at wthread.c:73
#10 0x000000313800673d in start_thread () from /lib64/libpthread.so.0
#11 0x00000031374d44bd in clone () from /lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---

Thread 8 (Thread 28753):
#0  0x00000031374d5887 in semop () from /lib64/libc.so.6
#1  0x000000000041b835 in pthread_ipc_consumer (conn=0x1ed76900) at ipc.c:412
#2  0x000000313800673d in start_thread () from /lib64/libpthread.so.0
#3  0x00000031374d44bd in clone () from /lib64/libc.so.6

Thread 7 (Thread 29052):
#0  0x00000031374d5887 in semop () from /lib64/libc.so.6
#1  0x000000000041b835 in pthread_ipc_consumer (conn=0x1ed76250) at ipc.c:412
#2  0x000000313800673d in start_thread () from /lib64/libpthread.so.0
#3  0x00000031374d44bd in clone () from /lib64/libc.so.6

Thread 6 (Thread 29338):
#0  0x00000031374d5887 in semop () from /lib64/libc.so.6
#1  0x000000000041b835 in pthread_ipc_consumer (conn=0x2aaab00009f0)
    at ipc.c:412
#2  0x000000313800673d in start_thread () from /lib64/libpthread.so.0
#3  0x00000031374d44bd in clone () from /lib64/libc.so.6

Thread 5 (Thread 32532):
#0  0x000000313800a048 in pthread_mutex_unlock () from /lib64/libpthread.so.0
#1  0x00000000004158fb in hdb_handle_get (handle=<value optimized out>, 
---Type <return> to continue, or q <return> to quit---
    iovec=0x40e37e00, iov_len=1) at ../include/hdb.h:152
#2  totempg_groups_joined_reserve (handle=<value optimized out>, 
    iovec=0x40e37e00, iov_len=1) at totempg.c:1157
#3  0x000000000041b894 in pthread_ipc_consumer (conn=0x2aaaf1440f20)
    at ipc.c:427
#4  0x000000313800673d in start_thread () from /lib64/libpthread.so.0
#5  0x00000031374d44bd in clone () from /lib64/libc.so.6

Thread 4 (Thread 627):
#0  0x00000031374d5887 in semop () from /lib64/libc.so.6
#1  0x000000000041b835 in pthread_ipc_consumer (conn=0x2aaaf1649f20)
    at ipc.c:412
#2  0x000000313800673d in start_thread () from /lib64/libpthread.so.0
#3  0x00000031374d44bd in clone () from /lib64/libc.so.6

Thread 3 (Thread 1013):
#0  0x000000313800d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003138008e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x0000003138008cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000000000041586b in totempg_groups_joined_reserve (handle=0, iovec=0x80, 
    iov_len=0) at totempg.c:1155
#4  0x000000000041b894 in pthread_ipc_consumer (conn=0x2aaafe9f1a00)
    at ipc.c:427
---Type <return> to continue, or q <return> to quit---
#5  0x000000313800673d in start_thread () from /lib64/libpthread.so.0
#6  0x00000031374d44bd in clone () from /lib64/libc.so.6

Thread 2 (Thread 1091):
#0  0x000000313800d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003138008e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x0000003138008cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000000000041aeed in ipc_thread_active (conn=0x2aaac90ffc10) at ipc.c:187
#4  0x000000000041b81d in pthread_ipc_consumer (conn=0x2aaac90ffc10)
    at ipc.c:408
#5  0x000000313800673d in start_thread () from /lib64/libpthread.so.0
#6  0x00000031374d44bd in clone () from /lib64/libc.so.6

Thread 1 (Thread 18872):
#0  0x000000313800d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003138008e1a in _L_lock_1034 () from /lib64/libpthread.so.0
#2  0x0000003138008cdc in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000000000041af51 in ipc_disconnect (conn_info=0x2aaac90ffc10)
    at ipc.c:802
#4  0x000000000041bd30 in msg_send (conn=0x2aaac90ffc10, 
    iov=<value optimized out>, iov_len=1, locked=1) at ipc.c:1078
#5  0x000000000041c5ae in outq_flush (handle=<value optimized out>, 
    fd=<value optimized out>, revent=1, data=0x2aaac90ffc10) at ipc.c:1118
---Type <return> to continue, or q <return> to quit---
#6  poll_handler_connection (handle=<value optimized out>, 
    fd=<value optimized out>, revent=1, data=0x2aaac90ffc10) at ipc.c:735
#7  0x0000000000405ce0 in poll_run (handle=0) at aispoll.c:402
#8  0x00000000004186be in main (argc=<value optimized out>, 
    argv=<value optimized out>) at main.c:628

Comment 3 Steven Dake 2011-07-27 17:11:36 UTC
thread 1 is returning an error condition in ipc_disconnect:802.  Unfortunately errno is not available.  Would be helpful to know what this error condition is - perhaps this code is not typically exercised.

Comment 4 Alan Conway 2011-07-27 17:50:51 UTC
Created attachment 515571 [details]
Reproducer script

Script to reproduce the failure, run in a loop. Failure only seen once, after 2 days of running the test continuously.

Comment 5 Alan Conway 2011-07-27 18:50:44 UTC
To run the test: 

make
while true; do ./runme.sh; done

Comment 6 Steven Dake 2011-08-08 21:57:53 UTC
Honza

Try to reproduce and determine value of errno in thread 1 ipc_disconnect:802.

Comment 8 Jan Friesse 2011-08-15 13:03:38 UTC
Steve,
I was not able reproduce this problem after +- 3 days of heavy running attached test case, but, with simple modification of msg_send in ipc.c to set err to -1 and set errno to random value, and really deadlock was there. So there is for sure bug. Also I'm pretty sure there is error like ECONNRESET or something like that.

My proposed solution:
- Use reentrant mutex for conn_info->mutex. This shouldn't change any code logic and it's very small change.
- Or use code for (res == -1 && errno == EAGAIN) for every res == -1. This will ensure that disconnection is handled by upper layers and/or later and non critical errors are also handled by retry.

Steve, what's your opinion there?

Comment 9 Steven Dake 2011-08-15 15:11:28 UTC
do you mean PTHREAD_MUTEX_RECURSIVE?  This seems reasonable - another option would be to create a an ipc_disconnect_unlocked() function if locked == 1.  As you fix the bug, please handle EINTR by retry in ipc.c send operation in case that is the cause of the ipc_disconnect call.

ECONNRESET is a possibility too (which would just call ipc_disconnect).

Comment 11 Jan Friesse 2011-08-17 14:37:01 UTC
Created attachment 518695 [details]
Proposed patch 1 - Handle EINTR

This patch handles EINTR by retry send.

Comment 12 Jan Friesse 2011-08-17 14:37:53 UTC
Created attachment 518696 [details]
Proposed patch 2 - Don't deadlock on ipc_disconnect

Patch handles deadlock by preventing double lock.

Comment 13 Jan Friesse 2011-08-17 14:39:13 UTC
Alan,
because I was not able to reproduce issue, can you please test attached two patches and confirm/deny that it solves problem?

Thanks,
  Honza

Comment 20 Alan Conway 2011-09-20 13:07:19 UTC
re comment 13: I can't readily reproduce this either. I only saw it once and that was after running the test for 2 days.

Comment 21 Miroslav Svoboda 2011-11-03 18:04:41 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
In rare cases, rapidly starting and stopping the openais service sometimes resulted in a deadlock. This update adapts the underlying source code to prevent such a deadlock, resolving this issue. Deadlock situations no longer happen during a rapid start/stop process.

Comment 23 errata-xmlrpc 2012-02-21 05:21:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0180.html