Bug 748351

Summary: corosync unloading hang on "__lll_lock_wait ()"
Product: [Fedora] Fedora Reporter: Shining <nshi_nb>
Component: corosyncAssignee: Steven Dake <sdake>
Status: CLOSED UPSTREAM QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: low Docs Contact:
Priority: unspecified    
Version: rawhideCC: agk, fdinitto, jfriesse, sdake
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-11 16:31:59 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Shining 2011-10-24 08:51:44 UTC
Description of problem:
  corosync hang when unloading service.
  log in corosync.log 
   --------------------------------
Oct 24 15:45:02 corosync [SERV  ] Unloading all Corosync service engines.
Oct 24 15:45:02 corosync [SERV  ] Service engine unloaded: corosync extended virtual synchrony service
Oct 24 15:45:02 corosync [SERV  ] Service engine unloaded: corosync configuration service
Oct 24 15:45:02 corosync [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Oct 24 15:45:02 corosync [SERV  ] Service engine unloaded: corosync cluster config database access v1.01
Oct 24 15:45:02 corosync [SERV  ] Service engine unloaded: corosync profile loading service
Oct 24 15:45:02 corosync [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Oct 24 15:45:02 corosync [SERV  ] Service engine unloaded: gcw cluster membership service A.01.01
Oct 24 15:45:02 corosync [CIB   ] [DEBUG]: cib_exec_exit_fn [ENTER]
Oct 24 15:45:02 corosync [CIB   ] [DEBUG]: cib_exec_exit_fn [LEAVE]
Oct 24 15:45:02 corosync [SERV  ] Service engine unloaded: gcw cib service 1.0.0
Oct 24 15:45:02 corosync [CRM   ] [DEBUG]: crm_exec_exit_fn [ENTER]
Oct 24 15:45:02 corosync [CRM   ] [DEBUG]: crm_exec_exit_fn [LEAVE]
Oct 24 15:45:02 corosync [SERV  ] Service engine unloaded: gcw crm service 1.0.0
Oct 24 15:45:02 corosync [TOTEM ] sending join/leave message
Oct 24 15:45:02 corosync [MAIN  ] Corosync Cluster Engine exiting with status 0 at main.c:1810.
   --------------------------------
   (I think the line of last but one in corosync.log is the reason to cause the problem.)

   gdb info:
   --------------------------------
0x000000371c20d4c4 in __lll_lock_wait () from /lib64/libpthread.so.0
   --------------------------------


Version-Release number of selected component (if applicable):
  corosync v1.3.4
  os: centos 5.6 x86_64

How reproducible:
  I have four nodes. each one run a simple shell script to simulate the corosync service on the node to start/stop. after several round of corosync  start/stop, corosync on some nodes will be hanged.

Steps to Reproduce:
1.
2.
3.
  
Actual results:



Expected results:


Additional info:

Comment 1 Steven Dake 2011-10-24 14:31:02 UTC
please run "thread apply all bt" on the gdb so we can get a proper trace of the entire system.  You will need the debuginfo packages installed for this to work properly.

Comment 2 Steven Dake 2012-06-11 16:31:59 UTC
This has been fixed in upstream pacemaker and corosync.