Bug 566467 - aisexec still running after "service cman stop"
aisexec still running after "service cman stop"
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openais (Show other bugs)
5.3
All Linux
urgent Severity urgent
: rc
: ---
Assigned To: Steven Dake
Cluster QE
: Regression, TestBlocker, ZStream
Depends On:
Blocks: 581603 581604
  Show dependency treegraph
 
Reported: 2010-02-18 10:05 EST by Nate Straz
Modified: 2016-04-26 11:21 EDT (History)
6 users (show)

See Also:
Fixed In Version: openais-0.80.6-16.el5
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-03-30 03:48:49 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
patch to fix the problem (3.69 KB, patch)
2010-03-09 14:29 EST, Steven Dake
no flags Details | Diff

  None (edit)
Description Nate Straz 2010-02-18 10:05:31 EST
Description of problem:

While running through loops starting and stopping cman on all nodes, one node still had aisexec running after `service cman stop` and cman_tool commands would hang.

(gdb) thr a a bt

Thread 2 (Thread 0x40faffe0 (LWP 16173)):
#0  0x0000003009acaf36 in __poll (fds=0x0, nfds=0, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:87
#1  0x000000000041cc05 in prioritized_timer_thread (data=<value optimized out>) at timer.c:125
#2  0x000000300a206617 in start_thread (arg=<value optimized out>) at pthread_create.c:297
#3  0x0000003009ad3d1d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x2ac31d139f00 (LWP 16172)):
#0  0x000000300a20ad09 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000417e11 in worker_thread_group_wait (worker_thread_group=0x731910) at wthread.c:171
#2  0x0000003009a333a5 in exit (status=0) at exit.c:75
#3  0x0000003009a1d99b in __libc_start_main (main=0x418360 <main>, argc=1, ubp_av=0x7fff5f23c168, init=<value optimized out>, fini=<value optimized out>,
    rtld_fini=<value optimized out>, stack_end=0x7fff5f23c158) at libc-start.c:262
#4  0x0000000000405a19 in _start ()


Version-Release number of selected component (if applicable):
openais-0.80.6-14.el5
cman-2.0.115-30.el5

How reproducible:
Unknown

Steps to Reproduce:
1. start and stop cman on all nodes, verifying each step
  
Actual results:


Expected results:


Additional info:
Comment 1 Nate Straz 2010-02-18 10:08:32 EST
Looking at the backtrace info makes me thing that this may be a glibc problem.

Version-Release number of selected component (if applicable):
glibc-2.5-47
Comment 2 Nate Straz 2010-02-18 15:47:58 EST
I hit this again while running through laryngitis which shuts down aisexec via cman_tool leave.  A subsequent cman_tool status hung.  Here are the logs and gdb back traces:

Feb 18 15:20:54 west-04 qarshd[6040]: Running cmdline: cman_tool leave force -t 5
Feb 18 15:21:01 west-04 openais[5957]: [TOTEM] The token was lost in the OPERATIONAL state.
Feb 18 15:21:01 west-04 openais[5957]: [TOTEM] Receive multicast socket recv buffer size (262142 bytes).
Feb 18 15:21:01 west-04 openais[5957]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
Feb 18 15:21:01 west-04 openais[5957]: [TOTEM] entering GATHER state from 2.
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] entering GATHER state from 0.
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] Creating commit token because I am the rep.
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] Saving state aru 43 high seq received 43
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] Storing new sequence id for ring fb8
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] entering COMMIT state.
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] entering RECOVERY state.
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] position [0] member 10.16.34.104:
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] previous ring seq 4020 rep 10.16.34.103
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] aru 43 high delivered 43 received flag 1
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] position [1] member 10.16.34.105:
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] previous ring seq 4020 rep 10.16.34.103
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] aru 43 high delivered 43 received flag 1
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] position [2] member 10.16.34.107:
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] previous ring seq 4020 rep 10.16.34.103
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] aru 43 high delivered 43 received flag 1
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] position [3] member 10.16.34.108:
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] previous ring seq 4020 rep 10.16.34.103
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] aru 43 high delivered 43 received flag 1
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] Did not need to originate any messages in recovery.
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] Sending initial ORF token
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ] New Configuration:
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ]  r(0) ip(10.16.34.104)
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ]  r(0) ip(10.16.34.105)
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ]  r(0) ip(10.16.34.107)
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ]  r(0) ip(10.16.34.108)
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ] Members Left:
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ]  r(0) ip(10.16.34.103)
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ] Members Joined:
Feb 18 15:21:21 west-04 openais[5957]: [CMAN ] quorum lost, blocking activity
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ] New Configuration:
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ]  r(0) ip(10.16.34.104)
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ]  r(0) ip(10.16.34.105)
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ]  r(0) ip(10.16.34.107)
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ]  r(0) ip(10.16.34.108)
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ] Members Left:
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ] Members Joined:
Feb 18 15:21:21 west-04 openais[5957]: [SYNC ] This node is within the primary component and will provide service.
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] entering OPERATIONAL state.
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ] got nodejoin message 10.16.34.104
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ] got nodejoin message 10.16.34.105
Feb 18 15:21:21 west-04 xinetd[2734]: EXIT: qarsh status=0 pid=6040 duration=27(sec)
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ] got nodejoin message 10.16.34.107
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ] got nodejoin message 10.16.34.108
Feb 18 15:21:21 west-04 openais[5957]: [CPG  ] got joinlist message from node 8
Feb 18 15:21:21 west-04 openais[5957]: [CPG  ] got joinlist message from node 4
Feb 18 15:21:21 west-04 openais[5957]: [CPG  ] got joinlist message from node 5
Feb 18 15:21:21 west-04 openais[5957]: [CPG  ] got joinlist message from node 7
Feb 18 15:21:22 west-04 openais[5957]: [SERV ] Unloading all openais components
Feb 18 15:21:22 west-04 openais[5957]: [SERV ] Unloading openais component: openais_confdb v0 (19/10)
Feb 18 15:21:22 west-04 openais[5957]: [SERV ] Unloading openais component: openais_cpg v0 (18/8)
Feb 18 15:21:22 west-04 openais[5957]: [SERV ] Unloading openais component: openais_cfg v0 (17/7)
Feb 18 15:21:22 west-04 openais[5957]: [SERV ] Unloading openais component: openais_msg v0 (16/6)
[root@west-04 ~]# tail -n 100 /var/log/messages
Feb 18 15:20:51 west-04 openais[5957]: [CLM  ]  r(0) ip(10.16.34.107)
Feb 18 15:20:51 west-04 openais[5957]: [CLM  ]  r(0) ip(10.16.34.108)
Feb 18 15:20:51 west-04 openais[5957]: [CLM  ] Members Left:
Feb 18 15:20:51 west-04 openais[5957]: [CLM  ] Members Joined:
Feb 18 15:20:51 west-04 openais[5957]: [SYNC ] This node is within the primary component and will provide service.
Feb 18 15:20:51 west-04 openais[5957]: [TOTEM] entering OPERATIONAL state.
Feb 18 15:20:51 west-04 openais[5957]: [CLM  ] got nodejoin message 10.16.34.103
Feb 18 15:20:51 west-04 openais[5957]: [CLM  ] got nodejoin message 10.16.34.104
Feb 18 15:20:51 west-04 openais[5957]: [CLM  ] got nodejoin message 10.16.34.105
Feb 18 15:20:51 west-04 openais[5957]: [CLM  ] got nodejoin message 10.16.34.107
Feb 18 15:20:51 west-04 openais[5957]: [CLM  ] got nodejoin message 10.16.34.108
Feb 18 15:20:51 west-04 openais[5957]: [CPG  ] got joinlist message from node 8
Feb 18 15:20:51 west-04 openais[5957]: [CPG  ] got joinlist message from node 3
Feb 18 15:20:51 west-04 openais[5957]: [CPG  ] got joinlist message from node 4
Feb 18 15:20:51 west-04 openais[5957]: [CPG  ] got joinlist message from node 5
Feb 18 15:20:51 west-04 openais[5957]: [CPG  ] got joinlist message from node 7
Feb 18 15:20:54 west-04 xinetd[2734]: START: qarsh pid=6038 from=10.16.32.100
Feb 18 15:20:54 west-04 qarshd[6038]: Talking to peer 10.16.32.100:48050
Feb 18 15:20:54 west-04 qarshd[6038]: Running cmdline: cman_tool status
Feb 18 15:20:54 west-04 xinetd[2734]: EXIT: qarsh status=0 pid=6038 duration=0(sec)
Feb 18 15:20:54 west-04 xinetd[2734]: START: qarsh pid=6040 from=10.16.32.100
Feb 18 15:20:54 west-04 qarshd[6040]: Talking to peer 10.16.32.100:48051
Feb 18 15:20:54 west-04 qarshd[6040]: Running cmdline: cman_tool leave force -t 5
Feb 18 15:21:01 west-04 openais[5957]: [TOTEM] The token was lost in the OPERATIONAL state.
Feb 18 15:21:01 west-04 openais[5957]: [TOTEM] Receive multicast socket recv buffer size (262142 bytes).
Feb 18 15:21:01 west-04 openais[5957]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
Feb 18 15:21:01 west-04 openais[5957]: [TOTEM] entering GATHER state from 2.
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] entering GATHER state from 0.
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] Creating commit token because I am the rep.
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] Saving state aru 43 high seq received 43
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] Storing new sequence id for ring fb8
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] entering COMMIT state.
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] entering RECOVERY state.
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] position [0] member 10.16.34.104:
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] previous ring seq 4020 rep 10.16.34.103
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] aru 43 high delivered 43 received flag 1
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] position [1] member 10.16.34.105:
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] previous ring seq 4020 rep 10.16.34.103
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] aru 43 high delivered 43 received flag 1
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] position [2] member 10.16.34.107:
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] previous ring seq 4020 rep 10.16.34.103
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] aru 43 high delivered 43 received flag 1
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] position [3] member 10.16.34.108:
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] previous ring seq 4020 rep 10.16.34.103
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] aru 43 high delivered 43 received flag 1
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] Did not need to originate any messages in recovery.
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] Sending initial ORF token
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ] New Configuration:
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ]  r(0) ip(10.16.34.104)
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ]  r(0) ip(10.16.34.105)
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ]  r(0) ip(10.16.34.107)
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ]  r(0) ip(10.16.34.108)
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ] Members Left:
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ]  r(0) ip(10.16.34.103)
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ] Members Joined:
Feb 18 15:21:21 west-04 openais[5957]: [CMAN ] quorum lost, blocking activity
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ] New Configuration:
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ]  r(0) ip(10.16.34.104)
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ]  r(0) ip(10.16.34.105)
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ]  r(0) ip(10.16.34.107)
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ]  r(0) ip(10.16.34.108)
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ] Members Left:
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ] Members Joined:
Feb 18 15:21:21 west-04 openais[5957]: [SYNC ] This node is within the primary component and will provide service.
Feb 18 15:21:21 west-04 openais[5957]: [TOTEM] entering OPERATIONAL state.
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ] got nodejoin message 10.16.34.104
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ] got nodejoin message 10.16.34.105
Feb 18 15:21:21 west-04 xinetd[2734]: EXIT: qarsh status=0 pid=6040 duration=27(sec)
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ] got nodejoin message 10.16.34.107
Feb 18 15:21:21 west-04 openais[5957]: [CLM  ] got nodejoin message 10.16.34.108
Feb 18 15:21:21 west-04 openais[5957]: [CPG  ] got joinlist message from node 8
Feb 18 15:21:21 west-04 openais[5957]: [CPG  ] got joinlist message from node 4
Feb 18 15:21:21 west-04 openais[5957]: [CPG  ] got joinlist message from node 5
Feb 18 15:21:21 west-04 openais[5957]: [CPG  ] got joinlist message from node 7
Feb 18 15:21:22 west-04 openais[5957]: [SERV ] Unloading all openais components
Feb 18 15:21:22 west-04 openais[5957]: [SERV ] Unloading openais component: openais_confdb v0 (19/10)
Feb 18 15:21:22 west-04 openais[5957]: [SERV ] Unloading openais component: openais_cpg v0 (18/8)
Feb 18 15:21:22 west-04 openais[5957]: [SERV ] Unloading openais component: openais_cfg v0 (17/7)
Feb 18 15:21:22 west-04 openais[5957]: [SERV ] Unloading openais component: openais_msg v0 (16/6)
Feb 18 15:21:22 west-04 openais[5957]: [SERV ] Unloading openais component: openais_lck v0 (15/5)
Feb 18 15:21:22 west-04 openais[5957]: [SERV ] Unloading openais component: openais_evt v0 (14/4)
Feb 18 15:21:22 west-04 openais[5957]: [SERV ] Unloading openais component: openais_ckpt v0 (13/3)
Feb 18 15:21:22 west-04 openais[5957]: [SERV ] Unloading openais component: openais_amf v0 (12/2)
Feb 18 15:21:22 west-04 openais[5957]: [SERV ] Unloading openais component: openais_clm v0 (11/1)
Feb 18 15:21:22 west-04 openais[5957]: [SERV ] Unloading openais component: openais_evs v0 (10/0)
Feb 18 15:21:22 west-04 openais[5957]: [SERV ] Unloading openais component: openais_cman v0 (9/9)
Feb 18 15:21:22 west-04 openais[5957]: [SERV ] AIS Executive exiting (reason: CMAN requested shutdown).


(gdb) thr a a bt

Thread 2 (Thread 0x4027dfe0 (LWP 5958)):
#0  0x0000003f716caf36 in __poll (fds=0x0, nfds=0, timeout=-1)
    at ../sysdeps/unix/sysv/linux/poll.c:87
#1  0x000000000041cc05 in prioritized_timer_thread (data=<value optimized out>) at timer.c:125
#2  0x0000003f71e06617 in start_thread (arg=<value optimized out>) at pthread_create.c:297
#3  0x0000003f716d3d1d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x2b2158f6ef00 (LWP 5957)):
#0  0x0000003f71e0ad09 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000417e11 in worker_thread_group_wait (worker_thread_group=0x731910)
    at wthread.c:171
#2  0x0000003f716333a5 in exit (status=0) at exit.c:75
#3  0x0000003f7161d99b in __libc_start_main (main=0x418360 <main>, argc=1,
    ubp_av=0x7fff0104e448, init=<value optimized out>, fini=<value optimized out>,
    rtld_fini=<value optimized out>, stack_end=0x7fff0104e438) at libc-start.c:262
#4  0x0000000000405a19 in _start ()
Comment 4 Steven Dake 2010-02-18 18:09:44 EST
Not a regression in glibc, kernel, or openais.

The stack trace isn't complete, and as a result it is not possible to determine which stack trace in thread 1 is calling.

There are two scenarios here:

1)
the worker thread code calls pthread_cancel followed by pthread_cond_destroy during the exiting of the worker thread.  pthread_cond_destroy indicates that calling pthread_cond_destroy on a condition variable that is currently referenced by another thread results in undefined results.  This could be the problem.

2)
It is possible that the poll loop which drives the delivery of messages to totemsrp delivers a token, which calls worker_thread_group_wait.  Prior to this call of worker_thread_group_wait, another thread of execution triggers a shutdown operation, which destroys the variables and information the totemnet.c code uses in the first thread.  There is no mutual exclusion between totempg delivery of messages from the poll abstraction and the totempg_finalize api at the top level.  It is not necessary, except for this particular case.

Unfortunately fixing #2 is challenging, since introducing a mutex in the delivery of udp messages has significant performance impact.
Comment 5 Dean Jansa 2010-02-19 15:23:49 EST
I just hit this same issue on my 5 node x86_64 cluster.

(gdb) thr a a bt

Thread 2 (Thread 0x41cd3fe0 (LWP 13069)):
#0  0x0000003049ccaf36 in poll () from /lib64/libc.so.6
#1  0x000000000041cc05 in ?? ()
#2  0x000000304a406617 in start_thread () from /lib64/libpthread.so.0
#3  0x0000003049cd3d1d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x2b6b903cff00 (LWP 13068)):
#0  0x000000304a40ad09 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x0000000000417e11 in worker_thread_group_wait ()
#2  0x0000003049c333a5 in exit () from /lib64/libc.so.6
#3  0x0000003049c1d99b in __libc_start_main () from /lib64/libc.so.6
#4  0x0000000000405a19 in _start ()


The same test case passed in 5.4, marking this a regression.
Comment 6 Steven Dake 2010-02-19 16:36:43 EST
This is not a regression.  The code is obviously incorrect and hasn't changed since 5.0.  Other aspects of the system could have changed resulting in timing changes exposing this issue.
Comment 10 Nate Straz 2010-03-09 10:15:16 EST
I'm making much better progress with scratch build openais-0.80.6-15.el5_5.10
Comment 11 Steven Dake 2010-03-09 14:29:38 EST
Created attachment 398901 [details]
patch to fix the problem

------------------------------------------------------------------------
r2121 | sdake | 2010-03-09 12:27:35 -0700 (Tue, 09 Mar 2010) | 4 lines

Fix lockup in atexit handler registered by print.c that occurs less then 1% of
the time.  This is resolved by calling a special version of log_flush before
calling exit, and not relying on the atexit handler to call pthread_cond_wait.
Comment 14 Nate Straz 2010-03-17 13:54:06 EDT
Verified with openais-0.80.6-16.el5.  I've made it through whiplash, laryngitis, and several cluster_setup/cluster_clean cycles and have not reproduced the bug with the new package.
Comment 16 errata-xmlrpc 2010-03-30 03:48:49 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2010-0180.html

Note You need to log in before you can comment on or make changes to this bug.