194491 – clvmd init script hangs after recovery

Bug 194491 - clvmd init script hangs after recovery

Summary: clvmd init script hangs after recovery

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	rgmanager
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	180185
TreeView+	depends on / blocked

Reported:	2006-06-08 14:48 UTC by Nate Straz
Modified:	2009-04-16 20:20 UTC (History)
CC List:	6 users (show)
Fixed In Version:	RHBA-2006-0557
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-08-10 21:24:41 UTC
Embargoed:

Attachments	(Terms of Use)
Proposed patch to fix (735 bytes, patch) 2006-06-14 08:11 UTC, Christine Caulfield	no flags	Details \| Diff
Don't twiddle the rg thread refcount; this is not the right thread to do it in (only rg threads should twiddle the rg thread count) (1.27 KB, patch) 2006-06-21 18:12 UTC, Lon Hohberger	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2006:0557	0	normal	SHIPPED_LIVE	rgmanager bug fix update	2006-08-10 04:00:00 UTC

Description Nate Straz 2006-06-08 14:48:57 UTC

Description of problem:

After being shot during a revolver run, the three nodes that got shot
hung during startup in the clvmd init script.

Version-Release number of selected component (if applicable):
lvm2-cluster-2.02.06-1.0.RHEL4.i386
GFS-kernel-hugemem-2.6.9-57.0.i686
cman-kernel-hugemem-2.6.9-44.7.i686
cmirror-kernel-hugemem-2.6.9-5.3.i686
dlm-kernel-hugemem-2.6.9-42.7.i686


How reproducible:
Unknown

Steps to Reproduce:
1. Start dlm cluster with GFS file system
2. Start load on all nodes
3. reboot -fin on a node or two.
  
Actual results:

Nodes hung during restart:
Starting ccsd: [  OK  ]
Starting cman:CMAN 2.6.9-44.7 (built Jun  2 2006 17:49:45) installed
DLM 2.6.9-42.7 (built Jun  2 2006 17:55:25) installed
[FAILED]
Loading clustered mirror log:dm_cmirror: disagrees about version of symbol
struct_module
FATAL: Error inserting dm_cmirror
(/lib/modules/2.6.9-39.ELhugemem/kernel/cluster/dm-cmirror.ko): Invalid module
format
[FAILED]
Starting lock_gulmd:[WARNING]
Starting fence domain:[FAILED]
Starting clvmd:



Expected results:
Nodes should restart without issues.

Additional info:

Comment 1 Nate Straz 2006-06-08 14:57:03 UTC

It appears I hit split brain.

[root@morph-02 ~]# cman_tool nodes
Node  Votes Exp Sts  Name
   1    1    5   X   morph-01
   2    1    5   M   morph-02
   3    1    5   X   morph-05
   4    1    5   X   morph-03
   5    1    5   M   morph-04
[root@morph-02 ~]# cman_tool status
Protocol version: 5.0.1
Config version: 1
Cluster name: morph-cluster
Cluster ID: 41652
Cluster Member: Yes
Membership state: State-Transition: Master is morph-04
Nodes: 2
Expected_votes: 5
Total_votes: 2
Quorum: 3
Active subsystems: 12
Node name: morph-02
Node ID: 2
Node addresses: 10.15.89.62

[root@morph-04 ~]# cman_tool status
Protocol version: 5.0.1
Config version: 1
Cluster name: morph-cluster
Cluster ID: 41652
Cluster Member: Yes
Membership state: State-Transition: Master is morph-02
Nodes: 2
Expected_votes: 5
Total_votes: 2
Quorum: 3
Active subsystems: 12
Node name: morph-04
Node ID: 5
Node addresses: 10.15.89.64

Comment 2 Christine Caulfield 2006-06-14 08:11:41 UTC

Created attachment 130808 [details]
Proposed patch to fix

Are you in a position to test a patch ?

The attached works for me, but then I can't reproduce it anwyay.

Comment 3 Nate Straz 2006-06-14 12:47:59 UTC

I have the time if you can build me a cman-kernel package for i686.

Comment 4 Christine Caulfield 2006-06-14 13:52:40 UTC

OK, get one from http://people.redhat.com/pcaulfie/

Comment 5 Nate Straz 2006-06-14 16:09:36 UTC

I tried the cman-kernel-smp package and was still able to make the clvmd init
script to hang.  Unfortunately all nodes are back in the cluster and the cluster
is quorate.  Time to move sshd and xinetd up in the boot order.

Comment 6 Nate Straz 2006-06-14 18:26:19 UTC

I made sshd start earlier in the boot sequence and was able to reproduce the
clvmd hang again.  All nodes are in the cluster so I'm not going to worry about
the cman problems I listed above.

On the nodes where clvmd was hung I found cman_userjoin and clvmd in D state.

 2620 ?        D      0:00 [cman_userjoin]
 2629 ?        Ds     0:00 clvmd

I need to get the kernel-debuginfo installed so I can use crash.  I can't attach
to the hung clvmd process with gdb.

Comment 7 Nate Straz 2006-06-14 19:16:56 UTC

The output from crash looks totally useless :(

crash> bt 2630
PID: 2630   TASK: f5d1aeb0  CPU: 0   COMMAND: "clvmd"
 #0 [f5aebe14] schedule at c02d20e5
 #1 [f5aebe78] wait_for_completion at c02d2212
 #2 [f5aebecc] kcl_join_service at f892b210
 #3 [f5aebee4] new_lockspace at f89a853a
 #4 [f5aebf0c] dlm_new_lockspace at f89a8613
 #5 [f5aebf28] register_lockspace at f89a22c5
 #6 [f5aebf50] do_user_create_lockspace at f89a2b03
 #7 [f5aebf5c] dlm_write at f89a36b2
 #8 [f5aebf88] vfs_write at c015b08d
 #9 [f5aebfa4] sys_write at c015b155
#10 [f5aebfc0] system_call at c02d4218
    EAX: 00000004  EBX: 00000007  ECX: bff4eb90  EDX: 0000006d 
    DS:  007b      ESI: 080922d2  ES:  007b      EDI: 080922d7 
    SS:  007b      ESP: bff4eb74  EBP: bff4fcd8 
    CS:  0073      EIP: 004c57a2  ERR: 00000004  EFLAGS: 00000246 
crash> bt 2621
PID: 2621   TASK: f5def830  CPU: 0   COMMAND: "cman_userjoin"
 #0 [f5b24f0c] schedule at c02d20e5
 #1 [f5b24f70] wait_for_completion at c02d2212
 #2 [f5b24fc4] kcl_join_service at f892b210
 #3 [f5b24fdc] user_join_async at f892bb6c
 #4 [f5b24ff0] kernel_thread_helper at c01041f3

Any thoughts on where to go with this?  I can reproduce it reliably, but it
takes an hour or so to hit.  I have access to the machines while hung via
ssh with crash available.  I would like to know what to try next.

Comment 8 Christine Caulfield 2006-06-15 07:18:26 UTC

That's a totally different bug to the last one if all the nodes are joining. 

Can you post the output of "cman_tool services" and /proc/cluster/dlm_debug?

Also are you sure it's not just clvmd waiting for the cluster to become quorate
(sorry, but I have to ask)?

Comment 9 Nate Straz 2006-06-15 14:43:40 UTC

I think there really are two different bugs here.  
 1. the clvmd hang
 2. the weird cman state.  

2 doesn't always occur when 1 happens.

Here is the output of cman_tool services for each node:

morph-01::

Service          Name                              GID LID State     Code
Fence Domain:    "default"                          11   2 run       U-1,10,3
[2 1]

DLM Lock Space:  "clvmd"                            12   3 run       U-1,10,4
[2 1]

DLM Lock Space:  "morph-cluster0"                   14   5 run       -
[2 1]

DLM Lock Space:  "morph-cluster1"                   16   7 run       -
[2 1]

DLM Lock Space:  "morph-cluster2"                   18   9 run       -
[2 1]

GFS Mount Group: "morph-cluster0"                   15   6 run       -
[2 1]

GFS Mount Group: "morph-cluster1"                   17   8 run       -
[2 1]

GFS Mount Group: "morph-cluster2"                   19  10 run       -
[2 1]

User:            "usrm::manager"                    13   4 recover 4 -
[2 1]

morph-02::

Service          Name                              GID LID State     Code
Fence Domain:    "default"                          11   2 run       U-1,10,3
[2 1]

DLM Lock Space:  "clvmd"                            12   3 run       U-1,10,4
[2 1]

DLM Lock Space:  "morph-cluster0"                   14   5 run       -
[2 1]

DLM Lock Space:  "morph-cluster1"                   16   7 run       -
[2 1]

DLM Lock Space:  "morph-cluster2"                   18   9 run       -
[2 1]

GFS Mount Group: "morph-cluster0"                   15   6 run       -
[2 1]

GFS Mount Group: "morph-cluster1"                   17   8 run       -
[2 1]

GFS Mount Group: "morph-cluster2"                   19  10 run       -
[2 1]

User:            "usrm::manager"                    13   4 recover 2 -
[2 1]

morph-03::

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           0   2 join      S-1,80,5
[]

DLM Lock Space:  "clvmd"                            12   3 join      S-4,4,1
[2 1 4]

morph-04::

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           0   2 join      S-1,80,5
[]

DLM Lock Space:  "clvmd"                             0   3 join      S-1,80,5
[]

morph-05::

Service          Name                              GID LID State     Code
Fence Domain:    "default"                          11   2 join      S-4,4,1
[2 1 3]

DLM Lock Space:  "clvmd"                             0   3 join      S-1,80,5
[]

dlm_debug contents:

morph-01::

morph-cluster1 resend 1b02d4 lq 1 flg 200000 node -1/-1 "      
morph-cluster1 resent 5 requests
morph-cluster1 recover event 31 finished
morph-cluster1 grant lock on lockqueue 3
morph-cluster1 process_lockqueue_reply id 540119 state 0
morph-cluster1 grant lock on lockqueue 3
morph-cluster1 process_lockqueue_reply id 4e0068 state 0
morph-cluster1 grant lock on lockqueue 3
morph-cluster1 process_lockqueue_reply id 5400b8 state 0
morph-cluster1 grant lock on lockqueue 3
morph-cluster1 process_lockqueue_reply id 68031e state 0
morph-cluster1 grant lock on lockqueue 3
morph-cluster1 process_lockqueue_reply id 480137 state 0
morph-cluster1 grant lock on lockqueue 3
morph-cluster1 process_lockqueue_reply id 580152 state 0
morph-cluster1 grant lock on lockqueue 3
morph-cluster1 process_lockqueue_reply id 600110 state 0
morph-cluster1 grant lock on lockqueue 3
morph-cluster1 process_lockqueue_reply id 600200 state 0
morph-cluster2 grant lock on lockqueue 3
morph-cluster2 process_lockqueue_reply id 3c0341 state 0

morph-02::
f2 lq 1 nodeid -1
morph-cluster1 marked 2 requests
morph-cluster1 purge locks of departed nodes
morph-cluster0 rebuilt 3 locks
morph-cluster1 purged 3728 locks
morph-cluster1 update remastered resources
morph-cluster0 recover event 57 done
morph-cluster1 updated 3 resources
morph-cluster1 rebuild locks
morph-cluster1 rebuilt 3 locks
morph-cluster1 recover event 57 done
morph-cluster0 move flags 0,0,1 ids 51,57,57
morph-cluster0 process held requests
morph-cluster0 processed 0 requests
morph-cluster0 resend marked requests
morph-cluster0 resend 6902c4 lq 1 flg 200000 node -1/-1 "      
morph-cluster0 resent 1 requests
morph-cluster1 move flags 0,0,1 ids 53,57,57
morph-cluster1 process held requests
morph-cluster1 processed 0 requests
morph-cluster1 resend marked requests
morph-cluster1 resend 8601f4 lq 1 flg 200000 node -1/-1 "      
morph-cluster1 resend 8d03f2 lq 1 flg 200000 node -1/-1 "      
morph-cluster1 resent 2 requests
morph-cluster1 recover event 57 finished
morph-cluster0 recover event 57 finished

morph-03::
<empty>
morph-04::
<empty>
morph-05::
<empty>

All nodes are in the cluster (in this case) so I'm assuming it's quorate.  But
in this case 2. did not happen.

Comment 10 Christine Caulfield 2006-06-15 15:05:59 UTC

Yes, this is certainly two bugs. If you're not seeing the cman split any more
(2) then I'll commit that patch for U4.

The other looks like it might possibly be fencing related. Are there any fencing
messages on the remaining nodes ?

Comment 11 Nate Straz 2006-06-15 15:39:09 UTC

I haven't run the test case enough to see if (2) still happens some of the time.

There weren't as many fence messages as I expected.  Here are the parts from
/var/log/messages on morph-01 and morph-02, the surviving nodes, just after the
other nodes were rebooted with "reboot -fin."

morph-01::
Jun 14 13:22:46 morph-01 qarshd[4413]: Running cmdline: tcpdump -s0 -x port 6809
 
Jun 14 13:22:46 morph-01 kernel: device eth0 entered promiscuous mode
Jun 14 13:23:06 morph-01 kernel: CMAN: removing node morph-03 from the cluster :
 Missed too many heartbeats
Jun 14 13:23:12 morph-01 kernel: CMAN: removing node morph-05 from the cluster :
 No response to messages
Jun 14 13:23:24 morph-01 kernel: CMAN: node morph-04 has been removed from the c
luster : No response to messages
Jun 14 13:23:34 morph-01 kernel: CMAN: quorum lost, blocking activity
Jun 14 13:23:34 morph-01 clurgmgrd[3239]: <emerg> #1: Quorum Dissolved 
Jun 14 13:23:37 morph-01 ccsd[2657]: Cluster is not quorate.  Refusing connectio
n. 
Jun 14 13:23:37 morph-01 ccsd[2657]: Error while processing connect: Connection 
refused 
Jun 14 13:23:47 morph-01 ccsd[2657]: Cluster is not quorate.  Refusing connectio
n. 
Jun 14 13:23:47 morph-01 ccsd[2657]: Error while processing connect: Connection 
refused 
Jun 14 13:23:48 morph-01 qarshd[4424]: Talking to peer 10.15.80.50:50338
Jun 14 13:23:48 morph-01 qarshd[4424]: Running cmdline: cat /proc/cluster/nodes 
Jun 14 13:23:53 morph-01 qarshd[4426]: Talking to peer 10.15.80.50:50340
Jun 14 13:23:53 morph-01 qarshd[4426]: Running cmdline: cat /proc/cluster/nodes 
Jun 14 13:23:57 morph-01 ccsd[2657]: Cluster is not quorate.  Refusing connectio
n. 
Jun 14 13:23:57 morph-01 ccsd[2657]: Error while processing connect: Connection 
refused 
Jun 14 13:23:58 morph-01 qarshd[4429]: Talking to peer 10.15.80.50:50342
Jun 14 13:23:58 morph-01 qarshd[4429]: Running cmdline: cat /proc/cluster/nodes 
Jun 14 13:24:07 morph-01 ccsd[2657]: Cluster is not quorate.  Refusing connectio
n. 
Jun 14 13:24:07 morph-01 ccsd[2657]: Error while processing connect: Connection 
refused 
Jun 14 13:24:10 morph-01 kernel: CMAN: node morph-03 rejoining
Jun 14 13:24:11 morph-01 kernel: CMAN: quorum regained, resuming activity
Jun 14 13:24:11 morph-01 clurgmgrd[3239]: <notice> Quorum Achieved 
Jun 14 13:24:11 morph-01 clurgmgrd[3239]: <info> Magma Event: Membership Change 
Jun 14 13:24:11 morph-01 ccsd[2657]: Cluster is quorate.  Allowing connections. 
Jun 14 13:24:11 morph-01 clurgmgrd[3239]: <info> State change: Local UP 
Jun 14 13:24:11 morph-01 clurgmgrd[3239]: <info> State change: morph-02 UP 
Jun 14 13:24:11 morph-01 clurgmgrd[3239]: <info> State change: morph-03 UP 
Jun 14 13:24:11 morph-01 clurgmgrd[3239]: <info> Loading Service Data 
Jun 14 13:24:12 morph-01 clurgmgrd[3239]: <info> Initializing Services 
Jun 14 13:24:25 morph-01 kernel: CMAN: node morph-05 rejoining
Jun 14 13:24:35 morph-01 kernel: CMAN: node morph-04 rejoining

morph-02::

Jun 14 13:22:54 morph-02 qarshd[4561]: Running cmdline: tcpdump -s0 -x port 6809
 
Jun 14 13:23:13 morph-02 kernel: CMAN: node morph-03 has been removed from the c
luster : Missed too many heartbeats
Jun 14 13:23:24 morph-02 kernel: CMAN: removing node morph-05 from the cluster :
 No response to messages
Jun 14 13:23:30 morph-02 kernel: CMAN: removing node morph-04 from the cluster :
 No response to messages
Jun 14 13:23:41 morph-02 clurgmgrd[3192]: <emerg> #1: Quorum Dissolved 
Jun 14 13:23:41 morph-02 kernel: CMAN: quorum lost, blocking activity
Jun 14 13:23:55 morph-02 qarshd[4569]: Talking to peer 10.15.80.50:36362
Jun 14 13:23:55 morph-02 qarshd[4569]: Running cmdline: cat /proc/cluster/nodes 
Jun 14 13:24:00 morph-02 qarshd[4571]: Talking to peer 10.15.80.50:36364
Jun 14 13:24:00 morph-02 qarshd[4571]: Running cmdline: cat /proc/cluster/nodes 
Jun 14 13:24:05 morph-02 qarshd[4573]: Talking to peer 10.15.80.50:36366
Jun 14 13:24:05 morph-02 qarshd[4573]: Running cmdline: cat /proc/cluster/nodes 
Jun 14 13:24:17 morph-02 kernel: CMAN: node morph-03 rejoining
Jun 14 13:24:18 morph-02 ccsd[2573]: Cluster is quorate.  Allowing connections. 
Jun 14 13:24:18 morph-02 kernel: CMAN: quorum regained, resuming activity
Jun 14 13:24:18 morph-02 fenced[2648]: fencing deferred to morph-01
Jun 14 13:24:32 morph-02 kernel: CMAN: node morph-05 rejoining
Jun 14 13:24:42 morph-02 kernel: CMAN: node morph-04 rejoining


I still have the tcpdump for these two nodes, would you like to see it for the
time slice from 13:22:54 to 13:24:42?

Comment 12 Christine Caulfield 2006-06-16 07:14:34 UTC

No thanks. I don't need another tcpdump, unless the cman problem recurs.

Odd about the fence messages, morph-01 should be IC fencing but there don't seem
to be any messages about it. I think we need to see what fenced is doing here.

Comment 13 Nate Straz 2006-06-16 15:50:21 UTC

I'm not sure how you want me to do this.  I'm trying to run fenced with the -D
option, but I can't get it to stay up reliably with that option.  Also, I don't
see anything useful in the output.  Here's the output from fenced on morph-04
when it was deferred to.

fenced: 1150471093 members:
fenced: 1150471093   nodeid = 1 "morph-04"
fenced: 1150471093   nodeid = 2 "morph-05"
fenced: 1150471093   nodeid = 3 "morph-02"
fenced: 1150471093   nodeid = 4 "morph-01"
fenced: 1150471093   nodeid = 5 "morph-03"
fenced: 1150471093 do_recovery stop 70 start 71 finish 70
fenced: 1150471093 finish:
fenced: 1150471093   event_id    = 71
fenced: 1150471093   last_stop   = 70
fenced: 1150471093   last_start  = 71
fenced: 1150471093   last_finish = 71
fenced: 1150471093   node_count  = 0
fenced: 1150471714 stop:
fenced: 1150471714   event_id    = 0
fenced: 1150471714   last_stop   = 71
fenced: 1150471714   last_start  = 71
fenced: 1150471714   last_finish = 71
fenced: 1150471714   node_count  = 0
fenced: 1150471748 start:
fenced: 1150471748   event_id    = 96
fenced: 1150471748   last_stop   = 71
fenced: 1150471748   last_start  = 96
fenced: 1150471748   last_finish = 71
fenced: 1150471748   node_count  = 2
fenced: 1150471748   start_type  = failed
fenced: 1150471748 members:
fenced: 1150471748   nodeid = 1 "morph-04"
fenced: 1150471748   nodeid = 5 "morph-03"
fenced: 1150471748 do_recovery stop 71 start 96 finish 71
fenced: 1150471748 add node 2 to list 0
fenced: 1150471748 add node 3 to list 0
fenced: 1150471748 add node 4 to list 0
fenced: 1150471749 reduce victim morph-01
fenced: 1150471763 reduce victim morph-02
fenced: 1150471778 reduce victim morph-05
fenced: 1150471778 delay of 30s leaves 0 victims
fenced: 1150471778 finish:
fenced: 1150471778   event_id    = 96
fenced: 1150471778   last_stop   = 71
fenced: 1150471778   last_start  = 96
fenced: 1150471778   last_finish = 96
fenced: 1150471778   node_count  = 0
fenced: 1150471783 stop:
fenced: 1150471783   event_id    = 0
fenced: 1150471783   last_stop   = 96
fenced: 1150471783   last_start  = 96
fenced: 1150471783   last_finish = 96
fenced: 1150471783   node_count  = 0
fenced: 1150471783 start:
fenced: 1150471783   event_id    = 97
fenced: 1150471783   last_stop   = 96
fenced: 1150471783   last_start  = 97
fenced: 1150471783   last_finish = 96
fenced: 1150471783   node_count  = 3
fenced: 1150471783   start_type  = join
fenced: 1150471783 members:
fenced: 1150471783   nodeid = 1 "morph-04"
fenced: 1150471783   nodeid = 5 "morph-03"
fenced: 1150471783   nodeid = 2 "morph-05"
fenced: 1150471783 do_recovery stop 96 start 97 finish 96
fenced: 1150471783 finish:
fenced: 1150471783   event_id    = 97
fenced: 1150471783   last_stop   = 96
fenced: 1150471783   last_start  = 97
fenced: 1150471783   last_finish = 97
fenced: 1150471783   node_count  = 0
fenced: 1150471784 stop:
fenced: 1150471784   event_id    = 0
fenced: 1150471784   last_stop   = 97
fenced: 1150471784   last_start  = 97
fenced: 1150471784   last_finish = 97
fenced: 1150471784   node_count  = 0
fenced: 1150471784 start:
fenced: 1150471784   event_id    = 99
fenced: 1150471784   last_stop   = 97
fenced: 1150471784   last_start  = 99
fenced: 1150471784   last_finish = 97
fenced: 1150471784   node_count  = 4
fenced: 1150471784   start_type  = join
fenced: 1150471784 members:
fenced: 1150471784   nodeid = 1 "morph-04"
fenced: 1150471784   nodeid = 5 "morph-03"
fenced: 1150471784   nodeid = 2 "morph-05"
fenced: 1150471784   nodeid = 3 "morph-02"
fenced: 1150471784 do_recovery stop 97 start 99 finish 97
fenced: 1150471784 finish:
fenced: 1150471784   event_id    = 99
fenced: 1150471784   last_stop   = 97
fenced: 1150471784   last_start  = 99
fenced: 1150471784   last_finish = 99
fenced: 1150471784   node_count  = 0
fenced: 1150471787 stop:
fenced: 1150471787   event_id    = 0
fenced: 1150471787   last_stop   = 99
fenced: 1150471787   last_start  = 99
fenced: 1150471787   last_finish = 99
fenced: 1150471787   node_count  = 0
fenced: 1150471787 start:
fenced: 1150471787   event_id    = 101
fenced: 1150471787   last_stop   = 99
fenced: 1150471787   last_start  = 101
fenced: 1150471787   last_finish = 99
fenced: 1150471787   node_count  = 5
fenced: 1150471787   start_type  = join
fenced: 1150471787 members:
fenced: 1150471787   nodeid = 1 "morph-04"
fenced: 1150471787   nodeid = 5 "morph-03"
fenced: 1150471787   nodeid = 2 "morph-05"
fenced: 1150471787   nodeid = 3 "morph-02"
fenced: 1150471787   nodeid = 4 "morph-01"
fenced: 1150471787 do_recovery s
^[[A
^[[A

top 99 start 101 finish 99
fenced: 1150471787 finish:
fenced: 1150471787   event_id    = 101
fenced: 1150471787   last_stop   = 99
fenced: 1150471787   last_start  = 101
fenced: 1150471787   last_finish = 101
fenced: 1150471787   node_count  = 0
fenced: 1150471869 stop:
fenced: 1150471869   event_id    = 0
fenced: 1150471869   last_stop   = 101
fenced: 1150471869   last_start  = 101
fenced: 1150471869   last_finish = 101
fenced: 1150471869   node_count  = 0
fenced: 1150471869 start:
fenced: 1150471869   event_id    = 124
fenced: 1150471869   last_stop   = 101
fenced: 1150471869   last_start  = 124
fenced: 1150471869   last_finish = 101
fenced: 1150471869   node_count  = 4
fenced: 1150471869   start_type  = leave
fenced: 1150471869 members:
fenced: 1150471869   nodeid = 1 "morph-04"
fenced: 1150471869   nodeid = 5 "morph-03"
fenced: 1150471869   nodeid = 2 "morph-05"
fenced: 1150471869   nodeid = 3 "morph-02"
fenced: 1150471869 do_recovery stop 101 start 124 finish 101
fenced: 1150471869 add node 4 to list 2
fenced: 1150471869 finish:
fenced: 1150471869   event_id    = 124
fenced: 1150471869   last_stop   = 101
fenced: 1150471869   last_start  = 124
fenced: 1150471869   last_finish = 124
fenced: 1150471869   node_count  = 0
fenced: 1150471884 stop:
fenced: 1150471884   event_id    = 0
fenced: 1150471884   last_stop   = 124
fenced: 1150471884   last_start  = 124
fenced: 1150471884   last_finish = 124
fenced: 1150471884   node_count  = 0
fenced: 1150471884 start:
fenced: 1150471884   event_id    = 125
fenced: 1150471884   last_stop   = 124
fenced: 1150471884   last_start  = 125
fenced: 1150471884   last_finish = 124
fenced: 1150471884   node_count  = 3
fenced: 1150471884   start_type  = leave
fenced: 1150471884 members:
fenced: 1150471884   nodeid = 1 "morph-04"
fenced: 1150471884   nodeid = 5 "morph-03"
fenced: 1150471884   nodeid = 2 "morph-05"
fenced: 1150471884 do_recovery stop 124 start 125 finish 124
fenced: 1150471884 add node 3 to list 2
fenced: 1150471884 finish:
fenced: 1150471884   event_id    = 125
fenced: 1150471884   last_stop   = 124
fenced: 1150471884   last_start  = 125
fenced: 1150471884   last_finish = 125
fenced: 1150471884   node_count  = 0
fenced: 1150471899 stop:
fenced: 1150471899   event_id    = 0
fenced: 1150471899   last_stop   = 125
fenced: 1150471899   last_start  = 125
fenced: 1150471899   last_finish = 125
fenced: 1150471899   node_count  = 0
fenced: 1150471899 start:
fenced: 1150471899   event_id    = 126
fenced: 1150471899   last_stop   = 125
fenced: 1150471899   last_start  = 126
fenced: 1150471899   last_finish = 125
fenced: 1150471899   node_count  = 2
fenced: 1150471899   start_type  = leave
fenced: 1150471899 members:
fenced: 1150471899   nodeid = 1 "morph-04"
fenced: 1150471899   nodeid = 5 "morph-03"
fenced: 1150471899 do_recovery stop 125 start 126 finish 125
fenced: 1150471899 add node 2 to list 2
fenced: 1150471899 finish:
fenced: 1150471899   event_id    = 126
fenced: 1150471899   last_stop   = 125
fenced: 1150471899   last_start  = 126
fenced: 1150471899   last_finish = 126
fenced: 1150471899   node_count  = 0
fenced: 1150472130 stop:
fenced: 1150472130   event_id    = 0
fenced: 1150472130   last_stop   = 126
fenced: 1150472130   last_start  = 126
fenced: 1150472130   last_finish = 126
fenced: 1150472130   node_count  = 0
fenced: 1150472130 start:
fenced: 1150472130   event_id    = 127
fenced: 1150472130   last_stop   = 126
fenced: 1150472130   last_start  = 127
fenced: 1150472130   last_finish = 126
fenced: 1150472130   node_count  = 3
fenced: 1150472130   start_type  = join
fenced: 1150472130 members:
fenced: 1150472130   nodeid = 1 "morph-04"
fenced: 1150472130   nodeid = 5 "morph-03"
fenced: 1150472130   nodeid = 4 "morph-01"
fenced: 1150472130 do_recovery stop 126 start 127 finish 126
fenced: 1150472130 finish:
fenced: 1150472130   event_id    = 127
fenced: 1150472130   last_stop   = 126
fenced: 1150472130   last_start  = 127
fenced: 1150472130   last_finish = 127
fenced: 1150472130   node_count  = 0
fenced: 1150472137 stop:
fenced: 1150472137   event_id    = 0
fenced: 1150472137   last_stop   = 127
fenced: 1150472137   last_start  = 127
fenced: 1150472137   last_finish = 127
fenced: 1150472137   node_count  = 0
fenced: 1150472137 start:
fenced: 1150472137   event_id    = 128
fenced: 1150472137   last_stop   = 127
fenced: 1150472137   last_start  = 128
fenced: 1150472137   last_finish = 127
fenced: 1150472137   node_count  = 4
fenced: 1150472137   start_type  = join
fenced: 1150472137 members:
fenced: 1150472137   nodeid = 1 "morph-04"
fenced: 1150472137   nodeid = 5 "morph-03"
fenced: 1150472137   nodeid = 4 "morph-01"
fenced: 1150472137   nodeid = 3 "morph-02"
fenced: 1150472137 do_recovery stop 127 start 128 finish 127
fenced: 1150472137 finish:
fenced: 1150472137   event_id    = 128
fenced: 1150472137   last_stop   = 127
fenced: 1150472137   last_start  = 128
fenced: 1150472137   last_finish = 128
fenced: 1150472137   node_count  = 0
fenced: 1150472176 stop:
fenced: 1150472176   event_id    = 0
fenced: 1150472176   last_stop   = 128
fenced: 1150472176   last_start  = 128
fenced: 1150472176   last_finish = 128
fenced: 1150472176   node_count  = 0
fenced: 1150472176 start:
fenced: 1150472176   event_id    = 129
fenced: 1150472176   last_stop   = 128
fenced: 1150472176   last_start  = 129
fenced: 1150472176   last_finish = 128
fenced: 1150472176   node_count  = 5
fenced: 1150472176   start_type  = join
fenced: 1150472176 members:
fenced: 1150472176   nodeid = 1 "morph-04"
fenced: 1150472176   nodeid = 5 "morph-03"
fenced: 1150472176   nodeid = 4 "morph-01"
fenced: 1150472176   nodeid = 3 "morph-02"
fenced: 1150472176   nodeid = 2 "morph-05"

Comment 14 Christine Caulfield 2006-06-19 07:26:18 UTC

> I'm not sure how you want me to do this. 

No, me neither. I was hoping dave would chip in with some ideas :-)

Comment 15 David Teigland 2006-06-19 13:55:45 UTC

The cman_tool services info above shows a pretty messed up cluster,
I can't make too much sense of it.  I suspect what happened is that
after the nodes failed, things went bad.  Then they came back
and tried to start everything up again, and that made things worse.
At that point, it's really hard to decipher what the original
buggered state was.

In order to figure this out, I think we'll want to stop everything
and diagnose the problem when things first go bad, and not let the
nodes restart services.  Because you're killing 3 of 5 nodes and
losing quorum, the killed nodes need to be allowed to rejoin the
cluster, but that's all.  So, we want to do something like this:

- kill 3 nodes
- after the 2 live nodes have seen they're dead, save the output of
  cman_tool services on those 2 nodes
- let the 3 nodes reboot and rejoin the cluster, but not start anything
  else (like fenced)
- watch cman_tool services on the 2 nodes to see if they successfully
  recover for the 3 failed nodes (that have now rejoined the cluster)

Comment 16 Nate Straz 2006-06-19 16:50:57 UTC

* Started cluster with morph-0[1-5] w/ 3 GFS file systems mounted.

* Disabled cluster init scripts on all nodes.

* Powered off morph-03, morph-04, and morph-05 via APC power switch.

* morph-01 and morph-02 removed the failed nodes from the cluster.

morph-01:
Jun 19 11:43:24 morph-01 kernel: CMAN: node morph-05 has been removed from the
cluster : Missed too many heartbeats
Jun 19 11:43:27 morph-01 kernel: CMAN: removing node morph-04 from the cluster :
Missed too many heartbeats
Jun 19 11:43:33 morph-01 kernel: CMAN: removing node morph-03 from the cluster :
No response to messages
Jun 19 11:43:46 morph-01 kernel: CMAN: quorum lost, blocking activity
# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 recover 0 -
[1 5]

DLM Lock Space:  "clvmd"                             3   4 recover 0 -
[1 5]

DLM Lock Space:  "morph-cluster0"                    4   5 recover 0 -
[1 5]

DLM Lock Space:  "morph-cluster1"                    6   7 recover 0 -
[1 5]

DLM Lock Space:  "morph-cluster2"                    8   9 recover 0 -
[1 5]

GFS Mount Group: "morph-cluster0"                    5   6 recover 0 -
[1 5]

GFS Mount Group: "morph-cluster1"                    7   8 recover 0 -
[1 5]

GFS Mount Group: "morph-cluster2"                    9  10 recover 0 -
[1 5]

morph-02:
Jun 19 11:43:31 morph-02 kernel: CMAN: removing node morph-05 from the cluster :
Missed too many heartbeats
Jun 19 11:43:37 morph-02 kernel: CMAN: removing node morph-03 from the cluster :
No response to messages
Jun 19 11:43:43 morph-02 kernel: CMAN: removing node morph-04 from the cluster :
No response to messages
Jun 19 11:43:53 morph-02 kernel: CMAN: quorum lost, blocking activity
# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 recover 0 -
[1 5]

DLM Lock Space:  "clvmd"                             3   4 recover 0 -
[1 5]

DLM Lock Space:  "morph-cluster0"                    4   5 recover 0 -
[1 5]

DLM Lock Space:  "morph-cluster1"                    6   7 recover 0 -
[1 5]

DLM Lock Space:  "morph-cluster2"                    8   9 recover 0 -
[1 5]

GFS Mount Group: "morph-cluster0"                    5   6 recover 0 -
[1 5]

GFS Mount Group: "morph-cluster1"                    7   8 recover 0 -
[1 5]

GFS Mount Group: "morph-cluster2"                    9  10 recover 0 -
[1 5]

* Powered on nodes morph-03, morph-04, and morph-05.

* Started services ccsd and cman on morph-0[3-5]

* All nodes joined the cluster, morph-02 deferred to morph-01 for fencing.

morph-01:
[root@morph-01 ~]# cman_tool nodes
Node  Votes Exp Sts  Name
   1    1    5   M   morph-01
   2    1    5   M   morph-03
   3    1    5   M   morph-05
   4    1    5   M   morph-04
   5    1    5   M   morph-02
[root@morph-01 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 5]

DLM Lock Space:  "clvmd"                             3   4 run       -
[1 5]

DLM Lock Space:  "morph-cluster0"                    4   5 run       -
[1 5]

DLM Lock Space:  "morph-cluster1"                    6   7 run       -
[1 5]

DLM Lock Space:  "morph-cluster2"                    8   9 run       -
[1 5]

GFS Mount Group: "morph-cluster0"                    5   6 run       -
[1 5]

GFS Mount Group: "morph-cluster1"                    7   8 run       -
[1 5]

GFS Mount Group: "morph-cluster2"                    9  10 run       -
[1 5]

morph-02:
[root@morph-02 ~]# cman_tool nodes
Node  Votes Exp Sts  Name
   1    1    5   M   morph-01
   2    1    5   M   morph-03
   3    1    5   M   morph-05
   4    1    5   M   morph-04
   5    1    5   M   morph-02
[root@morph-02 ~]# cman_tool services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 5]

DLM Lock Space:  "clvmd"                             3   4 run       -
[1 5]

DLM Lock Space:  "morph-cluster0"                    4   5 run       -
[1 5]

DLM Lock Space:  "morph-cluster1"                    6   7 run       -
[1 5]

DLM Lock Space:  "morph-cluster2"                    8   9 run       -
[1 5]

GFS Mount Group: "morph-cluster0"                    5   6 run       -
[1 5]

GFS Mount Group: "morph-cluster1"                    7   8 run       -
[1 5]

GFS Mount Group: "morph-cluster2"                    9  10 run       -
[1 5]

Comment 17 David Teigland 2006-06-19 17:35:04 UTC

So that sequence of events worked ok.  I think we need a more
exact sequence of events that leads to the hung cluster state.
The info in comment #9 seems to show some mix of nodes leaving,
joining and failing, and I suspect that part of that appeared
after things had already gone bad.  If we can get a specific
log/trace of a revolver run that leads to a hang, that would
be a good start.  One problem I've had with revolver in the
past is that it doesn't always detect/quit when things first
break, so we may need to add some new checking or logging to
kill it when things break or have a record of the state when
things first broke.

Comment 18 Nate Straz 2006-06-19 20:11:46 UTC

What is the best way to determine when "things break?"  It would be wonderful
to hear from the developers what to check for and record and when.

Comment 19 David Teigland 2006-06-19 20:31:48 UTC

Yeah, that's pretty difficult because with many bugs there's
no obvious sign apart from nothing happening for a "long time".
After nodes have been killed and recovery is to happen, one
good check that everything is back to a good/normal state is
that all groups in cman_tool services have both a state of "run"
and a code of "-".

We don't necessarily need to know exactly when things break,
either.  If we print the state of cman_tool services at various
stages of the test, that can help when we go back and reconstruct
what happened in what order.

In this case, we should just start with reproducing the problem,
looking at the end result and the revolver log and see where we'd
like revolver to print some extra cman_tool services.  We may also
find some spots in the sequence where we'd like revolver to pause
and wait for some specific state of the services.

Comment 20 Nate Straz 2006-06-20 15:15:18 UTC

I've reproduced (1) and here is the end result from cman_tool services:

morph-01::
Fence Domain:    "default"                           1   2 join      S-4,4,1  [3
2 1]
DLM Lock Space:  "clvmd"                             0   3 join      S-1,80,5 []

morph-02::
Fence Domain:    "default"                           1   2 run       U-1,10,1 [3 2]
DLM Lock Space:  "clvmd"                             3   3 run       U-1,10,5 [3 2]
DLM Lock Space:  "morph-cluster0"                    4   5 run       - [3 2]
DLM Lock Space:  "morph-cluster1"                    6   7 run       - [3 2]
DLM Lock Space:  "morph-cluster2"                    8   9 run       - [3 2]
GFS Mount Group: "morph-cluster0"                    5   6 run       - [3 2]
GFS Mount Group: "morph-cluster1"                    7   8 run       - [3 2]
GFS Mount Group: "morph-cluster2"                    9  10 run       - [3 2]
User:            "usrm::manager"                    10   4 recover 2 - [3 2]

morph-03::
Fence Domain:    "default"                           0   2 join      S-1,480,5 []
DLM Lock Space:  "clvmd"                             3   3 join      S-4,4,1 [2 3 5]

morph-04::
Fence Domain:    "default"                           0   2 join      S-1,80,5 []
DLM Lock Space:  "clvmd"                             0   3 join      S-1,80,5 []

morph-05::
Fence Domain:    "default"                           1   2 run       U-1,10,1 [3 2]
DLM Lock Space:  "clvmd"                             3   3 run       U-1,10,5 [3 2]
DLM Lock Space:  "morph-cluster0"                    4   5 run       - [3 2]
DLM Lock Space:  "morph-cluster1"                    6   7 run       - [3 2]
DLM Lock Space:  "morph-cluster2"                    8   9 run       - [3 2]
GFS Mount Group: "morph-cluster0"                    5   6 run       - [3 2]
GFS Mount Group: "morph-cluster1"                    7   8 run       - [3 2]
GFS Mount Group: "morph-cluster2"                    9  10 run       - [3 2]
User:            "usrm::manager"                    10   4 recover 4 - [3 2]

Comment 21 Nate Straz 2006-06-20 15:34:05 UTC

Here is a description of what revolver does:

Start with a cluster with file systems mounted
Start an I/O load on all nodes and all file systems
for each iteration
  wait for five minutes to get the lock count up
  pick a set of nodes to shoot (duellist)
  for victim in duellist
    run reboot -fin on the node
  wait a few seconds
  for victim in duellist
    ping victim to make sure it died, else reboot -fin again
  for victim in duellist
    for node not in duellist
      verify victim has X in /proc/cluster/nodes on node
  wait for all nodes in duellist to be pingable again
  wait for all nodes in duellist to be qarshable again
  for node not in duellist
    check /proc/cluster/services for Fence in "run" state
    check /proc/cluster/services for DLM in "run" state
    check /proc/cluster/services for GFS in "run" state
  for victim in duellist
    make sure vgdisplay runs on victim without error
    remount file systems on victim

Comment 22 David Teigland 2006-06-20 16:04:42 UTC

I'm fairly certain I see what's happening here.  This is a
side effect of the fix for bz 162014.  rgmanager is ignored
in most of our testing/reasoning about how gfs/dlm/fencing
services work, but rgmanager does have an effect in a situation
like this simply by the fact that it creates an sm service.
What's happening here is that the rgmanager service is sitting
in the recover state, and sm doesn't allow new groups to form
until all recoveries are done.

I suspect rgmanager is sitting in the recover state because
it's polling /proc/cluster/services waiting for all the
failed nodes to be fenced.  They have been fenced, but I'm
guessing rgmanager is just looking at the "state" value,
seeing it's not "run" and concluding that the nodes haven't
been recovered.  It's not in the run state because someone
is joining it.

If I'm correct about rgmanager, we might be able to fix this
problem by making it decipher /proc/cluster/services more
precisely.  Or, we could add a check in SM to not consider
user-level services when blocking activity for pending recoveries.

Changing rgmanager may have some other positive effects, but
it might be difficult to really determine what's been fenced by
parsing /proc/cluster/services.  Lon, does this speculation sound
correct, and if so would you prefer to tinker with the proc parsing
or have me change cman/sm?  I don't have a preference.

Comment 23 Nate Straz 2006-06-21 14:18:34 UTC

Dave, that sounds very likely.  When I set up for revolver I turn on all of the
init scripts for cluster stuff, including rgmanager, but I don't start every
service.  rgmanager usually isn't started during cluster_setup for revolver. 
After a node is shot rgmanager is started from the init scripts.  This could
explain why it doesn't always trip on the first try.

I'm going to run two experiments to try to verify this.

1. Start rgmanager before running revolver.
2. Turn off the rgmanager init script before running revolver.

Comment 24 Lon Hohberger 2006-06-21 16:17:57 UTC

rgmanager doesn't usually parse /proc/cluster/services (except to get members of
other user level groups); its recovery is not (or should not be) started until
fencing completes.  It's sitting in the recover state because the main thread is
somehow blocked and not handling one of the node-death events - it probably got
the "inquorate" event before the "group change" event, or vice-versa.

When 3/5 have died, the cluster lost quorum, right?  So there might be some
issue in this case with rgmanager that I haven't seen before.  I'll look into it
this afternoon or tomorrow morning.

This does not look like a CMAN issue, but changing it to ignore user services in
recover state when higher-priority services (fence, etc.) are trying to form
might eliminate this kind of bug report in the future.

If possible, could you get a backtrace of rgmanager when this happens?

Comment 25 Nate Straz 2006-06-21 16:35:29 UTC

Results of experiment 1:

It still took until the second iteration for clvmd startup to hang.
The output from cman_tool services was similar to #20.

Results of experiment 2:

Revolver made it through 10 iterations without problems.  Going for 50...

Comment 26 Nate Straz 2006-06-21 17:27:02 UTC

Here are the backtraces for clurgmgrd from two surviving nodes:

morph-01:
[root@morph-01 ~]# ps -C clurgmgrd
  PID TTY          TIME CMD
 3258 ?        00:00:00 clurgmgrd
 3259 ?        00:00:00 clurgmgrd
[root@morph-01 ~]# gdb /usr/sbin/clurgmgrd 3258
(gdb) bt
#0  0x002887a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x0041b393 in __waitpid_nocancel () from /lib/tls/libpthread.so.0
#2  0x0805966f in watchdog_init () at watchdog.c:73
#3  0x0804d37a in main (argc=1, argv=0xbff3f7f4) at main.c:740
#4  0x002bade3 in __libc_start_main () from /lib/tls/libc.so.6
#5  0x0804a75d in _start ()
[root@morph-01 ~]# gdb /usr/sbin/clurgmgrd 3259
(gdb) info threads
  3 Thread -1208480848 (LWP 4109)  0x002887a2 in _dl_sysinfo_int80 ()
   from /lib/ld-linux.so.2
  2 Thread -1218970704 (LWP 4456)  0x002887a2 in _dl_sysinfo_int80 ()
   from /lib/ld-linux.so.2
  1 Thread -1208478016 (LWP 3259)  0x002887a2 in _dl_sysinfo_int80 ()
   from /lib/ld-linux.so.2
(gdb) thread 1
[Switching to thread 1 (Thread -1208478016 (LWP 3259))]#0  0x002887a2 in
_dl_sysinfo_int80 () from /lib/ld-linux.so.2
(gdb) bt
#0  0x002887a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00417b26 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/tls/libpthread.so.0
#2  0x0804bdc2 in rg_wait_threads () at rg_locks.c:271
#3  0x0804ecd7 in rg_doall (request=10, block=1, 
    debugfmt=0x8063773 "Emergency stop of %s") at groups.c:747
#4  0x0804cd38 in handle_cluster_event (fd=4) at main.c:487
#5  0x0804ceb2 in event_loop (clusterfd=4) at main.c:537
#6  0x0804d71d in main (argc=1, argv=0xbff3f7f4) at main.c:843
#7  0x002bade3 in __libc_start_main () from /lib/tls/libc.so.6
#8  0x0804a75d in _start ()
(gdb) thread 2
[Switching to thread 2 (Thread -1218970704 (LWP 4456))]#0  0x002887a2 in
_dl_sysinfo_int80 () from /lib/ld-linux.so.2
(gdb) bt
#0  0x002887a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00417b26 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/tls/libpthread.so.0
#2  0x0804bdc2 in rg_wait_threads () at rg_locks.c:271
#3  0x0804ecd7 in rg_doall (request=10, block=1, 
    debugfmt=0x8063c14 "Initializing %s\n") at groups.c:747
#4  0x0804f8f4 in init_resource_groups (reconfigure=0) at groups.c:1095
#5  0x0804bfe5 in node_event (local=1, nodeID=1, nodeStatus=1) at main.c:152
#6  0x080593bd in node_event_thread (arg=0x0) at nodeevent.c:56
#7  0x00415371 in start_thread () from /lib/tls/libpthread.so.0
#8  0x0036de5e in clone () from /lib/tls/libc.so.6
(gdb) thread 3
[Switching to thread 3 (Thread -1208480848 (LWP 4109))]#0  0x002887a2 in
_dl_sysinfo_int80 () from /lib/ld-linux.so.2
(gdb) bt
#0  0x002887a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x003668b1 in ___newselect_nocancel () from /lib/tls/libc.so.6
#2  0x0805b897 in vf_event_loop (my_node_id=1) at vft.c:855
#3  0x0805bbd8 in vf_server (arg=0x8ffe170) at vft.c:943
#4  0x00415371 in start_thread () from /lib/tls/libpthread.so.0
#5  0x0036de5e in clone () from /lib/tls/libc.so.6

morph-03:
[root@morph-03 ~]# ps -C clurgmgrd
  PID TTY          TIME CMD
 3139 ?        00:00:00 clurgmgrd
 3140 ?        00:00:00 clurgmgrd
[root@morph-03 ~]# gdb /usr/sbin/clurgmgrd 3139
(gdb) bt
#0  0x004c57a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00658393 in __waitpid_nocancel () from /lib/tls/libpthread.so.0
#2  0x0805966f in watchdog_init () at watchdog.c:73
#3  0x0804d37a in main (argc=1, argv=0xbfe9d1a4) at main.c:740
#4  0x004f7de3 in __libc_start_main () from /lib/tls/libc.so.6
#5  0x0804a75d in _start ()
[root@morph-03 ~]# gdb /usr/sbin/clurgmgrd 3140
(gdb) info threads
  3 Thread -1208554576 (LWP 3913)  0x004c57a2 in _dl_sysinfo_int80 ()
   from /lib/ld-linux.so.2
  2 Thread -1219044432 (LWP 4234)  0x004c57a2 in _dl_sysinfo_int80 ()
   from /lib/ld-linux.so.2
  1 Thread -1208551744 (LWP 3140)  0x004c57a2 in _dl_sysinfo_int80 ()
   from /lib/ld-linux.so.2
(gdb) thread 1
[Switching to thread 1 (Thread -1208551744 (LWP 3140))]#0  0x004c57a2 in
_dl_sysinfo_int80 () from /lib/ld-linux.so.2
(gdb) bt
#0  0x004c57a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x005a38b1 in ___newselect_nocancel () from /lib/tls/libc.so.6
#2  0x0804ce6f in event_loop (clusterfd=4) at main.c:529
#3  0x0804d71d in main (argc=1, argv=0xbfe9d1a4) at main.c:843
#4  0x004f7de3 in __libc_start_main () from /lib/tls/libc.so.6
#5  0x0804a75d in _start ()
(gdb) thread 2
[Switching to thread 2 (Thread -1219044432 (LWP 4234))]#0  0x004c57a2 in
_dl_sysinfo_int80 () from /lib/ld-linux.so.2
(gdb) bt
#0  0x004c57a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00654b26 in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib/tls/libpthread.so.0
#2  0x0804bdc2 in rg_wait_threads () at rg_locks.c:271
#3  0x0804ecd7 in rg_doall (request=10, block=1, 
    debugfmt=0x8063c14 "Initializing %s\n") at groups.c:747
#4  0x0804f8f4 in init_resource_groups (reconfigure=0) at groups.c:1095
#5  0x0804bfe5 in node_event (local=1, nodeID=4, nodeStatus=1) at main.c:152
#6  0x080593bd in node_event_thread (arg=0x0) at nodeevent.c:56
#7  0x00652371 in start_thread () from /lib/tls/libpthread.so.0
#8  0x005aae5e in clone () from /lib/tls/libc.so.6
(gdb) thread 3
[Switching to thread 3 (Thread -1208554576 (LWP 3913))]#0  0x004c57a2 in
_dl_sysinfo_int80 () from /lib/ld-linux.so.2
(gdb) bt
#0  0x004c57a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x005a38b1 in ___newselect_nocancel () from /lib/tls/libc.so.6
#2  0x0805b897 in vf_event_loop (my_node_id=4) at vft.c:855
#3  0x0805bbd8 in vf_server (arg=0x9f63008) at vft.c:943
#4  0x00652371 in start_thread () from /lib/tls/libpthread.so.0
#5  0x005aae5e in clone () from /lib/tls/libc.so.6

Comment 27 Lon Hohberger 2006-06-21 17:47:49 UTC

thread (refcnt) leak in rgmanager...

Comment 28 Lon Hohberger 2006-06-21 18:10:33 UTC

Not really a leak, but effectively, the node_event_thread was bumping the
refcount then waiting for it to reach zero later on.

Comment 29 Lon Hohberger 2006-06-21 18:12:12 UTC

Created attachment 131306 [details]
Don't twiddle the rg thread refcount; this is not the right thread to do it in (only rg threads should twiddle the rg thread count)

Comment 31 Lon Hohberger 2006-06-22 22:16:29 UTC

fixes are actually in CVS

Comment 34 Red Hat Bugzilla 2006-08-10 21:24:53 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0557.html

Note You need to log in before you can comment on or make changes to this bug.