Bug 187777

Summary: dlm_emergency_shutdown caused by another node being rebooted
Product: [Retired] Red Hat Cluster Suite Reporter: Corey Marthaler <cmarthal>
Component: dlmAssignee: Christine Caulfield <ccaulfie>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: ccaulfie, cluster-maint, jbacik
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2006-0518 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-05-25 21:04:47 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 180185    
Attachments:
Description Flags
Proposed patch to fix none

Description Corey Marthaler 2006-04-03 15:54:32 UTC
Description of problem:
This may be similar to a few other dlm bugs but I couldn't find what I thought
was an open duplicate of this issue so I'm opening a new one.

Three node x86_64 cluster (link-01, 02, 08), revolver I/O load running, link-01
gets rebooted, and that causes the following on link-02:

WARNING: dlm_emergency_shutdown
dlm: process_cluster_request invalid lockspace 1000006 from 3 req 9
dlm: process_cluster_request invalid lockspace 1000006 from 3 req 1
WARNING: dlm_emergency_shutdown
SM: 00000001 sm_stop: SG still joined
SM: 01000003 sm_stop: SG still joined
SM: 02000005 sm_stop: SG still joined
SM: 03000008 sm_stop: SG still joined
dlm: dlm_unlock: lkid 6603c1 lockspace not found
lvmd marked 0 requests
clvmd purge locks of departed nodes
clvmd purged 0 locks
clvmd update remastered resources
gfs0 purged 2 locks
gfs0 update remastered resources
clvmd updated 2 resources
clvmd rebuild locks
clvmd rebuilt 0 locks
clvmd recover event 16 done
gfs1 purged 1 locks
gfs1 update remastered resources
gfs0 updated 1911 resources
gfs0 rebuild locks
clvmd move flags 0,0,1 ids 10,16,16
clvmd process held requests
clvmd processed 0 requests
clvmd resend marked requests
clvmd resent 0 requests
clvmd recover event 16 finished
gfs1 updated 1913 resources
gfs1 rebuild locks
gfs1 rebuilt 0 locks
gfs1 recover event 16 done
gfs0 rebuilt 813 locks
gfs0 recover event 16 done
gfs0 move flags 0,0,1 ids 11,16,16
gfs0 process held requests
gfs0 processed 0 requests
gfs0 resend marked requests
gfs0 resent 0 requests
gfs0 recover event 16 finished
gfs1 move flags 0,0,1 ids 13,16,16
gfs1 process held requests
gfs1 processed 0 requests
gfs1 resend marked requests
gfs1 resent 0 requests
gfs1 recover event 16 finished
f lkf 2000 wait 1
6784 ex plock 0
6785 ex plock 0
6784 en punlock 7,4c
6784 remove 7,4c
6784 ex punlock 0
6784 en plock 7,4a
6785 en punlock 7,4a
6785 remove 7,4a
6785 ex punlock 0
6785 en plock 7,4c
6785 req 7,4c ex 0-7fffffffffffffff lkf 2000 wait 1
6784 req 7,4a ex 0-7fffffffffffffff lkf 2000 wait 1
6785 ex plock 0
6784 ex plock 0
6785 en punlock 7,4c
6785 remove 7,4c
6785 ex punlock 0
6785 en plock 7,49
6785 req 7,49 ex 0-7fffffffffffffff lkf 2000 wait 1
6785 ex plock 0
6784 en punlock 7,4a
6784 remove 7,4a
6784 ex punlock 0
6784 en plock 7,4b
6784 req 7,4b ex 0-7fffffffffffffff lkf 2000 wait 1
6784 ex plock 0
6785 en punlock 7,49
6785 remove 7,49
6785 ex punlock 0
6785 en plock 7,4a
6785 req 7,4a ex 0-7fffffffffffffff lkf 2000 wait 1
6784 en punlock 7,4b
6784 remove 7,4b
6784 ex punlock 0
6784 en plock 7,4a
6785 ex plock 0
6785 en punlock 7,4a
6785 remove 7,4a
6785 ex punlock 0
6785 en plock 7,4d
6785 req 7,4d ex 0-7fffffffffffffff lkf 2000 wait 1
6785 ex plock 0
6784 req 7,4a ex 0-7fffffffffffffff lkf 2000 wait 1
6784 ex plock 0
6785 en punlock 7,4d
6785 remove 7,4d
6785 ex punlock 0
6785 en plock 7,4a
6784 en punlock 7,4a
6784 remove 7,4a
6784 ex punlock 0
6784 en plock 7,4d
6785 req 7,4a ex 0-7fffffffffffffff lkf 2000 wait 1
6784 req 7,4d ex 0-7fffffffffffffff lkf 2000 wait 1
6784 ex plock 0
6785 ex plock 0
6784 en punlock 7,4d
6784 remove 7,4d
6784 ex punlock 0
6784 en plock 7,4b
6784 req 7,4b ex 0-7fffffffffffffff lkf 2000 wait 1
6784 ex plock 0
6785 en punlock 7,4a
6785 remove 7,4a
6785 ex punlock 0
6785 en plock 7,49
6785 req 7,49 ex 0-7fffffffffffffff lkf 2000 wait 1
6785 ex plock 0
6784 en punlock 7,4b
6784 remove 7,4b
6784 ex punlock 0
6784 en plock 7,4c
6784 req 7,4c ex 0-7fffffffffffffff lkf 2000 wait 1
6784 ex plock 0
6785 en punlock 7,49
6785 remove 7,49
6785 ex punlock 0
6785 en plock 7,4c
6784 en punlock 7,4c
6784 remove 7,4c
6784 ex punlock 0
6784 en plock 7,4a
6784 req 7,4a ex 0-7fffffffffffffff lkf 2000 wait 1
6785 req 7,4c ex 0-7fffffffffffffff lkf 2000 wait 1
6785 ex plock 0
6784 ex plock 0
6785 en punlock 7,4c
6785 remove 7,4c
6785 ex punlock 0
6785 en plock 7,4d
6785 req 7,4d ex 0-7fffffffffffffff lkf 2000 wait 1
6785 ex plock 0
6784 en punlock 7,4a
6784 remove 7,4a
6784 ex punlock 0
6784 en plock 7,4c
6784 req 7,4c ex 0-7fffffffffffffff lkf 2000 wait 1
6784 ex plock 0
6785 en punlock 7,4d
6785 remove 7,4d
6785 ex punlock 0
6785 en plock 7,4a
6785 req 7,4a ex 0-7fffffffffffffff lkf 2000 wait 1
6785 ex plock 0
6784 en punlock 7,4c
6784 remove 7,4c
6784 ex punlock 0
6784 en plock 7,4b
6784 req 7,4b ex 0-7fffffffffffffff lkf 2000 wait 1
6784 ex plock 0
6785 en punlock 7,4a
6785 remove 7,4a
6785 ex punlock 0
6785 en plock 7,4d
6785 req 7,4d ex 0-7fffffffffffffff lkf 2000 wait 1
6785 ex plock 0
6784 en punlock 7,4b
6784 remove 7,4b
6784 ex punlock 0
6784 en plock 7,4b
6784 req 7,4b ex 0-7fffffffffffffff lkf 2000 wait 1
6785 en punlock 7,4d
6785 remove 7,4d
6785 ex punlock 0
6785 en plock 7,4d
6785 req 7,4d ex 0-7fffffffffffffff lkf 2000 wait 1
6784 ex plock 0
6785 ex plock 0
6784 en punlock 7,4b
6784 remove 7,4b
6784 ex punlock 0
6784 en plock 7,4c
6784 req 7,4c ex 0-7fffffffffffffff lkf 2000 wait 1
6784 ex plock 0
6785 en punlock 7,4d
6785 remove 7,4d
6785 ex punlock 0
6785 en plock 7,49
6785 req 7,49 ex 0-7fffffffffffffff lkf 2000 wait 1
6784 en punlock 7,4c
6784 remove 7,4c
6784 ex punlock 0
6784 en plock 7,49
6784 ex plock -4
6784 en punlock 7,49
6784 ex punlock 0
6785 ex plock 0
6785 en punlock 7,49
6785 remove 7,49
6785 ex punlock 0
2661 pr_start last_stop 14 last_start 16 last_finish 14
2661 pr_start count 2 type 1 event 16 flags 21a
2661 pr_start cb jid 1 id 1
2661 pr_start 16 done 0
2649 pr_start last_stop 12 last_start 16 last_finish 12
2649 pr_start count 2 type 1 event 16 flags 21a
2649 pr_start cb jid 1 id 1
2649 pr_start 16 done 0
2664 recovery_done jid 1 msg 308 11a
2664 recovery_done nodeid 1 flg 1b
2664 recovery_done start_done 16
2653 recovery_done jid 1 msg 308 11a
2653 recovery_done nodeid 1 flg 1b
2653 recovery_done start_done 16
2649 pr_finish flags 1a
2661 pr_finish flags 1a

lock_dlm:  Assertion failed on line 357 of file
/usr/src/build/714650-x86_64/BUILD/gfs-kernel-2.6.9-49/smp/src/dlm/lock.c
lock_dlm:  assertion:  "!error"
lock_dlm:  time = 4542632363
gfs1: error=-22 num=2,1fd27 lkf=0 flags=84

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lock:357
invalid operand: 0000 [1] SMP
CPU 0
Modules linked in: parport_pc lp parport autofs4 i2c_dev i2c_core lock_dlm(U)
gfs(U) lock_harness(U) dlm(U) cman(U) md5 ipv6 sunrpc ds yenta_socket
pcmcia_core button battery ac ohci_hcd hw_random shpchp tg3 floppy dm_snapshot
dm_zero dm_mirror ext3 jbd dm_mod qla2300 qla2xxx scsi_transport_fc sd_mod scsi_mod
Pid: 2663, comm: gfs_glockd Tainted: G   M  2.6.9-34.ELsmp
RIP: 0010:[<ffffffffa0260819>] <ffffffffa0260819>{:lock_dlm:do_dlm_unlock+167}
RSP: 0018:00000100368b1dd8  EFLAGS: 00010212
RAX: 0000000000000001 RBX: 000001000f42f080 RCX: ffffffff803d9e48
RDX: ffffffff803d9e48 RSI: 0000000000000246 RDI: ffffffff803d9e40
RBP: 00000000ffffffea R08: ffffffff803d9e48 R09: 000001000f42f080
R10: 0000000100000000 R11: ffffffff8011dfe0 R12: 0000010007f0f2dc
R13: ffffff00001a4000 R14: ffffffffa025cec0 R15: 0000010007f0f2b0
FS:  0000002a95564b00(0000) GS:ffffffff804d7b00(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002a9556c000 CR3: 0000000000101000 CR4: 00000000000006e0
Process gfs_glockd (pid: 2663, threadinfo 00000100368b0000, task 000001003f502030)
Stack: 0000000000000084 0000000100000000 ffffffff80190740 ffffff00001a4000
       0000010007f0f2b0 ffffffffa0260b96 ffffffff803df640 ffffffffa022797c
       0000000000000001 ffffffffa021e3d9
Call Trace:<ffffffff80190740>{generic_drop_inode+0}
<ffffffffa0260b96>{:lock_dlm:lm_dlm_unlock+21}
       <ffffffffa022797c>{:gfs:gfs_lm_unlock+41}
<ffffffffa021e3d9>{:gfs:gfs_glock_drop_th+290}
       <ffffffffa021cb7c>{:gfs:run_queue+314}
<ffffffffa021cdd0>{:gfs:unlock_on_glock+37}
       <ffffffffa021cec6>{:gfs:gfs_reclaim_glock+234}
<ffffffffa021175a>{:gfs:gfs_glockd+61}
       <ffffffff801333c8>{default_wake_function+0}
<ffffffff801333c8>{default_wake_function+0}
       <ffffffff80110e17>{child_rip+8} <ffffffff8011dfe0>{flat_send_IPI_mask+0}
       <ffffffffa021171d>{:gfs:gfs_glockd+0} <ffffffff80110e0f>{child_rip+0}


Code: 0f 0b 52 51 26 a0 ff ff ff ff 65 01 48 c7 c7 57 51 26 a0 31
RIP <ffffffffa0260819>{:lock_dlm:do_dlm_unlock+167} RSP <00000100368b1dd8>
 <0>Kernel panic - not syncing: Oops


Version-Release number of selected component (if applicable):
[root@link-08 ~]# uname -ar
Linux link-08 2.6.9-34.ELsmp #1 SMP Fri Feb 24 16:56:28 EST 2006 x86_64 x86_64
x86_64 GNU/Linux
[root@link-08 ~]# rpm -q dlm-kernel
dlm-kernel-2.6.9-41.7
[root@link-08 ~]# rpm -q dlm
dlm-1.0.0-5

Comment 1 Corey Marthaler 2006-04-03 15:59:14 UTC
Here's link-08's view of the world:

[root@link-08 ~]# cat /proc/cluster/nodes
Node  Votes Exp Sts  Name
   1    1    3   X   link-01
   2    1    3   X   link-02
   3    1    3   M   link-08
[root@link-08 ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 recover 0 -
[3]

DLM Lock Space:  "clvmd"                             3   3 recover 0 -
[3]

DLM Lock Space:  "gfs0"                              4   4 recover 0 -
[3]

DLM Lock Space:  "gfs1"                              6   6 recover 0 -
[3]

GFS Mount Group: "gfs0"                              5   5 recover 0 -
[3]

GFS Mount Group: "gfs1"                              7   7 recover 0 -
[3]

User:            "usrm::manager"                     8   8 recover 0 -
[3]


[root@link-08 cluster]# cat dlm_debug
d -1
gfs1 mark 650054 lq 1 nodeid -1
gfs1 marked 2 requests
gfs1 purge locks of departed nodes
gfs1 purged 1 locks
gfs1 update remastered resources
clvmd move flags 0,0,1 ids 2,8,8
clvmd process held requests
clvmd processed 0 requests
clvmd resend marked requests
clvmd resent 0 requests
clvmd recover event 8 finished
gfs0 rebuilt 1099 locks
gfs0 recover event 8 done
gfs1 updated 1913 resources
gfs1 rebuild locks
gfs0 move flags 0,0,1 ids 3,8,8
gfs0 process held requests
gfs0 processed 0 requests
gfs0 resend marked requests
gfs0 resent 0 requests
gfs0 recover event 8 finished
gfs1 rebuilt 1919 locks
gfs1 recover event 8 done
gfs1 move flags 0,0,1 ids 5,8,8
gfs1 process held requests
gfs1 processed 0 requests
gfs1 resend marked requests
gfs1 resend 5901bb lq 1 flg 200000 node -1/-1 "       7
gfs1 resend 650054 lq 1 flg 200000 node -1/-1 "       5
gfs1 resent 2 requests
gfs1 recover event 8 finished
gfs1 move flags 1,0,0 ids 8,8,8
clvmd move flags 1,0,0 ids 8,8,8
gfs0 move flags 1,0,0 ids 8,8,8


[root@link-08 cluster]# cat dlm_stats
DLM stats (HZ=1000)

Lock operations:     197760
Unlock operations:   176703
Convert operations:  627646
Completion ASTs:    1002106
Blocking ASTs:            5

Lockqueue        num  waittime   ave
WAIT_RSB      104635    418221     3
WAIT_CONV          7         8     1
WAIT_GRANT     26678     17879     0
WAIT_UNLOCK    11447     91841     8
Total         142767    527949     3


[root@link-08 cluster]# cat sm_debug
cover state 3
03000008 recover state 5
00000001 remove node 2 count 1
01000006 remove node 2 count 1
01000004 remove node 2 count 1
01000003 remove node 2 count 1
02000007 remove node 2 count 1
02000005 remove node 2 count 1
03000008 remove node 2 count 1


[root@link-08 cluster]# cat status
Protocol version: 5.0.1
Config version: 2
Cluster name: LINK_128
Cluster ID: 19208
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 1
Expected_votes: 3
Total_votes: 1
Quorum: 2  Activity blocked
Active subsystems: 10
Node name: link-08
Node addresses: 10.15.89.158



Comment 2 David Teigland 2006-04-10 14:32:39 UTC
cman has shut down on link-02, /var/log/messages
might give some more info on why.  The dlm
response is correct.

Comment 3 Corey Marthaler 2006-05-02 19:05:24 UTC
Just hit this again on the link cluster and the following nodes all ended up
withdrawing and panicing: link-02, link-04, link-07, link-08. This was after
link-01 and link-07 were shot and brought back into the cluster. 


Here are the requested /var/log/messages of all nodes.

link-01:
May  2 10:38:07 link-01 kernel: CMAN: Waiting to join or form a Linux-cluster  
May  2 10:38:08 link-01 kernel: CMAN: sending membership request
May  2 10:38:08 link-01 ccsd[3832]: Connected to cluster infrastruture via:
CMAN/SM Plugin v1.1.5
May  2 10:38:08 link-01 ccsd[3832]: Initial status:: Inquorate                 
May  2 10:38:09 link-01 kernel: CMAN: got node link-03
May  2 10:38:09 link-01 kernel: CMAN: got node link-04                         
May  2 10:38:09 link-01 kernel: CMAN: got node link-08
May  2 10:38:09 link-01 kernel: CMAN: got node link-05                         
May  2 10:38:09 link-01 kernel: CMAN: got node link-02
May  2 10:38:24 link-01 kernel: CMAN: node link-07 rejoining                    
May  2 10:38:25 link-01 kernel: CMAN: Being told to leave the cluster by node 6
May  2 10:38:25 link-01 kernel: CMAN: we are leaving the cluster.              
May  2 10:38:25 link-01 kernel: WARNING: dlm_emergency_shutdown
May  2 10:38:25 link-01 kernel: WARNING: dlm_emergency_shutdown  


link-02:               
May  2 10:33:58 link-02 kernel: CMAN: node link-07 has been removed from the
cluster : Missed too many heartbeats
May  2 10:34:09 link-02 kernel: CMAN: removing node link-01 from the cluster :
No response to messages
May  2 10:34:16 link-02 fenced[6033]: fencing deferred to link-03
May  2 10:35:05 link-02 kernel: GFS: fsid=LINK_CLUSTER:link1.5: jid=4: Trying to
acquire journal lock...
May  2 10:35:05 link-02 kernel: GFS: fsid=LINK_CLUSTER:link0.5: jid=4: Trying to
acquire journal lock.
..
May  2 10:35:05 link-02 kernel: GFS: fsid=LINK_CLUSTER:link1.5: jid=4: Busy    
                      May  2 10:35:05 link-02 kernel: GFS:
fsid=LINK_CLUSTER:link1.5: jid=2: Trying to acquire journal lock.
..
May  2 10:35:05 link-02 kernel: GFS: fsid=LINK_CLUSTER:link0.5: jid=4: Busy
May  2 10:35:05 link-02 kernel: GFS: fsid=LINK_CLUSTER:link0.5: jid=2: Trying to
acquire journal lock...
May  2 10:35:05 link-02 kernel: GFS: fsid=LINK_CLUSTER:link0.5: jid=2: Looking
at journal...
May  2 10:35:05 link-02 kernel: GFS: fsid=LINK_CLUSTER:link1.5: jid=2: Busy
May  2 10:35:05 link-02 kernel: GFS: fsid=LINK_CLUSTER:link0.5: jid=2: Acquiring
the transaction lock...
May  2 10:35:06 link-02 kernel: GFS: fsid=LINK_CLUSTER:link0.5: jid=2: Replaying
journal...
May  2 10:35:06 link-02 kernel: GFS: fsid=LINK_CLUSTER:link0.5: jid=2: Replayed
148 of 148 blocks
May  2 10:35:06 link-02 kernel: GFS: fsid=LINK_CLUSTER:link0.5: jid=2: replays =
148, skips = 0, sames = 0
May  2 10:35:06 link-02 kernel: GFS: fsid=LINK_CLUSTER:link0.5: jid=2: Journal
replayed in 1s
May  2 10:35:06 link-02 kernel: GFS: fsid=LINK_CLUSTER:link0.5: jid=2: Done
May  2 10:39:06 link-02 kernel: CMAN: node link-01 rejoining                   
                      May  2 10:39:08 link-02 kernel: CMAN: node link-07 rejoining
May  2 10:39:23 link-02 kernel: CMAN: removing node link-01 from the cluster :
Shutdown
May  2 10:39:41 link-02 kernel: CMAN: node link-08 has been removed from the
cluster : Inconsistent cluster view                                            
                                               May  2 10:39:59 link-02 kernel:
CMAN: node link-03 has been removed from the cluster : Inconsistent cluster view
May  2 10:40:25 link-02 kernel: CMAN: node link-04 has been removed from the
cluster : Inconsistent cluster view                                            
                                               May  2 10:40:46 link-02 kernel:
CMAN: we are leaving the cluster. Inconsistent cluster view
May  2 10:40:46 link-02 kernel: WARNING: dlm_emergency_shutdown
May  2 10:40:46 link-02 kernel: WARNING: dlm_emergency_shutdown
May  2 10:40:46 link-02 kernel: SM: 00000001 sm_stop: SG still joined          
                      May  2 10:40:46 link-02 ccsd[3966]: Cluster manager
shutdown.  Attemping to reconnect...
May  2 10:40:46 link-02 kernel: SM: 01000002 sm_stop: SG still joined
May  2 10:40:46 link-02 kernel: SM: 02000004 sm_stop: SG still joined
May  2 10:41:15 link-02 ccsd[3966]: Unable to connect to cluster infrastructure
after 30 seconds.     May  2 10:41:37 link-02 kernel: dlm: dlm_unlock: lkid
110200 lockspace not found


link-03:
May  2 11:56:40 link-03 kernel: CMAN: node link-07 has been removed from the
cluster : Missed too many
 heartbeats
May  2 11:56:51 link-03 kernel: CMAN: removing node link-01 from the cluster :
No response to messages
May  2 11:57:28 link-03 fenced[3510]: link-01 not a cluster member after 30 sec
post_fail_delay
May  2 11:57:28 link-03 fenced[3510]: link-07 not a cluster member after 30 sec
post_fail_delay
May  2 11:57:28 link-03 fenced[3510]: fencing node "link-01"
May  2 11:57:33 link-03 fenced[3510]: fence "link-01" success
May  2 11:57:38 link-03 fenced[3510]: fencing node "link-07"
May  2 11:57:41 link-03 fenced[3510]: fence "link-07" success
May  2 11:57:48 link-03 kernel: GFS: fsid=LINK_CLUSTER:link1.0: jid=4: Trying to
acquire journal lock.
..
May  2 11:57:48 link-03 kernel: GFS: fsid=LINK_CLUSTER:link0.1: jid=4: Trying to
acquire journal lock.
..
May  2 11:57:48 link-03 kernel: GFS: fsid=LINK_CLUSTER:link1.0: jid=4: Busy
May  2 11:57:48 link-03 kernel: GFS: fsid=LINK_CLUSTER:link1.0: jid=2: Trying to
acquire journal lock.
..
May  2 11:57:48 link-03 kernel: GFS: fsid=LINK_CLUSTER:link0.1: jid=4: Busy
May  2 11:57:48 link-03 kernel: GFS: fsid=LINK_CLUSTER:link0.1: jid=2: Trying to
acquire journal lock.
..
May  2 11:57:48 link-03 kernel: GFS: fsid=LINK_CLUSTER:link1.0: jid=2: Busy
May  2 11:57:48 link-03 kernel: GFS: fsid=LINK_CLUSTER:link0.1: jid=2: Busy
May  2 12:01:49 link-03 kernel: CMAN: node link-01 rejoining
May  2 12:01:50 link-03 kernel: CMAN: node link-07 rejoining
May  2 12:02:21 link-03 kernel: CMAN: removing node link-01 from the cluster :
No response to messages
May  2 12:02:24 link-03 kernel: CMAN: removing node link-08 from the cluster :
Inconsistent cluster view
May  2 12:02:42 link-03 kernel: CMAN: we are leaving the cluster. Inconsistent
cluster view
May  2 12:02:42 link-03 kernel: WARNING: dlm_emergency_shutdown
May  2 12:02:42 link-03 kernel: WARNING: dlm_emergency_shutdown
May  2 12:02:42 link-03 kernel: SM: 00000001 sm_stop: SG still joined
May  2 12:02:42 link-03 kernel: SM: 01000002 sm_stop: SG still joined
May  2 12:02:42 link-03 kernel: SM: 02000004 sm_stop: SG still joined
May  2 12:02:42 link-03 ccsd[3400]: Cluster manager shutdown.  Attemping to
reconnect...
May  2 12:03:00 link-03 ccsd[3400]: Unable to connect to cluster infrastructure
after 30 seconds.
May  2 12:03:30 link-03 ccsd[3400]: Unable to connect to cluster infrastructure
after 60 seconds.


link-04:
May  2 05:56:01 link-04 kernel: CMAN: node link-07 has been removed from the
cluster : Missed too many
 heartbeats
May  2 05:56:12 link-04 kernel: CMAN: removing node link-01 from the cluster :
No response to messages
May  2 05:56:19 link-04 fenced[5446]: fencing deferred to link-03
May  2 05:57:08 link-04 kernel: GFS: fsid=LINK_CLUSTER:link1.3: jid=4: Trying to
acquire journal lock.
..
May  2 05:57:08 link-04 kernel: GFS: fsid=LINK_CLUSTER:link0.3: jid=4: Trying to
acquire journal lock.
..
May  2 05:57:08 link-04 kernel: GFS: fsid=LINK_CLUSTER:link1.3: jid=4: Looking
at journal...
May  2 05:57:08 link-04 kernel: GFS: fsid=LINK_CLUSTER:link0.3: jid=4: Looking
at journal...
May  2 05:57:08 link-04 kernel: GFS: fsid=LINK_CLUSTER:link1.3: jid=4: Acquiring
the transaction lock.
..
May  2 05:57:08 link-04 kernel: GFS: fsid=LINK_CLUSTER:link0.3: jid=4: Acquiring
the transaction lock.
..
May  2 05:57:08 link-04 kernel: GFS: fsid=LINK_CLUSTER:link1.3: jid=4: Replaying
journal...
May  2 05:57:08 link-04 kernel: GFS: fsid=LINK_CLUSTER:link1.3: jid=4: Replayed
0 of 169 blocks
May  2 05:57:08 link-04 kernel: GFS: fsid=LINK_CLUSTER:link1.3: jid=4: replays =
0, skips = 2, sames =
 167
May  2 05:57:08 link-04 kernel: GFS: fsid=LINK_CLUSTER:link1.3: jid=4: Journal
replayed in 1s
May  2 05:57:08 link-04 kernel: GFS: fsid=LINK_CLUSTER:link1.3: jid=4: Done
May  2 05:57:08 link-04 kernel: GFS: fsid=LINK_CLUSTER:link1.3: jid=2: Trying to
acquire journal lock.
..
May  2 05:57:08 link-04 kernel: GFS: fsid=LINK_CLUSTER:link1.3: jid=2: Busy
May  2 05:57:09 link-04 kernel: GFS: fsid=LINK_CLUSTER:link0.3: jid=4: Replaying
journal...
May  2 05:57:09 link-04 kernel: GFS: fsid=LINK_CLUSTER:link0.3: jid=4: Replayed
0 of 3 blocks
May  2 05:57:09 link-04 kernel: GFS: fsid=LINK_CLUSTER:link0.3: jid=4: replays =
0, skips = 0, sames = 3
May  2 05:57:09 link-04 kernel: GFS: fsid=LINK_CLUSTER:link0.3: jid=4: Journal
replayed in 1s
May  2 05:57:09 link-04 kernel: GFS: fsid=LINK_CLUSTER:link0.3: jid=4: Done
May  2 05:57:09 link-04 kernel: GFS: fsid=LINK_CLUSTER:link0.3: jid=2: Trying to
acquire journal lock.
..
May  2 05:57:09 link-04 kernel: GFS: fsid=LINK_CLUSTER:link0.3: jid=2: Looking
at journal...
May  2 05:57:09 link-04 kernel: GFS: fsid=LINK_CLUSTER:link0.3: jid=2: Done
May  2 06:01:09 link-04 kernel: CMAN: node link-01 rejoining
May  2 06:01:11 link-04 kernel: CMAN: node link-07 rejoining
May  2 06:01:41 link-04 kernel: CMAN: removing node link-01 from the cluster :
No response to messages
May  2 06:01:44 link-04 kernel: CMAN: node link-08 has been removed from the
cluster : Inconsistent cl
uster view
May  2 06:02:02 link-04 kernel: CMAN: node link-03 has been removed from the
cluster : Inconsistent cluster view
May  2 06:02:28 link-04 kernel: CMAN: we are leaving the cluster. Inconsistent
cluster view
May  2 06:02:28 link-04 kernel: WARNING: dlm_emergency_shutdown



link-05:
May  2 04:47:02 link-05 kernel: CMAN: node link-07 has been removed from the
cluster : Missed too many heartbeats
May  2 04:47:13 link-05 kernel: CMAN: removing node link-01 from the cluster :
No response to messages
May  2 04:47:20 link-05 fenced[5968]: fencing deferred to link-03
May  2 04:48:10 link-05 kernel: GFS: fsid=LINK_CLUSTER:link1.6: jid=4: Trying to
acquire journal lock...
May  2 04:48:10 link-05 kernel: GFS: fsid=LINK_CLUSTER:link0.6: jid=4: Trying to
acquire journal lock...
May  2 04:48:10 link-05 kernel: GFS: fsid=LINK_CLUSTER:link1.6: jid=4: Busy
May  2 04:48:10 link-05 kernel: GFS: fsid=LINK_CLUSTER:link1.6: jid=2: Trying to
acquire journal lock...
May  2 04:48:10 link-05 kernel: GFS: fsid=LINK_CLUSTER:link0.6: jid=4: Busy
May  2 04:48:10 link-05 kernel: GFS: fsid=LINK_CLUSTER:link0.6: jid=2: Trying to
acquire journal lock...
May  2 04:48:10 link-05 kernel: GFS: fsid=LINK_CLUSTER:link1.6: jid=2: Busy
May  2 04:48:10 link-05 kernel: GFS: fsid=LINK_CLUSTER:link0.6: jid=2: Busy
May  2 04:52:10 link-05 kernel: CMAN: node link-01 rejoining
May  2 04:52:12 link-05 kernel: CMAN: node link-07 rejoining
May  2 04:52:42 link-05 kernel: CMAN: removing node link-01 from the cluster :
No response to messages
May  2 04:52:45 link-05 kernel: CMAN: node link-08 has been removed from the
cluster : Inconsistent cluster view
May  2 04:53:03 link-05 kernel: CMAN: removing node link-03 from the cluster :
Inconsistent cluster view
May  2 04:53:29 link-05 kernel: CMAN: removing node link-04 from the cluster :
Inconsistent cluster view
May  2 04:53:50 link-05 kernel: CMAN: removing node link-02 from the cluster :
Inconsistent cluster view
May  2 04:58:53 link-05 kernel: CMAN: too many transition restarts - will die
May  2 04:58:53 link-05 kernel: CMAN: we are leaving the cluster. Inconsistent
cluster view
May  2 04:58:53 link-05 kernel: WARNING: dlm_emergency_shutdown
May  2 04:58:54 link-05 kernel: WARNING: dlm_emergency_shutdown
May  2 04:58:54 link-05 kernel: SM: 00000001 sm_stop: SG still joined
May  2 04:58:54 link-05 kernel: SM: 01000002 sm_stop: SG still joined
May  2 04:58:54 link-05 kernel: SM: 02000004 sm_stop: SG still joined
May  2 04:58:54 link-05 ccsd[3894]: Cluster manager shutdown.  Attemping to
reconnect...
May  2 04:59:20 link-05 ccsd[3894]: Unable to connect to cluster infrastructure
after 30 seconds.
May  2 04:59:50 link-05 ccsd[3894]: Unable to connect to cluster infrastructure
after 60 seconds.

link-07:
May  2 06:55:23 link-07 kernel: CMAN: Waiting to join or form a Linux-cluster
May  2 06:55:24 link-07 ccsd[4092]: Connected to cluster infrastruture via:
CMAN/SM Plugin v1.1.5
May  2 06:55:24 link-07 ccsd[4092]: Initial status:: Inquorate
May  2 06:55:24 link-07 kernel: CMAN: sending membership request
May  2 06:55:25 link-07 kernel: CMAN: sending membership request
May  2 06:55:41 link-07 kernel: CMAN: got node link-03
May  2 06:55:41 link-07 kernel: CMAN: got node link-04
May  2 06:55:41 link-07 kernel: CMAN: got node link-02
May  2 06:55:41 link-07 kernel: CMAN: got node link-05
May  2 06:55:41 link-07 kernel: CMAN: got node link-01
May  2 06:55:41 link-07 kernel: CMAN: got node link-08
May  2 06:55:56 link-07 kernel: CMAN: removing node link-01 from the cluster :
No response to messages
May  2 06:55:59 link-07 kernel: CMAN: node link-08 has been removed from the
cluster : Inconsistent cluster view
May  2 06:56:00 link-07 kernel: CMAN: got WAIT barrier not in phase 1
TRANSITION.1330 (2)
May  2 06:56:06 link-07 kernel: CMAN: quorum regained, resuming activity
May  2 06:56:06 link-07 ccsd[4092]: Cluster is quorate.  Allowing connections.
May  2 06:56:17 link-07 kernel: CMAN: node link-03 has been removed from the
cluster : Inconsistent cluster view
May  2 06:56:43 link-07 kernel: CMAN: node link-04 has been removed from the
cluster : Inconsistent cluster view
May  2 06:57:04 link-07 kernel: CMAN: node link-02 has been removed from the
cluster : Inconsistent cluster view
May  2 07:02:21 link-07 kernel: CMAN: removing node link-05 from the cluster :
Missed too many heartbeats


link-08:
May  2 10:26:04 link-08 kernel: CMAN: removing node link-07 from the cluster :
Missed too many heartbeats
May  2 10:26:10 link-08 kernel: CMAN: removing node link-01 from the cluster :
No response to messages
May  2 10:26:22 link-08 fenced[6112]: fencing deferred to link-03
May  2 10:27:12 link-08 kernel: GFS: fsid=LINK_CLUSTER:link1.1: jid=4: Trying to
acquire journal lock...
May  2 10:27:12 link-08 kernel: GFS: fsid=LINK_CLUSTER:link0.0: jid=4: Trying to
acquire journal lock...
May  2 10:27:12 link-08 kernel: GFS: fsid=LINK_CLUSTER:link1.1: jid=4: Busy
May  2 10:27:12 link-08 kernel: GFS: fsid=LINK_CLUSTER:link1.1: jid=2: Trying to
acquire journal lock...
May  2 10:27:12 link-08 kernel: GFS: fsid=LINK_CLUSTER:link1.1: jid=2: Looking
at journal...
May  2 10:27:12 link-08 kernel: GFS: fsid=LINK_CLUSTER:link0.0: jid=4: Busy
May  2 10:27:12 link-08 kernel: GFS: fsid=LINK_CLUSTER:link0.0: jid=2: Trying to
acquire journal lock...
May  2 10:27:12 link-08 kernel: GFS: fsid=LINK_CLUSTER:link0.0: jid=2: Busy
May  2 10:27:12 link-08 kernel: GFS: fsid=LINK_CLUSTER:link1.1: jid=2: Acquiring
the transaction lock...
May  2 10:27:12 link-08 kernel: GFS: fsid=LINK_CLUSTER:link1.1: jid=2: Replaying
journal...
May  2 10:27:13 link-08 kernel: GFS: fsid=LINK_CLUSTER:link1.1: jid=2: Replayed
765 of 1123 blocks
May  2 10:27:13 link-08 kernel: GFS: fsid=LINK_CLUSTER:link1.1: jid=2: replays =
765, skips = 142, sames = 216
May  2 10:27:13 link-08 kernel: GFS: fsid=LINK_CLUSTER:link1.1: jid=2: Journal
replayed in 1s
May  2 10:27:13 link-08 kernel: GFS: fsid=LINK_CLUSTER:link1.1: jid=2: Done
May  2 10:31:13 link-08 kernel: CMAN: node link-01 rejoining
May  2 10:31:14 link-08 kernel: CMAN: node link-07 rejoining
May  2 10:31:36 link-08 kernel: CMAN: removing node link-01 from the cluster :
No response to messages
May  2 10:31:48 link-08 kernel: CMAN: we are leaving the cluster. Inconsistent
cluster view
May  2 10:31:48 link-08 kernel: WARNING: dlm_emergency_shutdown
May  2 10:31:48 link-08 kernel: dlm: link0: remote_stage error -105 6e02a9
May  2 10:31:48 link-08 kernel: dlm: link0: remote_stage error -105 880212
May  2 10:31:48 link-08 kernel: dlm: link1: remote_stage error -105 480354
May  2 10:31:48 link-08 kernel: WARNING: dlm_emergency_shutdown
May  2 10:31:48 link-08 kernel: SM: 00000001 sm_stop: SG still joined
May  2 10:31:48 link-08 kernel: SM: 01000002 sm_stop: SG still joined
May  2 10:31:48 link-08 kernel: SM: 02000004 sm_stop: SG still joined
May  2 10:31:48 link-08 ccsd[4045]: Cluster manager shutdown.  Attemping to
reconnect...


Comment 4 Christine Caulfield 2006-05-03 12:55:35 UTC
If that "Inconsistent cluster view" is the cause of the DLM shutdowns then this
is a CMAN bug and a damnably difficult one to isolate IIRC.

Something similar was seen on the mailing list a cuple of weeks ago. I was sent
a tcpdump and around 50-70% of the network packets were going missing in that
case. I doubt (and hope) that isn't happening here!

I've set a revolver running on my 8node bench cluster and its been running all
day now.


Comment 5 Christine Caulfield 2006-05-05 08:26:23 UTC
This is an extension of bz#177163

Here's how to reproduce it:

1. Start up a few nodes (minimum 3)
2. Let them run until the sequence numbers get > 32767 (OK this is hard to
verify but with some code finagling its easy to fake)
3. Remove one node
4. Restart that node (with sequence numbers starting from 0)

Boom. The whole cluster falls apart.


Comment 6 Christine Caulfield 2006-05-05 09:37:34 UTC
Created attachment 128650 [details]
Proposed patch to fix

This is the patch I'm currently testing,
if it runs OK over the weekend I'll commit it.

Anyone else please feel free to try it :-)

Comment 7 Christine Caulfield 2006-05-09 09:27:49 UTC
This ran OK over the weekend and other tests.

-rRHEL4
Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/cnxman.c,v  <--  cnxman.c
new revision: 1.42.2.25; previous revision: 1.42.2.24
done
Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/membership.c,v  <--  membership.c
new revision: 1.44.2.23; previous revision: 1.44.2.22
done

-rSTABLE
Checking in cnxman.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/cnxman.c,v  <--  cnxman.c
new revision: 1.42.2.12.4.1.2.12; previous revision: 1.42.2.12.4.1.2.11
done
Checking in membership.c;
/cvs/cluster/cluster/cman-kernel/src/Attic/membership.c,v  <--  membership.c
new revision: 1.44.2.18.6.5; previous revision: 1.44.2.18.6.4
done


Comment 11 Red Hat Bugzilla 2006-05-25 21:04:47 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0518.html