From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4.3) Gecko/20040803 Description of problem: 5 node cluster (morph-01 - morph-05) all running I/O load to 4 filesystems. Revolver shoots 3 nodes (morph-03, morph-04, morph-05). morph-01 and morph-02 detect that the nodes are gone: morph-01: Nov 11 11:36:09 morph-01 kernel: CMAN: quorum lost, blocking activity morph-02: Nov 11 11:37:02 morph-02 kernel: CMAN: no HELLO from morph-03, removing from the cluster Nov 11 11:37:06 morph-02 kernel: CMAN: node morph-05 is not responding - removing from the cluster Nov 11 11:37:06 morph-02 kernel: CMAN: node morph-04 is not responding - removing from the cluster Nov 11 11:37:08 morph-02 kernel: CMAN: quorum lost, blocking activity Revolver waits a couple minutes for the nodes to come back up and when they do attempts to bring them back into the cluster. When the cman join is attempted on the shot nodes, morph-01 decides to fence all three of them: morph-01: Nov 11 11:40:32 morph-01 kernel: CMAN: node morph-05 rejoining Nov 11 11:40:33 morph-01 kernel: CMAN: quorum regained, resuming activity Nov 11 11:40:33 morph-01 fenced[4015]: fencing node "morph-05" Nov 11 11:40:33 morph-01 kernel: CMAN: node morph-03 rejoining Nov 11 11:40:37 morph-01 kernel: CMAN: node morph-04 rejoining Nov 11 11:40:38 morph-01 fenced[4015]: fence "morph-05" success Nov 11 11:40:39 morph-01 fenced[4015]: fencing node "morph-04" Nov 11 11:40:41 morph-01 kernel: CMAN: node morph-05 is not responding - removing from the cluster Nov 11 11:40:43 morph-01 fenced[4015]: fence "morph-04" success Nov 11 11:40:44 morph-01 fenced[4015]: fencing node "morph-03" Nov 11 11:40:48 morph-01 fenced[4015]: fence "morph-03" success Nov 11 11:40:53 morph-01 kernel: CMAN: node morph-03 is not responding - removing from the cluster morph-02: Nov 11 11:41:31 morph-02 kernel: CMAN: node morph-05 rejoining Nov 11 11:41:32 morph-02 kernel: CMAN: quorum regained, resuming activity Nov 11 11:41:32 morph-02 kernel: CMAN: node morph-03 rejoining Nov 11 11:41:32 morph-02 fenced[3848]: fencing deferred to 1 Nov 11 11:41:36 morph-02 kernel: CMAN: node morph-04 rejoining Nov 11 11:41:52 morph-02 kernel: CMAN: node morph-03 is not responding - removing from the cluster both morph-01 and morph-02 then start replaying journals and recovering the filesystems. morph-02 then panics due to what appears to be a corrupted filesystem: Nov 11 11:41:59 morph-02 kernel: GFS: fsid=morph-cluster:corey0.1: jid=3: Replaying journal... mh_magic = 0x01161970 mh_type = 8 mh_generation = 0 mh_format = 800 mh_incarn = 0 lh_flags = 0x00000000 lh_pad = 0 lh_first = 65775200 lh_sequence = 1003 lh_tail = 65775008 lh_last_dump = 65774992 lh_reserved = 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 start = 65775120 Kernel panic - not syncing: GFS: Assertion failed on line 362 of file /usr/src/cluster/gfs-kernel/src/gfs/recovery.c GFS: assertion: "header.lh_first == start" GFS: time = 1100194920 GFS: fsid=morph-cluster:corey0.1 How reproducible: Didn't try
I'll try to reproduce this and maybe narrow down where the bug really lives (cman, fenced, or gfs)
FWIW, gfs_fsck was able to make this fs mountable again
I was able to reproduce the delayed fencing (after a cluster rejoin is attempted) but not yet the fs corruption.
What version of the code is this? A fix I added on 11/12 appears to be missing. Here's what should happen: - 5 nodes in cluster (A,B,X,Y,Z) and running fenced,dlm,gfs - kill 3 (X,Y,Z) leaving 2 (A,B) - cluster loses quorum meaning all services are suspended on A+B (this includes, fencing, dlm and gfs services; no fencing should occur, no dlm recovery should occur and no gfs journal replay should occur, until...) - X is brought back into the cluster - Y+Z are left inactive - 3 of 5 nodes (A+B+X) now satisfies quorum - on A+B, the services are now unsuspended and they do recovery: * first, fence domain recovery: any node that was in the cluster but is not any longer is fenced. This means that Y+Z are fenced. X, which was also killed but has also just joined the cluster, should /not/ be fenced. Here's where the bug is: the fencing daemon was incorrectly fencing node X in addition to correctly fencing Y+Z. I fixed this on 11/12/04 in response to a report by Patrick on cluster-list. [Note: I should add an additional delay before fencing Y+Z in this situation in the hope that they'll have enough time to rejoin the cluster and avoid being fenced like X.] * second, dlm recovery occurs * third, gfs recovery occurs; A and B are responsible for recovering the journals used by X, Y and Z. X will not be allowed to mount until these recoveries are complete. I don't have an explanation for the gfs assertion. There was another obscure bug I fixed on 11/15/04 where a machine would be allowed to mount gfs while it was /joining/ the fencing domain, but was not yet a full domain member. This could result in a machine having gfs mounted, being killed but not getting fenced. It's a possible explanation for this although I'm not sure of the details.
Dave, You mentioned a possible additional delay in the comment above so that nodes Y and Z can avoid being fenced like X. Did that ever go in? I still see cases where I have 6 nodes, 3 get shot and brought back within a few seconds of each other and one or two of them end up fenced.
Nevermind, I found it with the post_fail_delay variable.
Can we mark this resolved or not a bug now?
a post_join_delay and post_fail_delay varible of 15 seconds or more appears to fix this issue.